Skip to content
Snippets Groups Projects
  1. May 08, 2015
    • Joseph K. Bradley's avatar
      [SPARK-7498] [ML] removed varargs annotation from Params.setDefaults · 29926238
      Joseph K. Bradley authored
      In SPARK-7429 and PR https://github.com/apache/spark/pull/5960, I added the varargs annotation to Params.setDefault which takes a variable number of ParamPairs. It worked locally and on Jenkins for me.
      However, mengxr reported issues compiling on his machine. So I'm reverting the change introduced in https://github.com/apache/spark/pull/5960 by removing varargs.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6021 from jkbradley/revert-varargs and squashes the following commits:
      
      098ed39 [Joseph K. Bradley] removed varargs annotation from Params.setDefaults taking multiple ParamPairs
      29926238
    • DB Tsai's avatar
      [SPARK-7262] [ML] Binary LogisticRegression with L1/L2 (elastic net) using OWLQN in new ML package · 86ef4cfd
      DB Tsai authored
      1) Handle scaling and addBias internally.
      2) L1/L2 elasticnet using OWLQN optimizer.
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #5967 from dbtsai/lor and squashes the following commits:
      
      fa029bb [DB Tsai] made the bound smaller
      0806002 [DB Tsai] better initial intercept and more test
      5c31824 [DB Tsai] fix import
      c387e25 [DB Tsai] Merge branch 'master' into lor
      c84e931 [DB Tsai] Made MultiClassSummarizer private
      f98e711 [DB Tsai] address feedback
      a784321 [DB Tsai] fix style
      8ec65d2 [DB Tsai] remove new line
      f3f8c88 [DB Tsai] add more tests and they match R which is good. fix a bug
      34705bc [DB Tsai] first commit
      86ef4cfd
    • Josh Rosen's avatar
      [SPARK-7375] [SQL] Avoid row copying in exchange when sort.serializeMapOutputs takes effect · cde54838
      Josh Rosen authored
      This patch refactors the SQL `Exchange` operator's logic for determining whether map outputs need to be copied before being shuffled. As part of this change, we'll now avoid unnecessary copies in cases where sort-based shuffle operates on serialized map outputs (as in #4450 /
      SPARK-4550).
      
      This patch also includes a change to copy the input to RangePartitioner partition bounds calculation, which is necessary because this calculation buffers mutable Java objects.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5948)
      <!-- Reviewable:end -->
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5948 from JoshRosen/SPARK-7375 and squashes the following commits:
      
      f305ff3 [Josh Rosen] Reduce scope of some variables in Exchange
      899e1d7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-7375
      6a6bfce [Josh Rosen] Fix issue related to RangePartitioning:
      ad006a4 [Josh Rosen] [SPARK-7375] Avoid defensive copying in exchange operator when sort.serializeMapOutputs takes effect.
      cde54838
    • Shivaram Venkataraman's avatar
      [SPARK-7231] [SPARKR] Changes to make SparkR DataFrame dplyr friendly. · 0a901dd3
      Shivaram Venkataraman authored
      Changes include
      1. Rename sortDF to arrange
      2. Add new aliases `group_by` and `sample_frac`, `summarize`
      3. Add more user friendly column addition (mutate), rename
      4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr
      
      Using these changes we can pretty much run the examples as described in http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html with the same syntax
      
      The only thing missing in SparkR is auto resolving column names when used in an expression i.e. making something like `select(flights, delay)` works in dply but we right now need `select(flights, flights$delay)` or `select(flights, "delay")`. But this is a complicated change and I'll file a new issue for it
      
      cc sun-rui rxin
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #6005 from shivaram/sparkr-df-api and squashes the following commits:
      
      5e0716a [Shivaram Venkataraman] Fix some roxygen bugs
      1254953 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/apache/spark into sparkr-df-api
      0521149 [Shivaram Venkataraman] Changes to make SparkR DataFrame dplyr friendly. Changes include 1. Rename sortDF to arrange 2. Add new aliases `group_by` and `sample_frac`, `summarize` 3. Add more user friendly column addition (mutate), rename 4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr
      0a901dd3
    • Ashwin Shankar's avatar
      [SPARK-7451] [YARN] Preemption of executors is counted as failure causing Spark job to fail · b6c797b0
      Ashwin Shankar authored
      Added a check to handle container exit status for the preemption scenario, log an INFO message in such cases and move on.
      andrewor14
      
      Author: Ashwin Shankar <ashankar@netflix.com>
      
      Closes #5993 from ashwinshankar77/SPARK-7451 and squashes the following commits:
      
      90900cf [Ashwin Shankar] Fix log info message
      cf8b6cf [Ashwin Shankar] Stop counting preemption of executors as failure
      b6c797b0
    • Burak Yavuz's avatar
      [SPARK-7488] [ML] Feature Parity in PySpark for ml.recommendation · 84bf931f
      Burak Yavuz authored
      Adds Python Api for `ALS` under `ml.recommendation` in PySpark. Also adds seed as a settable parameter in the Scala Implementation of ALS.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #6015 from brkyvz/ml-rec and squashes the following commits:
      
      be6e931 [Burak Yavuz] addressed comments
      eaed879 [Burak Yavuz] readd numFeatures
      0bd66b1 [Burak Yavuz] fixed seed
      7f6d964 [Burak Yavuz] merged master
      52e2bda [Burak Yavuz] added ALS
      84bf931f
    • tedyu's avatar
      [SPARK-7237] Clean function in several RDD methods · 54e6fa05
      tedyu authored
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #5959 from ted-yu/master and squashes the following commits:
      
      f83d445 [tedyu] Move cleaning outside of mapPartitionsWithIndex
      56d7c92 [tedyu] Consolidate import of Random
      f6014c0 [tedyu] Remove cleaning in RDD#filterWith
      36feb6c [tedyu] Try to get correct syntax
      55d01eb [tedyu] Try to get correct syntax
      c2786df [tedyu] Correct syntax
      d92bfcf [tedyu] Correct syntax in test
      164d3e4 [tedyu] Correct variable name
      8b50d93 [tedyu] Address Andrew's review comments
      0c8d47e [tedyu] Add test for mapWith()
      6846e40 [tedyu] Add test for flatMapWith()
      6c124a9 [tedyu] Clean function in several RDD methods
      54e6fa05
    • Andrew Or's avatar
      [SPARK-7469] [SQL] DAG visualization: show SQL query operators · bd61f070
      Andrew Or authored
      The DAG visualization currently displays only low-level Spark primitives (e.g. `map`, `reduceByKey`, `filter` etc.). For SQL, these aren't particularly useful. Instead, we should display higher level physical operators (e.g. `Filter`, `Exchange`, `ShuffleHashJoin`). cc marmbrus
      
      -----------------
      **Before**
      <img src="https://issues.apache.org/jira/secure/attachment/12731586/before.png" width="600px"/>
      -----------------
      **After** (Pay attention to the words)
      <img src="https://issues.apache.org/jira/secure/attachment/12731587/after.png" width="600px"/>
      -----------------
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #5999 from andrewor14/dag-viz-sql and squashes the following commits:
      
      0db23a4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-sql
      1e211db [Andrew Or] Update comment
      0d49fd6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-sql
      ffd237a [Andrew Or] Fix style
      202dac1 [Andrew Or] Make ignoreParent false by default
      e61b1ab [Andrew Or] Visualize SQL operators, not low-level Spark primitives
      569034a [Andrew Or] Add a flag to ignore parent settings and scopes
      bd61f070
    • Aaron Davidson's avatar
      [SPARK-6955] Perform port retries at NettyBlockTransferService level · ffdc40ce
      Aaron Davidson authored
      Currently we're doing port retries in the TransportServer level, but this is not specified by the TransportContext API and it has other further-reaching impacts like causing undesirable behavior for the Yarn and Standalone shuffle services.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #5575 from aarondav/port-bind and squashes the following commits:
      
      3c2d6ed [Aaron Davidson] Oops, never do it.
      a5d9432 [Aaron Davidson] Remove shouldHostShuffleServiceIfEnabled
      e901eb2 [Aaron Davidson] fix local-cluster mode for ExternalShuffleServiceSuite
      59e5e38 [Aaron Davidson] [SPARK-6955] Perform port retries at NettyBlockTransferService level
      ffdc40ce
    • Brendan Collins's avatar
      updated ec2 instance types · 1c78f686
      Brendan Collins authored
      I needed to run some d2 instances, so I updated the spark_ec2.py accordingly
      
      Author: Brendan Collins <bcollins@blueraster.com>
      
      Closes #6014 from brendancol/ec2-instance-types-update and squashes the following commits:
      
      d7b4191 [Brendan Collins] Merge branch 'ec2-instance-types-update' of github.com:brendancol/spark into ec2-instance-types-update
      6366c45 [Brendan Collins] added back cc1.4xlarge
      fc2931f [Brendan Collins] updated ec2 instance types
      80c2aa6 [Brendan Collins] vertically aligned whitespace
      85c6236 [Brendan Collins] vertically aligned whitespace
      1657c26 [Brendan Collins] updated ec2 instance types
      1c78f686
    • Yanbo Liang's avatar
      [SPARK-5913] [MLLIB] Python API for ChiSqSelector · 35c9599b
      Yanbo Liang authored
      Add a Python API for mllib.feature.ChiSqSelector
      https://issues.apache.org/jira/browse/SPARK-5913
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #5939 from yanboliang/spark-5913 and squashes the following commits:
      
      cdaac99 [Yanbo Liang] Python API for ChiSqSelector
      35c9599b
    • Jacky Li's avatar
      [SPARK-4699] [SQL] Make caseSensitive configurable in spark sql analyzer · 6dad76e5
      Jacky Li authored
      based on #3558
      
      Author: Jacky Li <jacky.likun@huawei.com>
      Author: wangfei <wangfei1@huawei.com>
      Author: scwf <wangfei1@huawei.com>
      
      Closes #5806 from scwf/case and squashes the following commits:
      
      cd51712 [wangfei] fix compile
      d4b724f [wangfei] address michael's comment
      af512c7 [wangfei] fix conflicts
      4ef1be7 [wangfei] fix conflicts
      269cf21 [scwf] fix conflicts
      b73df6c [scwf] style issue
      9e11752 [scwf] improve SimpleCatalystConf
      b35529e [scwf] minor style
      a3f7659 [scwf] remove unsed imports
      2a56515 [scwf] fix conflicts
      6db4bf5 [scwf] also fix for HiveContext
      7fc4a98 [scwf] fix test case
      d5a9933 [wangfei] fix style
      eee75ba [wangfei] fix EmptyConf
      6ef31cf [wangfei] revert pom changes
      5d7c456 [wangfei] set CASE_SENSITIVE false in TestHive
      966e719 [wangfei] set CASE_SENSITIVE false in hivecontext
      fd30e25 [wangfei] added override
      69b3b70 [wangfei] fix AnalysisSuite
      5472b08 [wangfei] fix compile issue
      56034ca [wangfei] fix conflicts and improve for catalystconf
      664d1e9 [Jacky Li] Merge branch 'master' of https://github.com/apache/spark into case
      12eca9a [Jacky Li] solve conflict with master
      39e369c [Jacky Li] fix confilct after DataFrame PR
      dee56e9 [Jacky Li] fix test case failure
      05b09a3 [Jacky Li] fix conflict base on the latest master branch
      73c16b1 [Jacky Li] fix bug in sql/hive
      9bf4cc7 [Jacky Li] fix bug in catalyst
      005c56d [Jacky Li] make SQLContext caseSensitivity configurable
      6332e0f [Jacky Li] fix bug
      fcbf0d9 [Jacky Li] fix scalastyle check
      e7bca31 [Jacky Li] make caseSensitive configuration in Analyzer and Catalog
      91b1b96 [Jacky Li] make caseSensitive configurable in Analyzer
      f57f15c [Jacky Li] add testcase
      578d167 [Jacky Li] make caseSensitive configurable
      6dad76e5
    • Liang-Chi Hsieh's avatar
      [SPARK-7390] [SQL] Only merge other CovarianceCounter when its count is greater than zero · 90527f56
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7390
      
      Also fix a minor typo.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5931 from viirya/fix_covariancecounter and squashes the following commits:
      
      352eda6 [Liang-Chi Hsieh] Only merge other CovarianceCounter when its count is greater than zero.
      90527f56
    • Marcelo Vanzin's avatar
      [SPARK-7378] [CORE] Handle deep links to unloaded apps. · 5467c34c
      Marcelo Vanzin authored
      The code was treating deep links as if they were attempt IDs, so
      for example if you tried to load "/history/app1/jobs" directly,
      that would fail because the code would treat "jobs" as an attempt id.
      
      This change modifies the code to try both cases - first without an
      attempt id, then with it, so that deep links are handled correctly.
      This assumes that the links in the Spark UI do not clash with the
      attempt id namespace, though, which is the case for YARN at least,
      which is the only backend that currently publishes attempt IDs.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5922 from vanzin/SPARK-7378 and squashes the following commits:
      
      96f648b [Marcelo Vanzin] Fix comparison.
      ed3bcd4 [Marcelo Vanzin] Merge branch 'master' into SPARK-7378
      23483e4 [Marcelo Vanzin] Fat fingers.
      b728f08 [Marcelo Vanzin] [SPARK-7378] [core] Handle deep links to unloaded apps.
      5467c34c
    • Marcelo Vanzin's avatar
      [MINOR] [CORE] Allow History Server to read kerberos opts from config file. · 9042f8f3
      Marcelo Vanzin authored
      Order of initialization code was wrong.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5998 from vanzin/hs-conf-fix and squashes the following commits:
      
      00b6b6b [Marcelo Vanzin] [minor] [core] Allow History Server to read kerberos opts from config file.
      9042f8f3
    • Andrew Or's avatar
      [SPARK-7466] DAG visualization: fix orphan nodes · 3b0c5e71
      Andrew Or authored
      Simple fix. We were comparing an option with `null`.
      
      Before:
      <img src="https://issues.apache.org/jira/secure/attachment/12731383/before.png" width="250px"/>
      After:
      <img src="https://issues.apache.org/jira/secure/attachment/12731384/after.png" width="250px"/>
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6002 from andrewor14/dag-viz-orphan-nodes and squashes the following commits:
      
      a1468dc [Andrew Or] Fix null check
      3b0c5e71
    • Tim Ellison's avatar
      [MINOR] Defeat early garbage collection of test suite variable · 31da40df
      Tim Ellison authored
      The JVM is free to collect references to variables that no longer participate in a computation.  This simple patch adds an operation to the variable 'rdd' to ensure it is not collected early in the test suite's explicit calls to GC.
      
      ref: http://bugs.java.com/view_bug.do?bug_id=6721588
      
      Author: Tim Ellison <t.p.ellison@gmail.com>
      
      Closes #6010 from tellison/master and squashes the following commits:
      
      77d1c8f [Tim Ellison] Defeat early garbage collection of test suite variable by aggressive JVMs
      31da40df
    • vinodkc's avatar
      [SPARK-7489] [SPARK SHELL] Spark shell crashes when compiled with scala 2.11 · 4e7360e1
      vinodkc authored
      Spark shell crashes when compiled with scala 2.11 and  SPARK_PREPEND_CLASSES=true
      
      There is a similar Resolved JIRA issue -SPARK-7470 and a PR https://github.com/apache/spark/pull/5997 , which handled same issue only in scala 2.10
      
      Author: vinodkc <vinod.kc.in@gmail.com>
      
      Closes #6013 from vinodkc/fix_sqlcontext_exception_scala_2.11 and squashes the following commits:
      
      119061c [vinodkc] Spark shell crashes when compiled with scala 2.11
      4e7360e1
    • Kousuke Saruta's avatar
      [WEBUI] Remove debug feature for vis.js · c45c09b0
      Kousuke Saruta authored
      `vis.min.js` refers `vis.map` and this even refers `vis.js` which is used for debug `vis.js` but this debug feature is not needed for Spark itself.
      
      This issue is really minor so I don't file this in JIRA.
      
      /CC andrewor14
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #5994 from sarutak/remove-debug-feature-for-vis and squashes the following commits:
      
      8be038f [Kousuke Saruta] Remove vis.map entry from .rat-exclude
      7404945 [Kousuke Saruta] Removed debug feature for vis.js
      c45c09b0
    • zsxwing's avatar
      [MINOR] Ignore python/lib/pyspark.zip · dc71e47f
      zsxwing authored
      Add `python/lib/pyspark.zip` to `.gitignore`. After merging #5580, `python/lib/pyspark.zip` will be generated when building Spark.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6017 from zsxwing/gitignore and squashes the following commits:
      
      39b10c4 [zsxwing] Ignore python/lib/pyspark.zip
      dc71e47f
    • Evan Jones's avatar
      [SPARK-7490] [CORE] [Minor] MapOutputTracker.deserializeMapStatuses: close input streams · 25889d8d
      Evan Jones authored
      GZIPInputStream allocates native memory that is not freed until close() or
      when the finalizer runs. It is best to close() these streams explicitly.
      
      stephenh made the same change for serializeMapStatuses in commit b0d884f0. This is the same change for deserialize.
      
      (I ran the unit test suite! it seems to have passed. I did not make a JIRA since this seems "trivial", and the guidelines suggest it is not required for trivial changes)
      
      Author: Evan Jones <ejones@twitter.com>
      
      Closes #5982 from evanj/master and squashes the following commits:
      
      0d76e85 [Evan Jones] [CORE] MapOutputTracker.deserializeMapStatuses: close input streams
      25889d8d
    • Kay Ousterhout's avatar
      [SPARK-6627] Finished rename to ShuffleBlockResolver · 4b3bb0e4
      Kay Ousterhout authored
      The previous cleanup-commit for SPARK-6627 renamed ShuffleBlockManager
      to ShuffleBlockResolver, but didn't rename the associated subclasses and
      variables; this commit does that.
      
      I'm unsure whether it's ok to rename ExternalShuffleBlockManager, since that's technically a public class?
      
      cc pwendell
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #5764 from kayousterhout/SPARK-6627 and squashes the following commits:
      
      43add1e [Kay Ousterhout] Spacing fix
      96080bf [Kay Ousterhout] Test fixes
      d8a5d36 [Kay Ousterhout] [SPARK-6627] Finished rename to ShuffleBlockResolver
      4b3bb0e4
    • Wenchen Fan's avatar
      [SPARK-7133] [SQL] Implement struct, array, and map field accessor · 2d05f325
      Wenchen Fan authored
      It's the first step: generalize UnresolvedGetField to support all map, struct, and array
      TODO: add `apply` in Scala and `__getitem__` in Python, and unify the `getItem` and `getField` methods to one single API(or should we keep them for compatibility?).
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #5744 from cloud-fan/generalize and squashes the following commits:
      
      715c589 [Wenchen Fan] address comments
      7ea5b31 [Wenchen Fan] fix python test
      4f0833a [Wenchen Fan] add python test
      f515d69 [Wenchen Fan] add apply method and test cases
      8df6199 [Wenchen Fan] fix python test
      239730c [Wenchen Fan] fix test compile
      2a70526 [Wenchen Fan] use _bin_op in dataframe.py
      6bf72bc [Wenchen Fan] address comments
      3f880c3 [Wenchen Fan] add java doc
      ab35ab5 [Wenchen Fan] fix python test
      b5961a9 [Wenchen Fan] fix style
      c9d85f5 [Wenchen Fan] generalize UnresolvedGetField to support all map, struct, and array
      2d05f325
    • Matei Zaharia's avatar
      [SPARK-7298] Harmonize style of new visualizations · a1ec08f7
      Matei Zaharia authored
      - Colors on the timeline now match the rest of the UI
      - The expandable buttons to show timeline view, DAG, etc are now more visible
      - Timeline text is smaller
      - DAG visualization text and colors are more consistent throughout
      - Fix some JavaScript style issues
      - Various small fixes throughout (e.g. inconsistent capitalization, some confusing names, HTML escaping, etc)
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #5942 from mateiz/ui and squashes the following commits:
      
      def38d0 [Matei Zaharia] Add some tooltips
      4c5a364 [Matei Zaharia] Reduce stage and rank separation slightly
      43dcbe3 [Matei Zaharia] Some updates to DAG
      fac734a [Matei Zaharia] tweaks
      6a6705d [Matei Zaharia] More fixes
      67629f5 [Matei Zaharia] Various small tweaks
      a1ec08f7
    • Jacek Lewandowski's avatar
      [SPARK-7436] Fixed instantiation of custom recovery mode factory and added tests · 35d6a99c
      Jacek Lewandowski authored
      Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>
      
      Closes #5977 from jacek-lewandowski/SPARK-7436 and squashes the following commits:
      
      ff0a3c2 [Jacek Lewandowski] SPARK-7436: Fixed instantiation of custom recovery mode factory and added tests
      35d6a99c
    • hqzizania's avatar
      [SPARK-6824] Fill the docs for DataFrame API in SparkR · 008a60dd
      hqzizania authored
      This patch also removes the RDD docs from being built as a part of roxygen just by the method to delete
      " ' '" of " \#' ".
      
      Author: hqzizania <qian.huang@intel.com>
      Author: qhuang <qian.huang@intel.com>
      
      Closes #5969 from hqzizania/R1 and squashes the following commits:
      
      6d27696 [qhuang] fixes in NAMESPACE
      eb4b095 [qhuang] remove more docs
      6394579 [qhuang] remove RDD docs in generics.R
      6813860 [hqzizania] Fill the docs for DataFrame API in SparkR
      857220f [hqzizania] remove the pairRDD docs from being built as a part of roxygen
      c045d64 [hqzizania] remove the RDD docs from being built as a part of roxygen
      008a60dd
    • Xiangrui Meng's avatar
      [SPARK-7474] [MLLIB] update ParamGridBuilder doctest · 65afd3ce
      Xiangrui Meng authored
      Multiline commands are properly handled in this PR. oefirouz
      
      ![screen shot 2015-05-07 at 10 53 25 pm](https://cloud.githubusercontent.com/assets/829644/7531290/02ad2fd4-f50c-11e4-8c04-e58d1a61ad69.png)
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6001 from mengxr/SPARK-7474 and squashes the following commits:
      
      b94b11d [Xiangrui Meng] update ParamGridBuilder doctest
      65afd3ce
    • Burak Yavuz's avatar
      [SPARK-7383] [ML] Feature Parity in PySpark for ml.features · f5ff4a84
      Burak Yavuz authored
      Implemented python wrappers for Scala functions that don't exist in `ml.features`
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5991 from brkyvz/ml-feat-PR and squashes the following commits:
      
      adcca55 [Burak Yavuz] add regex tokenizer to __all__
      b91cb44 [Burak Yavuz] addressed comments
      bd39fd2 [Burak Yavuz] remove addition
      b82bd7c [Burak Yavuz] Parity in PySpark for ml.features
      f5ff4a84
    • Imran Rashid's avatar
      [SPARK-3454] separate json endpoints for data in the UI · c796be70
      Imran Rashid authored
      Exposes data available in the UI as json over http.  Key points:
      
      * new endpoints, handled independently of existing XyzPage classes.  Root entrypoint is `JsonRootResource`
      * Uses jersey + jackson for routing & converting POJOs into json
      * tests against known results in `HistoryServerSuite`
      * also fixes some minor issues w/ the UI -- synchronizing on access to `StorageListener` & `StorageStatusListener`, and fixing some inconsistencies w/ the way we handle retained jobs & stages.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #5940 from squito/SPARK-3454_better_test_files and squashes the following commits:
      
      1a72ed6 [Imran Rashid] rats
      85fdb3e [Imran Rashid] Merge branch 'no_php' into SPARK-3454
      1fc65b0 [Imran Rashid] Revert "Revert "[SPARK-3454] separate json endpoints for data in the UI""
      1276900 [Imran Rashid] get rid of giant event file, replace w/ smaller one; check both shuffle read & shuffle write
      4e12013 [Imran Rashid] just use test case name for expectation file name
      863ef64 [Imran Rashid] rename json files to avoid strange file names and not look like php
      c796be70
    • Lianhui Wang's avatar
      [SPARK-6869] [PYSPARK] Add pyspark archives path to PYTHONPATH · ebff7327
      Lianhui Wang authored
      Based on https://github.com/apache/spark/pull/5478 that provide a PYSPARK_ARCHIVES_PATH env. within this PR, we just should export PYSPARK_ARCHIVES_PATH=/user/spark/pyspark.zip,/user/spark/python/lib/py4j-0.8.2.1-src.zip in conf/spark-env.sh when we don't install PySpark on each node of Yarn. i run python application successfully on yarn-client and yarn-cluster with this PR.
      andrewor14 sryza Sephiroth-Lin Can you take a look at this?thanks.
      
      Author: Lianhui Wang <lianhuiwang09@gmail.com>
      
      Closes #5580 from lianhuiwang/SPARK-6869 and squashes the following commits:
      
      66ffa43 [Lianhui Wang] Update Client.scala
      c2ad0f9 [Lianhui Wang] Update Client.scala
      1c8f664 [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
      008850a [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
      f0b4ed8 [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
      150907b [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
      20402cd [Lianhui Wang] use ZipEntry
      9d87c3f [Lianhui Wang] update scala style
      e7bd971 [Lianhui Wang] address vanzin's comments
      4b8a3ed [Lianhui Wang] use pyArchivesEnvOpt
      e6b573b [Lianhui Wang] address vanzin's comments
      f11f84a [Lianhui Wang] zip pyspark archives
      5192cca [Lianhui Wang] update import path
      3b1e4c8 [Lianhui Wang] address tgravescs's comments
      9396346 [Lianhui Wang] put zip to make-distribution.sh
      0d2baf7 [Lianhui Wang] update import paths
      e0179be [Lianhui Wang] add zip pyspark archives in build or sparksubmit
      31e8e06 [Lianhui Wang] update code style
      9f31dac [Lianhui Wang] update code and add comments
      f72987c [Lianhui Wang] add archives path to PYTHONPATH
      ebff7327
    • Zhang, Liye's avatar
      [SPARK-7392] [CORE] bugfix: Kryo buffer size cannot be larger than 2M · c2f0821a
      Zhang, Liye authored
      Author: Zhang, Liye <liye.zhang@intel.com>
      
      Closes #5934 from liyezhang556520/kryoBufSize and squashes the following commits:
      
      5707e04 [Zhang, Liye] fix import order
      8693288 [Zhang, Liye] replace multiplier with ByteUnit methods
      9bf93e9 [Zhang, Liye] add tests
      d91e5ed [Zhang, Liye] change kb to mb
      c2f0821a
    • wangfei's avatar
      [SPARK-7232] [SQL] Add a Substitution batch for spark sql analyzer · f496bf3c
      wangfei authored
        Added a new batch named `Substitution` before `Resolution` batch. The motivation for this is there are kind of cases we want to do some substitution on the parsed logical plan before resolve it.
      Consider this two cases:
      1 CTE, for cte we first build a row logical plan
      ```
      'With Map(q1 -> 'Subquery q1
                         'Project ['key]
                            'UnresolvedRelation [src], None)
       'Project [*]
        'Filter ('key = 5)
         'UnresolvedRelation [q1], None
      ```
      In `With` logicalplan here is a map stored the (`q1-> subquery`), we want first take off the with command and substitute the  `q1` of `UnresolvedRelation` by the `subquery`
      
      2 Another example is Window function, in window function user may define some windows, we also need substitute the window name of child by the concrete window. this should also done in the Substitution batch.
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #5776 from scwf/addbatch and squashes the following commits:
      
      d4b962f [wangfei] added WindowsSubstitution
      70f6932 [wangfei] Merge branch 'master' of https://github.com/apache/spark into addbatch
      ecaeafb [wangfei] address yhuai's comments
      553005a [wangfei] fix test case
      0c54798 [wangfei] address comments
      29aaaaf [wangfei] fix compile
      1c9a092 [wangfei] added Substitution bastch
      f496bf3c
    • Andrew Or's avatar
      [SPARK-7470] [SQL] Spark shell SQLContext crashes without hive · 714db2ef
      Andrew Or authored
      This only happens if you have `SPARK_PREPEND_CLASSES` set. Then I built it with `build/sbt clean assembly compile` and just ran it with `bin/spark-shell`.
      ```
      ...
      15/05/07 17:07:30 INFO EventLoggingListener: Logging events to file:/tmp/spark-events/local-1431043649919
      15/05/07 17:07:30 INFO SparkILoop: Created spark context..
      Spark context available as sc.
      java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf
      	at java.lang.Class.getDeclaredConstructors0(Native Method)
      	at java.lang.Class.privateGetDeclaredConstructors(Class.java:2493)
      	at java.lang.Class.getConstructor0(Class.java:2803)
      ...
      Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf
      	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
      	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
      	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
      	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
      	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
      	... 52 more
      
      <console>:10: error: not found: value sqlContext
             import sqlContext.implicits._
                    ^
      <console>:10: error: not found: value sqlContext
             import sqlContext.sql
                    ^
      ```
      yhuai marmbrus
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #5997 from andrewor14/sql-shell-crash and squashes the following commits:
      
      61147e6 [Andrew Or] Also expect NoClassDefFoundError
      714db2ef
  2. May 07, 2015
    • Yin Huai's avatar
      [SPARK-6986] [SQL] Use Serializer2 in more cases. · 3af423c9
      Yin Huai authored
      With https://github.com/apache/spark/commit/0a2b15ce43cf6096e1a7ae060b7c8a4010ce3b92, the serialization stream and deserialization stream has enough information to determine it is handling a key-value pari, a key, or a value. It is safe to use `SparkSqlSerializer2` in more cases.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #5849 from yhuai/serializer2MoreCases and squashes the following commits:
      
      53a5eaa [Yin Huai] Josh's comments.
      487f540 [Yin Huai] Use BufferedOutputStream.
      8385f95 [Yin Huai] Always create a new row at the deserialization side to work with sort merge join.
      c7e2129 [Yin Huai] Update tests.
      4513d13 [Yin Huai] Use Serializer2 in more places.
      3af423c9
    • Shuo Xiang's avatar
      [SPARK-7452] [MLLIB] fix bug in topBykey and update test · 92f8f803
      Shuo Xiang authored
      the toArray function of the BoundedPriorityQueue does not necessarily preserve order. Add a counter-example as the test, which would fail the original impl.
      
      Author: Shuo Xiang <shuoxiangpub@gmail.com>
      
      Closes #5990 from coderxiang/topbykey-test and squashes the following commits:
      
      98804c9 [Shuo Xiang] fix bug in topBykey and update test
      92f8f803
    • Michael Armbrust's avatar
      [SPARK-6908] [SQL] Use isolated Hive client · cd1d4110
      Michael Armbrust authored
      This PR switches Spark SQL's Hive support to use the isolated hive client interface introduced by #5851, instead of directly interacting with the client.  By using this isolated client we can now allow users to dynamically configure the version of Hive that they are connecting to by setting `spark.sql.hive.metastore.version` without the need recompile.  This also greatly reduces the surface area for our interaction with the hive libraries, hopefully making it easier to support other versions in the future.
      
      Jars for the desired hive version can be configured using `spark.sql.hive.metastore.jars`, which accepts the following options:
       - a colon-separated list of jar files or directories for hive and hadoop.
       - `builtin` - attempt to discover the jars that were used to load Spark SQL and use those. This
                  option is only valid when using the execution version of Hive.
       - `maven` - download the correct version of hive on demand from maven.
      
      By default, `builtin` is used for Hive 13.
      
      This PR also removes the test step for building against Hive 12, as this will no longer be required to talk to Hive 12 metastores.  However, the full removal of the Shim is deferred until a later PR.
      
      Remaining TODOs:
       - Remove the Hive Shims and inline code for Hive 13.
       - Several HiveCompatibility tests are not yet passing.
        - `nullformatCTAS` - As detailed below, we now are handling CTAS parsing ourselves instead of hacking into the Hive semantic analyzer.  However, we currently only handle the common cases and not things like CTAS where the null format is specified.
        - `combine1` now leaks state about compression somehow, breaking all subsequent tests.  As such we currently add it to the blacklist
        - `part_inherit_tbl_props` and `part_inherit_tbl_props_with_star` do not work anymore.  We are correctly propagating the information
        - "load_dyn_part14.*" - These tests pass when run on their own, but fail when run with all other tests.  It seems our `RESET` mechanism may not be as robust as it used to be?
      
      Other required changes:
       -  `CreateTableAsSelect` no longer carries parts of the HiveQL AST with it through the query execution pipeline.  Instead, we parse CTAS during the HiveQL conversion and construct a `HiveTable`.  The full parsing here is not yet complete as detailed above in the remaining TODOs.  Since the operator is Hive specific, it is moved to the hive package.
       - `Command` is simplified to be a trait that simply acts as a marker for a LogicalPlan that should be eagerly evaluated.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5876 from marmbrus/useIsolatedClient and squashes the following commits:
      
      258d000 [Michael Armbrust] really really correct path handling
      e56fd4a [Michael Armbrust] getAbsolutePath
      5a259f5 [Michael Armbrust] fix typos
      81bb366 [Michael Armbrust] comments from vanzin
      5f3945e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
      4b5cd41 [Michael Armbrust] yin's comments
      f5de7de [Michael Armbrust] cleanup
      11e9c72 [Michael Armbrust] better coverage in versions suite
      7e8f010 [Michael Armbrust] better error messages and jar handling
      e7b3941 [Michael Armbrust] more permisive checking for function registration
      da91ba7 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
      5fe5894 [Michael Armbrust] fix serialization suite
      81711c4 [Michael Armbrust] Initial support for running without maven
      1d8ae44 [Michael Armbrust] fix final tests?
      1c50813 [Michael Armbrust] more comments
      a3bee70 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
      a6f5df1 [Michael Armbrust] style
      ab07f7e [Michael Armbrust] WIP
      4d8bf02 [Michael Armbrust] Remove hive 12 compilation
      8843a25 [Michael Armbrust] [SPARK-6908] [SQL] Use isolated Hive client
      cd1d4110
    • zsxwing's avatar
      [SPARK-7305] [STREAMING] [WEBUI] Make BatchPage show friendly information when... · 22ab70e0
      zsxwing authored
      [SPARK-7305] [STREAMING] [WEBUI] Make BatchPage show friendly information when jobs are dropped by SparkListener
      
      If jobs are dropped by SparkListener, at least we can show the job ids in BatchPage. Screenshot:
      
      ![b1](https://cloud.githubusercontent.com/assets/1000778/7434968/f19aa784-eff3-11e4-8f86-36a073873574.png)
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5840 from zsxwing/SPARK-7305 and squashes the following commits:
      
      aca0ba6 [zsxwing] Fix the code style
      718765e [zsxwing] Make generateNormalJobRow private
      8073b03 [zsxwing] Merge branch 'master' into SPARK-7305
      83dec11 [zsxwing] Make BatchPage show friendly information when jobs are dropped by SparkListener
      22ab70e0
    • tedyu's avatar
      [SPARK-7450] Use UNSAFE.getLong() to speed up BitSetMethods#anySet() · 88063c62
      tedyu authored
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #5897 from tedyu/master and squashes the following commits:
      
      473bf9d [tedyu] Address Josh's review comments
      1719c5b [tedyu] Correct upper bound in for loop
      b51dcaf [tedyu] Add unit test in BitSetSuite for BitSet#anySet()
      83f9f87 [tedyu] Merge branch 'master' of github.com:apache/spark
      817e3f9 [tedyu] Replace constant 8 with SIZE_OF_LONG
      75a467b [tedyu] Correct offset for UNSAFE.getLong()
      855374b [tedyu] Remove second loop since bitSetWidthInBytes is WORD aligned
      093b7a4 [tedyu] Use UNSAFE.getLong() to speed up BitSetMethods#anySet()
      63ee050 [tedyu] Use UNSAFE.getLong() to speed up BitSetMethods#anySet()
      4ca0ef6 [tedyu] Use UNSAFE.getLong() to speed up BitSetMethods#anySet()
      3e9b6919 [tedyu] Use UNSAFE.getLong() to speed up BitSetMethods#anySet()
      88063c62
    • Wenchen Fan's avatar
      [SPARK-2155] [SQL] [WHEN D THEN E] [ELSE F] add CaseKeyWhen for "CASE a WHEN b THEN c * END" · 35f0173b
      Wenchen Fan authored
      Avoid translating to CaseWhen and evaluate the key expression many times.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #5979 from cloud-fan/condition and squashes the following commits:
      
      3ce54e1 [Wenchen Fan] add CaseKeyWhen
      35f0173b
    • Iulian Dragos's avatar
      [SPARK-5281] [SQL] Registering table on RDD is giving MissingRequirementError · 937ba798
      Iulian Dragos authored
      Go through the context classloader when reflecting on user types in ScalaReflection.
      
      Replaced calls to `typeOf` with `typeTag[T].in(mirror)`. The convenience method assumes
      all types can be found in the classloader that loaded scala-reflect (the primordial
      classloader). This assumption is not valid in all contexts (sbt console, Eclipse launchers).
      
      Fixed SPARK-5281
      
      Author: Iulian Dragos <jaguarul@gmail.com>
      
      Closes #5981 from dragos/issue/mirrors-missing-requirement-error and squashes the following commits:
      
      d103e70 [Iulian Dragos] Go through the context classloader when reflecting on user types in ScalaReflection
      937ba798
Loading