Skip to content
Snippets Groups Projects
  1. Sep 12, 2016
    • Davies Liu's avatar
      [SPARK-17474] [SQL] fix python udf in TakeOrderedAndProjectExec · a91ab705
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      When there is any Python UDF in the Project between Sort and Limit, it will be collected into TakeOrderedAndProjectExec, ExtractPythonUDFs failed to pull the Python UDFs out because QueryPlan.expressions does not include the expression inside Option[Seq[Expression]].
      
      Ideally, we should fix the `QueryPlan.expressions`, but tried with no luck (it always run into infinite loop). In PR, I changed the TakeOrderedAndProjectExec to no use Option[Seq[Expression]] to workaround it. cc JoshRosen
      
      ## How was this patch tested?
      
      Added regression test.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #15030 from davies/all_expr.
      a91ab705
    • Josh Rosen's avatar
      [SPARK-17485] Prevent failed remote reads of cached blocks from failing entire job · f9c580f1
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      In Spark's `RDD.getOrCompute` we first try to read a local copy of a cached RDD block, then a remote copy, and only fall back to recomputing the block if no cached copy (local or remote) can be read. This logic works correctly in the case where no remote copies of the block exist, but if there _are_ remote copies and reads of those copies fail (due to network issues or internal Spark bugs) then the BlockManager will throw a `BlockFetchException` that will fail the task (and which could possibly fail the whole job if the read failures keep occurring).
      
      In the cases of TorrentBroadcast and task result fetching we really do want to fail the entire job in case no remote blocks can be fetched, but this logic is inappropriate for reads of cached RDD blocks because those can/should be recomputed in case cached blocks are unavailable.
      
      Therefore, I think that the `BlockManager.getRemoteBytes()` method should never throw on remote fetch errors and, instead, should handle failures by returning `None`.
      
      ## How was this patch tested?
      
      Block manager changes should be covered by modified tests in `BlockManagerSuite`: the old tests expected exceptions to be thrown on failed remote reads, while the modified tests now expect `None` to be returned from the `getRemote*` method.
      
      I also manually inspected all usages of `BlockManager.getRemoteValues()`, `getRemoteBytes()`, and `get()` to verify that they correctly pattern-match on the result and handle `None`. Note that these `None` branches are already exercised because the old `getRemoteBytes` returned `None` when no remote locations for the block could be found (which could occur if an executor died and its block manager de-registered with the master).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #15037 from JoshRosen/SPARK-17485.
      f9c580f1
    • Josh Rosen's avatar
      [SPARK-14818] Post-2.0 MiMa exclusion and build changes · 7c51b99a
      Josh Rosen authored
      This patch makes a handful of post-Spark-2.0 MiMa exclusion and build updates. It should be merged to master and a subset of it should be picked into branch-2.0 in order to test Spark 2.0.1-SNAPSHOT.
      
      - Remove the ` sketch`, `mllibLocal`, and `streamingKafka010` from the list of excluded subprojects so that MiMa checks them.
      - Remove now-unnecessary special-case handling of the Kafka 0.8 artifact in `mimaSettings`.
      - Move the exclusion added in SPARK-14743 from `v20excludes` to `v21excludes`, since that patch was only merged into master and not branch-2.0.
      - Add exclusions for an API change introduced by SPARK-17096 / #14675.
      - Add missing exclusions for the `o.a.spark.internal` and `o.a.spark.sql.internal` packages.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #15061 from JoshRosen/post-2.0-mima-changes.
      7c51b99a
    • Josh Rosen's avatar
      [SPARK-17483] Refactoring in BlockManager status reporting and block removal · 3d40896f
      Josh Rosen authored
      This patch makes three minor refactorings to the BlockManager:
      
      - Move the `if (info.tellMaster)` check out of `reportBlockStatus`; this fixes an issue where a debug logging message would incorrectly claim to have reported a block status to the master even though no message had been sent (in case `info.tellMaster == false`). This also makes it easier to write code which unconditionally sends block statuses to the master (which is necessary in another patch of mine).
      - Split  `removeBlock()` into two methods, the existing method and an internal `removeBlockInternal()` method which is designed to be called by internal code that already holds a write lock on the block. This is also needed by a followup patch.
      - Instead of calling `getCurrentBlockStatus()` in `removeBlock()`, just pass `BlockStatus.empty`; the block status should always be empty following complete removal of a block.
      
      These changes were originally authored as part of a bug fix patch which is targeted at branch-2.0 and master; I've split them out here into their own separate PR in order to make them easier to review and so that the behavior-changing parts of my other patch can be isolated to their own PR.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #15036 from JoshRosen/cache-failure-race-conditions-refactorings-only.
      3d40896f
    • Sean Zhong's avatar
      [SPARK-17503][CORE] Fix memory leak in Memory store when unable to cache the whole RDD in memory · 1742c3ab
      Sean Zhong authored
      ## What changes were proposed in this pull request?
      
         MemoryStore may throw OutOfMemoryError when trying to cache a super big RDD that cannot fit in memory.
         ```
         scala> sc.parallelize(1 to 1000000000, 100).map(x => new Array[Long](1000)).cache().count()
      
         java.lang.OutOfMemoryError: Java heap space
      	at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:24)
      	at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:23)
      	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
      	at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232)
      	at org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683)
      	at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
      	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684)
      	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
      	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
      	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915)
      	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
      	at org.apache.spark.scheduler.Task.run(Task.scala:86)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
         ```
      
      Spark MemoryStore uses SizeTrackingVector as a temporary unrolling buffer to store all input values that it has read so far before transferring the values to storage memory cache. The problem is that when the input RDD is too big for caching in memory, the temporary unrolling memory SizeTrackingVector is not garbage collected in time. As SizeTrackingVector can occupy all available storage memory, it may cause the executor JVM to run out of memory quickly.
      
      More info can be found at https://issues.apache.org/jira/browse/SPARK-17503
      
      ## How was this patch tested?
      
      Unit test and manual test.
      
      ### Before change
      
      Heap memory consumption
      <img width="702" alt="screen shot 2016-09-12 at 4 16 15 pm" src="https://cloud.githubusercontent.com/assets/2595532/18429524/60d73a26-7906-11e6-9768-6f286f5c58c8.png">
      
      Heap dump
      <img width="1402" alt="screen shot 2016-09-12 at 4 34 19 pm" src="https://cloud.githubusercontent.com/assets/2595532/18429577/cbc1ef20-7906-11e6-847b-b5903f450b3b.png">
      
      ### After change
      
      Heap memory consumption
      <img width="706" alt="screen shot 2016-09-12 at 4 29 10 pm" src="https://cloud.githubusercontent.com/assets/2595532/18429503/4abe9342-7906-11e6-844a-b2f815072624.png">
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #15056 from clockfly/memory_store_leak.
      1742c3ab
    • WeichenXu's avatar
      [SPARK CORE][MINOR] fix "default partitioner cannot partition array keys"... · 8087ecf8
      WeichenXu authored
      [SPARK CORE][MINOR] fix "default partitioner cannot partition array keys" error message in PairRDDfunctions
      
      ## What changes were proposed in this pull request?
      
      In order to avoid confusing user,
      error message in `PairRDDfunctions`
      `Default partitioner cannot partition array keys.`
      is updated,
      the one in `partitionBy` is replaced with
      `Specified partitioner cannot partition array keys.`
      other is replaced with
      `Specified or default partitioner cannot partition array keys.`
      
      ## How was this patch tested?
      
      N/A
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #15045 from WeichenXu123/fix_partitionBy_error_message.
      8087ecf8
    • Gaetan Semet's avatar
      [SPARK-16992][PYSPARK] use map comprehension in doc · b3c22912
      Gaetan Semet authored
      Code is equivalent, but map comprehency is most of the time faster than a map.
      
      Author: Gaetan Semet <gaetan@xeberon.net>
      
      Closes #14863 from Stibbons/map_comprehension.
      b3c22912
    • codlife's avatar
      [SPARK-17447] Performance improvement in Partitioner.defaultPartitioner without sortBy · 4efcdb7f
      codlife authored
      ## What changes were proposed in this pull request?
      
      if there are many rdds in some situations,the sort will loss he performance servely,actually we needn't sort the rdds , we can just scan the rdds one time to gain the same goal.
      
      ## How was this patch tested?
      
      manual tests
      
      Author: codlife <1004910847@qq.com>
      
      Closes #15039 from codlife/master.
      4efcdb7f
    • cenyuhai's avatar
      [SPARK-17171][WEB UI] DAG will list all partitions in the graph · cc87280f
      cenyuhai authored
      ## What changes were proposed in this pull request?
      DAG will list all partitions in the graph, it is too slow and hard to see all graph.
      Always we don't want to see all partitions,we just want to see the relations of DAG graph.
      So I just show 2 root nodes for Rdds.
      
      Before this PR, the DAG graph looks like [dag1.png](https://issues.apache.org/jira/secure/attachment/12824702/dag1.png), [dag3.png](https://issues.apache.org/jira/secure/attachment/12825456/dag3.png), after this PR, the DAG graph looks like [dag2.png](https://issues.apache.org/jira/secure/attachment/12824703/dag2.png),[dag4.png](https://issues.apache.org/jira/secure/attachment/12825457/dag4.png)
      
      Author: cenyuhai <cenyuhai@didichuxing.com>
      Author: 岑玉海 <261810726@qq.com>
      
      Closes #14737 from cenyuhai/SPARK-17171.
      cc87280f
  2. Sep 11, 2016
    • Josh Rosen's avatar
      [SPARK-17486] Remove unused TaskMetricsUIData.updatedBlockStatuses field · 72eec70b
      Josh Rosen authored
      The `TaskMetricsUIData.updatedBlockStatuses` field is assigned to but never read, increasing the memory consumption of the web UI. We should remove this field.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #15038 from JoshRosen/remove-updated-block-statuses-from-TaskMetricsUIData.
      72eec70b
    • Sameer Agarwal's avatar
      [SPARK-17415][SQL] Better error message for driver-side broadcast join OOMs · 767d4807
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This is a trivial patch that catches all `OutOfMemoryError` while building the broadcast hash relation and rethrows it by wrapping it in a nice error message.
      
      ## How was this patch tested?
      
      Existing Tests
      
      Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
      
      Closes #14979 from sameeragarwal/broadcast-join-error.
      767d4807
    • Yanbo Liang's avatar
      [SPARK-17389][FOLLOW-UP][ML] Change KMeans k-means|| default init steps from 5 to 2. · 883c7631
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      #14956 reduced default k-means|| init steps to 2 from 5 only for spark.mllib package, we should also do same change for spark.ml and PySpark.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15050 from yanboliang/spark-17389.
      883c7631
    • Bryan Cutler's avatar
      [SPARK-17336][PYSPARK] Fix appending multiple times to PYTHONPATH from spark-config.sh · c76baff0
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      During startup of Spark standalone, the script file spark-config.sh appends to the PYTHONPATH and can be sourced many times, causing duplicates in the path.  This change adds a env flag that is set when the PYTHONPATH is appended so it will happen only one time.
      
      ## How was this patch tested?
      Manually started standalone master/worker and verified PYTHONPATH has no duplicate entries.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #15028 from BryanCutler/fix-duplicate-pythonpath-SPARK-17336.
      c76baff0
    • tone-zhang's avatar
      [SPARK-17330][SPARK UT] Clean up spark-warehouse in UT · bf222173
      tone-zhang authored
      ## What changes were proposed in this pull request?
      
      Check the database warehouse used in Spark UT, and remove the existing database file before run the UT (SPARK-8368).
      
      ## How was this patch tested?
      
      Run Spark UT with the command for several times:
      ./build/sbt -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver "test-only *HiveSparkSubmitSuit*"
      Without the patch, the test case can be passed only at the first time, and always failed from the second time.
      With the patch the test case always can be passed correctly.
      
      Author: tone-zhang <tone.zhang@linaro.org>
      
      Closes #14894 from tone-zhang/issue1.
      bf222173
    • Timothy Hunter's avatar
      [SPARK-17439][SQL] Fixing compression issues with approximate quantiles and adding more tests · 180796ec
      Timothy Hunter authored
      ## What changes were proposed in this pull request?
      
      This PR build on #14976 and fixes a correctness bug that would cause the wrong quantile to be returned for small target errors.
      
      ## How was this patch tested?
      
      This PR adds 8 unit tests that were failing without the fix.
      
      Author: Timothy Hunter <timhunter@databricks.com>
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15002 from thunterdb/ml-1783.
      180796ec
    • Sean Owen's avatar
      [SPARK-17389][ML][MLLIB] KMeans speedup with better choice of k-means|| init steps = 2 · 29ba9578
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Reduce default k-means|| init steps to 2 from 5. See JIRA for discussion.
      See also https://github.com/apache/spark/pull/14948
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14956 from srowen/SPARK-17389.2.
      29ba9578
  3. Sep 10, 2016
    • Xin Ren's avatar
      [SPARK-16445][MLLIB][SPARKR] Fix @return description for sparkR mlp summary() method · 71b7d42f
      Xin Ren authored
      ## What changes were proposed in this pull request?
      
      Fix summary() method's `return` description for spark.mlp
      
      ## How was this patch tested?
      
      Ran tests locally on my laptop.
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #15015 from keypointt/SPARK-16445-2.
      71b7d42f
    • Ryan Blue's avatar
      [SPARK-17396][CORE] Share the task support between UnionRDD instances. · 6ea5055f
      Ryan Blue authored
      ## What changes were proposed in this pull request?
      
      Share the ForkJoinTaskSupport between UnionRDD instances to avoid creating a huge number of threads if lots of RDDs are created at the same time.
      
      ## How was this patch tested?
      
      This uses existing UnionRDD tests.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #14985 from rdblue/SPARK-17396-use-shared-pool.
      6ea5055f
    • Yanbo Liang's avatar
      [SPARK-15509][FOLLOW-UP][ML][SPARKR] R MLlib algorithms should support input... · bcdd259c
      Yanbo Liang authored
      [SPARK-15509][FOLLOW-UP][ML][SPARKR] R MLlib algorithms should support input columns "features" and "label"
      
      ## What changes were proposed in this pull request?
      #13584 resolved the issue of features and label columns conflict with ```RFormula``` default ones when loading libsvm data, but it still left some issues should be resolved:
      1, It’s not necessary to check and rename label column.
      Since we have considerations on the design of ```RFormula```, it can handle the case of label column already exists(with restriction of the existing label column should be numeric/boolean type). So it’s not necessary to change the column name to avoid conflict. If the label column is not numeric/boolean type, ```RFormula``` will throw exception.
      
      2, We should rename features column name to new one if there is conflict, but appending a random value is enough since it was used internally only. We done similar work when implementing ```SQLTransformer```.
      
      3, We should set correct new features column for the estimators. Take ```GLM``` as example:
      ```GLM``` estimator should set features column with the changed one(rFormula.getFeaturesCol) rather than the default “features”. Although it’s same when training model, but it involves problems when predicting. The following is the prediction result of GLM before this PR:
      ![image](https://cloud.githubusercontent.com/assets/1962026/18308227/84c3c452-74a8-11e6-9caa-9d6d846cc957.png)
      We should drop the internal used feature column name, otherwise, it will appear on the prediction DataFrame which will confused users. And this behavior is same as other scenarios which does not exist column name conflict.
      After this PR:
      ![image](https://cloud.githubusercontent.com/assets/1962026/18308240/92082a04-74a8-11e6-9226-801f52b856d9.png)
      
      ## How was this patch tested?
      Existing unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #14993 from yanboliang/spark-15509.
      bcdd259c
    • Yves Raimond's avatar
      [SPARK-11496][GRAPHX] Parallel implementation of personalized pagerank · 1fec3ce4
      Yves Raimond authored
      (Updated version of [PR-9457](https://github.com/apache/spark/pull/9457), rebased on latest Spark master, and using mllib-local).
      
      This implements a parallel version of personalized pagerank, which runs all propagations for a list of source vertices in parallel.
      
      I ran a few benchmarks on the full [DBpedia](http://dbpedia.org/) graph. When running personalized pagerank for only one source node, the existing implementation is twice as fast as the parallel one (because of the SparseVector overhead). However for 10 source nodes, the parallel implementation is four times as fast. When increasing the number of source nodes, this difference becomes even greater.
      
      ![image](https://cloud.githubusercontent.com/assets/2491/10927702/dd82e4fa-8256-11e5-89a8-4799b407f502.png)
      
      Author: Yves Raimond <yraimond@netflix.com>
      
      Closes #14998 from moustaki/parallel-ppr.
      1fec3ce4
  4. Sep 09, 2016
    • Tejas Patil's avatar
      [SPARK-15453][SQL] FileSourceScanExec to extract `outputOrdering` information · 33549170
      Tejas Patil authored
      ## What changes were proposed in this pull request?
      
      Jira : https://issues.apache.org/jira/browse/SPARK-15453
      
      Extracting sort ordering information in `FileSourceScanExec` so that planner can make use of it. My motivation to make this change was to get Sort Merge join in par with Hive's Sort-Merge-Bucket join when the source tables are bucketed + sorted.
      
      Query:
      
      ```
      val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", "k").coalesce(1)
      df.write.bucketBy(8, "j", "k").sortBy("j", "k").saveAsTable("table8")
      df.write.bucketBy(8, "j", "k").sortBy("j", "k").saveAsTable("table9")
      context.sql("SELECT * FROM table8 a JOIN table9 b ON a.j=b.j AND a.k=b.k").explain(true)
      ```
      
      Before:
      
      ```
      == Physical Plan ==
      *SortMergeJoin [j#120, k#121], [j#123, k#124], Inner
      :- *Sort [j#120 ASC, k#121 ASC], false, 0
      :  +- *Project [i#119, j#120, k#121]
      :     +- *Filter (isnotnull(k#121) && isnotnull(j#120))
      :        +- *FileScan orc default.table8[i#119,j#120,k#121] Batched: false, Format: ORC, InputPaths: file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table8, PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: struct<i:int,j:int,k:string>
      +- *Sort [j#123 ASC, k#124 ASC], false, 0
      +- *Project [i#122, j#123, k#124]
      +- *Filter (isnotnull(k#124) && isnotnull(j#123))
       +- *FileScan orc default.table9[i#122,j#123,k#124] Batched: false, Format: ORC, InputPaths: file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table9, PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: struct<i:int,j:int,k:string>
      ```
      
      After:  (note that the `Sort` step is no longer there)
      
      ```
      == Physical Plan ==
      *SortMergeJoin [j#49, k#50], [j#52, k#53], Inner
      :- *Project [i#48, j#49, k#50]
      :  +- *Filter (isnotnull(k#50) && isnotnull(j#49))
      :     +- *FileScan orc default.table8[i#48,j#49,k#50] Batched: false, Format: ORC, InputPaths: file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table8, PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: struct<i:int,j:int,k:string>
      +- *Project [i#51, j#52, k#53]
         +- *Filter (isnotnull(j#52) && isnotnull(k#53))
            +- *FileScan orc default.table9[i#51,j#52,k#53] Batched: false, Format: ORC, InputPaths: file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table9, PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: struct<i:int,j:int,k:string>
      ```
      
      ## How was this patch tested?
      
      Added a test case in `JoinSuite`. Ran all other tests in `JoinSuite`
      
      Author: Tejas Patil <tejasp@fb.com>
      
      Closes #14864 from tejasapatil/SPARK-15453_smb_optimization.
      33549170
    • hyukjinkwon's avatar
      [SPARK-17354] [SQL] Partitioning by dates/timestamps should work with Parquet vectorized reader · f7d21437
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR fixes `ColumnVectorUtils.populate` so that Parquet vectorized reader can read partitioned table with dates/timestamps. This works fine with Parquet normal reader.
      
      This is being only called within [VectorizedParquetRecordReader.java#L185](https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L185).
      
      When partition column types are explicitly given to `DateType` or `TimestampType` (rather than inferring the type of partition column), this fails with the exception below:
      
      ```
      16/09/01 10:30:07 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 6)
      java.lang.ClassCastException: java.lang.Integer cannot be cast to java.sql.Date
      	at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:89)
      	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:185)
      	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:204)
      	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362)
      ...
      ```
      
      ## How was this patch tested?
      
      Unit tests in `SQLQuerySuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14919 from HyukjinKwon/SPARK-17354.
      f7d21437
    • Thomas Graves's avatar
      [SPARK-17433] YarnShuffleService doesn't handle moving credentials levelDb · a3981c28
      Thomas Graves authored
      The secrets leveldb isn't being moved if you run spark shuffle services without yarn nm recovery on and then turn it on.  This fixes that.  I unfortunately missed this when I ported the patch from our internal branch 2 to master branch due to the changes for the recovery path.  Note this only applies to master since it is the only place the yarn nm recovery dir is used.
      
      Unit tests ran and tested on 8 node cluster.  Fresh startup with NM recovery, fresh startup no nm recovery, switching between no nm recovery and recovery.  Also tested running applications to make sure wasn't affected by rolling upgrade.
      
      Author: Thomas Graves <tgraves@prevailsail.corp.gq1.yahoo.com>
      Author: Tom Graves <tgraves@apache.org>
      
      Closes #14999 from tgravescs/SPARK-17433.
      a3981c28
    • Satendra Kumar's avatar
      Streaming doc correction. · 7098a129
      Satendra Kumar authored
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      Streaming doc correction.
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: Satendra Kumar <satendra@knoldus.com>
      
      Closes #14996 from satendrakumar06/patch-1.
      7098a129
    • Yanbo Liang's avatar
      [SPARK-17464][SPARKR][ML] SparkR spark.als argument reg should be 0.1 by default. · 2ed60121
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      SparkR ```spark.als``` arguments ```reg``` should be 0.1 by default, which need to be consistent with ML.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15021 from yanboliang/spark-17464.
      2ed60121
    • Joseph K. Bradley's avatar
      [SPARK-17456][CORE] Utility for parsing Spark versions · 65b814bf
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      This patch adds methods for extracting major and minor versions as Int types in Scala from a Spark version string.
      
      Motivation: There are many hacks within Spark's codebase to identify and compare Spark versions. We should add a simple utility to standardize these code paths, especially since there have been mistakes made in the past. This will let us add unit tests as well.  Currently, I want this functionality to check Spark versions to provide backwards compatibility for ML model persistence.
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #15017 from jkbradley/version-parsing.
      65b814bf
  5. Sep 08, 2016
    • Gurvinder Singh's avatar
      [SPARK-15487][WEB UI] Spark Master UI to reverse proxy Application and Workers UI · 92ce8d48
      Gurvinder Singh authored
      ## What changes were proposed in this pull request?
      
      This pull request adds the functionality to enable accessing worker and application UI through master UI itself. Thus helps in accessing SparkUI when running spark cluster in closed networks e.g. Kubernetes. Cluster admin needs to expose only spark master UI and rest of the UIs can be in the private network, master UI will reverse proxy the connection request to corresponding resource. It adds the path for workers/application UIs as
      
      WorkerUI: <http/https>://master-publicIP:<port>/target/workerID/
      ApplicationUI: <http/https>://master-publicIP:<port>/target/appID/
      
      This makes it easy for users to easily protect the Spark master cluster access by putting some reverse proxy e.g. https://github.com/bitly/oauth2_proxy
      
      ## How was this patch tested?
      
      The functionality has been tested manually and there is a unit test too for testing access to worker UI with reverse proxy address.
      
      pwendell bomeng BryanCutler can you please review it, thanks.
      
      Author: Gurvinder Singh <gurvinder.singh@uninett.no>
      
      Closes #13950 from gurvindersingh/rproxy.
      92ce8d48
    • Eric Liang's avatar
      [SPARK-17405] RowBasedKeyValueBatch should use default page size to prevent OOMs · 722afbb2
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      Before this change, we would always allocate 64MB per aggregation task for the first-level hash map storage, even when running in low-memory situations such as local mode. This changes it to use the memory manager default page size, which is automatically reduced from 64MB in these situations.
      
      cc ooq JoshRosen
      
      ## How was this patch tested?
      
      Tested manually with `bin/spark-shell --master=local[32]` and verifying that `(1 to math.pow(10, 3).toInt).toDF("n").withColumn("m", 'n % 2).groupBy('m).agg(sum('n)).show` does not crash.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #15016 from ericl/sc-4483.
      722afbb2
    • hyukjinkwon's avatar
      [SPARK-17200][PROJECT INFRA][BUILD][SPARKR] Automate building and testing on... · 78d5d4dd
      hyukjinkwon authored
      [SPARK-17200][PROJECT INFRA][BUILD][SPARKR] Automate building and testing on Windows (currently SparkR only)
      
      ## What changes were proposed in this pull request?
      
      This PR adds the build automation on Windows with [AppVeyor](https://www.appveyor.com/) CI tool.
      
      Currently, this only runs the tests for SparkR as we have been having some issues with testing Windows-specific PRs (e.g. https://github.com/apache/spark/pull/14743 and https://github.com/apache/spark/pull/13165) and hard time to verify this.
      
      One concern is, this build is dependent on [steveloughran/winutils](https://github.com/steveloughran/winutils) for pre-built Hadoop bin package (who is a Hadoop PMC member).
      
      ## How was this patch tested?
      
      Manually, https://ci.appveyor.com/project/HyukjinKwon/spark/build/88-SPARK-17200-build-profile
      This takes roughly 40 mins.
      
      Some tests are already being failed and this was found in https://github.com/apache/spark/pull/14743#issuecomment-241405287.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14859 from HyukjinKwon/SPARK-17200-build.
      78d5d4dd
    • Felix Cheung's avatar
      [SPARK-17442][SPARKR] Additional arguments in write.df are not passed to data source · f0d21b7f
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      additional options were not passed down in write.df.
      
      ## How was this patch tested?
      
      unit tests
      falaki shivaram
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15010 from felixcheung/testreadoptions.
      f0d21b7f
    • Wenchen Fan's avatar
      [SPARK-17432][SQL] PreprocessDDL should respect case sensitivity when checking duplicated columns · 3ced39df
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      In `PreprocessDDL` we will check if table columns are duplicated. However, this checking ignores case sensitivity config(it's always case-sensitive) and lead to different result between `HiveExternalCatalog` and `InMemoryCatalog`. `HiveExternalCatalog` will throw exception because hive metastore is always case-nonsensitive, and `InMemoryCatalog` is fine.
      
      This PR fixes it.
      
      ## How was this patch tested?
      
      a new test in DDLSuite
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14994 from cloud-fan/check-dup.
      3ced39df
  6. Sep 07, 2016
    • gatorsmile's avatar
      [SPARK-17052][SQL] Remove Duplicate Test Cases auto_join from HiveCompatibilitySuite.scala · b230fb92
      gatorsmile authored
      ### What changes were proposed in this pull request?
      The original [JIRA Hive-1642](https://issues.apache.org/jira/browse/HIVE-1642) delivered the test cases `auto_joinXYZ` for verifying the results when the joins are automatically converted to map-join. Basically, most of them are just copied from the corresponding `joinXYZ`.
      
      After comparison between `auto_joinXYZ` and `joinXYZ`, below is a list of duplicate cases:
      ```
          "auto_join0",
          "auto_join1",
          "auto_join10",
          "auto_join11",
          "auto_join12",
          "auto_join13",
          "auto_join14",
          "auto_join14_hadoop20",
          "auto_join15",
          "auto_join17",
          "auto_join18",
          "auto_join2",
          "auto_join20",
          "auto_join21",
          "auto_join23",
          "auto_join24",
          "auto_join3",
          "auto_join4",
          "auto_join5",
          "auto_join6",
          "auto_join7",
          "auto_join8",
          "auto_join9"
      ```
      
      We can remove all of them without affecting the test coverage.
      
      ### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14635 from gatorsmile/removeAuto.
      b230fb92
    • Eric Liang's avatar
      [SPARK-17370] Shuffle service files not invalidated when a slave is lost · 649fa4bf
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      DAGScheduler invalidates shuffle files when an executor loss event occurs, but not when the external shuffle service is enabled. This is because when shuffle service is on, the shuffle file lifetime can exceed the executor lifetime.
      
      However, it also doesn't invalidate shuffle files when the shuffle service itself is lost (due to whole slave loss). This can cause long hangs when slaves are lost since the file loss is not detected until a subsequent stage attempts to read the shuffle files.
      
      The proposed fix is to also invalidate shuffle files when an executor is lost due to a `SlaveLost` event.
      
      ## How was this patch tested?
      
      Unit tests, also verified on an actual cluster that slave loss invalidates shuffle files immediately as expected.
      
      cc mateiz
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #14931 from ericl/sc-4439.
      649fa4bf
    • Srinivasa Reddy Vundela's avatar
      [MINOR][SQL] Fixing the typo in unit test · 76ad89e9
      Srinivasa Reddy Vundela authored
      ## What changes were proposed in this pull request?
      
      Fixing the typo in the unit test of CodeGenerationSuite.scala
      
      ## How was this patch tested?
      Ran the unit test after fixing the typo and it passes
      
      Author: Srinivasa Reddy Vundela <vsr@cloudera.com>
      
      Closes #14989 from vundela/typo_fix.
      76ad89e9
    • Daoyuan Wang's avatar
      [SPARK-17427][SQL] function SIZE should return -1 when parameter is null · 6f4aeccf
      Daoyuan Wang authored
      ## What changes were proposed in this pull request?
      
      `select size(null)` returns -1 in Hive. In order to be compatible, we should return `-1`.
      
      ## How was this patch tested?
      
      unit test in `CollectionFunctionsSuite` and `DataFrameFunctionsSuite`.
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #14991 from adrian-wang/size.
      6f4aeccf
    • hyukjinkwon's avatar
      [SPARK-17339][SPARKR][CORE] Fix some R tests and use Path.toUri in... · 6b41195b
      hyukjinkwon authored
      [SPARK-17339][SPARKR][CORE] Fix some R tests and use Path.toUri in SparkContext for Windows paths in SparkR
      
      ## What changes were proposed in this pull request?
      
      This PR fixes the Windows path issues in several APIs. Please refer https://issues.apache.org/jira/browse/SPARK-17339 for more details.
      
      ## How was this patch tested?
      
      Tests via AppVeyor CI - https://ci.appveyor.com/project/HyukjinKwon/spark/build/82-SPARK-17339-fix-r
      
      Also, manually,
      
      ![2016-09-06 3 14 38](https://cloud.githubusercontent.com/assets/6477701/18263406/b93a98be-7444-11e6-9521-b28ee65a4771.png)
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14960 from HyukjinKwon/SPARK-17339.
      6b41195b
    • Liwei Lin's avatar
      [SPARK-17359][SQL][MLLIB] Use ArrayBuffer.+=(A) instead of... · 3ce3a282
      Liwei Lin authored
      [SPARK-17359][SQL][MLLIB] Use ArrayBuffer.+=(A) instead of ArrayBuffer.append(A) in performance critical paths
      
      ## What changes were proposed in this pull request?
      
      We should generally use `ArrayBuffer.+=(A)` rather than `ArrayBuffer.append(A)`, because `append(A)` would involve extra boxing / unboxing.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #14914 from lw-lin/append_to_plus_eq_v2.
      3ce3a282
    • Clark Fitzgerald's avatar
      [SPARK-16785] R dapply doesn't return array or raw columns · 9fccde4f
      Clark Fitzgerald authored
      ## What changes were proposed in this pull request?
      
      Fixed bug in `dapplyCollect` by changing the `compute` function of `worker.R` to explicitly handle raw (binary) vectors.
      
      cc shivaram
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: Clark Fitzgerald <clarkfitzg@gmail.com>
      
      Closes #14783 from clarkfitzg/SPARK-16785.
      9fccde4f
  7. Sep 06, 2016
    • Tathagata Das's avatar
      [SPARK-17372][SQL][STREAMING] Avoid serialization issues by using Arrays to... · eb1ab88a
      Tathagata Das authored
      [SPARK-17372][SQL][STREAMING] Avoid serialization issues by using Arrays to save file names in FileStreamSource
      
      ## What changes were proposed in this pull request?
      
      When we create a filestream on a directory that has partitioned subdirs (i.e. dir/x=y/), then ListingFileCatalog.allFiles returns the files in the dir as Seq[String] which internally is a Stream[String]. This is because of this [line](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L93), where a LinkedHashSet.values.toSeq returns Stream. Then when the [FileStreamSource](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L79) filters this Stream[String] to remove the seen files, it creates a new Stream[String], which has a filter function that has a $outer reference to the FileStreamSource (in Scala 2.10). Trying to serialize this Stream[String] causes NotSerializableException. This will happened even if there is just one file in the dir.
      
      Its important to note that this behavior is different in Scala 2.11. There is no $outer reference to FileStreamSource, so it does not throw NotSerializableException. However, with a large sequence of files (tested with 10000 files), it throws StackOverflowError. This is because how Stream class is implemented. Its basically like a linked list, and attempting to serialize a long Stream requires *recursively* going through linked list, thus resulting in StackOverflowError.
      
      In short, across both Scala 2.10 and 2.11, serialization fails when both the following conditions are true.
      - file stream defined on a partitioned directory
      - directory has 10k+ files
      
      The right solution is to convert the seq to an array before writing to the log. This PR implements this fix in two ways.
      - Changing all uses for HDFSMetadataLog to ensure Array is used instead of Seq
      - Added a `require` in HDFSMetadataLog such that it is never used with type Seq
      
      ## How was this patch tested?
      
      Added unit test that test that ensures the file stream source can handle with 10000 files. This tests fails in both Scala 2.10 and 2.11 with different failures as indicated above.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #14987 from tdas/SPARK-17372.
      eb1ab88a
    • Wenchen Fan's avatar
      [SPARK-17238][SQL] simplify the logic for converting data source table into hive compatible format · d6eede9a
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Previously we have 2 conditions to decide whether a data source table is hive-compatible:
      
      1. the data source is file-based and has a corresponding Hive serde
      2. have a `path` entry in data source options/storage properties
      
      However, if condition 1 is true, condition 2 must be true too, as we will put the default table path into data source options/storage properties for managed data source tables.
      
      There is also a potential issue: we will set the `locationUri` even for managed table.
      
      This PR removes the condition 2 and only set the `locationUri` for external data source tables.
      
      Note: this is also a first step to unify the `path` of data source tables and `locationUri` of hive serde tables. For hive serde tables, `locationUri` is only set for external table. For data source tables, `path` is always set. We can make them consistent after this PR.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14809 from cloud-fan/minor2.
      d6eede9a
Loading