Skip to content
Snippets Groups Projects
  1. Jun 28, 2016
    • Yin Huai's avatar
      [SPARK-15863][SQL][DOC][FOLLOW-UP] Update SQL programming guide. · dd6b7dbe
      Yin Huai authored
      ## What changes were proposed in this pull request?
      This PR makes several updates to SQL programming guide.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #13938 from yhuai/doc.
      dd6b7dbe
    • Dongjoon Hyun's avatar
      [SPARK-16221][SQL] Redirect Parquet JUL logger via SLF4J for WRITE operations · a0da854f
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      [SPARK-8118](https://github.com/apache/spark/pull/8196) implements redirecting Parquet JUL logger via SLF4J, but it is currently applied only when READ operations occurs. If users use only WRITE operations, there occurs many Parquet logs.
      
      This PR makes the redirection work on WRITE operations, too.
      
      **Before**
      ```scala
      scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
      SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
      SLF4J: Defaulting to no-operation (NOP) logger implementation
      SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
      Jun 26, 2016 9:04:38 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
      ............ about 70 lines Parquet Log .............
      scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
      ............ about 70 lines Parquet Log .............
      ```
      
      **After**
      ```scala
      scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
      SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
      SLF4J: Defaulting to no-operation (NOP) logger implementation
      SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
      scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
      ```
      
      This PR also fixes some typos.
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13918 from dongjoon-hyun/SPARK-16221.
      a0da854f
  2. Jun 27, 2016
  3. Jun 26, 2016
    • Felix Cheung's avatar
      [SPARK-16184][SPARKR] conf API for SparkSession · 30b182bc
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Add `conf` method to get Runtime Config from SparkSession
      
      ## How was this patch tested?
      
      unit tests, manual tests
      
      This is how it works in sparkR shell:
      ```
       SparkSession available as 'spark'.
      > conf()
      $hive.metastore.warehouse.dir
      [1] "file:/opt/spark-2.0.0-bin-hadoop2.6/R/spark-warehouse"
      
      $spark.app.id
      [1] "local-1466749575523"
      
      $spark.app.name
      [1] "SparkR"
      
      $spark.driver.host
      [1] "10.0.2.1"
      
      $spark.driver.port
      [1] "45629"
      
      $spark.executorEnv.LD_LIBRARY_PATH
      [1] "$LD_LIBRARY_PATH:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/jre/lib/amd64/server"
      
      $spark.executor.id
      [1] "driver"
      
      $spark.home
      [1] "/opt/spark-2.0.0-bin-hadoop2.6"
      
      $spark.master
      [1] "local[*]"
      
      $spark.sql.catalogImplementation
      [1] "hive"
      
      $spark.submit.deployMode
      [1] "client"
      
      > conf("spark.master")
      $spark.master
      [1] "local[*]"
      
      ```
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #13885 from felixcheung/rconf.
      30b182bc
  4. Jun 25, 2016
    • Sean Owen's avatar
      [SPARK-16193][TESTS] Address flaky ExternalAppendOnlyMapSuite spilling tests · e8774158
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Make spill tests wait until job has completed before returning the number of stages that spilled
      
      ## How was this patch tested?
      
      Existing Jenkins tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #13896 from srowen/SPARK-16193.
      e8774158
    • Alex Bozarth's avatar
      [SPARK-1301][WEB UI] Added anchor links to Accumulators and Tasks on StagePage · 3ee9695d
      Alex Bozarth authored
      ## What changes were proposed in this pull request?
      
      Sometimes the "Aggregated Metrics by Executor" table on the Stage page can get very long so actor links to the Accumulators and Tasks tables below it have been added to the summary at the top of the page. This has been done in the same way as the Jobs and Stages pages. Note: the Accumulators link only displays when the table exists.
      
      ## How was this patch tested?
      
      Manually Tested and dev/run-tests
      
      ![justtasks](https://cloud.githubusercontent.com/assets/13952758/15165269/6e8efe8c-16c9-11e6-9784-cffe966fdcf0.png)
      ![withaccumulators](https://cloud.githubusercontent.com/assets/13952758/15165270/7019ec9e-16c9-11e6-8649-db69ed7a317d.png)
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #13037 from ajbozarth/spark1301.
      3ee9695d
    • Sital Kedia's avatar
      [SPARK-15958] Make initial buffer size for the Sorter configurable · bf665a95
      Sital Kedia authored
      ## What changes were proposed in this pull request?
      
      Currently the initial buffer size in the sorter is hard coded inside the code and is too small for large workload. As a result, the sorter spends significant time expanding the buffer size and copying the data. It would be useful to have it configurable.
      
      ## How was this patch tested?
      
      Tested by running a job on the cluster.
      
      Author: Sital Kedia <skedia@fb.com>
      
      Closes #13699 from sitalkedia/config_sort_buffer_upstream.
      bf665a95
    • José Antonio's avatar
      [MLLIB] org.apache.spark.mllib.util.SVMDataGenerator generates... · a3c7b418
      José Antonio authored
      [MLLIB] org.apache.spark.mllib.util.SVMDataGenerator generates ArrayIndexOutOfBoundsException. I have found the bug and tested the solution.
      
      ## What changes were proposed in this pull request?
      
      Just adjust the size of an array in line 58 so it does not cause an ArrayOutOfBoundsException in line 66.
      
      ## How was this patch tested?
      
      Manual tests. I have recompiled the entire project with the fix, it has been built successfully and I have run the code, also with good results.
      
      line 66: val yD = blas.ddot(trueWeights.length, x, 1, trueWeights, 1) + rnd.nextGaussian() * 0.1
      crashes because trueWeights has length "nfeatures + 1" while "x" has length "features", and they should have the same length.
      
      To fix this just make trueWeights be the same length as x.
      
      I have recompiled the project with the change and it is working now:
      [spark-1.6.1]$ spark-submit --master local[*] --class org.apache.spark.mllib.util.SVMDataGenerator mllib/target/spark-mllib_2.11-1.6.1.jar local /home/user/test
      
      And it generates the data successfully now in the specified folder.
      
      Author: José Antonio <joseanmunoz@gmail.com>
      
      Closes #13895 from j4munoz/patch-2.
      a3c7b418
    • Dongjoon Hyun's avatar
      [SPARK-16186] [SQL] Support partition batch pruning with `IN` predicate in InMemoryTableScanExec · a7d29499
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      One of the most frequent usage patterns for Spark SQL is using **cached tables**. This PR improves `InMemoryTableScanExec` to handle `IN` predicate efficiently by pruning partition batches. Of course, the performance improvement varies over the queries and the datasets. But, for the following simple query, the query duration in Spark UI goes from 9 seconds to 50~90ms. It's about over 100 times faster.
      
      **Before**
      ```scala
      $ bin/spark-shell --driver-memory 6G
      scala> val df = spark.range(2000000000)
      scala> df.createOrReplaceTempView("t")
      scala> spark.catalog.cacheTable("t")
      scala> sql("select id from t where id = 1").collect()    // About 2 mins
      scala> sql("select id from t where id = 1").collect()    // less than 90ms
      scala> sql("select id from t where id in (1,2,3)").collect()  // 9 seconds
      ```
      
      **After**
      ```scala
      scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
      ```
      
      This PR has impacts over 35 queries of TPC-DS if the tables are cached.
      Note that this optimization is applied for `IN`.  To apply `IN` predicate having more than 10 items, `spark.sql.optimizer.inSetConversionThreshold` option should be increased.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including new testcases).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13887 from dongjoon-hyun/SPARK-16186.
      a7d29499
  5. Jun 24, 2016
    • Takeshi YAMAMURO's avatar
      [SPARK-16192][SQL] Add type checks in CollectSet · d2e44d7d
      Takeshi YAMAMURO authored
      ## What changes were proposed in this pull request?
      `CollectSet` cannot have map-typed data because MapTypeData does not implement `equals`.
      So, this pr is to add type checks in `CheckAnalysis`.
      
      ## How was this patch tested?
      Added tests to check failures when we found map-typed data in `CollectSet`.
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #13892 from maropu/SPARK-16192.
      d2e44d7d
    • Dilip Biswal's avatar
      [SPARK-16195][SQL] Allow users to specify empty over clause in window... · 9053054c
      Dilip Biswal authored
      [SPARK-16195][SQL] Allow users to specify empty over clause in window expressions through dataset API
      
      ## What changes were proposed in this pull request?
      Allow to specify empty over clause in window expressions through dataset API
      
      In SQL, its allowed to specify an empty OVER clause in the window expression.
      
      ```SQL
      select area, sum(product) over () as c from windowData
      where product > 3 group by area, product
      having avg(month) > 0 order by avg(month), product
      ```
      In this case the analytic function sum is presented based on all the rows of the result set
      
      Currently its not allowed through dataset API and is handled in this PR.
      
      ## How was this patch tested?
      
      Added a new test in DataframeWindowSuite
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #13897 from dilipbiswal/spark-empty-over.
      9053054c
    • Dongjoon Hyun's avatar
      [SPARK-16173] [SQL] Can't join describe() of DataFrame in Scala 2.10 · e5d0928e
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes `DataFrame.describe()` by forcing materialization to make the `Seq` serializable. Currently, `describe()` of DataFrame throws `Task not serializable` Spark exceptions when joining in Scala 2.10.
      
      ## How was this patch tested?
      
      Manual. (After building with Scala 2.10, test on `bin/spark-shell` and `bin/pyspark`.)
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13900 from dongjoon-hyun/SPARK-16173.
      e5d0928e
    • Davies Liu's avatar
      Revert "[SPARK-16186] [SQL] Support partition batch pruning with `IN`... · 20768dad
      Davies Liu authored
      Revert "[SPARK-16186] [SQL] Support partition batch pruning with `IN` predicate in InMemoryTableScanExec"
      
      This reverts commit a65bcbc2.
      20768dad
    • Dongjoon Hyun's avatar
      [SPARK-16186] [SQL] Support partition batch pruning with `IN` predicate in InMemoryTableScanExec · a65bcbc2
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      One of the most frequent usage patterns for Spark SQL is using **cached tables**. This PR improves `InMemoryTableScanExec` to handle `IN` predicate efficiently by pruning partition batches. Of course, the performance improvement varies over the queries and the datasets. But, for the following simple query, the query duration in Spark UI goes from 9 seconds to 50~90ms. It's about over 100 times faster.
      
      **Before**
      ```scala
      $ bin/spark-shell --driver-memory 6G
      scala> val df = spark.range(2000000000)
      scala> df.createOrReplaceTempView("t")
      scala> spark.catalog.cacheTable("t")
      scala> sql("select id from t where id = 1").collect()    // About 2 mins
      scala> sql("select id from t where id = 1").collect()    // less than 90ms
      scala> sql("select id from t where id in (1,2,3)").collect()  // 9 seconds
      ```
      
      **After**
      ```scala
      scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
      ```
      
      This PR has impacts over 35 queries of TPC-DS if the tables are cached.
      Note that this optimization is applied for `IN`.  To apply `IN` predicate having more than 10 items, `spark.sql.optimizer.inSetConversionThreshold` option should be increased.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including new testcases).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13887 from dongjoon-hyun/SPARK-16186.
      a65bcbc2
    • Davies Liu's avatar
      [SPARK-16179][PYSPARK] fix bugs for Python udf in generate · 4435de1b
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR fix the bug when Python UDF is used in explode (generator), GenerateExec requires that all the attributes in expressions should be resolvable from children when creating, we should replace the children first, then replace it's expressions.
      
      ```
      >>> df.select(explode(f(*df))).show()
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/home/vlad/dev/spark/python/pyspark/sql/dataframe.py", line 286, in show
          print(self._jdf.showString(n, truncate))
        File "/home/vlad/dev/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
        File "/home/vlad/dev/spark/python/pyspark/sql/utils.py", line 63, in deco
          return f(*a, **kw)
        File "/home/vlad/dev/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
      py4j.protocol.Py4JJavaError: An error occurred while calling o52.showString.
      : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree:
      Generate explode(<lambda>(_1#0L)), false, false, [col#15L]
      +- Scan ExistingRDD[_1#0L]
      
      	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
      	at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:69)
      	at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:45)
      	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:177)
      	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144)
      	at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:153)
      	at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
      	at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
      	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
      	at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
      	at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
      	at org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:95)
      	at org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:95)
      	at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
      	at scala.collection.immutable.List.foldLeft(List.scala:84)
      	at org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:95)
      	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:85)
      	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:85)
      	at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2557)
      	at org.apache.spark.sql.Dataset.head(Dataset.scala:1923)
      	at org.apache.spark.sql.Dataset.take(Dataset.scala:2138)
      	at org.apache.spark.sql.Dataset.showString(Dataset.scala:239)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
      	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
      	at py4j.Gateway.invoke(Gateway.java:280)
      	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
      	at py4j.commands.CallCommand.execute(CallCommand.java:79)
      	at py4j.GatewayConnection.run(GatewayConnection.java:211)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: java.lang.reflect.InvocationTargetException
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
      	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$13.apply(TreeNode.scala:413)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$13.apply(TreeNode.scala:413)
      	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:412)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:387)
      	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
      	... 42 more
      Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF0#20
      	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
      	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
      	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
      	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268)
      	at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
      	at org.apache.spark.sql.execution.GenerateExec.<init>(GenerateExec.scala:63)
      	... 52 more
      Caused by: java.lang.RuntimeException: Couldn't find pythonUDF0#20 in [_1#0L]
      	at scala.sys.package$.error(package.scala:27)
      	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94)
      	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88)
      	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
      	... 67 more
      ```
      
      ## How was this patch tested?
      
      Added regression tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13883 from davies/udf_in_generate.
      4435de1b
    • Reynold Xin's avatar
      [SQL][MINOR] Simplify data source predicate filter translation. · 5f8de216
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This is a small patch to rewrite the predicate filter translation in DataSourceStrategy. The original code used excessive functional constructs (e.g. unzip) and was very difficult to understand.
      
      ## How was this patch tested?
      Should be covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13889 from rxin/simplify-predicate-filter.
      5f8de216
    • Davies Liu's avatar
      [SPARK-16077] [PYSPARK] catch the exception from pickle.whichmodule() · d4893540
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      In the case that we don't know which module a object came from, will call pickle.whichmodule() to go throught all the loaded modules to find the object, which could fail because some modules, for example, six, see https://bitbucket.org/gutworth/six/issues/63/importing-six-breaks-pickling
      
      We should ignore the exception here, use `__main__` as the module name (it means we can't find the module).
      
      ## How was this patch tested?
      
      Manual tested. Can't have a unit test for this.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13788 from davies/whichmodule.
      d4893540
    • Liwei Lin's avatar
      [SPARK-15963][CORE] Catch `TaskKilledException` correctly in Executor.TaskRunner · a4851ed0
      Liwei Lin authored
      ## The problem
      
      Before this change, if either of the following cases happened to a task , the task would be marked as `FAILED` instead of `KILLED`:
      - the task was killed before it was deserialized
      - `executor.kill()` marked `taskRunner.killed`, but before calling `task.killed()` the worker thread threw the `TaskKilledException`
      
      The reason is, in the `catch` block of the current [Executor.TaskRunner](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L362)'s implementation, we are mistakenly catching:
      ```scala
      case _: TaskKilledException | _: InterruptedException if task.killed => ...
      ```
      the semantics of which is:
      - **(**`TaskKilledException` **OR** `InterruptedException`**)** **AND** `task.killed`
      
      Then when `TaskKilledException` is thrown but `task.killed` is not marked, we would mark the task as `FAILED` (which should really be `KILLED`).
      
      ## What changes were proposed in this pull request?
      
      This patch alters the catch condition's semantics from:
      - **(**`TaskKilledException` **OR** `InterruptedException`**)** **AND** `task.killed`
      
      to
      
      - `TaskKilledException` **OR** **(**`InterruptedException` **AND** `task.killed`**)**
      
      so that we can catch `TaskKilledException` correctly and mark the task as `KILLED` correctly.
      
      ## How was this patch tested?
      
      Added unit test which failed before the change, ran new test 1000 times manually
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #13685 from lw-lin/fix-task-killed.
      a4851ed0
    • GayathriMurali's avatar
      [SPARK-15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer and CountVectorizer · be88383e
      GayathriMurali authored
      ## What changes were proposed in this pull request?
      
      Made changes to HashingTF,QuantileVectorizer and CountVectorizer
      
      Author: GayathriMurali <gayathri.m@intel.com>
      
      Closes #13745 from GayathriMurali/SPARK-15997.
      be88383e
    • Sean Owen's avatar
      [SPARK-16129][CORE][SQL] Eliminate direct use of commons-lang classes in favor of commons-lang3 · 158af162
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Replace use of `commons-lang` in favor of `commons-lang3` and forbid the former via scalastyle; remove `NotImplementedException` from `comons-lang` in favor of JDK `UnsupportedOperationException`
      
      ## How was this patch tested?
      
      Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #13843 from srowen/SPARK-16129.
      158af162
    • peng.zhang's avatar
      [SPARK-16125][YARN] Fix not test yarn cluster mode correctly in YarnClusterSuite · f4fd7432
      peng.zhang authored
      ## What changes were proposed in this pull request?
      
      Since SPARK-13220(Deprecate "yarn-client" and "yarn-cluster"), YarnClusterSuite doesn't test "yarn cluster" mode correctly.
      This pull request fixes it.
      
      ## How was this patch tested?
      Unit test
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: peng.zhang <peng.zhang@xiaomi.com>
      
      Closes #13836 from renozhang/SPARK-16125-test-yarn-cluster-mode.
      f4fd7432
    • Cheng Lian's avatar
      [SPARK-13709][SQL] Initialize deserializer with both table and partition... · 2d2f607b
      Cheng Lian authored
      [SPARK-13709][SQL] Initialize deserializer with both table and partition properties when reading partitioned tables
      
      ## What changes were proposed in this pull request?
      
      When reading partitions of a partitioned Hive SerDe table, we only initializes the deserializer using partition properties. However, for SerDes like `AvroSerDe`, essential properties (e.g. Avro schema information) may be defined in table properties. We should merge both table properties and partition properties before initializing the deserializer.
      
      Note that an individual partition may have different properties than the one defined in the table properties (e.g. partitions within a table can have different SerDes). Thus, for any property key defined in both partition and table properties, the value set in partition properties wins.
      
      ## How was this patch tested?
      
      New test case added in `QueryPartitionSuite`.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #13865 from liancheng/spark-13709-partitioned-avro-table.
      2d2f607b
  6. Jun 23, 2016
    • Yuhao Yang's avatar
      [SPARK-16133][ML] model loading backward compatibility for ml.feature · cc6778ee
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      model loading backward compatibility for ml.feature,
      
      ## How was this patch tested?
      
      existing ut and manual test for loading 1.6 models.
      
      Author: Yuhao Yang <yuhao.yang@intel.com>
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #13844 from hhbyyh/featureComp.
      cc6778ee
    • Xiangrui Meng's avatar
      [SPARK-16142][R] group naiveBayes method docs in a single Rd · 4a40d43b
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      This PR groups `spark.naiveBayes`, `summary(NB)`, `predict(NB)`, and `write.ml(NB)` into a single Rd.
      
      ## How was this patch tested?
      
      Manually checked generated HTML doc. See attached screenshots.
      
      ![screen shot 2016-06-23 at 2 11 00 pm](https://cloud.githubusercontent.com/assets/829644/16320452/a5885e92-394c-11e6-994f-2ab5cddad86f.png)
      
      ![screen shot 2016-06-23 at 2 11 15 pm](https://cloud.githubusercontent.com/assets/829644/16320455/aad1f6d8-394c-11e6-8ef4-13bee989f52f.png)
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #13877 from mengxr/SPARK-16142.
      4a40d43b
    • Yuhao Yang's avatar
      [SPARK-16177][ML] model loading backward compatibility for ml.regression · 14bc5a7f
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      jira: https://issues.apache.org/jira/browse/SPARK-16177
      model loading backward compatibility for ml.regression
      
      ## How was this patch tested?
      
      existing ut and manual test for loading 1.6 models.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #13879 from hhbyyh/regreComp.
      14bc5a7f
    • Wenchen Fan's avatar
      [SQL][MINOR] ParserUtils.operationNotAllowed should throw exception directly · 6a3c6276
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      It's weird that `ParserUtils.operationNotAllowed` returns an exception and the caller throw it.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13874 from cloud-fan/style.
      6a3c6276
    • Sameer Agarwal's avatar
      [SPARK-16123] Avoid NegativeArraySizeException while reserving additional... · cc71d4fa
      Sameer Agarwal authored
      [SPARK-16123] Avoid NegativeArraySizeException while reserving additional capacity in VectorizedColumnReader
      
      ## What changes were proposed in this pull request?
      
      This patch fixes an overflow bug in vectorized parquet reader where both off-heap and on-heap variants of `ColumnVector.reserve()` can unfortunately overflow while reserving additional capacity during reads.
      
      ## How was this patch tested?
      
      Manual Tests
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #13832 from sameeragarwal/negative-array.
      cc71d4fa
    • Dongjoon Hyun's avatar
      [SPARK-16165][SQL] Fix the update logic for InMemoryTableScanExec.readBatches · 264bc636
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, `readBatches` accumulator of `InMemoryTableScanExec` is updated only when `spark.sql.inMemoryColumnarStorage.partitionPruning` is true. Although this metric is used for only testing purpose, we had better have correct metric without considering SQL options.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including a new testcase).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13870 from dongjoon-hyun/SPARK-16165.
      264bc636
    • Shixiong Zhu's avatar
      [SPARK-15443][SQL] Fix 'explain' for streaming Dataset · 0e4bdebe
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      - Fix the `explain` command for streaming Dataset/DataFrame. E.g.,
      ```
      == Parsed Logical Plan ==
      'SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
      +- 'MapElements <function1>, obj#6: java.lang.String
         +- 'DeserializeToObject unresolveddeserializer(createexternalrow(getcolumnbyordinal(0, StringType).toString, StructField(value,StringType,true))), obj#5: org.apache.spark.sql.Row
            +- Filter <function1>.apply
               +- StreamingRelation FileSource[/Users/zsx/stream], [value#0]
      
      == Analyzed Logical Plan ==
      value: string
      SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
      +- MapElements <function1>, obj#6: java.lang.String
         +- DeserializeToObject createexternalrow(value#0.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row
            +- Filter <function1>.apply
               +- StreamingRelation FileSource[/Users/zsx/stream], [value#0]
      
      == Optimized Logical Plan ==
      SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
      +- MapElements <function1>, obj#6: java.lang.String
         +- DeserializeToObject createexternalrow(value#0.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row
            +- Filter <function1>.apply
               +- StreamingRelation FileSource[/Users/zsx/stream], [value#0]
      
      == Physical Plan ==
      *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
      +- *MapElements <function1>, obj#6: java.lang.String
         +- *DeserializeToObject createexternalrow(value#0.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row
            +- *Filter <function1>.apply
               +- StreamingRelation FileSource[/Users/zsx/stream], [value#0]
      ```
      
      - Add `StreamingQuery.explain` to display the last execution plan. E.g.,
      ```
      == Parsed Logical Plan ==
      SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
      +- MapElements <function1>, obj#6: java.lang.String
         +- DeserializeToObject createexternalrow(value#12.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row
            +- Filter <function1>.apply
               +- Relation[value#12] text
      
      == Analyzed Logical Plan ==
      value: string
      SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
      +- MapElements <function1>, obj#6: java.lang.String
         +- DeserializeToObject createexternalrow(value#12.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row
            +- Filter <function1>.apply
               +- Relation[value#12] text
      
      == Optimized Logical Plan ==
      SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
      +- MapElements <function1>, obj#6: java.lang.String
         +- DeserializeToObject createexternalrow(value#12.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row
            +- Filter <function1>.apply
               +- Relation[value#12] text
      
      == Physical Plan ==
      *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
      +- *MapElements <function1>, obj#6: java.lang.String
         +- *DeserializeToObject createexternalrow(value#12.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row
            +- *Filter <function1>.apply
               +- *Scan text [value#12] Format: org.apache.spark.sql.execution.datasources.text.TextFileFormat1836ab91, InputPaths: file:/Users/zsx/stream/a.txt, file:/Users/zsx/stream/b.txt, file:/Users/zsx/stream/c.txt, PushedFilters: [], ReadSchema: struct<value:string>
      ```
      
      ## How was this patch tested?
      
      The added unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13815 from zsxwing/sdf-explain.
      0e4bdebe
Loading