Skip to content
Snippets Groups Projects
  1. Jun 14, 2016
    • Jeff Zhang's avatar
      doc fix of HiveThriftServer · 53bb0308
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      
      Just minor doc fix.
      
      \cc yhuai
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #13659 from zjffdu/doc_fix.
      53bb0308
    • Adam Roberts's avatar
      [SPARK-15821][DOCS] Include parallel build info · a431e3f1
      Adam Roberts authored
      ## What changes were proposed in this pull request?
      
      We should mention that users can build Spark using multiple threads to decrease build times; either here or in "Building Spark"
      
      ## How was this patch tested?
      
      Built on machines with between one core to 192 cores using mvn -T 1C and observed faster build times with no loss in stability
      
      In response to the question here https://issues.apache.org/jira/browse/SPARK-15821 I think we should suggest this option as we know it works for Spark and can result in faster builds
      
      Author: Adam Roberts <aroberts@uk.ibm.com>
      
      Closes #13562 from a-roberts/patch-3.
      a431e3f1
    • Shixiong Zhu's avatar
      [SPARK-15935][PYSPARK] Enable test for sql/streaming.py and fix these tests · 96c3500c
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR just enables tests for sql/streaming.py and also fixes the failures.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13655 from zsxwing/python-streaming-test.
      96c3500c
    • Mortada Mehyar's avatar
      [DOCUMENTATION] fixed typos in python programming guide · a87a56f5
      Mortada Mehyar authored
      ## What changes were proposed in this pull request?
      
      minor typo
      
      ## How was this patch tested?
      
      minor typo in the doc, should be self explanatory
      
      Author: Mortada Mehyar <mortada.mehyar@gmail.com>
      
      Closes #13639 from mortada/typo.
      a87a56f5
    • Wenchen Fan's avatar
      [SPARK-15932][SQL][DOC] document the contract of encoder serializer expressions · 688b6ef9
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      In our encoder framework, we imply that serializer expressions should use `BoundReference` to refer to the input object, and a lot of codes depend on this contract(e.g. ExpressionEncoder.tuple).  This PR adds some document and assert in `ExpressionEncoder` to make it clearer.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13648 from cloud-fan/comment.
      688b6ef9
  2. Jun 13, 2016
    • Sandeep Singh's avatar
      [SPARK-15663][SQL] SparkSession.catalog.listFunctions shouldn't include the... · 1842cdd4
      Sandeep Singh authored
      [SPARK-15663][SQL] SparkSession.catalog.listFunctions shouldn't include the list of built-in functions
      
      ## What changes were proposed in this pull request?
      SparkSession.catalog.listFunctions currently returns all functions, including the list of built-in functions. This makes the method not as useful because anytime it is run the result set contains over 100 built-in functions.
      
      ## How was this patch tested?
      CatalogSuite
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      
      Closes #13413 from techaddict/SPARK-15663.
      1842cdd4
    • Liang-Chi Hsieh's avatar
      [SPARK-15364][ML][PYSPARK] Implement PySpark picklers for ml.Vector and... · baa3e633
      Liang-Chi Hsieh authored
      [SPARK-15364][ML][PYSPARK] Implement PySpark picklers for ml.Vector and ml.Matrix under spark.ml.python
      
      ## What changes were proposed in this pull request?
      
      Now we have PySpark picklers for new and old vector/matrix, individually. However, they are all implemented under `PythonMLlibAPI`. To separate spark.mllib from spark.ml, we should implement the picklers of new vector/matrix under `spark.ml.python` instead.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #13219 from viirya/pyspark-pickler-ml.
      baa3e633
    • gatorsmile's avatar
      [SPARK-15808][SQL] File Format Checking When Appending Data · 5827b65e
      gatorsmile authored
      #### What changes were proposed in this pull request?
      **Issue:** Got wrong results or strange errors when append data to a table with mismatched file format.
      
      _Example 1: PARQUET -> CSV_
      ```Scala
      createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc")
      createDF(10, 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc")
      ```
      
      Error we got:
      ```
      Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.RuntimeException: file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-00000-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [79, 82, 67, 23]
      ```
      
      _Example 2: Json -> CSV_
      ```Scala
      createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV")
      createDF(10, 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV")
      ```
      
      No exception, but wrong results:
      ```
      +----+----+
      |  c1|  c2|
      +----+----+
      |null|null|
      |null|null|
      |null|null|
      |null|null|
      |   0|str0|
      |   1|str1|
      |   2|str2|
      |   3|str3|
      |   4|str4|
      |   5|str5|
      |   6|str6|
      |   7|str7|
      |   8|str8|
      |   9|str9|
      +----+----+
      ```
      _Example 3: Json -> Text_
      ```Scala
      createDF(0, 9).write.format("json").saveAsTable("appendJsonToText")
      createDF(10, 19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText")
      ```
      
      Error we got:
      ```
      Text data source supports only a single column, and you have 2 columns.
      ```
      
      This PR is to issue an exception with appropriate error messages.
      
      #### How was this patch tested?
      Added test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13546 from gatorsmile/fileFormatCheck.
      5827b65e
    • Sean Zhong's avatar
      [SPARK-15910][SQL] Check schema consistency when using Kryo encoder to convert DataFrame to Dataset · 7b9071ee
      Sean Zhong authored
      ## What changes were proposed in this pull request?
      
      This PR enforces schema check when converting DataFrame to Dataset using Kryo encoder. For example.
      
      **Before the change:**
      
      Schema is NOT checked when converting DataFrame to Dataset using kryo encoder.
      ```
      scala> case class B(b: Int)
      scala> implicit val encoder = Encoders.kryo[B]
      scala> val df = Seq((1)).toDF("b")
      scala> val ds = df.as[B] // Schema compatibility is NOT checked
      ```
      
      **After the change:**
      Report AnalysisException since the schema is NOT compatible.
      ```
      scala> val ds = Seq((1)).toDF("b").as[B]
      org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`b` AS BINARY)' due to data type mismatch: cannot cast IntegerType to BinaryType;
      ...
      ```
      
      ## How was this patch tested?
      
      Unit test.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #13632 from clockfly/spark-15910.
      7b9071ee
    • Josh Rosen's avatar
      [SPARK-15929] Fix portability of DataFrameSuite path globbing tests · a6babca1
      Josh Rosen authored
      The DataFrameSuite regression tests for SPARK-13774 fail in my environment because they attempt to glob over all of `/mnt` and some of the subdirectories restrictive permissions which cause the test to fail.
      
      This patch rewrites those tests to remove all environment-specific assumptions; the tests now create their own unique temporary paths for use in the tests.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #13649 from JoshRosen/SPARK-15929.
      a6babca1
    • Cheng Lian's avatar
      [SPARK-15925][SQL][SPARKR] Replaces registerTempTable with createOrReplaceTempView · ced8d669
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR replaces `registerTempTable` with `createOrReplaceTempView` as a follow-up task of #12945.
      
      ## How was this patch tested?
      
      Existing SparkR tests.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #13644 from liancheng/spark-15925-temp-view-for-r.
      ced8d669
    • Wenchen Fan's avatar
      [SPARK-15887][SQL] Bring back the hive-site.xml support for Spark 2.0 · c4b1ad02
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Right now, Spark 2.0 does not load hive-site.xml. Based on users' feedback, it seems make sense to still load this conf file.
      
      This PR adds a `hadoopConf` API in `SharedState`, which is `sparkContext.hadoopConfiguration` by default. When users are under hive context, `SharedState.hadoopConf` will load hive-site.xml and append its configs to `sparkContext.hadoopConfiguration`.
      
      When we need to read hadoop config in spark sql, we should call `SessionState.newHadoopConf`, which contains `sparkContext.hadoopConfiguration`, hive-site.xml and sql configs.
      
      ## How was this patch tested?
      
      new test in `HiveDataFrameSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13611 from cloud-fan/hive-site.
      c4b1ad02
    • Tathagata Das's avatar
      [SPARK-15889][SQL][STREAMING] Add a unique id to ContinuousQuery · c654ae21
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      ContinuousQueries have names that are unique across all the active ones. However, when queries are rapidly restarted with same name, it causes races conditions with the listener. A listener event from a stopped query can arrive after the query has been restarted, leading to complexities in monitoring infrastructure.
      
      Along with this change, I have also consolidated all the messy code paths to start queries with different sinks.
      
      ## How was this patch tested?
      Added unit tests, and existing unit tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #13613 from tdas/SPARK-15889.
      c654ae21
    • Takeshi YAMAMURO's avatar
      [SPARK-15530][SQL] Set #parallelism for file listing in listLeafFilesInParallel · 5ad4e32d
      Takeshi YAMAMURO authored
      ## What changes were proposed in this pull request?
      This pr is to set the number of parallelism to prevent file listing in `listLeafFilesInParallel` from generating many tasks in case of large #defaultParallelism.
      
      ## How was this patch tested?
      Manually checked
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #13444 from maropu/SPARK-15530.
      5ad4e32d
    • gatorsmile's avatar
      [SPARK-15676][SQL] Disallow Column Names as Partition Columns For Hive Tables · 3b7fb84c
      gatorsmile authored
      #### What changes were proposed in this pull request?
      When creating a Hive Table (not data source tables), a common error users might make is to specify an existing column name as a partition column. Below is what Hive returns in this case:
      ```
      hive> CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (data string, part string);
      FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns
      ```
      Currently, the error we issued is very confusing:
      ```
      org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:For direct MetaStore DB connections, we don't support retries at the client level.);
      ```
      This PR is to fix the above issue by capturing the usage error in `Parser`.
      
      #### How was this patch tested?
      Added a test case to `DDLCommandSuite`
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13415 from gatorsmile/partitionColumnsInTableSchema.
      3b7fb84c
    • Tathagata Das's avatar
      [HOTFIX][MINOR][SQL] Revert " Standardize 'continuous queries' to 'streaming D… · a6a18a45
      Tathagata Das authored
      This reverts commit d32e2277.
      Broke build - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.0-compile-maven-hadoop-2.3/326/console
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #13645 from tdas/build-break.
      a6a18a45
    • Liwei Lin's avatar
      [MINOR][SQL] Standardize 'continuous queries' to 'streaming Datasets/DataFrames' · d32e2277
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      This patch does some replacing (as `streaming Datasets/DataFrames` is the term we've chosen in [SPARK-15593](https://github.com/apache/spark/commit/00c310133df4f3893dd90d801168c2ab9841b102)):
       - `continuous queries` -> `streaming Datasets/DataFrames`
       - `non-continuous queries` -> `non-streaming Datasets/DataFrames`
      
      This patch also adds `test("check foreach() can only be called on streaming Datasets/DataFrames")`.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #13595 from lw-lin/continuous-queries-to-streaming-dss-dfs.
      d32e2277
    • Prashant Sharma's avatar
      [SPARK-15697][REPL] Unblock some of the useful repl commands. · 4134653e
      Prashant Sharma authored
      ## What changes were proposed in this pull request?
      
      Unblock some of the useful repl commands. like, "implicits", "javap", "power", "type", "kind". As they are useful and fully functional and part of scala/scala project, I see no harm in having them either.
      
      Verbatim paste form JIRA description.
      "implicits", "javap", "power", "type", "kind" commands in repl are blocked. However, they work fine in all cases I have tried. It is clear we don't support them as they are part of the scala/scala repl project. What is the harm in unblocking them, given they are useful ?
      In previous versions of spark we disabled these commands because it was difficult to support them without customization and the associated maintenance. Since the code base of scala repl was actually ported and maintained under spark source. Now that is not the situation and one can benefit from these commands in Spark REPL as much as in scala repl.
      
      ## How was this patch tested?
      Existing tests and manual, by trying out all of the above commands.
      
      P.S. Symantics of reset are to be discussed in a separate issue.
      
      Author: Prashant Sharma <prashsh1@in.ibm.com>
      
      Closes #13437 from ScrapCodes/SPARK-15697/repl-unblock-commands.
      4134653e
    • Dongjoon Hyun's avatar
      [SPARK-15913][CORE] Dispatcher.stopped should be enclosed by synchronized block. · 938434dc
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      `Dispatcher.stopped` is guarded by `this`, but it is used without synchronization in `postMessage` function. This PR fixes this and also the exception message became more accurate.
      
      ## How was this patch tested?
      
      Pass the existing Jenkins tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13634 from dongjoon-hyun/SPARK-15913.
      938434dc
    • Wenchen Fan's avatar
      [SPARK-15814][SQL] Aggregator can return null result · cd47e233
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      It's similar to the bug fixed in https://github.com/apache/spark/pull/13425, we should consider null object and wrap the `CreateStruct` with `If` to do null check.
      
      This PR also improves the test framework to test the objects of `Dataset[T]` directly, instead of calling `toDF` and compare the rows.
      
      ## How was this patch tested?
      
      new test in `DatasetAggregatorSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13553 from cloud-fan/agg-null.
      cd47e233
    • Peter Ableda's avatar
      [SPARK-15813] Improve Canceling log message to make it less ambiguous · d681742b
      Peter Ableda authored
      ## What changes were proposed in this pull request?
      Add new desired executor number to make the log message less ambiguous.
      
      ## How was this patch tested?
      This is a trivial change
      
      Author: Peter Ableda <abledapeter@gmail.com>
      
      Closes #13552 from peterableda/patch-1.
      d681742b
  3. Jun 12, 2016
    • Wenchen Fan's avatar
      [SPARK-15898][SQL] DataFrameReader.text should return DataFrame · e2ab79d5
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      We want to maintain API compatibility for DataFrameReader.text, and will introduce a new API called DataFrameReader.textFile which returns Dataset[String].
      
      affected PRs:
      https://github.com/apache/spark/pull/11731
      https://github.com/apache/spark/pull/13104
      https://github.com/apache/spark/pull/13184
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13604 from cloud-fan/revert.
      e2ab79d5
    • Herman van Hövell tot Westerflier's avatar
      [SPARK-15370][SQL] Fix count bug · 1f8f2b5c
      Herman van Hövell tot Westerflier authored
      # What changes were proposed in this pull request?
      This pull request fixes the COUNT bug in the `RewriteCorrelatedScalarSubquery` rule.
      
      After this change, the rule tests the expression at the root of the correlated subquery to determine whether the expression returns `NULL` on empty input. If the expression does not return `NULL`, the rule generates additional logic in the `Project` operator above the rewritten subquery. This additional logic intercepts `NULL` values coming from the outer join and replaces them with the value that the subquery's expression would return on empty input.
      
      This PR takes over https://github.com/apache/spark/pull/13155. It only fixes an issue with `Literal` construction and style issues.  All credits should go frreiss.
      
      # How was this patch tested?
      Added regression tests to cover all branches of the updated rule (see changes to `SubquerySuite`).
      Ran all existing automated regression tests after merging with latest trunk.
      
      Author: frreiss <frreiss@us.ibm.com>
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #13629 from hvanhovell/SPARK-15370-cleanup.
      1f8f2b5c
    • Wenchen Fan's avatar
    • Takuya UESHIN's avatar
      [SPARK-15870][SQL] DataFrame can't execute after uncacheTable. · caebd7f2
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      If a cached `DataFrame` executed more than once and then do `uncacheTable` like the following:
      
      ```
          val selectStar = sql("SELECT * FROM testData WHERE key = 1")
          selectStar.createOrReplaceTempView("selectStar")
      
          spark.catalog.cacheTable("selectStar")
          checkAnswer(
            selectStar,
            Seq(Row(1, "1")))
      
          spark.catalog.uncacheTable("selectStar")
          checkAnswer(
            selectStar,
            Seq(Row(1, "1")))
      ```
      
      , then the uncached `DataFrame` can't execute because of `Task not serializable` exception like:
      
      ```
      org.apache.spark.SparkException: Task not serializable
      	at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
      	at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
      	at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
      	at org.apache.spark.SparkContext.clean(SparkContext.scala:2038)
      	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
      	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1912)
      	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:884)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
      	at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
      	at org.apache.spark.rdd.RDD.collect(RDD.scala:883)
      	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:290)
      ...
      Caused by: java.lang.UnsupportedOperationException: Accumulator must be registered before send to executor
      	at org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:153)
      	at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1118)
      	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1136)
      	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
      	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
      	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
      ...
      ```
      
      Notice that `DataFrame` uncached with `DataFrame.unpersist()` works, but with `spark.catalog.uncacheTable` doesn't work.
      
      This pr reverts a part of cf38fe04 not to unregister `batchStats` accumulator, which is not needed to be unregistered here because it will be done by `ContextCleaner` after it is collected by GC.
      
      ## How was this patch tested?
      
      Added a test to check if DataFrame can execute after uncacheTable and other existing tests.
      But I made a test to check if the accumulator was cleared as `ignore` because the test would be flaky.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #13596 from ueshin/issues/SPARK-15870.
      caebd7f2
    • Herman van Hovell's avatar
      [SPARK-15370][SQL] Revert PR "Update RewriteCorrelatedSuquery rule" · 20b8f2c3
      Herman van Hovell authored
      This reverts commit 9770f6ee.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #13626 from hvanhovell/SPARK-15370-revert.
      20b8f2c3
    • hyukjinkwon's avatar
      [SPARK-15892][ML] Incorrectly merged AFTAggregator with zero total count · e3554605
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, `AFTAggregator` is not being merged correctly. For example, if there is any single empty partition in the data, this creates an `AFTAggregator` with zero total count which causes the exception below:
      
      ```
      IllegalArgumentException: u'requirement failed: The number of instances should be greater than 0.0, but got 0.'
      ```
      
      Please see [AFTSurvivalRegression.scala#L573-L575](https://github.com/apache/spark/blob/6ecedf39b44c9acd58cdddf1a31cf11e8e24428c/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala#L573-L575) as well.
      
      Just to be clear, the python example `aft_survival_regression.py` seems using 5 rows. So, if there exist partitions more than 5, it throws the exception above since it contains empty partitions which results in an incorrectly merged `AFTAggregator`.
      
      Executing `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py` on a machine with CPUs more than 5 is being failed because it creates tasks with some empty partitions with defualt  configurations (AFAIK, it sets the parallelism level to the number of CPU cores).
      
      ## How was this patch tested?
      
      An unit test in `AFTSurvivalRegressionSuite.scala` and manually tested by `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: Hyukjin Kwon <gurwls223@gmail.com>
      
      Closes #13619 from HyukjinKwon/SPARK-15892.
      e3554605
    • Ioana Delaney's avatar
      [SPARK-15832][SQL] Embedded IN/EXISTS predicate subquery throws TreeNodeException · 0ff8a68b
      Ioana Delaney authored
      ## What changes were proposed in this pull request?
      Queries with embedded existential sub-query predicates throws exception when building the physical plan.
      
      Example failing query:
      ```SQL
      scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t1")
      scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t2")
      scala> sql("select c1 from t1 where (case when c2 in (select c2 from t2) then 2 else 3 end) IN (select c2 from t1)").show()
      
      Binding attribute, tree: c2#239
      org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: c2#239
        at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
        at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
      
        ...
        at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
        at org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66)
        at org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.immutable.List.map(List.scala:285)
        at org.apache.spark.sql.execution.joins.HashJoin$class.org$apache$spark$sql$execution$joins$HashJoin$$x$8(HashJoin.scala:66)
        at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8$lzycompute(BroadcastHashJoinExec.scala:38)
        at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8(BroadcastHashJoinExec.scala:38)
        at org.apache.spark.sql.execution.joins.HashJoin$class.buildKeys(HashJoin.scala:63)
        at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys$lzycompute(BroadcastHashJoinExec.scala:38)
        at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys(BroadcastHashJoinExec.scala:38)
        at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.requiredChildDistribution(BroadcastHashJoinExec.scala:52)
      ```
      
      **Problem description:**
      When the left hand side expression of an existential sub-query predicate contains another embedded sub-query predicate, the RewritePredicateSubquery optimizer rule does not resolve the embedded sub-query expressions into existential joins.For example, the above query has the following optimized plan, which fails during physical plan build.
      
      ```SQL
      == Optimized Logical Plan ==
      Project [_1#224 AS c1#227]
      +- Join LeftSemi, (CASE WHEN predicate-subquery#255 [(_2#225 = c2#239)] THEN 2 ELSE 3 END = c2#228#262)
         :  +- SubqueryAlias predicate-subquery#255 [(_2#225 = c2#239)]
         :     +- LocalRelation [c2#239]
         :- LocalRelation [_1#224, _2#225]
         +- LocalRelation [c2#228#262]
      
      == Physical Plan ==
      org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: c2#239
      ```
      
      **Solution:**
      In RewritePredicateSubquery, before rewriting the outermost predicate sub-query, resolve any embedded existential sub-queries. The Optimized plan for the above query after the changes looks like below.
      
      ```SQL
      == Optimized Logical Plan ==
      Project [_1#224 AS c1#227]
      +- Join LeftSemi, (CASE WHEN exists#285 THEN 2 ELSE 3 END = c2#228#284)
         :- Join ExistenceJoin(exists#285), (_2#225 = c2#239)
         :  :- LocalRelation [_1#224, _2#225]
         :  +- LocalRelation [c2#239]
         +- LocalRelation [c2#228#284]
      
      == Physical Plan ==
      *Project [_1#224 AS c1#227]
      +- *BroadcastHashJoin [CASE WHEN exists#285 THEN 2 ELSE 3 END], [c2#228#284], LeftSemi, BuildRight
         :- *BroadcastHashJoin [_2#225], [c2#239], ExistenceJoin(exists#285), BuildRight
         :  :- LocalTableScan [_1#224, _2#225]
         :  +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
         :     +- LocalTableScan [c2#239]
         +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
            +- LocalTableScan [c2#228#284]
            +- LocalTableScan [c222#36], [[111],[222]]
      ```
      
      ## How was this patch tested?
      Added new test cases in SubquerySuite.scala
      
      Author: Ioana Delaney <ioanamdelaney@gmail.com>
      
      Closes #13570 from ioana-delaney/fixEmbedSubPredV1.
      0ff8a68b
    • frreiss's avatar
      [SPARK-15370][SQL] Update RewriteCorrelatedScalarSubquery rule to fix COUNT bug · 9770f6ee
      frreiss authored
      ## What changes were proposed in this pull request?
      This pull request fixes the COUNT bug in the `RewriteCorrelatedScalarSubquery` rule.
      
      After this change, the rule tests the expression at the root of the correlated subquery to determine whether the expression returns NULL on empty input. If the expression does not return NULL, the rule generates additional logic in the Project operator above the rewritten subquery.  This additional logic intercepts NULL values coming from the outer join and replaces them with the value that the subquery's expression would return on empty input.
      
      ## How was this patch tested?
      Added regression tests to cover all branches of the updated rule (see changes to `SubquerySuite.scala`).
      Ran all existing automated regression tests after merging with latest trunk.
      
      Author: frreiss <frreiss@us.ibm.com>
      
      Closes #13155 from frreiss/master.
      9770f6ee
    • Sean Owen's avatar
      [SPARK-15876][CORE] Remove support for "zk://" master URL · 0a6f0908
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Remove deprecated support for `zk://` master (`mesos://zk//` remains supported)
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #13625 from srowen/SPARK-15876.
      0a6f0908
    • Sean Owen's avatar
      [SPARK-15086][CORE][STREAMING] Deprecate old Java accumulator API · f51dfe61
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Deprecate old Java accumulator API; should use Scala now
      - Update Java tests and examples
      - Don't bother testing old accumulator API in Java 8 (too)
      - (fix a misspelling too)
      
      ## How was this patch tested?
      
      Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #13606 from srowen/SPARK-15086.
      f51dfe61
    • bomeng's avatar
      [SPARK-15806][DOCUMENTATION] update doc for SPARK_MASTER_IP · 50248dcf
      bomeng authored
      ## What changes were proposed in this pull request?
      
      SPARK_MASTER_IP is a deprecated environment variable. It is replaced by SPARK_MASTER_HOST according to MasterArguments.scala.
      
      ## How was this patch tested?
      
      Manually verified.
      
      Author: bomeng <bmeng@us.ibm.com>
      
      Closes #13543 from bomeng/SPARK-15806.
      50248dcf
    • bomeng's avatar
      [SPARK-15781][DOCUMENTATION] remove deprecated environment variable doc · 3fd3ee03
      bomeng authored
      ## What changes were proposed in this pull request?
      
      Like `SPARK_JAVA_OPTS` and `SPARK_CLASSPATH`, we will remove the document for `SPARK_WORKER_INSTANCES` to discourage user not to use them. If they are actually used, SparkConf will show a warning message as before.
      
      ## How was this patch tested?
      
      Manually tested.
      
      Author: bomeng <bmeng@us.ibm.com>
      
      Closes #13533 from bomeng/SPARK-15781.
      3fd3ee03
    • Imran Rashid's avatar
      [SPARK-15878][CORE][TEST] fix cleanup in EventLoggingListenerSuite and ReplayListenerSuite · 8cc22b00
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      These tests weren't properly using `LocalSparkContext` so weren't cleaning up correctly when tests failed.
      
      ## How was this patch tested?
      
      Jenkins.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #13602 from squito/SPARK-15878_cleanup_replaylistener.
      8cc22b00
    • hyukjinkwon's avatar
      [SPARK-15840][SQL] Add two missing options in documentation and some option related changes · 9e204c62
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR
      
      1. Adds the documentations for some missing options, `inferSchema` and `mergeSchema` for Python and Scala.
      
      2. Fiixes `[[DataFrame]]` to ```:class:`DataFrame` ``` so that this can be shown
      
        - from
          ![2016-06-09 9 31 16](https://cloud.githubusercontent.com/assets/6477701/15929721/8b864734-2e89-11e6-83f6-207527de4ac9.png)
      
        - to (with class link)
          ![2016-06-09 9 31 00](https://cloud.githubusercontent.com/assets/6477701/15929717/8a03d728-2e89-11e6-8a3f-08294964db22.png)
      
        (Please refer [the latest documentation](https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html))
      
      3. Moves `mergeSchema` option to `ParquetOptions` with removing unused options, `metastoreSchema` and `metastoreTableName`.
      
        They are not used anymore. They were removed in https://github.com/apache/spark/commit/e720dda42e806229ccfd970055c7b8a93eb447bf and there are no use cases as below:
      
        ```bash
        grep -r -e METASTORE_SCHEMA -e \"metastoreSchema\" -e \"metastoreTableName\" -e METASTORE_TABLE_NAME .
        ```
      
        ```
        ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:  private[sql] val METASTORE_SCHEMA = "metastoreSchema"
        ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:  private[sql] val METASTORE_TABLE_NAME = "metastoreTableName"
        ./sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:        ParquetFileFormat.METASTORE_TABLE_NAME -> TableIdentifier(
      ```
      
        It only sets `metastoreTableName` in the last case but does not use the table name.
      
      4. Sets the correct default values (in the documentation) for `compression` option for ORC(`snappy`, see [OrcOptions.scala#L33-L42](https://github.com/apache/spark/blob/3ded5bc4db2badc9ff49554e73421021d854306b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala#L33-L42)) and Parquet(`the value specified in SQLConf`, see [ParquetOptions.scala#L38-L47](https://github.com/apache/spark/blob/3ded5bc4db2badc9ff49554e73421021d854306b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala#L38-L47)) and `columnNameOfCorruptRecord` for JSON(`the value specified in SQLConf`, see [JsonFileFormat.scala#L53-L55](https://github.com/apache/spark/blob/4538443e276597530a27c6922e48503677b13956/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala#L53-L55) and [JsonFileFormat.scala#L105-L106](https://github.com/apache/spark/blob/4538443e276597530a27c6922e48503677b13956/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala#L105-L106)).
      
      ## How was this patch tested?
      
      Existing tests should cover this.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: Hyukjin Kwon <gurwls223@gmail.com>
      
      Closes #13576 from HyukjinKwon/SPARK-15840.
      9e204c62
    • Eric Liang's avatar
      [SPARK-15860] Metrics for codegen size and perf · e1f986c7
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      Adds codahale metrics for the codegen source text size and how long it takes to compile. The size is particularly interesting, since the JVM does have hard limits on how large methods can get.
      
      To simplify, I added the metrics under a statically-initialized source that is always registered with SparkEnv.
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #13586 from ericl/spark-15860.
      e1f986c7
  4. Jun 11, 2016
    • Dongjoon Hyun's avatar
      [SPARK-15807][SQL] Support varargs for dropDuplicates in Dataset/DataFrame · 3fd2ff4d
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      This PR adds `varargs`-types `dropDuplicates` functions in `Dataset/DataFrame`. Currently, `dropDuplicates` supports only `Seq` or `Array`.
      
      **Before**
      ```scala
      scala> val ds = spark.createDataFrame(Seq(("a", 1), ("b", 2), ("a", 2)))
      ds: org.apache.spark.sql.DataFrame = [_1: string, _2: int]
      
      scala> ds.dropDuplicates(Seq("_1", "_2"))
      res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: string, _2: int]
      
      scala> ds.dropDuplicates("_1", "_2")
      <console>:26: error: overloaded method value dropDuplicates with alternatives:
        (colNames: Array[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
        (colNames: Seq[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
        ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
       cannot be applied to (String, String)
             ds.dropDuplicates("_1", "_2")
                ^
      ```
      
      **After**
      ```scala
      scala> val ds = spark.createDataFrame(Seq(("a", 1), ("b", 2), ("a", 2)))
      ds: org.apache.spark.sql.DataFrame = [_1: string, _2: int]
      
      scala> ds.dropDuplicates("_1", "_2")
      res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: string, _2: int]
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests with new testcases.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13545 from dongjoon-hyun/SPARK-15807.
      3fd2ff4d
    • Eric Liang's avatar
      [SPARK-14851][CORE] Support radix sort with nullable longs · c06c58bb
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      This adds support for radix sort of nullable long fields. When a sort field is null and radix sort is enabled, we keep nulls in a separate region of the sort buffer so that radix sort does not need to deal with them. This also has performance benefits when sorting smaller integer types, since the current representation of nulls in two's complement (Long.MIN_VALUE) otherwise forces a full-width radix sort.
      
      This strategy for nulls does mean the sort is no longer stable. cc davies
      
      ## How was this patch tested?
      
      Existing randomized sort tests for correctness. I also tested some TPCDS queries and there does not seem to be any significant regression for non-null sorts.
      
      Some test queries (best of 5 runs each).
      Before change:
      scala> val start = System.nanoTime; spark.range(5000000).selectExpr("if(id > 5, cast(hash(id) as long), NULL) as h").coalesce(1).orderBy("h").collect(); (System.nanoTime - start) / 1e6
      start: Long = 3190437233227987
      res3: Double = 4716.471091
      
      After change:
      scala> val start = System.nanoTime; spark.range(5000000).selectExpr("if(id > 5, cast(hash(id) as long), NULL) as h").coalesce(1).orderBy("h").collect(); (System.nanoTime - start) / 1e6
      start: Long = 3190367870952791
      res4: Double = 2981.143045
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #13161 from ericl/sc-2998.
      c06c58bb
    • Wenchen Fan's avatar
      [SPARK-15856][SQL] Revert API breaking changes made in SQLContext.range · 75705e8d
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      It's easy for users to call `range(...).as[Long]` to get typed Dataset, and don't worth an API breaking change. This PR reverts it.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13605 from cloud-fan/range.
      75705e8d
    • Eric Liang's avatar
      [SPARK-15881] Update microbenchmark results for WideSchemaBenchmark · 5bb4564c
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      These were not updated after performance improvements. To make updating them easier, I also moved the results from inline comments out into a file, which is auto-generated when the benchmark is re-run.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #13607 from ericl/sc-3538.
      5bb4564c
Loading