Commits · 20768dade2fee5dfe967a4629cad477e3d3bce6e · cs525-sp18-g07 / spark

Jun 24, 2016

Revert "[SPARK-16186] [SQL] Support partition batch pruning with `IN`... · 20768dad

Davies Liu authored 8 years ago

Revert "[SPARK-16186] [SQL] Support partition batch pruning with `IN` predicate in InMemoryTableScanExec"

This reverts commit a65bcbc2.

20768dad

[SPARK-16186] [SQL] Support partition batch pruning with `IN` predicate in InMemoryTableScanExec · a65bcbc2

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

One of the most frequent usage patterns for Spark SQL is using **cached tables**. This PR improves `InMemoryTableScanExec` to handle `IN` predicate efficiently by pruning partition batches. Of course, the performance improvement varies over the queries and the datasets. But, for the following simple query, the query duration in Spark UI goes from 9 seconds to 50~90ms. It's about over 100 times faster.

**Before**
```scala
$ bin/spark-shell --driver-memory 6G
scala> val df = spark.range(2000000000)
scala> df.createOrReplaceTempView("t")
scala> spark.catalog.cacheTable("t")
scala> sql("select id from t where id = 1").collect()    // About 2 mins
scala> sql("select id from t where id = 1").collect()    // less than 90ms
scala> sql("select id from t where id in (1,2,3)").collect()  // 9 seconds
```

**After**
```scala
scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
```

This PR has impacts over 35 queries of TPC-DS if the tables are cached.
Note that this optimization is applied for `IN`.  To apply `IN` predicate having more than 10 items, `spark.sql.optimizer.inSetConversionThreshold` option should be increased.

## How was this patch tested?

Pass the Jenkins tests (including new testcases).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13887 from dongjoon-hyun/SPARK-16186.

a65bcbc2

[SPARK-16179][PYSPARK] fix bugs for Python udf in generate · 4435de1b

Davies Liu authored 8 years ago

## What changes were proposed in this pull request?

This PR fix the bug when Python UDF is used in explode (generator), GenerateExec requires that all the attributes in expressions should be resolvable from children when creating, we should replace the children first, then replace it's expressions.

```
>>> df.select(explode(f(*df))).show()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/vlad/dev/spark/python/pyspark/sql/dataframe.py", line 286, in show
    print(self._jdf.showString(n, truncate))
  File "/home/vlad/dev/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
  File "/home/vlad/dev/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/home/vlad/dev/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o52.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree:
Generate explode(<lambda>(_1#0L)), false, false, [col#15L]
+- Scan ExistingRDD[_1#0L]

	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
	at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
	at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:69)
	at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:45)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:177)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144)
	at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:153)
	at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
	at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
	at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
	at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:95)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:95)
	at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
	at scala.collection.immutable.List.foldLeft(List.scala:84)
	at org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:95)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:85)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:85)
	at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2557)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:1923)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2138)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:239)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:211)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$13.apply(TreeNode.scala:413)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$13.apply(TreeNode.scala:413)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:412)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:387)
	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
	... 42 more
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF0#20
	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
	at org.apache.spark.sql.execution.GenerateExec.<init>(GenerateExec.scala:63)
	... 52 more
Caused by: java.lang.RuntimeException: Couldn't find pythonUDF0#20 in [_1#0L]
	at scala.sys.package$.error(package.scala:27)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88)
	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
	... 67 more
```

## How was this patch tested?

Added regression tests.

Author: Davies Liu <davies@databricks.com>

Closes #13883 from davies/udf_in_generate.

4435de1b

[SQL][MINOR] Simplify data source predicate filter translation. · 5f8de216

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
This is a small patch to rewrite the predicate filter translation in DataSourceStrategy. The original code used excessive functional constructs (e.g. unzip) and was very difficult to understand.

## How was this patch tested?
Should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #13889 from rxin/simplify-predicate-filter.

5f8de216

[SPARK-16077] [PYSPARK] catch the exception from pickle.whichmodule() · d4893540

Davies Liu authored 8 years ago

## What changes were proposed in this pull request?

In the case that we don't know which module a object came from, will call pickle.whichmodule() to go throught all the loaded modules to find the object, which could fail because some modules, for example, six, see https://bitbucket.org/gutworth/six/issues/63/importing-six-breaks-pickling

We should ignore the exception here, use `__main__` as the module name (it means we can't find the module).

## How was this patch tested?

Manual tested. Can't have a unit test for this.

Author: Davies Liu <davies@databricks.com>

Closes #13788 from davies/whichmodule.

d4893540

[SPARK-15963][CORE] Catch `TaskKilledException` correctly in Executor.TaskRunner · a4851ed0

Liwei Lin authored 8 years ago

## The problem

Before this change, if either of the following cases happened to a task , the task would be marked as `FAILED` instead of `KILLED`:
- the task was killed before it was deserialized
- `executor.kill()` marked `taskRunner.killed`, but before calling `task.killed()` the worker thread threw the `TaskKilledException`

The reason is, in the `catch` block of the current [Executor.TaskRunner](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L362)'s implementation, we are mistakenly catching:
```scala
case _: TaskKilledException | _: InterruptedException if task.killed => ...
```
the semantics of which is:
- **(**`TaskKilledException` **OR** `InterruptedException`**)** **AND** `task.killed`

Then when `TaskKilledException` is thrown but `task.killed` is not marked, we would mark the task as `FAILED` (which should really be `KILLED`).

## What changes were proposed in this pull request?

This patch alters the catch condition's semantics from:
- **(**`TaskKilledException` **OR** `InterruptedException`**)** **AND** `task.killed`

to

- `TaskKilledException` **OR** **(**`InterruptedException` **AND** `task.killed`**)**

so that we can catch `TaskKilledException` correctly and mark the task as `KILLED` correctly.

## How was this patch tested?

Added unit test which failed before the change, ran new test 1000 times manually

Author: Liwei Lin <lwlin7@gmail.com>

Closes #13685 from lw-lin/fix-task-killed.

a4851ed0

[SPARK-15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer and CountVectorizer · be88383e

GayathriMurali authored 8 years ago

## What changes were proposed in this pull request?

Made changes to HashingTF,QuantileVectorizer and CountVectorizer

Author: GayathriMurali <gayathri.m@intel.com>

Closes #13745 from GayathriMurali/SPARK-15997.

be88383e

[SPARK-16129][CORE][SQL] Eliminate direct use of commons-lang classes in favor of commons-lang3 · 158af162

Sean Owen authored 8 years ago

## What changes were proposed in this pull request?

Replace use of `commons-lang` in favor of `commons-lang3` and forbid the former via scalastyle; remove `NotImplementedException` from `comons-lang` in favor of JDK `UnsupportedOperationException`

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #13843 from srowen/SPARK-16129.

158af162

[SPARK-16125][YARN] Fix not test yarn cluster mode correctly in YarnClusterSuite · f4fd7432

peng.zhang authored 8 years ago

## What changes were proposed in this pull request?

Since SPARK-13220(Deprecate "yarn-client" and "yarn-cluster"), YarnClusterSuite doesn't test "yarn cluster" mode correctly.
This pull request fixes it.

## How was this patch tested?
Unit test

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: peng.zhang <peng.zhang@xiaomi.com>

Closes #13836 from renozhang/SPARK-16125-test-yarn-cluster-mode.

f4fd7432

[SPARK-13709][SQL] Initialize deserializer with both table and partition... · 2d2f607b

Cheng Lian authored 8 years ago

[SPARK-13709][SQL] Initialize deserializer with both table and partition properties when reading partitioned tables

## What changes were proposed in this pull request?

When reading partitions of a partitioned Hive SerDe table, we only initializes the deserializer using partition properties. However, for SerDes like `AvroSerDe`, essential properties (e.g. Avro schema information) may be defined in table properties. We should merge both table properties and partition properties before initializing the deserializer.

Note that an individual partition may have different properties than the one defined in the table properties (e.g. partitions within a table can have different SerDes). Thus, for any property key defined in both partition and table properties, the value set in partition properties wins.

## How was this patch tested?

New test case added in `QueryPartitionSuite`.

Author: Cheng Lian <lian@databricks.com>

Closes #13865 from liancheng/spark-13709-partitioned-avro-table.

2d2f607b

Jun 23, 2016

[SPARK-16133][ML] model loading backward compatibility for ml.feature · cc6778ee

Yuhao Yang authored 8 years ago

## What changes were proposed in this pull request?

model loading backward compatibility for ml.feature,

## How was this patch tested?

existing ut and manual test for loading 1.6 models.

Author: Yuhao Yang <yuhao.yang@intel.com>
Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #13844 from hhbyyh/featureComp.

cc6778ee

[SPARK-16142][R] group naiveBayes method docs in a single Rd · 4a40d43b

Xiangrui Meng authored 8 years ago

## What changes were proposed in this pull request?

This PR groups `spark.naiveBayes`, `summary(NB)`, `predict(NB)`, and `write.ml(NB)` into a single Rd.

## How was this patch tested?

Manually checked generated HTML doc. See attached screenshots.

![screen shot 2016-06-23 at 2 11 00 pm](https://cloud.githubusercontent.com/assets/829644/16320452/a5885e92-394c-11e6-994f-2ab5cddad86f.png)

![screen shot 2016-06-23 at 2 11 15 pm](https://cloud.githubusercontent.com/assets/829644/16320455/aad1f6d8-394c-11e6-8ef4-13bee989f52f.png)

Author: Xiangrui Meng <meng@databricks.com>

Closes #13877 from mengxr/SPARK-16142.

4a40d43b

[SPARK-16177][ML] model loading backward compatibility for ml.regression · 14bc5a7f

Yuhao Yang authored 8 years ago

## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16177
model loading backward compatibility for ml.regression

## How was this patch tested?

existing ut and manual test for loading 1.6 models.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #13879 from hhbyyh/regreComp.

14bc5a7f

[SQL][MINOR] ParserUtils.operationNotAllowed should throw exception directly · 6a3c6276

Wenchen Fan authored 8 years ago

## What changes were proposed in this pull request?

It's weird that `ParserUtils.operationNotAllowed` returns an exception and the caller throw it.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13874 from cloud-fan/style.

6a3c6276

[SPARK-16123] Avoid NegativeArraySizeException while reserving additional... · cc71d4fa

Sameer Agarwal authored 8 years ago

[SPARK-16123] Avoid NegativeArraySizeException while reserving additional capacity in VectorizedColumnReader

## What changes were proposed in this pull request?

This patch fixes an overflow bug in vectorized parquet reader where both off-heap and on-heap variants of `ColumnVector.reserve()` can unfortunately overflow while reserving additional capacity during reads.

## How was this patch tested?

Manual Tests

Author: Sameer Agarwal <sameer@databricks.com>

Closes #13832 from sameeragarwal/negative-array.

cc71d4fa

[SPARK-16165][SQL] Fix the update logic for InMemoryTableScanExec.readBatches · 264bc636

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

Currently, `readBatches` accumulator of `InMemoryTableScanExec` is updated only when `spark.sql.inMemoryColumnarStorage.partitionPruning` is true. Although this metric is used for only testing purpose, we had better have correct metric without considering SQL options.

## How was this patch tested?

Pass the Jenkins tests (including a new testcase).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13870 from dongjoon-hyun/SPARK-16165.

264bc636

[SPARK-15443][SQL] Fix 'explain' for streaming Dataset · 0e4bdebe

Shixiong Zhu authored 8 years ago

## What changes were proposed in this pull request?

- Fix the `explain` command for streaming Dataset/DataFrame. E.g.,
```
== Parsed Logical Plan ==
'SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
+- 'MapElements <function1>, obj#6: java.lang.String
   +- 'DeserializeToObject unresolveddeserializer(createexternalrow(getcolumnbyordinal(0, StringType).toString, StructField(value,StringType,true))), obj#5: org.apache.spark.sql.Row
      +- Filter <function1>.apply
         +- StreamingRelation FileSource[/Users/zsx/stream], [value#0]

== Analyzed Logical Plan ==
value: string
SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
+- MapElements <function1>, obj#6: java.lang.String
   +- DeserializeToObject createexternalrow(value#0.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row
      +- Filter <function1>.apply
         +- StreamingRelation FileSource[/Users/zsx/stream], [value#0]

== Optimized Logical Plan ==
SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
+- MapElements <function1>, obj#6: java.lang.String
   +- DeserializeToObject createexternalrow(value#0.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row
      +- Filter <function1>.apply
         +- StreamingRelation FileSource[/Users/zsx/stream], [value#0]

== Physical Plan ==
*SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
+- *MapElements <function1>, obj#6: java.lang.String
   +- *DeserializeToObject createexternalrow(value#0.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row
      +- *Filter <function1>.apply
         +- StreamingRelation FileSource[/Users/zsx/stream], [value#0]
```

- Add `StreamingQuery.explain` to display the last execution plan. E.g.,
```
== Parsed Logical Plan ==
SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
+- MapElements <function1>, obj#6: java.lang.String
   +- DeserializeToObject createexternalrow(value#12.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row
      +- Filter <function1>.apply
         +- Relation[value#12] text

== Analyzed Logical Plan ==
value: string
SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
+- MapElements <function1>, obj#6: java.lang.String
   +- DeserializeToObject createexternalrow(value#12.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row
      +- Filter <function1>.apply
         +- Relation[value#12] text

== Optimized Logical Plan ==
SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
+- MapElements <function1>, obj#6: java.lang.String
   +- DeserializeToObject createexternalrow(value#12.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row
      +- Filter <function1>.apply
         +- Relation[value#12] text

== Physical Plan ==
*SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
+- *MapElements <function1>, obj#6: java.lang.String
   +- *DeserializeToObject createexternalrow(value#12.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row
      +- *Filter <function1>.apply
         +- *Scan text [value#12] Format: org.apache.spark.sql.execution.datasources.text.TextFileFormat1836ab91, InputPaths: file:/Users/zsx/stream/a.txt, file:/Users/zsx/stream/b.txt, file:/Users/zsx/stream/c.txt, PushedFilters: [], ReadSchema: struct<value:string>
```

## How was this patch tested?

The added unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #13815 from zsxwing/sdf-explain.

0e4bdebe

[SPARK-16164][SQL] Update `CombineFilters` to try to construct predicates with... · 91b1ef28

Dongjoon Hyun authored 8 years ago

[SPARK-16164][SQL] Update `CombineFilters` to try to construct predicates with child predicate first

## What changes were proposed in this pull request?

This PR changes `CombineFilters` to compose the final predicate condition by using (`child predicate` AND `parent predicate`) instead of (`parent predicate` AND `child predicate`). This is a best effort approach. Some other optimization rules may destroy this order by reorganizing conjunctive predicates.

**Reported Error Scenario**
Chris McCubbin reported a bug when he used StringIndexer in an ML pipeline with additional filters. It seems that during filter pushdown, we changed the ordering in the logical plan.
```scala
import org.apache.spark.ml.feature._
val df1 = (0 until 3).map(_.toString).toDF
val indexer = new StringIndexer()
  .setInputCol("value")
  .setOutputCol("idx")
  .setHandleInvalid("skip")
  .fit(df1)
val df2 = (0 until 5).map(_.toString).toDF
val predictions = indexer.transform(df2)
predictions.show() // this is okay
predictions.where('idx > 2).show() // this will throw an exception
```

Please see the notebook at https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/2159162931615821/588180/latest.html for error messages.

## How was this patch tested?

Pass the Jenkins tests (including a new testcase).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13872 from dongjoon-hyun/SPARK-16164.

91b1ef28

[SPARK-13723][YARN] Change behavior of --num-executors with dynamic allocation. · 738f134b

Ryan Blue authored 8 years ago

## What changes were proposed in this pull request?

This changes the behavior of --num-executors and spark.executor.instances when using dynamic allocation. Instead of turning dynamic allocation off, it uses the value for the initial number of executors.

This changes was discussed on [SPARK-13723](https://issues.apache.org/jira/browse/SPARK-13723). I highly recommend using it while we can change the behavior for 2.0.0. In practice, the 1.x behavior causes unexpected behavior for users (it is not clear that it disables dynamic allocation) and wastes cluster resources because users rarely notice the log message.

## How was this patch tested?

This patch updates tests and adds a test for Utils.getDynamicAllocationInitialExecutors.

Author: Ryan Blue <blue@apache.org>

Closes #13338 from rdblue/SPARK-13723-num-executors-with-dynamic-allocation.

738f134b

[SPARK-15725][YARN] Ensure ApplicationMaster sleeps for the min interval. · a410814c

Ryan Blue authored 8 years ago

## What changes were proposed in this pull request?

Update `ApplicationMaster` to sleep for at least the minimum allocation interval before calling `allocateResources`. This prevents overloading the `YarnAllocator` that is happening because the thread is triggered when an executor is killed and its connections die. In YARN, this prevents the app from overloading the allocator and becoming unstable.

## How was this patch tested?

Tested that this allows the an app to recover instead of hanging. It is still possible for the YarnAllocator to be overwhelmed by requests, but this prevents the issue for the most common cause.

Author: Ryan Blue <blue@apache.org>

Closes #13482 from rdblue/SPARK-15725-am-sleep-work-around.

a410814c

[SPARK-16163] [SQL] Cache the statistics for logical plans · 10396d95

Davies Liu authored 8 years ago

## What changes were proposed in this pull request?

This calculation of statistics is not trivial anymore, it could be very slow on large query (for example, TPC-DS Q64 took several minutes to plan).

During the planning of a query, the statistics of any logical plan should not change (even InMemoryRelation), so we should use `lazy val` to cache the statistics.

For InMemoryRelation, the statistics could be updated after materialization, it's only useful when used in another query (before planning), because once we finished the planning, the statistics will not be used anymore.

## How was this patch tested?

Testsed with TPC-DS Q64, it could be planned in a second after the patch.

Author: Davies Liu <davies@databricks.com>

Closes #13871 from davies/fix_statistics.

10396d95

[SPARK-16130][ML] model loading backward compatibility for ml.classfication.LogisticRegression · 60398dab

Yuhao Yang authored 8 years ago

## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16130
model loading backward compatibility for ml.classfication.LogisticRegression

## How was this patch tested?
existing ut and manual test for loading old models.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #13841 from hhbyyh/lrcomp.

60398dab

[SPARK-16116][SQL] ConsoleSink should not require checkpointLocation · d85bb10c

Shixiong Zhu authored 8 years ago

## What changes were proposed in this pull request?

When the user uses `ConsoleSink`, we should use a temp location if `checkpointLocation` is not specified.

## How was this patch tested?

The added unit test.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #13817 from zsxwing/console-checkpoint.

d85bb10c

[SPARK-16088][SPARKR] update setJobGroup, cancelJobGroup, clearJobGroup · b5a99766

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

Updated setJobGroup, cancelJobGroup, clearJobGroup to not require sc/SparkContext as parameter.
Also updated roxygen2 doc and R programming guide on deprecations.

## How was this patch tested?

unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13838 from felixcheung/rjobgroup.

b5a99766

[SPARK-16154][MLLIB] Update spark.ml and spark.mllib package docs · 65d1f0f7

Xiangrui Meng authored 8 years ago

## What changes were proposed in this pull request?

Since we decided to switch spark.mllib package into maintenance mode in 2.0, it would be nice to update the package docs to reflect this change.

## How was this patch tested?

Manually checked generated APIs.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13859 from mengxr/SPARK-16154.

65d1f0f7

[SPARK-16138] Try to cancel executor requests only if we have at least 1 · 5bf2889b

Peter Ableda authored 8 years ago

## What changes were proposed in this pull request?
Adding additional check to if statement

## How was this patch tested?
I built and deployed to internal cluster to observe behaviour. After the change the invalid logging is gone:

```
16/06/22 08:46:36 INFO yarn.YarnAllocator: Driver requested a total number of 1 executor(s).
16/06/22 08:46:36 INFO yarn.YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 1 executors.
16/06/22 08:46:36 INFO yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
16/06/22 08:47:36 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 1.
```

Author: Peter Ableda <abledapeter@gmail.com>

Closes #13850 from peterableda/patch-2.

5bf2889b

[SPARK-15660][CORE] Update RDD `variance/stdev` description and add popVariance/popStdev · 5eef1e6c

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

In Spark-11490, `variance/stdev` are redefined as the **sample** `variance/stdev` instead of population ones. This PR updates the other old documentations to prevent users from misunderstanding. This will update the following Scala/Java API docs.

- http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.api.java.JavaDoubleRDD
- http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.rdd.DoubleRDDFunctions
- http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.util.StatCounter
- http://spark.apache.org/docs/2.0.0-preview/api/java/org/apache/spark/api/java/JavaDoubleRDD.html
- http://spark.apache.org/docs/2.0.0-preview/api/java/org/apache/spark/rdd/DoubleRDDFunctions.html
- http://spark.apache.org/docs/2.0.0-preview/api/java/org/apache/spark/util/StatCounter.html

Also, this PR adds them `popVariance` and `popStdev` functions clearly.

## How was this patch tested?

Pass the updated Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13403 from dongjoon-hyun/SPARK-15660.

5eef1e6c

[SPARK-16162] Remove dead code OrcTableScan. · 4374a46b

Brian Cho authored 8 years ago

## What changes were proposed in this pull request?

SPARK-14535 removed all calls to class OrcTableScan. This removes the dead code.

## How was this patch tested?

Existing unit tests.

Author: Brian Cho <bcho@fb.com>

Closes #13869 from dafrista/clean-up-orctablescan.

4374a46b

[SQL][MINOR] Fix minor formatting issues in SHOW CREATE TABLE output · f34b5c62

Cheng Lian authored 8 years ago

## What changes were proposed in this pull request?

This PR fixes two minor formatting issues appearing in `SHOW CREATE TABLE` output.

Before:

```
CREATE EXTERNAL TABLE ...
...
WITH SERDEPROPERTIES ('serialization.format' = '1'
)
...
TBLPROPERTIES ('avro.schema.url' = '/tmp/avro/test.avsc',
  'transient_lastDdlTime' = '1466638180')
```

After:

```
CREATE EXTERNAL TABLE ...
...
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
)
...
TBLPROPERTIES (
  'avro.schema.url' = '/tmp/avro/test.avsc',
  'transient_lastDdlTime' = '1466638180'
)
```

## How was this patch tested?

Manually tested.

Author: Cheng Lian <lian@databricks.com>

Closes #13864 from liancheng/show-create-table-format-fix.

f34b5c62

Jun 22, 2016

[SPARK-15230][SQL] distinct() does not handle column name with dot properly · 925884a6

bomeng authored 8 years ago

## What changes were proposed in this pull request?

When table is created with column name containing dot, distinct() will fail to run. For example,
```scala
val rowRDD = sparkContext.parallelize(Seq(Row(1), Row(1), Row(2)))
val schema = StructType(Array(StructField("column.with.dot", IntegerType, nullable = false)))
val df = spark.createDataFrame(rowRDD, schema)
```
running the following will have no problem:
```scala
df.select(new Column("`column.with.dot`"))
```
but running the query with additional distinct() will cause exception:
```scala
df.select(new Column("`column.with.dot`")).distinct()
```

The issue is that distinct() will try to resolve the column name, but the column name in the schema does not have backtick with it. So the solution is to add the backtick before passing the column name to resolve().

## How was this patch tested?

Added a new test case.

Author: bomeng <bmeng@us.ibm.com>

Closes #13140 from bomeng/SPARK-15230.

925884a6

[SPARK-16159][SQL] Move RDD creation logic from FileSourceStrategy.apply · 37f3be5d

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
We embed partitioning logic in FileSourceStrategy.apply, making the function very long. This is a small refactoring to move it into its own functions. Eventually we would be able to move the partitioning functions into a physical operator, rather than doing it in physical planning.

## How was this patch tested?
This is a simple code move.

Author: Reynold Xin <rxin@databricks.com>

Closes #13862 from rxin/SPARK-16159.

37f3be5d

[SPARK-16024][SQL][TEST] Verify Column Comment for Data Source Tables · 9f990fa3

gatorsmile authored 8 years ago

#### What changes were proposed in this pull request?
This PR is to improve test coverage. It verifies whether `Comment` of `Column` can be appropriate handled.

The test cases verify the related parts in Parser, both SQL and DataFrameWriter interface, and both Hive Metastore catalog and In-memory catalog.

#### How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13764 from gatorsmile/dataSourceComment.

9f990fa3

[SPARK-15956][SQL] When unwrapping ORC avoid pattern matching at runtime · 4f869f88

Brian Cho authored 8 years ago

## What changes were proposed in this pull request?

Extend the returning of unwrapper functions from primitive types to all types.

This PR is based on https://github.com/apache/spark/pull/13676. It only fixes a bug with scala-2.10 compilation. All credit should go to dafrista.

## How was this patch tested?

The patch should pass all unit tests. Reading ORC files with non-primitive types with this change reduced the read time by ~15%.

Author: Brian Cho <bcho@fb.com>
Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #13854 from hvanhovell/SPARK-15956-scala210.

4f869f88

[SPARK-16131] initialize internal logger lazily in Scala preferred way · 044971ec

Prajwal Tuladhar authored 8 years ago

## What changes were proposed in this pull request?

Initialize logger instance lazily in Scala preferred way

## How was this patch tested?

By running `./build/mvn clean test` locally

Author: Prajwal Tuladhar <praj@infynyxx.com>

Closes #13842 from infynyxx/spark_internal_logger.

044971ec

[SPARK-16155][DOC] remove package grouping in Java docs · 857ecff1

Xiangrui Meng authored 8 years ago

## What changes were proposed in this pull request?

In 1.4 and earlier releases, we have package grouping in the generated Java API docs. See http://spark.apache.org/docs/1.4.0/api/java/index.html. However, this disappeared in 1.5.0: http://spark.apache.org/docs/1.5.0/api/java/index.html.

Rather than fixing it, I'd suggest removing grouping. Because it might take some time to fix and it is a manual process to update the grouping in `SparkBuild.scala`. I didn't find anyone complaining about missing groups since 1.5.0 on Google.

Manually checked the generated Java API docs and confirmed that they are the same as in master.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13856 from mengxr/SPARK-16155.

857ecff1

[SPARK-16153][MLLIB] switch to multi-line doc to avoid a genjavadoc bug · 00cc5cca

Xiangrui Meng authored 8 years ago

## What changes were proposed in this pull request?

We recently deprecated setLabelCol in ChiSqSelectorModel (#13823):

~~~scala
  /** group setParam */
  Since("1.6.0")
  deprecated("labelCol is not used by ChiSqSelectorModel.", "2.0.0")
  def setLabelCol(value: String): this.type = set(labelCol, value)
~~~

This unfortunately hit a genjavadoc bug and broken doc generation. This is the generated Java code:

~~~java
  /** group setParam */
  public  org.apache.spark.ml.feature.ChiSqSelectorModel setOutputCol (java.lang.String value)  { throw new RuntimeException(); }
   *
   * deprecated labelCol is not used by ChiSqSelectorModel. Since 2.0.0.
  */
  public  org.apache.spark.ml.feature.ChiSqSelectorModel setLabelCol (java.lang.String value)  { throw new RuntimeException(); }
~~~

Switching to multiline is a workaround.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13855 from mengxr/SPARK-16153.

00cc5cca

[SPARK-16078][SQL] from_utc_timestamp/to_utc_timestamp should not depends on local timezone · 20d411bc

Davies Liu authored 8 years ago

## What changes were proposed in this pull request?

Currently, we use local timezone to parse or format a timestamp (TimestampType), then use Long as the microseconds since epoch UTC.

In from_utc_timestamp() and to_utc_timestamp(), we did not consider the local timezone, they could return different results with different local timezone.

This PR will do the conversion based on human time (in local timezone), it should return same result in whatever timezone. But because the mapping from absolute timestamp to human time is not exactly one-to-one mapping, it will still return wrong result in some timezone (also in the begging or ending of DST).

This PR is kind of the best effort fix. In long term, we should make the TimestampType be timezone aware to fix this totally.

## How was this patch tested?

Tested these function in all timezone.

Author: Davies Liu <davies@databricks.com>

Closes #13784 from davies/convert_tz.

20d411bc

[SPARK-15672][R][DOC] R programming guide update · 43b04b7e

Kai Jiang authored 8 years ago

## What changes were proposed in this pull request?
Guide for
- UDFs with dapply, dapplyCollect
- spark.lapply for running parallel R functions

## How was this patch tested?
build locally
<img width="654" alt="screen shot 2016-06-14 at 03 12 56" src="https://cloud.githubusercontent.com/assets/3419881/16039344/12a3b6a0-31de-11e6-8d77-fe23308075c0.png">

Author: Kai Jiang <jiangkai@gmail.com>

Closes #13660 from vectorijk/spark-15672-R-guide-update.

43b04b7e

[SPARK-16003] SerializationDebugger runs into infinite loop · 6f915c9e

Eric Liang authored 8 years ago

## What changes were proposed in this pull request?

This fixes SerializationDebugger to not recurse forever when `writeReplace` returns an object of the same class, which is the case for at least the `SQLMetrics` class.

See also the OpenJDK unit tests on the behavior of recursive `writeReplace()`:
https://github.com/openjdk-mirror/jdk7u-jdk/blob/f4d80957e89a19a29bb9f9807d2a28351ed7f7df/test/java/io/Serializable/nestedReplace/NestedReplace.java

cc davies cloud-fan

## How was this patch tested?

Unit tests for SerializationDebugger.

Author: Eric Liang <ekl@databricks.com>

Closes #13814 from ericl/spark-16003.

6f915c9e

[SPARK-15956][SQL] Revert "[] When unwrapping ORC avoid pattern matching… · 472d611a

Herman van Hovell authored 8 years ago

This reverts commit 0a9c0275. It breaks the 2.10 build, I'll fix this in a different PR.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #13853 from hvanhovell/SPARK-15956-revert.

472d611a