Commits · 37fcda3e6cf1707fb7a348a4d47231849ef8abf6 · cs525-sp18-g07 / spark

Mar 09, 2016

[SPARK-13747][SQL] Fix concurrent query with fork-join pool · 37fcda3e

Andrew Or authored 9 years ago

## What changes were proposed in this pull request?

Fix this use case, which was already fixed in SPARK-10548 in 1.6 but was broken in master due to #9264:

```
(1 to 100).par.foreach { _ => sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count() }
```

This threw `IllegalArgumentException` consistently before this patch. For more detail, see the JIRA.

## How was this patch tested?

New test in `SQLExecutionSuite`.

Author: Andrew Or <andrew@databricks.com>

Closes #11586 from andrewor14/fix-concurrent-sql.

37fcda3e

[SPARK-13781][SQL] Use ExpressionSets in ConstraintPropagationSuite · dbf2a7cf

Sameer Agarwal authored 9 years ago

## What changes were proposed in this pull request?

This PR is a small follow up on https://github.com/apache/spark/pull/11338 (https://issues.apache.org/jira/browse/SPARK-13092) to use `ExpressionSet` as part of the verification logic in `ConstraintPropagationSuite`.
## How was this patch tested?

No new tests added. Just changes the verification logic in `ConstraintPropagationSuite`.

Author: Sameer Agarwal <sameer@databricks.com>

Closes #11611 from sameeragarwal/expression-set.

dbf2a7cf

[SPARK-11861][ML] Add feature importances for decision trees · e1772d3f

sethah authored 9 years ago

This patch adds an API entry point for single decision tree feature importances.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #9912 from sethah/SPARK-11861.

e1772d3f

[SPARK-13527][SQL] Prune Filters based on Constraints · c6aa356c

gatorsmile authored 9 years ago

#### What changes were proposed in this pull request?

Remove all the deterministic conditions in a [[Filter]] that are contained in the Child's Constraints.

For example, the first query can be simplified to the second one.

```scala
    val queryWithUselessFilter = tr1
      .where("tr1.a".attr > 10 || "tr1.c".attr < 10)
      .join(tr2.where('d.attr < 100), Inner, Some("tr1.a".attr === "tr2.a".attr))
      .where(
        ("tr1.a".attr > 10 || "tr1.c".attr < 10) &&
        'd.attr < 100 &&
        "tr2.a".attr === "tr1.a".attr)
```
```scala
    val query = tr1
      .where("tr1.a".attr > 10 || "tr1.c".attr < 10)
      .join(tr2.where('d.attr < 100), Inner, Some("tr1.a".attr === "tr2.a".attr))
```
#### How was this patch tested?

Six test cases are added.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #11406 from gatorsmile/FilterRemoval.

c6aa356c

[SPARK-13523] [SQL] Reuse exchanges in a query · 3dc9ae2e

Davies Liu authored 9 years ago

## What changes were proposed in this pull request?

It’s possible to have common parts in a query, for example, self join, it will be good to avoid the duplicated part to same CPUs and memory (Broadcast or cache).

Exchange will materialize the underlying RDD by shuffle or collect, it’s a great point to check duplicates and reuse them. Duplicated exchanges means they generate exactly the same result inside a query.

In order to find out the duplicated exchanges, we should be able to compare SparkPlan to check that they have same results or not. We already have that for LogicalPlan, so we should move that into QueryPlan to make it available for SparkPlan.

Once we can find the duplicated exchanges, we should replace all of them with same SparkPlan object (could be wrapped by ReusedExchage for explain), then the plan tree become a DAG. Since all the planner only work with tree, so this rule should be the last one for the entire planning.

After the rule, the plan will looks like:

```
WholeStageCodegen
:  +- Project [id#0L]
:     +- BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, None
:        :- Project [id#0L]
:        :  +- BroadcastHashJoin [id#0L], [id#1L], Inner, BuildRight, None
:        :     :- Range 0, 1, 4, 1024, [id#0L]
:        :     +- INPUT
:        +- INPUT
:- BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L))
:  +- WholeStageCodegen
:     :  +- Range 0, 1, 4, 1024, [id#1L]
+- ReusedExchange [id#2L], BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L))
```

![bjoin](https://cloud.githubusercontent.com/assets/40902/13414787/209e8c5c-df0a-11e5-8a0f-edff69d89e83.png)

For three ways SortMergeJoin,
```
== Physical Plan ==
WholeStageCodegen
:  +- Project [id#0L]
:     +- SortMergeJoin [id#0L], [id#4L], None
:        :- INPUT
:        +- INPUT
:- WholeStageCodegen
:  :  +- Project [id#0L]
:  :     +- SortMergeJoin [id#0L], [id#3L], None
:  :        :- INPUT
:  :        +- INPUT
:  :- WholeStageCodegen
:  :  :  +- Sort [id#0L ASC], false, 0
:  :  :     +- INPUT
:  :  +- Exchange hashpartitioning(id#0L, 200), None
:  :     +- WholeStageCodegen
:  :        :  +- Range 0, 1, 4, 33554432, [id#0L]
:  +- WholeStageCodegen
:     :  +- Sort [id#3L ASC], false, 0
:     :     +- INPUT
:     +- ReusedExchange [id#3L], Exchange hashpartitioning(id#0L, 200), None
+- WholeStageCodegen
   :  +- Sort [id#4L ASC], false, 0
   :     +- INPUT
   +- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200), None
```
![sjoin](https://cloud.githubusercontent.com/assets/40902/13414790/27aea61c-df0a-11e5-8cbf-fbc985c31d95.png)

If the same ShuffleExchange or BroadcastExchange, execute()/executeBroadcast() will be called by different parents, they should cached the RDD/Broadcast, return the same one for all the parents.

## How was this patch tested?

Added some unit tests for this.  Had done some manual tests on TPCDS query Q59 and Q64, we can see some exchanges are re-used (this requires a change in PhysicalRDD to for sameResult, is be done in #11514 ).

Author: Davies Liu <davies@databricks.com>

Closes #11403 from davies/dedup.

3dc9ae2e

[SPARK-13615][ML] GeneralizedLinearRegression supports save/load · 0dd06485

Yanbo Liang authored 9 years ago

## What changes were proposed in this pull request?
```GeneralizedLinearRegression``` supports ```save/load```.
cc mengxr
## How was this patch tested?
unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11465 from yanboliang/spark-13615.

0dd06485

[SPARK-13728][SQL] Fix ORC PPD test so that pushed filters can be checked. · cad29a40

hyukjinkwon authored 9 years ago

## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13728

https://github.com/apache/spark/pull/11509 makes the output only single ORC file.
It was 10 files but this PR writes only single file. So, this could not skip stripes in ORC by the pushed down filters.
So, this PR simply repartitions data into 10 so that the test could pass.
## How was this patch tested?

unittest and `./dev/run_tests` for code style test.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #11593 from HyukjinKwon/SPARK-13728.

cad29a40

[SPARK-13763][SQL] Remove Project when its Child's Output is Nil · 23369c3b

gatorsmile authored 9 years ago

#### What changes were proposed in this pull request?

As shown in another PR: https://github.com/apache/spark/pull/11596, we are using `SELECT 1` as a dummy table, when the table is used for SQL statements in which a table reference is required, but the contents of the table are not important. For example,

```SQL
SELECT value FROM (select 1) dummyTable Lateral View explode(array(1,2,3)) adTable as value
```
Before the PR, the optimized plan contains a useless `Project` after Optimizer executing the `ColumnPruning` rule, as shown below:

```
== Analyzed Logical Plan ==
value: int
Project [value#22]
+- Generate explode(array(1, 2, 3)), true, false, Some(adtable), [value#22]
   +- SubqueryAlias dummyTable
      +- Project [1 AS 1#21]
         +- OneRowRelation$

== Optimized Logical Plan ==
Generate explode([1,2,3]), false, false, Some(adtable), [value#22]
+- Project
   +- OneRowRelation$
```

After the fix, the optimized plan removed the useless `Project`, as shown below:
```
== Optimized Logical Plan ==
Generate explode([1,2,3]), false, false, Some(adtable), [value#22]
+- OneRowRelation$
```

This PR is to remove `Project` when its Child's output is Nil

#### How was this patch tested?

Added a new unit test case into the suite `ColumnPruningSuite.scala`

Author: gatorsmile <gatorsmile@gmail.com>

Closes #11599 from gatorsmile/projectOneRowRelation.

23369c3b

[SPARK-13595][BUILD] Move docker, extras modules into external · 256704c7

Sean Owen authored 9 years ago

## What changes were proposed in this pull request?

Move `docker` dirs out of top level into `external/`; move `extras/*` into `external/`

## How was this patch tested?

This is tested with Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #11523 from srowen/SPARK-13595.

256704c7

Revert "[SPARK-13668][SQL] Reorder filter/join predicates to short-circuit isNotNull checks" · 7791d0c3
Davies Liu authored 9 years ago
```
This reverts commit e430614e.
```
7791d0c3

[SPARK-13242] [SQL] codegen fallback in case-when if there many branches · 9634e17d

Davies Liu authored 9 years ago

## What changes were proposed in this pull request?

If there are many branches in a CaseWhen expression, the generated code could go above the 64K limit for single java method, will fail to compile. This PR change it to fallback to interpret mode if there are more than 20 branches.

This PR is based on #11243 and #11221, thanks to joehalliwell

Closes #11243
Closes #11221

## How was this patch tested?

Add a test with 50 branches.

Author: Davies Liu <davies@databricks.com>

Closes #11592 from davies/fix_when.

9634e17d

[SPARK-13698][SQL] Fix Analysis Exceptions when Using Backticks in Generate · 53ba6d6e

Dilip Biswal authored 9 years ago

## What changes were proposed in this pull request?
Analysis exception occurs while running the following query.
```
SELECT ints FROM nestedArray LATERAL VIEW explode(a.b) `a` AS `ints`
```
```
Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot resolve '`ints`' given input columns: [a, `ints`]; line 1 pos 7
'Project ['ints]
+- Generate explode(a#0.b), true, false, Some(a), [`ints`#8]
   +- SubqueryAlias nestedarray
      +- LocalRelation [a#0], [[[[1,2,3]]]]
```

## How was this patch tested?

Added new unit tests in SQLQuerySuite and HiveQlSuite

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #11538 from dilipbiswal/SPARK-13698.

53ba6d6e

[SPARK-13769][CORE] Update Java Doc in Spark Submit · 8e8633e0

Ahmed Kamal authored 9 years ago

JIRA : https://issues.apache.org/jira/browse/SPARK-13769

The java doc here (https://github.com/apache/spark/blob/e97fc7f176f8bf501c9b3afd8410014e3b0e1602/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L51)
needs to be updated from "The latter two operations are currently supported only for standalone cluster mode." to "The latter two operations are currently supported only for standalone and mesos cluster modes."

Author: Ahmed Kamal <ahmed.kamal@badrit.com>

Closes #11600 from AhmedKamal/SPARK-13769.

8e8633e0

[SPARK-13702][CORE][SQL][MLLIB] Use diamond operator for generic instance creation in Java code. · c3689bc2

Dongjoon Hyun authored 9 years ago

## What changes were proposed in this pull request?

In order to make `docs/examples` (and other related code) more simple/readable/user-friendly, this PR replaces existing codes like the followings by using `diamond` operator.

```
-    final ArrayList<Product2<Object, Object>> dataToWrite =
-      new ArrayList<Product2<Object, Object>>();
+    final ArrayList<Product2<Object, Object>> dataToWrite = new ArrayList<>();
```

Java 7 or higher supports **diamond** operator which replaces the type arguments required to invoke the constructor of a generic class with an empty set of type parameters (<>). Currently, Spark Java code use mixed usage of this.

## How was this patch tested?

Manual.
Pass the existing tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11541 from dongjoon-hyun/SPARK-13702.

c3689bc2

[SPARK-13631][CORE] Thread-safe getLocationsWithLargestOutputs · cbff2803

Andy Sloane authored 9 years ago

## What changes were proposed in this pull request?

If a job is being scheduled in one thread which has a dependency on an
RDD currently executing a shuffle in another thread, Spark would throw a
NullPointerException. This patch synchronizes access to `mapStatuses` and
skips null status entries (which are in-progress shuffle tasks).

## How was this patch tested?

Our client code unit test suite, which was reliably reproducing the race
condition with 10 threads, shows that this fixes it. I have not found a minimal
test case to add to Spark, but I will attempt to do so if desired.

The same test case was tripping up on SPARK-4454, which was fixed by
making other DAGScheduler code thread-safe.

shivaram srowen

Author: Andy Sloane <asloane@tetrationanalytics.com>

Closes #11505 from a1k0n/SPARK-13631.

cbff2803

[SPARK-13640][SQL] Synchronize ScalaReflection.mirror method. · 2c5af7d4

Takuya UESHIN authored 9 years ago

## What changes were proposed in this pull request?

`ScalaReflection.mirror` method should be synchronized when scala version is `2.10` because `universe.runtimeMirror` is not thread safe.

## How was this patch tested?

I added a test to check thread safety of `ScalaRefection.mirror` method in `ScalaReflectionSuite`, which will throw the following Exception in Scala `2.10` without this patch:

```
[info] - thread safety of mirror *** FAILED *** (49 milliseconds)
[info]   java.lang.UnsupportedOperationException: tail of empty list
[info]   at scala.collection.immutable.Nil$.tail(List.scala:339)
[info]   at scala.collection.immutable.Nil$.tail(List.scala:334)
[info]   at scala.reflect.internal.SymbolTable.popPhase(SymbolTable.scala:172)
[info]   at scala.reflect.internal.Symbols$Symbol.unsafeTypeParams(Symbols.scala:1477)
[info]   at scala.reflect.internal.Symbols$TypeSymbol.tpe(Symbols.scala:2777)
[info]   at scala.reflect.internal.Mirrors$RootsBase.init(Mirrors.scala:235)
[info]   at scala.reflect.runtime.JavaMirrors$class.createMirror(JavaMirrors.scala:34)
[info]   at scala.reflect.runtime.JavaMirrors$class.runtimeMirror(JavaMirrors.scala:61)
[info]   at scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12)
[info]   at scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12)
[info]   at org.apache.spark.sql.catalyst.ScalaReflection$.mirror(ScalaReflection.scala:36)
[info]   at org.apache.spark.sql.catalyst.ScalaReflectionSuite$$anonfun$12$$anonfun$apply$mcV$sp$1$$anonfun$apply$1$$anonfun$apply$2.apply(ScalaReflectionSuite.scala:256)
[info]   at org.apache.spark.sql.catalyst.ScalaReflectionSuite$$anonfun$12$$anonfun$apply$mcV$sp$1$$anonfun$apply$1$$anonfun$apply$2.apply(ScalaReflectionSuite.scala:252)
[info]   at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
[info]   at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
[info]   at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
[info]   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[info]   at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[info]   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[info]   at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
```

Notice that the test will pass when Scala version is `2.11`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #11487 from ueshin/issues/SPARK-13640.

2c5af7d4

[SPARK-13692][CORE][SQL] Fix trivial Coverity/Checkstyle defects · f3201aee

Dongjoon Hyun authored 9 years ago

## What changes were proposed in this pull request?

This issue fixes the following potential bugs and Java coding style detected by Coverity and Checkstyle.

- Implement both null and type checking in equals functions.
- Fix wrong type casting logic in SimpleJavaBean2.equals.
- Add `implement Cloneable` to `UTF8String` and `SortedIterator`.
- Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`.
- Fix coding style: Add '{}' to single `for` statement in mllib examples.
- Remove unused imports in `ColumnarBatch` and `JavaKinesisStreamSuite`.
- Remove unused fields in `ChunkFetchIntegrationSuite`.
- Add `stop()` to prevent resource leak.

Please note that the last two checkstyle errors exist on newly added commits after [SPARK-13583](https://issues.apache.org/jira/browse/SPARK-13583).

## How was this patch tested?

manual via `./dev/lint-java` and Coverity site.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11530 from dongjoon-hyun/SPARK-13692.

f3201aee

Mar 08, 2016

[SPARK-7286][SQL] Deprecate !== in favour of =!= · 035d3acd

Jakob Odersky authored 9 years ago

This PR replaces #9925 which had issues with CI. **Please see the original PR for any previous discussions.**

## What changes were proposed in this pull request?
Deprecate the SparkSQL column operator !== and use =!= as an alternative.
Fixes subtle issues related to operator precedence (basically, !== does not have the same priority as its logical negation, ===).

## How was this patch tested?
All currently existing tests.

Author: Jakob Odersky <jodersky@gmail.com>

Closes #11588 from jodersky/SPARK-7286.

035d3acd

[SPARK-13754] Keep old data source name for backwards compatibility · cc4ab37e

Hossein authored 9 years ago

## Motivation
CSV data source was contributed by Databricks. It is the inlined version of https://github.com/databricks/spark-csv. The data source name was `com.databricks.spark.csv`. As a result there are many tables created on older versions of spark with that name as the source. For backwards compatibility we should keep the old name.

## Proposed changes
`com.databricks.spark.csv` was added to list of `backwardCompatibilityMap` in `ResolvedDataSource.scala`

## Tests
A unit test was added to `CSVSuite` to parse a csv file using the old name.

Author: Hossein <hossein@databricks.com>

Closes #11589 from falaki/SPARK-13754.

cc4ab37e

[SPARK-13750][SQL] fix sizeInBytes of HadoopFsRelation · 982ef2b8

Davies Liu authored 9 years ago

## What changes were proposed in this pull request?

This PR fix the sizeInBytes of HadoopFsRelation.

## How was this patch tested?

Added regression test for that.

Author: Davies Liu <davies@databricks.com>

Closes #11590 from davies/fix_sizeInBytes.

982ef2b8

[SPARK-13625][PYSPARK][ML] Added a check to see if an attribute is a property... · d8813fa0

Bryan Cutler authored 9 years ago

[SPARK-13625][PYSPARK][ML] Added a check to see if an attribute is a property when getting param list

## What changes were proposed in this pull request?

Added a check in pyspark.ml.param.Param.params() to see if an attribute is a property (decorated with `property`) before checking if it is a `Param` instance.  This prevents the property from being invoked to 'get' this attribute, which could possibly cause an error.

## How was this patch tested?

Added a test case with a class has a property that will raise an error when invoked and then call`Param.params` to verify that the property is not invoked, but still able to find another property in the class.  Also ran pyspark-ml test before fix that will trigger an error, and again after the fix to verify that the error was resolved and the method was working properly.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #11476 from BryanCutler/pyspark-ml-property-attr-SPARK-13625.

d8813fa0

[SPARK-13755] Escape quotes in SQL plan visualization node labels · 81f54acc

Josh Rosen authored 9 years ago

When generating Graphviz DOT files in the SQL query visualization we need to escape double-quotes inside node labels. This is a followup to #11309, which fixed a similar graph in Spark Core's DAG visualization.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11587 from JoshRosen/graphviz-escaping.

81f54acc

[SPARK-13668][SQL] Reorder filter/join predicates to short-circuit isNotNull checks · e430614e

Sameer Agarwal authored 9 years ago

## What changes were proposed in this pull request?

If a filter predicate or a join condition consists of `IsNotNull` checks, we should reorder these checks such that these non-nullability checks are evaluated before the rest of the predicates.

For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we should rewrite this as `isNotNull(b) && a > 5` during physical plan generation.

## How was this patch tested?

new unit tests that verify the physical plan for both filters and joins in `ReorderedPredicateSuite`

Author: Sameer Agarwal <sameer@databricks.com>

Closes #11511 from sameeragarwal/reorder-isnotnull.

e430614e

[SPARK-13738][SQL] Cleanup Data Source resolution · 1e288405

Michael Armbrust authored 9 years ago

Follow-up to #11509, that simply refactors the interface that we use when resolving a pluggable `DataSource`.
 - Multiple functions share the same set of arguments so we make this a case class, called `DataSource`.  Actual resolution is now done by calling a function on this class.
 - Instead of having multiple methods named `apply` (some of which do writing some of which do reading) we now explicitly have `resolveRelation()` and `write(mode, df)`.
 - Get rid of `Array[String]` since this is an internal API and was forcing us to awkwardly call `toArray` in a bunch of places.

Author: Michael Armbrust <michael@databricks.com>

Closes #11572 from marmbrus/dataSourceResolution.

1e288405

[SPARK-13400] Stop using deprecated Octal escape literals · 076009b9

Dongjoon Hyun authored 9 years ago

## What changes were proposed in this pull request?

This removes the remaining deprecated Octal escape literals. The followings are the warnings on those two lines.
```
LiteralExpressionSuite.scala:99: Octal escape literals are deprecated, use \u0000 instead.
HiveQlSuite.scala:74: Octal escape literals are deprecated, use \u002c instead.
```

## How was this patch tested?

Manual.
During building, there should be no warning on `Octal escape literals`.
```
mvn -DskipTests clean install
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11584 from dongjoon-hyun/SPARK-13400.

076009b9

[SPARK-13593] [SQL] improve the `createDataFrame` to accept data type string and verify the data · d57daf1f

Wenchen Fan authored 9 years ago

## What changes were proposed in this pull request?

This PR improves the `createDataFrame` method to make it also accept datatype string, then users can convert python RDD to DataFrame easily, for example, `df = rdd.toDF("a: int, b: string")`.
It also supports flat schema so users can convert an RDD of int to DataFrame directly, we will automatically wrap int to row for users.
If schema is given, now we checks if the real data matches the given schema, and throw error if it doesn't.

## How was this patch tested?

new tests in `test.py` and doc test in `types.py`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11444 from cloud-fan/pyrdd.

d57daf1f

[SPARK-13740][SQL] add null check for _verify_type in types.py · d5ce6172

Wenchen Fan authored 9 years ago

## What changes were proposed in this pull request?

This PR adds null check in `_verify_type` according to the nullability information.

## How was this patch tested?

new doc tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11574 from cloud-fan/py-null-check.

d5ce6172

[ML] testEstimatorAndModelReadWrite should call checkModelData · 9740954f

Yanbo Liang authored 9 years ago

## What changes were proposed in this pull request?
Although we defined ```checkModelData``` in [```read/write``` test](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L994) of ML estimators/models and pass it to ```testEstimatorAndModelReadWrite```, ```testEstimatorAndModelReadWrite``` omits to call ```checkModelData``` to check the equality of model data. So actually we did not run the check of model data equality for all test cases currently, we should fix it.
BTW, fix the bug of LDA read/write test which did not set ```docConcentration```. This bug should have failed test, but it does not complain because we did not run ```checkModelData``` actually.
cc jkbradley mengxr
## How was this patch tested?
No new unit test, should pass the exist ones.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11513 from yanboliang/ml-check-model-data.

9740954f

[SPARK-12727][SQL] support SQL generation for aggregate with multi-distinct · 46881b4e

Wenchen Fan authored 9 years ago

## What changes were proposed in this pull request?

This PR add SQL generation support for aggregate with multi-distinct, by simply moving the `DistinctAggregationRewriter` rule to optimizer.

More discussions are needed as this breaks an import contract: analyzed plan should be able to run without optimization.  However, the `ComputeCurrentTime` rule has kind of broken it already, and I think maybe we should add a new phase for this kind of rules, because strictly speaking they don't belong to analysis and is coupled with the physical plan implementation.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11579 from cloud-fan/distinct.

46881b4e

[SPARK-13695] Don't cache MEMORY_AND_DISK blocks as bytes in memory after spills · ad3c9a97

Josh Rosen authored 9 years ago

When a cached block is spilled to disk and read back in serialized form (i.e. as bytes), the current BlockManager implementation will attempt to re-insert the serialized block into the MemoryStore even if the block's storage level requests deserialized caching.

This behavior adds some complexity to the MemoryStore but I don't think it offers many performance benefits and I'd like to remove it in order to simplify a larger refactoring patch. Therefore, this patch changes the behavior so that disk store reads will only cache bytes in the memory store for blocks with serialized storage levels.

There are two places where we request serialized bytes from the BlockStore:

1. getLocalBytes(), which is only called when reading local copies of TorrentBroadcast pieces. Broadcast pieces are always cached using a serialized storage level, so this won't lead to a mismatch in serialization forms if spilled bytes read from disk are cached as bytes in the memory store.
2. the non-shuffle-block branch in getBlockData(), which is only called by the NettyBlockRpcServer when responding to requests to read remote blocks. Caching the serialized bytes in memory will only benefit us if those cached bytes are read before they're evicted and the likelihood of that happening seems low since the frequency of remote reads of non-broadcast cached blocks seems very low. Caching these bytes when they have a low probability of being read is bad if it risks the eviction of blocks which are cached in their expected serialized/deserialized forms, since those blocks seem more likely to be read in local computation.

Given the argument above, I think this change is unlikely to cause performance regressions.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11533 from JoshRosen/remove-memorystore-level-mismatch.

ad3c9a97

[SPARK-13657] [SQL] Support parsing very long AND/OR expressions · 78d3b605

Davies Liu authored 9 years ago

## What changes were proposed in this pull request?

In order to avoid StackOverflow when parse a expression with hundreds of ORs, we should use loop instead of recursive functions to flatten the tree as list. This PR also build a balanced tree to reduce the depth of generated And/Or expression, to avoid StackOverflow in analyzer/optimizer.

## How was this patch tested?

Add new unit tests. Manually tested with TPCDS Q3 with hundreds predicates in it [1]. These predicates help to reduce the number of partitions, then the query time went from 60 seconds to 8 seconds.

[1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql

Author: Davies Liu <davies@databricks.com>

Closes #11501 from davies/long_or.

78d3b605

[SPARK-13715][MLLIB] Remove last usages of jblas in tests · 54040f8d

Sean Owen authored 9 years ago

## What changes were proposed in this pull request?

Remove last usage of jblas, in tests

## How was this patch tested?

Jenkins tests -- the same ones that are being modified.

Author: Sean Owen <sowen@cloudera.com>

Closes #11560 from srowen/SPARK-13715.

54040f8d

[HOTFIX][YARN] Fix yarn cluster mode fire and forget regression · ca1a7b9d

jerryshao authored 9 years ago

## What changes were proposed in this pull request?

Fire and forget is disabled by default, with this patch #10205 it is enabled by default, so this is a regression should be fixed.

## How was this patch tested?

Manually verified this change.

Author: jerryshao <sshao@hortonworks.com>

Closes #11577 from jerryshao/hot-fix-yarn-cluster.

ca1a7b9d

[SPARK-13637][SQL] use more information to simplify the code in Expand builder · 7d05d02b

Wenchen Fan authored 9 years ago

## What changes were proposed in this pull request?

The code in `Expand.apply` can be simplified by existing information:

* the `groupByExprs` parameter are all `Attribute`s
* the `child` parameter is a `Project` that append aliased group by expressions to its child's output

## How was this patch tested?

by existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11485 from cloud-fan/expand.

7d05d02b

[SPARK-13675][UI] Fix wrong historyserver url link for application running in yarn cluster mode · 9e86e6ef

jerryshao authored 9 years ago

## What changes were proposed in this pull request?

Current URL for each application to access history UI is like:
http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or http://localhost:18080/history/application_1457058760338_0016/2/jobs/

Here **1** or **2** represents the number of attempts in `historypage.js`, but it will parse to attempt id in `HistoryServer`, while the correct attempt id should be like "appattempt_1457058760338_0016_000002", so it will fail to parse to a correct attempt id in HistoryServer.

This is OK in yarn client mode, since we don't need this attempt id to fetch out the app cache, but it is failed in yarn cluster mode, where attempt id "1" or "2" is actually wrong.

So here we should fix this url to parse the correct application id and attempt id. Also the suffix "jobs/" is not needed.

Here is the screenshot:

![screen shot 2016-02-29 at 3 57 32 pm](https://cloud.githubusercontent.com/assets/850797/13524377/d4b44348-e235-11e5-8b3e-bc06de306e87.png)

## How was this patch tested?

This patch is tested manually, with different master and deploy mode.

![image](https://cloud.githubusercontent.com/assets/850797/13524419/118be5a0-e236-11e5-8022-3ff613ccde46.png)

Author: jerryshao <sshao@hortonworks.com>

Closes #11518 from jerryshao/SPARK-13675.

9e86e6ef

[SPARK-13117][WEB UI] WebUI should use the local ip not 0.0.0.0 · 9bf76ddd

Devaraj K authored 9 years ago

## What changes were proposed in this pull request?

In WebUI, now Jetty Server starts with SPARK_LOCAL_IP config value if it
is configured otherwise it starts with default value as '0.0.0.0'.

It is continuation as per the closed PR https://github.com/apache/spark/pull/11133 for the JIRA SPARK-13117 and discussion in SPARK-13117.

## How was this patch tested?

This has been verified using the command 'netstat -tnlp | grep <PID>' to check on which IP/hostname is binding with the below steps.

In the below results, mentioned PID in the command is the corresponding process id.

#### Without the patch changes,
Web UI(Jetty Server) is not taking the value configured for SPARK_LOCAL_IP and it is listening to all the interfaces.
###### Master
```
[devarajstobdtserver2 sbin]$ netstat -tnlp | grep 3930
tcp6       0      0 :::8080                 :::*                    LISTEN      3930/java
```

###### Worker
```
[devarajstobdtserver2 sbin]$ netstat -tnlp | grep 4090
tcp6       0      0 :::8081                 :::*                    LISTEN      4090/java
```

###### History Server Process,
```
[devarajstobdtserver2 sbin]$ netstat -tnlp | grep 2471
tcp6       0      0 :::18080                :::*                    LISTEN      2471/java
```
###### Driver
```
[devarajstobdtserver2 spark-master]$ netstat -tnlp | grep 6556
tcp6       0      0 :::4040                 :::*                    LISTEN      6556/java
```

#### With the patch changes

##### i. With SPARK_LOCAL_IP configured
If the SPARK_LOCAL_IP is configured then all the processes Web UI(Jetty Server) is getting bind to the configured value.
###### Master
```
[devarajstobdtserver2 sbin]$ netstat -tnlp | grep 1561
tcp6       0      0 x.x.x.x:8080       :::*                    LISTEN      1561/java
```
###### Worker
```
[devarajstobdtserver2 sbin]$ netstat -tnlp | grep 2229
tcp6       0      0 x.x.x.x:8081       :::*                    LISTEN      2229/java
```
###### History Server
```
[devarajstobdtserver2 sbin]$ netstat -tnlp | grep 3747
tcp6       0      0 x.x.x.x:18080      :::*                    LISTEN      3747/java
```
###### Driver
```
[devarajstobdtserver2 spark-master]$ netstat -tnlp | grep 6013
tcp6       0      0 x.x.x.x:4040       :::*                    LISTEN      6013/java
```

##### ii. Without SPARK_LOCAL_IP configured
If the SPARK_LOCAL_IP is not configured then all the processes Web UI(Jetty Server) will start with the '0.0.0.0' as default value.
###### Master
```
[devarajstobdtserver2 sbin]$ netstat -tnlp | grep 4573
tcp6       0      0 :::8080                 :::*                    LISTEN      4573/java
```

###### Worker
```
[devarajstobdtserver2 sbin]$ netstat -tnlp | grep 4703
tcp6       0      0 :::8081                 :::*                    LISTEN      4703/java
```

###### History Server
```
[devarajstobdtserver2 sbin]$ netstat -tnlp | grep 4846
tcp6       0      0 :::18080                :::*                    LISTEN      4846/java
```

###### Driver
```
[devarajstobdtserver2 sbin]$ netstat -tnlp | grep 5437
tcp6       0      0 :::4040                 :::*                    LISTEN      5437/java
```

Author: Devaraj K <devaraj@apache.org>

Closes #11490 from devaraj-kavali/SPARK-13117-v1.

9bf76ddd

[HOT-FIX][BUILD] Use the new location of `checkstyle-suppressions.xml` · 7771c731

Dongjoon Hyun authored 9 years ago

## What changes were proposed in this pull request?

This PR fixes `dev/lint-java` and `mvn checkstyle:check` failures due the recent file location change.
The following is the error message of current master.
```
Checkstyle checks failed at following occurrences:
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check (default-cli) on project spark-parent_2.11: Failed during checkstyle configuration: cannot initialize module SuppressionFilter - Cannot set property 'file' to 'checkstyle-suppressions.xml' in module SuppressionFilter: InvocationTargetException: Unable to find: checkstyle-suppressions.xml -> [Help 1]
```

## How was this patch tested?

Manual. The following command should run correctly.
```
./dev/lint-java
mvn checkstyle:check
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11567 from dongjoon-hyun/hotfix_checkstyle_suppression.

7771c731

Mar 07, 2016

[SPARK-13659] Refactor BlockStore put*() APIs to remove returnValues · e52e597d

Josh Rosen authored 9 years ago

In preparation for larger refactoring, this patch removes the confusing `returnValues` option from the BlockStore put() APIs: returning the value is only useful in one place (caching) and in other situations, such as block replication, it's simpler to put() and then get().

As part of this change, I needed to refactor `BlockManager.doPut()`'s block replication code. I also changed `doPut()` to access the memory and disk stores directly rather than calling them through the BlockStore interface; this is in anticipation of a followup patch to remove the BlockStore interface so that the disk store can expose a binary-data-oriented API which is not concerned with Java objects or serialization.

These changes should be covered by the existing storage unit tests. The best way to review this patch is probably to look at the individual commits, all of which are small and have useful descriptions to guide the review.

/cc davies for review.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11502 from JoshRosen/remove-returnvalues.

e52e597d

[SPARK-13711][CORE] Don't call SparkUncaughtExceptionHandler in AppClient as it's in driver · 017cdf2b

Shixiong Zhu authored 9 years ago

## What changes were proposed in this pull request?

AppClient runs in the driver side. It should not call `Utils.tryOrExit` as it will send exception to SparkUncaughtExceptionHandler and call `System.exit`. This PR just removed `Utils.tryOrExit`.

## How was this patch tested?

manual tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11566 from zsxwing/SPARK-13711.

017cdf2b

[SPARK-13404] [SQL] Create variables for input row when it's actually used · 25bba58d

Davies Liu authored 9 years ago

## What changes were proposed in this pull request?

This PR change the way how we generate the code for the output variables passing from a plan to it's parent.

Right now, they are generated before call consume() of it's parent. It's not efficient, if the parent is a Filter or Join, which could filter out most the rows, the time to access some of the columns that are not used by the Filter or Join are wasted.

This PR try to improve this by defering the access of columns until they are actually used by a plan. After this PR, a plan does not need to generate code to evaluate the variables for output, just passing the ExprCode to its parent by `consume()`. In `parent.consumeChild()`, it will check the output from child and `usedInputs`, generate the code for those columns that is part of `usedInputs` before calling `doConsume()`.

This PR also change the `if` from
```
if (cond) {
  xxx
}
```
to
```
if (!cond) continue;
xxx
```
The new one could help to reduce the nested indents for multiple levels of Filter and BroadcastHashJoin.

It also added some comments for operators.

## How was the this patch tested?

Unit tests. Manually ran TPCDS Q55, this PR improve the performance about 30% (scale=10, from 2.56s to 1.96s)

Author: Davies Liu <davies@databricks.com>

Closes #11274 from davies/gen_defer.

25bba58d