Commits · a2dce22e0a25922e2052318d32f32877b7c27ec2 · cs525-sp18-g07 / spark

Nov 20, 2015

Revert "[SPARK-11689][ML] Add user guide and example code for LDA under spark.ml" · a2dce22e
Xiangrui Meng authored 9 years ago
```
This reverts commit e359d5dc.
```
a2dce22e
[HOTFIX] Fix Java Dataset Tests · 47815878
Michael Armbrust authored 9 years ago

47815878
[SPARK-11890][SQL] Fix compilation for Scala 2.11 · 68ed0468
Michael Armbrust authored 9 years ago
```
Author: Michael Armbrust <michael@databricks.com>

Closes #9871 from marmbrus/scala211-break.
```
68ed0468

[SPARK-11889][SQL] Fix type inference for GroupedDataset.agg in REPL · 968acf3b

In this PR I delete a method that breaks type inference for aggregators (only in the REPL)

The error when this method is present is:
```
<console>:38: error: missing parameter type for expanded function ((x$2) => x$2._2)
              ds.groupBy(_._1).agg(sum(_._2), sum(_._3)).collect()
```

Author: Michael Armbrust <michael@databricks.com>

Closes #9870 from marmbrus/dataset-repl-agg.

968acf3b

[SPARK-11787][SPARK-11883][SQL][FOLLOW-UP] Cleanup for this patch. · 58b4e4f8

Nong Li authored 9 years ago

This mainly moves SqlNewHadoopRDD to the sql package. There is some state that is
shared between core and I've left that in core. This allows some other associated
minor cleanup.

Author: Nong Li <nong@databricks.com>

Closes #9845 from nongli/spark-11787.

58b4e4f8

[SPARK-11549][DOCS] Replace example code in mllib-evaluation-metrics.md using include_example · ed47b1e6
Vikas Nelamangala authored 9 years ago
```
Author: Vikas Nelamangala <vikasnelamangala@Vikass-MacBook-Pro.local>

Closes #9689 from vikasnp/master.
```
ed47b1e6

[SPARK-11636][SQL] Support classes defined in the REPL with Encoders · 4b84c72d

Michael Armbrust authored 9 years ago

#theScaryParts (i.e. changes to the repl, executor classloaders and codegen)...

Author: Michael Armbrust <michael@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #9825 from marmbrus/dataset-replClasses2.

4b84c72d

[SPARK-11756][SPARKR] Fix use of aliases - SparkR can not output help... · a6239d58

felixcheung authored 9 years ago

[SPARK-11756][SPARKR] Fix use of aliases - SparkR can not output help information for SparkR:::summary correctly

Fix use of aliases and changes uses of rdname and seealso
`aliases` is the hint for `?` - it should not be linked to some other name - those should be seealso
https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html

Clean up usage on family, as multiple use of family with the same rdname is causing duplicated See Also html blocks (like http://spark.apache.org/docs/latest/api/R/count.html)
Also changing some rdname for dplyr-like variant for better R user visibility in R doc, eg. rbind, summary, mutate, summarize

shivaram yanboliang

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9750 from felixcheung/rdocaliases.

a6239d58

[SPARK-11716][SQL] UDFRegistration just drops the input type when re-creating... · 03ba56d7

Jean-Baptiste Onofré authored 9 years ago

[SPARK-11716][SQL] UDFRegistration just drops the input type when re-creating the UserDefinedFunction

https://issues.apache.org/jira/browse/SPARK-11716

This is one is #9739 and a regression test. When commit it, please make sure the author is jbonofre.

You can find the original PR at https://github.com/apache/spark/pull/9739

closes #9739

Author: Jean-Baptiste Onofré <jbonofre@apache.org>
Author: Yin Huai <yhuai@databricks.com>

Closes #9868 from yhuai/SPARK-11716.

03ba56d7

[SPARK-11887] Close PersistenceEngine at the end of PersistenceEngineSuite tests · 89fd9bd0

Josh Rosen authored 9 years ago

In PersistenceEngineSuite, we do not call `close()` on the PersistenceEngine at the end of the test. For the ZooKeeperPersistenceEngine, this causes us to leak a ZooKeeper client, causing the logs of unrelated tests to be periodically spammed with connection error messages from that client:

```
15/11/20 05:13:35.789 pool-1-thread-1-ScalaTest-running-PersistenceEngineSuite-SendThread(localhost:15741) INFO ClientCnxn: Opening socket connection to server localhost/127.0.0.1:15741. Will not attempt to authenticate using SASL (unknown error)
15/11/20 05:13:35.790 pool-1-thread-1-ScalaTest-running-PersistenceEngineSuite-SendThread(localhost:15741) WARN ClientCnxn: Session 0x15124ff48dd0000 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
	at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
```

This patch fixes this by using a `finally` block.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9864 from JoshRosen/close-zookeeper-client-in-tests.

89fd9bd0

[SPARK-11870][STREAMING][PYSPARK] Rethrow the exceptions in TransformFunction... · be7a2cfd

Shixiong Zhu authored 9 years ago

[SPARK-11870][STREAMING][PYSPARK] Rethrow the exceptions in TransformFunction and TransformFunctionSerializer

TransformFunction and TransformFunctionSerializer don't rethrow the exception, so when any exception happens, it just return None. This will cause some weird NPE and confuse people.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9847 from zsxwing/pyspark-streaming-exception.

be7a2cfd

[SPARK-11724][SQL] Change casting between int and timestamp to consistently treat int in seconds. · 9ed4ad42

Nong Li authored 9 years ago

Hive has since changed this behavior as well. https://issues.apache.org/jira/browse/HIVE-3454

Author: Nong Li <nong@databricks.com>
Author: Nong Li <nongli@gmail.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #9685 from nongli/spark-11724.

9ed4ad42

[SPARK-11650] Reduce RPC timeouts to speed up slow AkkaUtilsSuite test · 652def31

Josh Rosen authored 9 years ago

This patch reduces some RPC timeouts in order to speed up the slow "AkkaUtilsSuite.remote fetch ssl on - untrusted server", which used to take two minutes to run.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9869 from JoshRosen/SPARK-11650.

652def31

[SPARK-11819][SQL] nice error message for missing encoder · 3b9d2a34

Wenchen Fan authored 9 years ago

before this PR, when users try to get an encoder for an un-supported class, they will only get a very simple error message like `Encoder for type xxx is not supported`.

After this PR, the error message become more friendly, for example:
```
No Encoder found for abc.xyz.NonEncodable
- array element class: "abc.xyz.NonEncodable"
- field (class: "scala.Array", name: "arrayField")
- root class: "abc.xyz.AnotherClass"
```

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9810 from cloud-fan/error-message.

3b9d2a34

[SPARK-11817][SQL] Truncating the fractional seconds to prevent inserting a NULL · 60bfb113

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-11817

Instead of return None, we should truncate the fractional seconds to prevent inserting NULL.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9834 from viirya/truncate-fractional-sec.

60bfb113

[SPARK-11876][SQL] Support printSchema in DataSet API · bef361c5

gatorsmile authored 9 years ago

DataSet APIs look great! However, I am lost when doing multiple level joins.  For example,
```
val ds1 = Seq(("a", 1), ("b", 2)).toDS().as("a")
val ds2 = Seq(("a", 1), ("b", 2)).toDS().as("b")
val ds3 = Seq(("a", 1), ("b", 2)).toDS().as("c")

ds1.joinWith(ds2, $"a._2" === $"b._2").as("ab").joinWith(ds3, $"ab._1._2" === $"c._2").printSchema()
```

The printed schema is like
```
root
 |-- _1: struct (nullable = true)
 |    |-- _1: struct (nullable = true)
 |    |    |-- _1: string (nullable = true)
 |    |    |-- _2: integer (nullable = true)
 |    |-- _2: struct (nullable = true)
 |    |    |-- _1: string (nullable = true)
 |    |    |-- _2: integer (nullable = true)
 |-- _2: struct (nullable = true)
 |    |-- _1: string (nullable = true)
 |    |-- _2: integer (nullable = true)
```

Personally, I think we need the printSchema function. Sometimes, I do not know how to specify the column, especially when their data types are mixed. For example, if I want to write the following select for the above multi-level join, I have to know the schema:
```
newDS.select(expr("_1._2._2 + 1").as[Int]).collect()
```

marmbrus rxin cloud-fan  Do you have the same feeling?

Author: gatorsmile <gatorsmile@gmail.com>

Closes #9855 from gatorsmile/printSchemaDataSet.

bef361c5

[SPARK-11689][ML] Add user guide and example code for LDA under spark.ml · e359d5dc

Yuhao Yang authored 9 years ago

jira: https://issues.apache.org/jira/browse/SPARK-11689

Add simple user guide for LDA under spark.ml and example code under examples/. Use include_example to include example code in the user guide markdown. Check SPARK-11606 for instructions.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9722 from hhbyyh/ldaMLExample.

e359d5dc

[SPARK-11852][ML] StandardScaler minor refactor · 9ace2e5c

Yanbo Liang authored 9 years ago

```withStd``` and ```withMean``` should be params of ```StandardScaler``` and ```StandardScalerModel```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9839 from yanboliang/standardScaler-refactor.

9ace2e5c

[SPARK-11877] Prevent agg. fallback conf. from leaking across test suites · a66142de

Josh Rosen authored 9 years ago

This patch fixes an issue where the `spark.sql.TungstenAggregate.testFallbackStartsAt` SQLConf setting was not properly reset / cleared at the end of `TungstenAggregationQueryWithControlledFallbackSuite`. This ended up causing test failures in HiveCompatibilitySuite in Maven builds by causing spilling to occur way too frequently.

This configuration leak was inadvertently introduced during test cleanup in #9618.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9857 from JoshRosen/clear-fallback-prop-in-test-teardown.

a66142de

[SPARK-11867] Add save/load for kmeans and naive bayes · 3e1d120c

Xusen Yin authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-11867

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9849 from yinxusen/SPARK-11867.

3e1d120c

[SPARK-11869][ML] Clean up TempDirectory properly in ML tests · 0fff8eb3

Joseph K. Bradley authored 9 years ago

Need to remove parent directory (```className```) rather than just tempDir (```className/random_name```)

I tested this with IDFSuite, which has 2 read/write tests, and it fixes the problem.

CC: mengxr Can you confirm this is fine? I believe it is since the same ```random_name``` is used for all tests in a suite; we basically have an extra unneeded level of nesting.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9851 from jkbradley/tempdir-cleanup.

0fff8eb3

[SPARK-11875][ML][PYSPARK] Update doc for PySpark HasCheckpointInterval · 7216f405

Yanbo Liang authored 9 years ago

* Update doc for PySpark ```HasCheckpointInterval``` that users can understand how to disable checkpoint.
* Update doc for PySpark ```cacheNodeIds``` of ```DecisionTreeParams``` to notify the relationship between ```cacheNodeIds``` and ```checkpointInterval```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9856 from yanboliang/spark-11875.

7216f405

[SPARK-11829][ML] Add read/write to estimators under ml.feature (II) · 3b7f056d

Yanbo Liang authored 9 years ago

Add read/write support to the following estimators under spark.ml:
* ChiSqSelector
* PCA
* VectorIndexer
* Word2Vec

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9838 from yanboliang/spark-11829.

3b7f056d

[SPARK-11846] Add save/load for AFTSurvivalRegression and IsotonicRegression · 4114ce20

Xusen Yin authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-11846

mengxr

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9836 from yinxusen/SPARK-11846.

4114ce20

Nov 19, 2015

[SPARK-11544][SQL][TEST-HADOOP1.0] sqlContext doesn't use PathFilter · 7ee7d5a3

Dilip Biswal authored 9 years ago

Apply the user supplied pathfilter while retrieving the files from fs.

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #9830 from dilipbiswal/spark-11544.

7ee7d5a3

[SPARK-11864][SQL] Improve performance of max/min · ee214077

Davies Liu authored 9 years ago

This PR has the following optimization:

1) The greatest/least already does the null-check, so the `If` and `IsNull` are not necessary.

2) In greatest/least, it should initialize the result using the first child (removing one block).

3) For primitive types, the generated greater expression is too complicated (`a > b ? 1 : (a < b) ? -1 : 0) > 0`), should be as simple as `a > b`

Combine these optimization, this could improve the performance of `ss_max` query by 30%.

Author: Davies Liu <davies@databricks.com>

Closes #9846 from davies/improve_max.

ee214077

[SPARK-11845][STREAMING][TEST] Added unit test to verify TrackStateRDD is correctly checkpointed · b2cecb80

Tathagata Das authored 9 years ago

To make sure that all lineage is correctly truncated for TrackStateRDD when checkpointed.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #9831 from tdas/SPARK-11845.

b2cecb80

[SPARK-4134][CORE] Lower severity of some executor loss logs. · 880128f3

Marcelo Vanzin authored 9 years ago

Don't log ERROR messages when executors are explicitly killed or when
the exit reason is not yet known.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9780 from vanzin/SPARK-11789.

880128f3

[SPARK-11275][SQL] Incorrect results when using rollup/cube · 37cff1b1

Andrew Ray authored 9 years ago

Fixes bug with grouping sets (including cube/rollup) where aggregates that included grouping expressions would return the wrong (null) result.

Also simplifies the analyzer rule a bit and leaves column pruning to the optimizer.

Added multiple unit tests to DataFrameAggregateSuite and verified it passes hive compatibility suite:
```
build/sbt -Phive -Dspark.hive.whitelist='groupby.*_grouping.*' 'test-only org.apache.spark.sql.hive.execution.HiveCompatibilitySuite'
```

This is an alternative to pr https://github.com/apache/spark/pull/9419 but I think its better as it simplifies the analyzer rule instead of adding another special case to it.

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #9815 from aray/groupingset-agg-fix.

37cff1b1

[SPARK-11746][CORE] Use cache-aware method dependencies · 01403aa9

hushan authored 9 years ago

a small change

Author: hushan <hushan@xiaomi.com>

Closes #9691 from suyanNone/unify-getDependency.

01403aa9

[SPARK-11828][CORE] Register DAGScheduler metrics source after app id is known. · f7135ed7
Marcelo Vanzin authored 9 years ago
```
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9820 from vanzin/SPARK-11828.
```
f7135ed7

[SPARK-11799][CORE] Make it explicit in executor logs that uncaught e… · 3bd77b21

Srinivasa Reddy Vundela authored 9 years ago

…xceptions are thrown during executor shutdown

This commit will make sure that when uncaught exceptions are prepended with [Container in shutdown] when JVM is shutting down.

Author: Srinivasa Reddy Vundela <vsr@cloudera.com>

Closes #9809 from vundela/master_11799.

3bd77b21

[SPARK-11831][CORE][TESTS] Use port 0 to avoid port conflicts in tests · 90d384dc

Shixiong Zhu authored 9 years ago

Use port 0 to fix port-contention-related flakiness

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9841 from zsxwing/SPARK-11831.

90d384dc

[SPARK-11858][SQL] Move sql.columnar into sql.execution. · 014c0f7a

Reynold Xin authored 9 years ago

In addition, tightened visibility of a lot of classes in the columnar package from private[sql] to private[columnar].

Author: Reynold Xin <rxin@databricks.com>

Closes #9842 from rxin/SPARK-11858.

014c0f7a

[SPARK-11812][PYSPARK] invFunc=None works properly with python's reduceByKeyAndWindow · 599a8c6e

David Tolpin authored 9 years ago

invFunc is optional and can be None. Instead of invFunc (the parameter) invReduceFunc (a local function) was checked for trueness (that is, not None, in this context). A local function is never None,
thus the case of invFunc=None (a common one when inverse reduction is not defined) was treated incorrectly, resulting in loss of data.

In addition, the docstring used wrong parameter names, also fixed.

Author: David Tolpin <david.tolpin@gmail.com>

Closes #9775 from dtolpin/master.

599a8c6e

[SPARK-11778][SQL] parse table name before it is passed to lookupRelation · 47000745

Huaxin Gao authored 9 years ago

Fix a bug in DataFrameReader.table (table with schema name such as "db_name.table" doesn't work)
Use SqlParser.parseTableIdentifier to parse the table name before lookupRelation.

Author: Huaxin Gao <huaxing@oc0558782468.ibm.com>

Closes #9773 from huaxingao/spark-11778.

47000745

[SPARK-11750][SQL] revert SPARK-11727 and code clean up · 47d1c232

Wenchen Fan authored 9 years ago

After some experiment, I found it's not convenient to have separate encoder builders: `FlatEncoder` and `ProductEncoder`. For example, when create encoders for `ScalaUDF`, we have no idea if the type `T` is flat or not. So I revert the splitting change in https://github.com/apache/spark/pull/9693, while still keeping the bug fixes and tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9726 from cloud-fan/follow.

47d1c232

[SPARK-11848][SQL] Support EXPLAIN in DataSet APIs · 7d4aba18

gatorsmile authored 9 years ago

When debugging DataSet API, I always need to print the logical and physical plans.

I am wondering if we should provide a simple API for EXPLAIN?

Author: gatorsmile <gatorsmile@gmail.com>

Closes #9832 from gatorsmile/explainDS.

7d4aba18

[SPARK-11633][SQL] LogicalRDD throws TreeNode Exception : Failed to Copy Node · 276a7e13

gatorsmile authored 9 years ago

When handling self joins, the implementation did not consider the case insensitivity of HiveContext. It could cause an exception as shown in the JIRA:
```
TreeNodeException: Failed to copy node.
```

The fix is low risk. It avoids unnecessary attribute replacement. It should not affect the existing behavior of self joins. Also added the test case to cover this case.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #9762 from gatorsmile/joinMakeCopy.

276a7e13

[SPARK-11830][CORE] Make NettyRpcEnv bind to the specified host · 72d150c2

zsxwing authored 9 years ago

This PR includes the following change:

1. Bind NettyRpcEnv to the specified host
2. Fix the port information in the log for NettyRpcEnv.
3. Fix the service name of NettyRpcEnv.

Author: zsxwing <zsxwing@gmail.com>
Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9821 from zsxwing/SPARK-11830.

72d150c2