Commits · 9db5f601facfdaba6e4333a6b2d2e4a9f009c788 · cs525-sp18-g07 / spark

Nov 23, 2015

[SPARK-9866][SQL] Speed up VersionsSuite by using persistent Ivy cache · 9db5f601

Josh Rosen authored 9 years ago

This patch attempts to speed up VersionsSuite by storing fetched Hive JARs in an Ivy cache that persists across tests runs. If `SPARK_VERSIONS_SUITE_IVY_PATH` is set, that path will be used for the cache; if it is not set, VersionsSuite will create a temporary Ivy cache which is deleted after the test completes.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9624 from JoshRosen/SPARK-9866.

9db5f601

[SPARK-11140][CORE] Transfer files using network lib when using NettyRpcEnv. · c2467dad

Marcelo Vanzin authored 9 years ago

This change abstracts the code that serves jars / files to executors so that
each RpcEnv can have its own implementation; the akka version uses the existing
HTTP-based file serving mechanism, while the netty versions uses the new
stream support added to the network lib, which makes file transfers benefit
from the easier security configuration of the network library, and should also
reduce overhead overall.

The change includes a small fix to TransportChannelHandler so that it propagates
user events to downstream handlers.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9530 from vanzin/SPARK-11140.

c2467dad

[SPARK-11865][NETWORK] Avoid returning inactive client in TransportClientFactory. · 7cfa4c6b

Marcelo Vanzin authored 9 years ago

There's a very narrow race here where it would be possible for the timeout handler
to close a channel after the client factory verified that the channel was still
active. This change makes sure the client is marked as being recently in use so
that the timeout handler does not close it until a new timeout cycle elapses.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9853 from vanzin/SPARK-11865.

7cfa4c6b

[SPARK-11910][STREAMING][DOCS] Update twitter4j dependency version · 242be7da
Luciano Resende authored 9 years ago
```
Author: Luciano Resende <lresende@apache.org>

Closes #9892 from lresende/SPARK-11910.
```
242be7da

[SPARK-11836][SQL] udf/cast should not create new SQLContext · 1d912020

Davies Liu authored 9 years ago

They should use the existing SQLContext.

Author: Davies Liu <davies@databricks.com>

Closes #9914 from davies/create_udf.

1d912020

[SPARK-4424] Remove spark.driver.allowMultipleContexts override in tests · 1b6e938b

Josh Rosen authored 9 years ago

This patch removes `spark.driver.allowMultipleContexts=true` from our test configuration. The multiple SparkContexts check was originally disabled because certain tests suites in SQL needed to create multiple contexts. As far as I know, this configuration change is no longer necessary, so we should remove it in order to make it easier to find test cleanup bugs.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9865 from JoshRosen/SPARK-4424.

1b6e938b

[SPARK-11837][EC2] python3 compatibility for launching ec2 m3 instances · f6dcc6e9

Mortada Mehyar authored 9 years ago

this currently breaks for python3 because `string` module doesn't have `letters` anymore, instead `ascii_letters` should be used

Author: Mortada Mehyar <mortada.mehyar@gmail.com>

Closes #9797 from mortada/python3_fix.

f6dcc6e9

[SPARK-11920][ML][DOC] ML LinearRegression should use correct dataset in... · 98d7ec7d

Yanbo Liang authored 9 years ago

[SPARK-11920][ML][DOC] ML LinearRegression should use correct dataset in examples and user guide doc

ML ```LinearRegression``` use ```data/mllib/sample_libsvm_data.txt``` as dataset in examples and user guide doc, but it's actually classification dataset rather than regression dataset. We should use ```data/mllib/sample_linear_regression_data.txt``` instead.
The deeper causes is that ```LinearRegression``` with "normal" solver can not solve this dataset correctly, may be due to the ill condition and unreasonable label. This issue has been reported at [SPARK-11918](https://issues.apache.org/jira/browse/SPARK-11918).
It will confuse users if they run the example code but get exception, so we should make this change which can clearly illustrate the usage of ```LinearRegression``` algorithm.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9905 from yanboliang/spark-11920.

98d7ec7d

[SPARK-11762][NETWORK] Account for active streams when couting outstanding requests. · 5231cd5a

Marcelo Vanzin authored 9 years ago

This way the timeout handling code can correctly close "hung" channels that are
processing streams.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9747 from vanzin/SPARK-11762.

5231cd5a

[SPARK-7173][YARN] Add label expression support for application master · 5fd86e4f

jerryshao authored 9 years ago

Add label expression support for AM to restrict it runs on the specific set of nodes. I tested it locally and works fine.

sryza and vanzin please help to review, thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #9800 from jerryshao/SPARK-7173.

5fd86e4f

[SPARK-11913][SQL] support typed aggregate with complex buffer schema · 946b4065
Wenchen Fan authored 9 years ago
```
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9898 from cloud-fan/agg.
```
946b4065
[SPARK-11921][SQL] fix `nullable` of encoder schema · f2996e0d
Wenchen Fan authored 9 years ago
```
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9906 from cloud-fan/nullable.
```
f2996e0d

[SPARK-11894][SQL] fix isNull for GetInternalRowField · 1a5baaa6

Wenchen Fan authored 9 years ago

We should use `InternalRow.isNullAt` to check if the field is null before calling `InternalRow.getXXX`

Thanks gatorsmile who discovered this bug.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9904 from cloud-fan/null.

1a5baaa6

[SPARK-11628][SQL] support column datatype of char(x) to recognize HiveChar · 94ce65df

Xiu Guo authored 9 years ago

Can someone review my code to make sure I'm not missing anything? Thanks!

Author: Xiu Guo <xguo27@gmail.com>
Author: Xiu Guo <guoxi@us.ibm.com>

Closes #9612 from xguo27/SPARK-11628.

94ce65df

[SPARK-11902][ML] Unhandled case in VectorAssembler#transform · 4be360d4

BenFradet authored 9 years ago

There is an unhandled case in the transform method of VectorAssembler if one of the input columns doesn't have one of the supported type DoubleType, NumericType, BooleanType or VectorUDT.

So, if you try to transform a column of StringType you get a cryptic "scala.MatchError: StringType".

This PR aims to fix this, throwing a SparkException when dealing with an unknown column type.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #9885 from BenFradet/SPARK-11902.

4be360d4

Nov 22, 2015

[SPARK-11912][ML] ml.feature.PCA minor refactor · d9cf9c21

Yanbo Liang authored 9 years ago

Like [SPARK-11852](https://issues.apache.org/jira/browse/SPARK-11852), ```k``` is params and we should save it under ```metadata/``` rather than both under ```data/``` and ```metadata/```. Refactor the constructor of ```ml.feature.PCAModel``` to take only ```pc``` but construct ```mllib.feature.PCAModel``` inside ```transform```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9897 from yanboliang/spark-11912.

d9cf9c21

[SPARK-11835] Adds a sidebar menu to MLlib's documentation · fc4b792d

Timothy Hunter authored 9 years ago

This PR adds a sidebar menu when browsing the user guide of MLlib. It uses a YAML file to describe the structure of the documentation. It should be trivial to adapt this to the other projects.

![screen shot 2015-11-18 at 4 46 12 pm](https://cloud.githubusercontent.com/assets/7594753/11259591/a55173f4-8e17-11e5-9340-0aed79d66262.png)

Author: Timothy Hunter <timhunter@databricks.com>

Closes #9826 from thunterdb/spark-11835.

fc4b792d

[SPARK-6791][ML] Add read/write for CrossValidator and Evaluators · a6fda0bf

Joseph K. Bradley authored 9 years ago

I believe this works for general estimators within CrossValidator, including compound estimators.  (See the complex unit test.)

Added read/write for all 3 Evaluators as well.

CC: mengxr yanboliang

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9848 from jkbradley/cv-io.

a6fda0bf

[SPARK-11895][ML] rename and refactor DatasetExample under mllib/examples · fe89c181

Xiangrui Meng authored 9 years ago

We used the name `Dataset` to refer to `SchemaRDD` in 1.2 in ML pipelines and created this example file. Since `Dataset` has a new meaning in Spark 1.6, we should rename it to avoid confusion. This PR also removes support for dense format to simplify the example code.

cc: yinxusen

Author: Xiangrui Meng <meng@databricks.com>

Closes #9873 from mengxr/SPARK-11895.

fe89c181

[SPARK-11908][SQL] Add NullType support to RowEncoder · 426004a9

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-11908

We should add NullType support to RowEncoder.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9891 from viirya/rowencoder-nulltype.

426004a9

Nov 21, 2015

[SPARK-11899][SQL] API audit for GroupedDataset. · ff442bbc

Reynold Xin authored 9 years ago

1. Renamed map to mapGroup, flatMap to flatMapGroup.
2. Renamed asKey -> keyAs.
3. Added more documentation.
4. Changed type parameter T to V on GroupedDataset.
5. Added since versions for all functions.

Author: Reynold Xin <rxin@databricks.com>

Closes #9880 from rxin/SPARK-11899.

ff442bbc

[SPARK-11901][SQL] API audit for Aggregator. · 59671026
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #9882 from rxin/SPARK-11901.
```
59671026
[SPARK-11900][SQL] Add since version for all encoders · 54328b6d
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #9881 from rxin/SPARK-11900.
```
54328b6d

[SPARK-11819][SQL][FOLLOW-UP] fix scala 2.11 build · 7d3f922c

Wenchen Fan authored 9 years ago

seems scala 2.11 doesn't support: define private methods in `trait xxx` and use it in `object xxx extend xxx`.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9879 from cloud-fan/follow.

7d3f922c

Nov 20, 2015

Revert "[SPARK-11689][ML] Add user guide and example code for LDA under spark.ml" · a2dce22e
Xiangrui Meng authored 9 years ago
```
This reverts commit e359d5dc.
```
a2dce22e
[HOTFIX] Fix Java Dataset Tests · 47815878
Michael Armbrust authored 9 years ago

47815878
[SPARK-11890][SQL] Fix compilation for Scala 2.11 · 68ed0468
Michael Armbrust authored 9 years ago
```
Author: Michael Armbrust <michael@databricks.com>

Closes #9871 from marmbrus/scala211-break.
```
68ed0468

[SPARK-11889][SQL] Fix type inference for GroupedDataset.agg in REPL · 968acf3b

Michael Armbrust authored 9 years ago

In this PR I delete a method that breaks type inference for aggregators (only in the REPL)

The error when this method is present is:
```
<console>:38: error: missing parameter type for expanded function ((x$2) => x$2._2)
              ds.groupBy(_._1).agg(sum(_._2), sum(_._3)).collect()
```

Author: Michael Armbrust <michael@databricks.com>

Closes #9870 from marmbrus/dataset-repl-agg.

968acf3b

[SPARK-11787][SPARK-11883][SQL][FOLLOW-UP] Cleanup for this patch. · 58b4e4f8

Nong Li authored 9 years ago

This mainly moves SqlNewHadoopRDD to the sql package. There is some state that is
shared between core and I've left that in core. This allows some other associated
minor cleanup.

Author: Nong Li <nong@databricks.com>

Closes #9845 from nongli/spark-11787.

58b4e4f8

[SPARK-11549][DOCS] Replace example code in mllib-evaluation-metrics.md using include_example · ed47b1e6
Vikas Nelamangala authored 9 years ago
```
Author: Vikas Nelamangala <vikasnelamangala@Vikass-MacBook-Pro.local>

Closes #9689 from vikasnp/master.
```
ed47b1e6

[SPARK-11636][SQL] Support classes defined in the REPL with Encoders · 4b84c72d

Michael Armbrust authored 9 years ago

#theScaryParts (i.e. changes to the repl, executor classloaders and codegen)...

Author: Michael Armbrust <michael@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #9825 from marmbrus/dataset-replClasses2.

4b84c72d

[SPARK-11756][SPARKR] Fix use of aliases - SparkR can not output help... · a6239d58

felixcheung authored 9 years ago

[SPARK-11756][SPARKR] Fix use of aliases - SparkR can not output help information for SparkR:::summary correctly

Fix use of aliases and changes uses of rdname and seealso
`aliases` is the hint for `?` - it should not be linked to some other name - those should be seealso
https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html

Clean up usage on family, as multiple use of family with the same rdname is causing duplicated See Also html blocks (like http://spark.apache.org/docs/latest/api/R/count.html)
Also changing some rdname for dplyr-like variant for better R user visibility in R doc, eg. rbind, summary, mutate, summarize

shivaram yanboliang

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9750 from felixcheung/rdocaliases.

a6239d58

[SPARK-11716][SQL] UDFRegistration just drops the input type when re-creating... · 03ba56d7

Jean-Baptiste Onofré authored 9 years ago

[SPARK-11716][SQL] UDFRegistration just drops the input type when re-creating the UserDefinedFunction

https://issues.apache.org/jira/browse/SPARK-11716

This is one is #9739 and a regression test. When commit it, please make sure the author is jbonofre.

You can find the original PR at https://github.com/apache/spark/pull/9739

closes #9739

Author: Jean-Baptiste Onofré <jbonofre@apache.org>
Author: Yin Huai <yhuai@databricks.com>

Closes #9868 from yhuai/SPARK-11716.

03ba56d7

[SPARK-11887] Close PersistenceEngine at the end of PersistenceEngineSuite tests · 89fd9bd0

Josh Rosen authored 9 years ago

In PersistenceEngineSuite, we do not call `close()` on the PersistenceEngine at the end of the test. For the ZooKeeperPersistenceEngine, this causes us to leak a ZooKeeper client, causing the logs of unrelated tests to be periodically spammed with connection error messages from that client:

```
15/11/20 05:13:35.789 pool-1-thread-1-ScalaTest-running-PersistenceEngineSuite-SendThread(localhost:15741) INFO ClientCnxn: Opening socket connection to server localhost/127.0.0.1:15741. Will not attempt to authenticate using SASL (unknown error)
15/11/20 05:13:35.790 pool-1-thread-1-ScalaTest-running-PersistenceEngineSuite-SendThread(localhost:15741) WARN ClientCnxn: Session 0x15124ff48dd0000 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
	at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
```

This patch fixes this by using a `finally` block.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9864 from JoshRosen/close-zookeeper-client-in-tests.

89fd9bd0

[SPARK-11870][STREAMING][PYSPARK] Rethrow the exceptions in TransformFunction... · be7a2cfd

Shixiong Zhu authored 9 years ago

[SPARK-11870][STREAMING][PYSPARK] Rethrow the exceptions in TransformFunction and TransformFunctionSerializer

TransformFunction and TransformFunctionSerializer don't rethrow the exception, so when any exception happens, it just return None. This will cause some weird NPE and confuse people.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9847 from zsxwing/pyspark-streaming-exception.

be7a2cfd

[SPARK-11724][SQL] Change casting between int and timestamp to consistently treat int in seconds. · 9ed4ad42

Nong Li authored 9 years ago

Hive has since changed this behavior as well. https://issues.apache.org/jira/browse/HIVE-3454

Author: Nong Li <nong@databricks.com>
Author: Nong Li <nongli@gmail.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #9685 from nongli/spark-11724.

9ed4ad42

[SPARK-11650] Reduce RPC timeouts to speed up slow AkkaUtilsSuite test · 652def31

Josh Rosen authored 9 years ago

This patch reduces some RPC timeouts in order to speed up the slow "AkkaUtilsSuite.remote fetch ssl on - untrusted server", which used to take two minutes to run.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9869 from JoshRosen/SPARK-11650.

652def31

[SPARK-11819][SQL] nice error message for missing encoder · 3b9d2a34

Wenchen Fan authored 9 years ago

before this PR, when users try to get an encoder for an un-supported class, they will only get a very simple error message like `Encoder for type xxx is not supported`.

After this PR, the error message become more friendly, for example:
```
No Encoder found for abc.xyz.NonEncodable
- array element class: "abc.xyz.NonEncodable"
- field (class: "scala.Array", name: "arrayField")
- root class: "abc.xyz.AnotherClass"
```

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9810 from cloud-fan/error-message.

3b9d2a34

[SPARK-11817][SQL] Truncating the fractional seconds to prevent inserting a NULL · 60bfb113

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-11817

Instead of return None, we should truncate the fractional seconds to prevent inserting NULL.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9834 from viirya/truncate-fractional-sec.

60bfb113

[SPARK-11876][SQL] Support printSchema in DataSet API · bef361c5

gatorsmile authored 9 years ago

DataSet APIs look great! However, I am lost when doing multiple level joins.  For example,
```
val ds1 = Seq(("a", 1), ("b", 2)).toDS().as("a")
val ds2 = Seq(("a", 1), ("b", 2)).toDS().as("b")
val ds3 = Seq(("a", 1), ("b", 2)).toDS().as("c")

ds1.joinWith(ds2, $"a._2" === $"b._2").as("ab").joinWith(ds3, $"ab._1._2" === $"c._2").printSchema()
```

The printed schema is like
```
root
 |-- _1: struct (nullable = true)
 |    |-- _1: struct (nullable = true)
 |    |    |-- _1: string (nullable = true)
 |    |    |-- _2: integer (nullable = true)
 |    |-- _2: struct (nullable = true)
 |    |    |-- _1: string (nullable = true)
 |    |    |-- _2: integer (nullable = true)
 |-- _2: struct (nullable = true)
 |    |-- _1: string (nullable = true)
 |    |-- _2: integer (nullable = true)
```

Personally, I think we need the printSchema function. Sometimes, I do not know how to specify the column, especially when their data types are mixed. For example, if I want to write the following select for the above multi-level join, I have to know the schema:
```
newDS.select(expr("_1._2._2 + 1").as[Int]).collect()
```

marmbrus rxin cloud-fan  Do you have the same feeling?

Author: gatorsmile <gatorsmile@gmail.com>

Closes #9855 from gatorsmile/printSchemaDataSet.

bef361c5