Commits · 49f1a820372d1cba41f3f00d07eb5728f2ed6705 · cs525-sp18-g07 / spark

Nov 06, 2015

[SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bits · 49f1a820

Imran Rashid authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-10116

This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.

mengxr mkolod

Author: Imran Rashid <irashid@cloudera.com>

Closes #8314 from squito/SPARK-10116.

49f1a820

Typo fixes + code readability improvements · 62bb2907

Jacek Laskowski authored 9 years ago

Author: Jacek Laskowski <jacek.laskowski@deepsense.io>

Closes #9501 from jaceklaskowski/typos-with-style.

62bb2907

[SPARK-9858][SQL] Add an ExchangeCoordinator to estimate the number of... · 8211aab0

Yin Huai authored 9 years ago

[SPARK-9858][SQL] Add an ExchangeCoordinator to estimate the number of post-shuffle partitions for aggregates and joins (follow-up)

https://issues.apache.org/jira/browse/SPARK-9858

This PR is the follow-up work of https://github.com/apache/spark/pull/9276. It addresses JoshRosen's comments.

Author: Yin Huai <yhuai@databricks.com>

Closes #9453 from yhuai/numReducer-followUp.

8211aab0

[SPARK-10978][SQL][FOLLOW-UP] More comprehensive tests for PR #9399 · c048929c

Cheng Lian authored 9 years ago

This PR adds test cases that test various column pruning and filter push-down cases.

Author: Cheng Lian <lian@databricks.com>

Closes #9468 from liancheng/spark-10978.follow-up.

c048929c

[SPARK-9162] [SQL] Implement code generation for ScalaUDF · 574141a2

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-9162

Currently ScalaUDF extends CodegenFallback and doesn't provide code generation implementation. This path implements code generation for ScalaUDF.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9270 from viirya/scalaudf-codegen.

574141a2

[SPARK-11511][STREAMING] Fix NPE when an InputDStream is not used · cf69ce13

Shixiong Zhu authored 9 years ago

Just ignored `InputDStream`s that have null `rememberDuration` in `DStreamGraph.getMaxInputStreamRememberDuration`.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9476 from zsxwing/SPARK-11511.

cf69ce13

[SPARK-11453][SQL][FOLLOW-UP] remove DecimalLit · 253e87e8

Wenchen Fan authored 9 years ago

A cleanup for https://github.com/apache/spark/pull/9085.

The `DecimalLit` is very similar to `FloatLit`, we can just keep one of them.
Also added low level unit test at `SqlParserSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9482 from cloud-fan/parser.

253e87e8

[SPARK-11541][SQL] Break JdbcDialects.scala into multiple files and mark... · bc5d6c03

Reynold Xin authored 9 years ago

[SPARK-11541][SQL] Break JdbcDialects.scala into multiple files and mark various dialects as private.

Author: Reynold Xin <rxin@databricks.com>

Closes #9511 from rxin/SPARK-11541.

bc5d6c03

Nov 05, 2015

[SPARK-11528] [SQL] Typed aggregations for Datasets · 363a476c

Michael Armbrust authored 9 years ago

This PR adds the ability to do typed SQL aggregations.  We will likely also want to provide an interface to allow users to do aggregations on objects, but this is deferred to another PR.

```scala
val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c", 1)).toDS()
ds.groupBy(_._1).agg(sum("_2").as[Int]).collect()

res0: Array(("a", 30), ("b", 3), ("c", 1))
```

Author: Michael Armbrust <michael@databricks.com>

Closes #9499 from marmbrus/dataset-agg.

363a476c

[SPARK-7542][SQL] Support off-heap index/sort buffer · eec74ba8

Davies Liu authored 9 years ago

This brings the support of off-heap memory for array inside BytesToBytesMap and InMemorySorter, then we could allocate all the memory from off-heap for execution.

Closes #8068

Author: Davies Liu <davies@databricks.com>

Closes #9477 from davies/unsafe_timsort.

eec74ba8

[SPARK-11540][SQL] API audit for QueryExecutionListener. · 3cc2c053
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #9509 from rxin/SPARK-11540.
```
3cc2c053

[SPARK-11538][BUILD] Force guava 14 in sbt build. · 5e31db70

Marcelo Vanzin authored 9 years ago

sbt's version resolution code always picks the most recent version, and we
don't want that for guava.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9508 from vanzin/SPARK-11538.

5e31db70

[SPARK-11457][STREAMING][YARN] Fix incorrect AM proxy filter conf recovery from checkpoint · 468ad0ae

jerryshao authored 9 years ago

Currently Yarn AM proxy filter configuration is recovered from checkpoint file when Spark Streaming application is restarted, which will lead to some unwanted behaviors:

1. Wrong RM address if RM is redeployed from failure.
2. Wrong proxyBase, since app id is updated, old app id for proxyBase is wrong.

So instead of recovering from checkpoint file, these configurations should be reloaded each time when app started.

This problem only exists in Yarn cluster mode, for Yarn client mode, these configurations will be updated with RPC message `AddWebUIFilter`.

Please help to review tdas harishreedharan vanzin , thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #9412 from jerryshao/SPARK-11457.

468ad0ae

[SPARK-11514][ML] Pass random seed to spark.ml DecisionTree* · 8fa8c837
Yu ISHIKAWA authored 9 years ago
```
cc jkbradley

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #9486 from yu-iskw/SPARK-11514.
```
8fa8c837
Revert "[SPARK-11469][SQL] Allow users to define nondeterministic udfs." · 6091e91f
Reynold Xin authored 9 years ago
```
This reverts commit 9cf56c96.
```
6091e91f

[SPARK-11537] [SQL] fix negative hours/minutes/seconds · 07414afa

Davies Liu authored 9 years ago

Currently, if the Timestamp is before epoch (1970/01/01), the hours, minutes and seconds will be negative (also rounding up).

Author: Davies Liu <davies@databricks.com>

Closes #9502 from davies/neg_hour.

07414afa

[SPARK-11542] [SPARKR] fix glm with long fomular · 24401062

Davies Liu authored 9 years ago

Because deparse() will break the long string into multiple lines, the deserialization will fail

Author: Davies Liu <davies@databricks.com>

Closes #9510 from davies/fix_glm.

24401062

[SPARK-11536][SQL] Remove the internal implicit conversion from Expression to... · b6974f8f

Reynold Xin authored 9 years ago

[SPARK-11536][SQL] Remove the internal implicit conversion from Expression to Column in functions.scala

Author: Reynold Xin <rxin@databricks.com>

Closes #9505 from rxin/SPARK-11536.

b6974f8f

[SPARK-10656][SQL] completely support special chars in DataFrame · d9e30c59

Wenchen Fan authored 9 years ago

the main problem is: we interpret column name with special handling of `.` for DataFrame. This enables us to write something like `df("a.b")` to get the field `b` of `a`. However, we don't need this feature in `DataFrame.apply("*")` or `DataFrame.withColumnRenamed`. In these 2 cases, the column name is the final name already, we don't need extra process to interpret it.

The solution is simple, use `queryExecution.analyzed.output` to get resolved column directly, instead of using `DataFrame.resolve`.

close https://github.com/apache/spark/pull/8811

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9462 from cloud-fan/special-chars.

d9e30c59

[SPARK-11260][SPARKR] with() function support · b9455d1f

adrian555 authored 9 years ago

Author: adrian555 <wzhuang@us.ibm.com>
Author: Adrian Zhuang <adrian555@users.noreply.github.com>

Closes #9443 from adrian555/with.

b9455d1f

[SPARK-11532][SQL] Remove implicit conversion from Expression to Column · 8a5314ef
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #9500 from rxin/SPARK-11532.
```
8a5314ef

[SPARK-10648] Oracle dialect to handle nonspecific numeric types · 14ee0f57

Travis Hegner authored 9 years ago

This is the alternative/agreed upon solution to PR #8780.

Creating an OracleDialect to handle the nonspecific numeric types that can be defined in oracle.

Author: Travis Hegner <thegner@trilliumit.com>

Closes #9495 from travishegner/OracleDialect.

14ee0f57

[SPARK-10265][DOCUMENTATION, ML] Fixed @Since annotation to ml.regression · f80f7b69
Ehsan M.Kermani authored 9 years ago
```
Here is my first commit.

Author: Ehsan M.Kermani <ehsanmo1367@gmail.com>

Closes #8728 from ehsanmok/SinceAnn.
```
f80f7b69

[SPARK-11513][SQL] Remove implicit conversion from LogicalPlan to DataFrame · 6b87acd6

Reynold Xin authored 9 years ago

This internal implicit conversion has been a source of confusion for a lot of new developers.

Author: Reynold Xin <rxin@databricks.com>

Closes #9479 from rxin/SPARK-11513.

6b87acd6

[SPARK-11484][WEBUI] Using proxyBase set by spark AM · c76865c6

Srinivasa Reddy Vundela authored 9 years ago

Use the proxyBase set by the AM, if not found then use env. This is to fix the issue if somebody accidentally set APPLICATION_WEB_PROXY_BASE to wrong proxyBase

Author: Srinivasa Reddy Vundela <vsr@cloudera.com>

Closes #9448 from vundela/master.

c76865c6

[SPARK-11473][ML] R-like summary statistics with intercept for OLS via normal equation solver · 9da7ceed

Yanbo Liang authored 9 years ago

Follow up [SPARK-9836](https://issues.apache.org/jira/browse/SPARK-9836), we should also support summary statistics for ```intercept```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9485 from yanboliang/spark-11473.

9da7ceed

[SPARK-11474][SQL] change fetchSize to fetchsize · b072ff4d

Huaxin Gao authored 9 years ago

In DefaultDataSource.scala, it has
override def createRelation(
sqlContext: SQLContext,
parameters: Map[String, String]): BaseRelation
The parameters is CaseInsensitiveMap.
After this line
parameters.foreach(kv => properties.setProperty(kv._1, kv._2))
properties is set to all lower case key/value pairs and fetchSize becomes fetchsize.
However, in compute method in JDBCRDD, it has
val fetchSize = properties.getProperty("fetchSize", "0").toInt
so fetchSize value is always 0 and never gets set correctly.

Author: Huaxin Gao <huaxing@oc0558782468.ibm.com>

Closes #9473 from huaxingao/spark-11474.

b072ff4d

[SPARK-11501][CORE][YARN] Propagate spark.rpc config to executors · a4b5cefc

Nishkam Ravi authored 9 years ago

spark.rpc is supposed to be configurable but is not currently (doesn't get propagated to executors because RpcEnv.create is done before driver properties are fetched).

Author: Nishkam Ravi <nishkamravi@gmail.com>

Closes #9460 from nishkamravi2/master_akka.

a4b5cefc

[SPARK-11527][ML][PYSPARK] PySpark AFTSurvivalRegressionModel should expose... · 2e86cf1b

Yanbo Liang authored 9 years ago

[SPARK-11527][ML][PYSPARK] PySpark AFTSurvivalRegressionModel should expose coefficients/intercept/scale

PySpark ```AFTSurvivalRegressionModel``` should expose coefficients/intercept/scale. mengxr vectorijk

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9492 from yanboliang/spark-11527.

2e86cf1b

[MINOR][ML][DOC] Rename weights to coefficients in user guide · 72634f27

Yanbo Liang authored 9 years ago

We should use ```coefficients``` rather than ```weights``` in user guide that freshman can get the right conventional name at the outset. mengxr vectorijk

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9493 from yanboliang/docs-coefficients.

72634f27

[MINOR][SQL] A minor log line fix · 77488fb8

Cheng Lian authored 9 years ago

`jars` in the log line is an array, so `$jars` doesn't print its content.

Author: Cheng Lian <lian@databricks.com>

Closes #9494 from liancheng/minor.log-fix.

77488fb8

[SPARK-11506][MLLIB] Removed redundant operation in Online LDA implementation · a94671a0

a1singh authored 9 years ago

In file LDAOptimizer.scala:

line 441: since "idx" was never used, replaced unrequired zipWithIndex.foreach with foreach.

-      nonEmptyDocs.zipWithIndex.foreach { case ((_, termCounts: Vector), idx: Int) =>
+      nonEmptyDocs.foreach { case (_, termCounts: Vector) =>

Author: a1singh <a1singh@ucsd.edu>

Closes #9456 from a1singh/master.

a94671a0

[SPARK-11449][CORE] PortableDataStream should be a factory · 7bdc9219

Herman van Hovell authored 9 years ago

```PortableDataStream``` maintains some internal state. This makes it tricky to reuse a stream (one needs to call ```close``` on both the ```PortableDataStream``` and the ```InputStream``` it produces).

This PR removes all state from ```PortableDataStream``` and effectively turns it into an ```InputStream```/```Array[Byte]``` factory. This makes the user responsible for managing the ```InputStream``` it returns.

cc srowen

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #9417 from hvanhovell/SPARK-11449.

7bdc9219

[SPARK-11378][STREAMING] make StreamingContext.awaitTerminationOrTimeout return properly · 859dff56

Nick Evans authored 9 years ago

This adds a failing test checking that `awaitTerminationOrTimeout` returns the expected value, and then fixes that failing test with the addition of a `return`.

tdas zsxwing

Author: Nick Evans <me@nicolasevans.org>

Closes #9336 from manygrams/fix_await_termination_or_timeout.

859dff56

[SPARK-11440][CORE][STREAMING][BUILD] Declare rest of @Experimental items... · 6f81eae2

Sean Owen authored 9 years ago

[SPARK-11440][CORE][STREAMING][BUILD] Declare rest of @Experimental items non-experimental if they've existed since 1.2.0

Remove `Experimental` annotations in core, streaming for items that existed in 1.2.0 or before. The changes are:

* SparkContext
  * binary{Files,Records} : 1.2.0
  * submitJob : 1.0.0
* JavaSparkContext
  * binary{Files,Records} : 1.2.0
* DoubleRDDFunctions, JavaDoubleRDD
  * {mean,sum}Approx : 1.0.0
* PairRDDFunctions, JavaPairRDD
  * sampleByKeyExact : 1.2.0
  * countByKeyApprox : 1.0.0
* PairRDDFunctions
  * countApproxDistinctByKey : 1.1.0
* RDD
  * countApprox, countByValueApprox, countApproxDistinct : 1.0.0
* JavaRDDLike
  * countApprox : 1.0.0
* PythonHadoopUtil.Converter : 1.1.0
* PortableDataStream : 1.2.0 (related to binaryFiles)
* BoundedDouble : 1.0.0
* PartialResult : 1.0.0
* StreamingContext, JavaStreamingContext
  * binaryRecordsStream : 1.2.0
* HiveContext
  * analyze : 1.2.0

Author: Sean Owen <sowen@cloudera.com>

Closes #9396 from srowen/SPARK-11440.

6f81eae2

Nov 04, 2015

[SPARK-11425] [SPARK-11486] Improve hybrid aggregation · 81498dd5

Davies Liu authored 9 years ago

After aggregation, the dataset could be smaller than inputs, so it's better to do hash based aggregation for all inputs, then using sort based aggregation to merge them.

Author: Davies Liu <davies@databricks.com>

Closes #9383 from davies/fix_switch.

81498dd5

[SPARK-11307] Reduce memory consumption of OutputCommitCoordinator · d0b56339

Josh Rosen authored 9 years ago

OutputCommitCoordinator uses a map in a place where an array would suffice, increasing its memory consumption for result stages with millions of tasks.

This patch replaces that map with an array. The only tricky part of this is reasoning about the range of possible array indexes in order to make sure that we never index out of bounds.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9274 from JoshRosen/SPARK-11307.

d0b56339

[SPARK-11398] [SQL] unnecessary def dialectClassName in HiveContext, and... · a752ddad

Zhenhua Wang authored 9 years ago

[SPARK-11398] [SQL] unnecessary def dialectClassName in HiveContext, and misleading dialect conf at the start of spark-sql

1. def dialectClassName in HiveContext is unnecessary.
In HiveContext, if conf.dialect == "hiveql", getSQLDialect() will return new HiveQLDialect(this);
else it will use super.getSQLDialect(). Then in super.getSQLDialect(), it calls dialectClassName, which is overriden in HiveContext and still return super.dialectClassName.
So we'll never reach the code "classOf[HiveQLDialect].getCanonicalName" of def dialectClassName in HiveContext.

2. When we start bin/spark-sql, the default context is HiveContext, and the corresponding dialect is hiveql.
However, if we type "set spark.sql.dialect;", the result is "sql", which is inconsistent with the actual dialect and is misleading. For example, we can use sql like "create table" which is only allowed in hiveql, but this dialect conf shows it's "sql".
Although this problem will not cause any execution error, it's misleading to spark sql users. Therefore I think we should fix it.
In this pr, while procesing “set spark.sql.dialect” in SetCommand, I use "conf.dialect" instead of "getConf()" for the case of key == SQLConf.DIALECT.key, so that it will return the right dialect conf.

Author: Zhenhua Wang <wangzhenhua@huawei.com>

Closes #9349 from wzhfy/dialect.

a752ddad

[SPARK-11491] Update build to use Scala 2.10.5 · ce5e6a28

Josh Rosen authored 9 years ago

Spark should build against Scala 2.10.5, since that includes a fix for Scaladoc that will fix doc snapshot publishing: https://issues.scala-lang.org/browse/SI-8479

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9450 from JoshRosen/upgrade-to-scala-2.10.5.

ce5e6a28

[SPARK-11510][SQL] Remove SQL aggregation tests for higher order statistics · b6e0a5ae

Reynold Xin authored 9 years ago

We have some aggregate function tests in both DataFrameAggregateSuite and SQLQuerySuite. The two have almost the same coverage and we should just remove the SQL one.

Author: Reynold Xin <rxin@databricks.com>

Closes #9475 from rxin/SPARK-11510.

b6e0a5ae