Commits · d9e30c59cede7f57786bb19e64ba422eda43bdcb · cs525-sp18-g07 / spark

Nov 05, 2015

[SPARK-10656][SQL] completely support special chars in DataFrame · d9e30c59

Wenchen Fan authored 9 years ago

the main problem is: we interpret column name with special handling of `.` for DataFrame. This enables us to write something like `df("a.b")` to get the field `b` of `a`. However, we don't need this feature in `DataFrame.apply("*")` or `DataFrame.withColumnRenamed`. In these 2 cases, the column name is the final name already, we don't need extra process to interpret it.

The solution is simple, use `queryExecution.analyzed.output` to get resolved column directly, instead of using `DataFrame.resolve`.

close https://github.com/apache/spark/pull/8811

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9462 from cloud-fan/special-chars.

d9e30c59

[SPARK-11260][SPARKR] with() function support · b9455d1f

adrian555 authored 9 years ago

Author: adrian555 <wzhuang@us.ibm.com>
Author: Adrian Zhuang <adrian555@users.noreply.github.com>

Closes #9443 from adrian555/with.

b9455d1f

[SPARK-11532][SQL] Remove implicit conversion from Expression to Column · 8a5314ef
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #9500 from rxin/SPARK-11532.
```
8a5314ef

[SPARK-10648] Oracle dialect to handle nonspecific numeric types · 14ee0f57

Travis Hegner authored 9 years ago

This is the alternative/agreed upon solution to PR #8780.

Creating an OracleDialect to handle the nonspecific numeric types that can be defined in oracle.

Author: Travis Hegner <thegner@trilliumit.com>

Closes #9495 from travishegner/OracleDialect.

14ee0f57

[SPARK-10265][DOCUMENTATION, ML] Fixed @Since annotation to ml.regression · f80f7b69
Ehsan M.Kermani authored 9 years ago
```
Here is my first commit.

Author: Ehsan M.Kermani <ehsanmo1367@gmail.com>

Closes #8728 from ehsanmok/SinceAnn.
```
f80f7b69

[SPARK-11513][SQL] Remove implicit conversion from LogicalPlan to DataFrame · 6b87acd6

Reynold Xin authored 9 years ago

This internal implicit conversion has been a source of confusion for a lot of new developers.

Author: Reynold Xin <rxin@databricks.com>

Closes #9479 from rxin/SPARK-11513.

6b87acd6

[SPARK-11484][WEBUI] Using proxyBase set by spark AM · c76865c6

Srinivasa Reddy Vundela authored 9 years ago

Use the proxyBase set by the AM, if not found then use env. This is to fix the issue if somebody accidentally set APPLICATION_WEB_PROXY_BASE to wrong proxyBase

Author: Srinivasa Reddy Vundela <vsr@cloudera.com>

Closes #9448 from vundela/master.

c76865c6

[SPARK-11473][ML] R-like summary statistics with intercept for OLS via normal equation solver · 9da7ceed

Yanbo Liang authored 9 years ago

Follow up [SPARK-9836](https://issues.apache.org/jira/browse/SPARK-9836), we should also support summary statistics for ```intercept```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9485 from yanboliang/spark-11473.

9da7ceed

[SPARK-11474][SQL] change fetchSize to fetchsize · b072ff4d

Huaxin Gao authored 9 years ago

In DefaultDataSource.scala, it has
override def createRelation(
sqlContext: SQLContext,
parameters: Map[String, String]): BaseRelation
The parameters is CaseInsensitiveMap.
After this line
parameters.foreach(kv => properties.setProperty(kv._1, kv._2))
properties is set to all lower case key/value pairs and fetchSize becomes fetchsize.
However, in compute method in JDBCRDD, it has
val fetchSize = properties.getProperty("fetchSize", "0").toInt
so fetchSize value is always 0 and never gets set correctly.

Author: Huaxin Gao <huaxing@oc0558782468.ibm.com>

Closes #9473 from huaxingao/spark-11474.

b072ff4d

[SPARK-11501][CORE][YARN] Propagate spark.rpc config to executors · a4b5cefc

Nishkam Ravi authored 9 years ago

spark.rpc is supposed to be configurable but is not currently (doesn't get propagated to executors because RpcEnv.create is done before driver properties are fetched).

Author: Nishkam Ravi <nishkamravi@gmail.com>

Closes #9460 from nishkamravi2/master_akka.

a4b5cefc

[SPARK-11527][ML][PYSPARK] PySpark AFTSurvivalRegressionModel should expose... · 2e86cf1b

Yanbo Liang authored 9 years ago

[SPARK-11527][ML][PYSPARK] PySpark AFTSurvivalRegressionModel should expose coefficients/intercept/scale

PySpark ```AFTSurvivalRegressionModel``` should expose coefficients/intercept/scale. mengxr vectorijk

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9492 from yanboliang/spark-11527.

2e86cf1b

[MINOR][ML][DOC] Rename weights to coefficients in user guide · 72634f27

Yanbo Liang authored 9 years ago

We should use ```coefficients``` rather than ```weights``` in user guide that freshman can get the right conventional name at the outset. mengxr vectorijk

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9493 from yanboliang/docs-coefficients.

72634f27

[MINOR][SQL] A minor log line fix · 77488fb8

Cheng Lian authored 9 years ago

`jars` in the log line is an array, so `$jars` doesn't print its content.

Author: Cheng Lian <lian@databricks.com>

Closes #9494 from liancheng/minor.log-fix.

77488fb8

[SPARK-11506][MLLIB] Removed redundant operation in Online LDA implementation · a94671a0

a1singh authored 9 years ago

In file LDAOptimizer.scala:

line 441: since "idx" was never used, replaced unrequired zipWithIndex.foreach with foreach.

-      nonEmptyDocs.zipWithIndex.foreach { case ((_, termCounts: Vector), idx: Int) =>
+      nonEmptyDocs.foreach { case (_, termCounts: Vector) =>

Author: a1singh <a1singh@ucsd.edu>

Closes #9456 from a1singh/master.

a94671a0

[SPARK-11449][CORE] PortableDataStream should be a factory · 7bdc9219

Herman van Hovell authored 9 years ago

```PortableDataStream``` maintains some internal state. This makes it tricky to reuse a stream (one needs to call ```close``` on both the ```PortableDataStream``` and the ```InputStream``` it produces).

This PR removes all state from ```PortableDataStream``` and effectively turns it into an ```InputStream```/```Array[Byte]``` factory. This makes the user responsible for managing the ```InputStream``` it returns.

cc srowen

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #9417 from hvanhovell/SPARK-11449.

7bdc9219

[SPARK-11378][STREAMING] make StreamingContext.awaitTerminationOrTimeout return properly · 859dff56

Nick Evans authored 9 years ago

This adds a failing test checking that `awaitTerminationOrTimeout` returns the expected value, and then fixes that failing test with the addition of a `return`.

tdas zsxwing

Author: Nick Evans <me@nicolasevans.org>

Closes #9336 from manygrams/fix_await_termination_or_timeout.

859dff56

[SPARK-11440][CORE][STREAMING][BUILD] Declare rest of @Experimental items... · 6f81eae2

Sean Owen authored 9 years ago

[SPARK-11440][CORE][STREAMING][BUILD] Declare rest of @Experimental items non-experimental if they've existed since 1.2.0

Remove `Experimental` annotations in core, streaming for items that existed in 1.2.0 or before. The changes are:

* SparkContext
  * binary{Files,Records} : 1.2.0
  * submitJob : 1.0.0
* JavaSparkContext
  * binary{Files,Records} : 1.2.0
* DoubleRDDFunctions, JavaDoubleRDD
  * {mean,sum}Approx : 1.0.0
* PairRDDFunctions, JavaPairRDD
  * sampleByKeyExact : 1.2.0
  * countByKeyApprox : 1.0.0
* PairRDDFunctions
  * countApproxDistinctByKey : 1.1.0
* RDD
  * countApprox, countByValueApprox, countApproxDistinct : 1.0.0
* JavaRDDLike
  * countApprox : 1.0.0
* PythonHadoopUtil.Converter : 1.1.0
* PortableDataStream : 1.2.0 (related to binaryFiles)
* BoundedDouble : 1.0.0
* PartialResult : 1.0.0
* StreamingContext, JavaStreamingContext
  * binaryRecordsStream : 1.2.0
* HiveContext
  * analyze : 1.2.0

Author: Sean Owen <sowen@cloudera.com>

Closes #9396 from srowen/SPARK-11440.

6f81eae2

Nov 04, 2015

[SPARK-11425] [SPARK-11486] Improve hybrid aggregation · 81498dd5

Davies Liu authored 9 years ago

After aggregation, the dataset could be smaller than inputs, so it's better to do hash based aggregation for all inputs, then using sort based aggregation to merge them.

Author: Davies Liu <davies@databricks.com>

Closes #9383 from davies/fix_switch.

81498dd5

[SPARK-11307] Reduce memory consumption of OutputCommitCoordinator · d0b56339

Josh Rosen authored 9 years ago

OutputCommitCoordinator uses a map in a place where an array would suffice, increasing its memory consumption for result stages with millions of tasks.

This patch replaces that map with an array. The only tricky part of this is reasoning about the range of possible array indexes in order to make sure that we never index out of bounds.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9274 from JoshRosen/SPARK-11307.

d0b56339

[SPARK-11398] [SQL] unnecessary def dialectClassName in HiveContext, and... · a752ddad

Zhenhua Wang authored 9 years ago

[SPARK-11398] [SQL] unnecessary def dialectClassName in HiveContext, and misleading dialect conf at the start of spark-sql

1. def dialectClassName in HiveContext is unnecessary.
In HiveContext, if conf.dialect == "hiveql", getSQLDialect() will return new HiveQLDialect(this);
else it will use super.getSQLDialect(). Then in super.getSQLDialect(), it calls dialectClassName, which is overriden in HiveContext and still return super.dialectClassName.
So we'll never reach the code "classOf[HiveQLDialect].getCanonicalName" of def dialectClassName in HiveContext.

2. When we start bin/spark-sql, the default context is HiveContext, and the corresponding dialect is hiveql.
However, if we type "set spark.sql.dialect;", the result is "sql", which is inconsistent with the actual dialect and is misleading. For example, we can use sql like "create table" which is only allowed in hiveql, but this dialect conf shows it's "sql".
Although this problem will not cause any execution error, it's misleading to spark sql users. Therefore I think we should fix it.
In this pr, while procesing “set spark.sql.dialect” in SetCommand, I use "conf.dialect" instead of "getConf()" for the case of key == SQLConf.DIALECT.key, so that it will return the right dialect conf.

Author: Zhenhua Wang <wangzhenhua@huawei.com>

Closes #9349 from wzhfy/dialect.

a752ddad

[SPARK-11491] Update build to use Scala 2.10.5 · ce5e6a28

Josh Rosen authored 9 years ago

Spark should build against Scala 2.10.5, since that includes a fix for Scaladoc that will fix doc snapshot publishing: https://issues.scala-lang.org/browse/SI-8479

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9450 from JoshRosen/upgrade-to-scala-2.10.5.

ce5e6a28

[SPARK-11510][SQL] Remove SQL aggregation tests for higher order statistics · b6e0a5ae

Reynold Xin authored 9 years ago

We have some aggregate function tests in both DataFrameAggregateSuite and SQLQuerySuite. The two have almost the same coverage and we should just remove the SQL one.

Author: Reynold Xin <rxin@databricks.com>

Closes #9475 from rxin/SPARK-11510.

b6e0a5ae

[SPARK-10028][MLLIB][PYTHON] Add Python API for PrefixSpan · 411ff6af
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #9469 from yu-iskw/SPARK-10028.
```
411ff6af

[SPARK-11493] remove bitset from BytesToBytesMap · 1b6a5d4a

Davies Liu authored 9 years ago

Since we have 4 bytes as number of records in the beginning of a page, the address can not be zero, so we do not need the bitset.

For performance concerns, the bitset could help speed up false lookup if the slot is empty (because bitset is smaller than longArray, cache hit rate will be higher). In practice, the map is filled with 35% - 70% (use 50% as average), so only half of the false lookups can benefit of it, all others will pay the cost of load the bitset (still need to access the longArray anyway).

For aggregation, we always need to access the longArray (insert a new key after false lookup), also confirmed by a benchmark.

 For broadcast hash join, there could be a regression, but a simple benchmark showed that it may not (most of lookup are false):

```
sqlContext.range(1<<20).write.parquet("small")
df = sqlContext.read.parquet('small')
for i in range(3):
    t = time.time()
    df2 = sqlContext.range(1<<26).selectExpr("id * 1111111111 % 987654321 as id2")
    df2.join(df, df.id == df2.id2).count()
    print time.time() -t
```

Having bitset (used time in seconds):
```
17.5404241085
10.2758829594
10.5786800385
```
After removing bitset (used time in seconds):
```
21.8939979076
12.4132959843
9.97224712372
```

cc rxin nongli

Author: Davies Liu <davies@databricks.com>

Closes #9452 from davies/remove_bitset.

1b6a5d4a

[SPARK-10949] Update Snappy version to 1.1.2 · 701fb505

Adam Roberts authored 9 years ago

This is an updated version of #8995 by a-roberts. Original description follows:

Snappy now supports concatenation of serialized streams, this patch contains a version number change and the "does not support" test is now a "supports" test.

Snappy 1.1.2 changelog mentions:

> snappy-java-1.1.2 (22 September 2015)
> This is a backward compatible release for 1.1.x.
> Add AIX (32-bit) support.
> There is no upgrade for the native libraries of the other platforms.

> A major change since 1.1.1 is a support for reading concatenated results of SnappyOutputStream(s)
> snappy-java-1.1.2-RC2 (18 May 2015)
> Fix #107: SnappyOutputStream.close() is not idempotent
> snappy-java-1.1.2-RC1 (13 May 2015)
> SnappyInputStream now supports reading concatenated compressed results of SnappyOutputStream
> There has been no compressed format change since 1.0.5.x. So You can read the compressed results > interchangeablly between these versions.
> Fixes a problem when java.io.tmpdir does not exist.

Closes #8995.

Author: Adam Roberts <aroberts@uk.ibm.com>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #9439 from JoshRosen/update-snappy.

701fb505

[SPARK-11505][SQL] Break aggregate functions into multiple files · d19f4fda

Reynold Xin authored 9 years ago

functions.scala was getting pretty long. I broke it into multiple files.

I also added explicit data types for some public vals, and renamed aggregate function pretty names to lower case, which is more consistent with rest of the functions.

Author: Reynold Xin <rxin@databricks.com>

Closes #9471 from rxin/SPARK-11505.

d19f4fda

[SPARK-11504][SQL] API audit for distributeBy and localSort · abf5e428

Reynold Xin authored 9 years ago

1. Renamed localSort -> sortWithinPartitions to avoid ambiguity in "local"
2. distributeBy -> repartition to match the existing repartition.

Author: Reynold Xin <rxin@databricks.com>

Closes #9470 from rxin/SPARK-11504.

abf5e428

[SPARK-10304][SQL] Following up checking valid dir structure for partition discovery · de289bf2
Liang-Chi Hsieh authored 9 years ago
```
This patch follows up #8840.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9459 from viirya/detect_invalid_part_dir_following.
```
de289bf2
Closes #9464 · 987df4bf
Reynold Xin authored 9 years ago

987df4bf

[SPARK-11490][SQL] variance should alias var_samp instead of var_pop. · 3bd6f5d2

Reynold Xin authored 9 years ago

stddev is an alias for stddev_samp. variance should be consistent with stddev.

Also took the chance to remove internal Stddev and Variance, and only kept StddevSamp/StddevPop and VarianceSamp/VariancePop.

Author: Reynold Xin <rxin@databricks.com>

Closes #9449 from rxin/SPARK-11490.

3bd6f5d2

[SPARK-11197][SQL] add doc for run SQL on files directly · e0fc9c7e
Wenchen Fan authored 9 years ago
```
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9467 from cloud-fan/doc.
```
e0fc9c7e

[SPARK-11485][SQL] Make DataFrameHolder and DatasetHolder public. · cd1df662

Reynold Xin authored 9 years ago

These two classes should be public, since they are used in public code.

Author: Reynold Xin <rxin@databricks.com>

Closes #9445 from rxin/SPARK-11485.

cd1df662

[SPARK-11235][NETWORK] Add ability to stream data using network lib. · 27feafcc

Marcelo Vanzin authored 9 years ago

The current interface used to fetch shuffle data is not very efficient for
large buffers; it requires the receiver to buffer the entirety of the
contents being downloaded in memory before processing the data.

To use the network library to transfer large files (such as those that
can be added using SparkContext addJar / addFile), this change adds a
more efficient way of downloding data, by streaming the data and feeding
it to a callback as data arrives.

This is achieved by a custom frame decoder that replaces the current netty
one; this decoder allows entering a mode where framing is skipped and data
is instead provided directly to a callback. The existing netty classes
(ByteToMessageDecoder and LengthFieldBasedFrameDecoder) could not be reused
since their semantics do not allow for the interception approach the new
decoder uses.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9206 from vanzin/SPARK-11235.

27feafcc

[SPARK-10622][CORE][YARN] Differentiate dead from "mostly dead" executors. · 8790ee6d

Marcelo Vanzin authored 9 years ago

In YARN mode, when preemption is enabled, we may leave executors in a
zombie state while we wait to retrieve the reason for which the executor
exited. This is so that we don't account for failed tasks that were
running on a preempted executor.

The issue is that while we wait for this information, the scheduler
might decide to schedule tasks on the executor, which will never be
able to run them. Other side effects include the block manager still
considering the executor available to cache blocks, for example.

So, when we know that an executor went down but we don't know why,
stop everything related to the executor, except its running tasks.
Only when we know the reason for the exit (or give up waiting for
it) we do update the running tasks.

This is achieved by a new `disableExecutor()` method in the
`Schedulable` interface. For managers that do not behave like this
(i.e. every one but YARN), the existing `executorLost()` method
will behave the same way it did before.

On top of that change, a few minor changes that made debugging easier,
and fixed some other minor issues:
- The cluster-mode AM was printing a misleading log message every
  time an executor disconnected from the driver (because the akka
  actor system was shared between driver and AM).
- Avoid sending unnecessary requests for an executor's exit reason
  when we already know it was explicitly disabled / killed. This
  avoids both multiple requests, and unnecessary requests that would
  just cause warning messages on the AM (in the explicit kill case).
- Tone down a log message about the executor being lost when it
  exited normally (e.g. preemption)
- Wake up the AM monitor thread when requests for executor loss
  reasons arrive too, so that we can more quickly remove executors
  from this zombie state.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8887 from vanzin/SPARK-10622.

8790ee6d

[SPARK-11443] Reserve space lines · 9b214cea

Xusen Yin authored 9 years ago

The trim_codeblock(lines) function in include_example.rb removes some blank lines in the code.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9400 from yinxusen/SPARK-11443.

9b214cea

[SPARK-11380][DOCS] Replace example code in mllib-frequent-pattern-mining.md using include_example · 820064e6
Pravin Gadakh authored 9 years ago
```
Author: Pravin Gadakh <pravingadakh177@gmail.com>
Author: Pravin Gadakh <prgadakh@in.ibm.com>

Closes #9340 from pravingadakh/SPARK-11380.
```
820064e6

[SPARK-9492][ML][R] LogisticRegression in R should provide model statistics · e328b69c

Yanbo Liang authored 9 years ago

Like ml ```LinearRegression```, ```LogisticRegression``` should provide a training summary including feature names and their coefficients.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9303 from yanboliang/spark-9492.

e328b69c

[SPARK-11442] Reduce numSlices for local metrics test of SparkListenerSuite · c09e5139

tedyu authored 9 years ago

In the thread, http://search-hadoop.com/m/q3RTtcQiFSlTxeP/test+failed+due+to+OOME&subj=test+failed+due+to+OOME, it was discussed that memory consumption for SparkListenerSuite should be brought down.

This is an attempt in that direction by reducing numSlices for local metrics test.

Author: tedyu <yuzhihong@gmail.com>

Closes #9384 from tedyu/master.

c09e5139

[SPARK-2960][DEPLOY] Support executing Spark from symlinks (reopen) · 8aff36e9

jerryshao authored 9 years ago

This PR is based on the work of roji to support running Spark scripts from symlinks. Thanks for the great work roji . Would you mind taking a look at this PR, thanks a lot.

For releases like HDP and others, normally it will expose the Spark executables as symlinks and put in `PATH`, but current Spark's scripts do not support finding real path from symlink recursively, this will make spark fail to execute from symlink. This PR try to solve this issue by finding the absolute path from symlink.

Instead of using `readlink -f` like what this PR (https://github.com/apache/spark/pull/2386) implemented is that `-f` is not support for Mac, so here manually seeking the path through loop.

I've tested with Mac and Linux (Cent OS), looks fine.

This PR did not fix the scripts under `sbin` folder, not sure if it needs to be fixed also?

Please help to review, any comment is greatly appreciated.

Author: jerryshao <sshao@hortonworks.com>
Author: Shay Rojansky <roji@roji.org>

Closes #8669 from jerryshao/SPARK-2960.

8aff36e9

Nov 03, 2015

[SPARK-11455][SQL] fix case sensitivity of partition by · 2692bdb7

Wenchen Fan authored 9 years ago

depend on `caseSensitive` to do column name equality check, instead of just `==`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9410 from cloud-fan/partition.

2692bdb7