Commits · a0f1a11837bfffb76582499d36fbaf21a1d628cb · cs525-sp18-g07 / spark

Nov 25, 2015

[SPARK-11981][SQL] Move implementations of methods back to DataFrame from Queryable · a0f1a118
Reynold Xin authored 9 years ago
```
Also added show methods to Dataset.

Author: Reynold Xin <rxin@databricks.com>

Closes #9964 from rxin/SPARK-11981.
```
a0f1a118

[SPARK-11970][SQL] Adding JoinType into JoinWith and support Sample in Dataset API · 2610e061

gatorsmile authored 9 years ago

Except inner join, maybe the other join types are also useful when users are using the joinWith function. Thus, added the joinType into the existing joinWith call in Dataset APIs.

Also providing another joinWith interface for the cartesian-join-like functionality.

Please provide your opinions. marmbrus rxin cloud-fan Thank you!

Author: gatorsmile <gatorsmile@gmail.com>

Closes #9921 from gatorsmile/joinWith.

2610e061

[SPARK-11979][STREAMING] Empty TrackStateRDD cannot be checkpointed and... · 21698868

Tathagata Das authored 9 years ago

[SPARK-11979][STREAMING] Empty TrackStateRDD cannot be checkpointed and recovered from checkpoint file

This solves the following exception caused when empty state RDD is checkpointed and recovered. The root cause is that an empty OpenHashMapBasedStateMap cannot be deserialized as the initialCapacity is set to zero.
```
Job aborted due to stage failure: Task 0 in stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 (TID 20, localhost): java.lang.IllegalArgumentException: requirement failed: Invalid initial capacity
	at scala.Predef$.require(Predef.scala:233)
	at org.apache.spark.streaming.util.OpenHashMapBasedStateMap.<init>(StateMap.scala:96)
	at org.apache.spark.streaming.util.OpenHashMapBasedStateMap.<init>(StateMap.scala:86)
	at org.apache.spark.streaming.util.OpenHashMapBasedStateMap.readObject(StateMap.scala:291)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
	at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:181)
	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
	at scala.collection.AbstractIterator.to(Iterator.scala:1157)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:921)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:921)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:88)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:744)
```

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #9958 from tdas/SPARK-11979.

21698868

Nov 24, 2015

[SPARK-10621][SQL] Consistent naming for functions in SQL, Python, Scala · 151d7c2b
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #9948 from rxin/SPARK-10621.
```
151d7c2b

[STREAMING][FLAKY-TEST] Catch execution context race condition in `FileBasedWriteAheadLog.close()` · a5d98876

Burak Yavuz authored 9 years ago

There is a race condition in `FileBasedWriteAheadLog.close()`, where if delete's of old log files are in progress, the write ahead log may close, and result in a `RejectedExecutionException`. This is okay, and should be handled gracefully.

Example test failures:
https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.6-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/95/testReport/junit/org.apache.spark.streaming.util/BatchedWriteAheadLogWithCloseFileAfterWriteSuite/BatchedWriteAheadLog___clean_old_logs/

The reason the test fails is in `afterEach`, `writeAheadLog.close` is called, and there may still be async deletes in flight.

tdas zsxwing

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #9953 from brkyvz/flaky-ss.

a5d98876

[SPARK-11947][SQL] Mark deprecated methods with "This will be removed in Spark 2.0." · 4d6bbbc0
Reynold Xin authored 9 years ago
```
Also fixed some documentation as I saw them.

Author: Reynold Xin <rxin@databricks.com>

Closes #9930 from rxin/SPARK-11947.
```
4d6bbbc0

[SPARK-11967][SQL] Consistent use of varargs for multiple paths in DataFrameReader · 25bbd3c1

Reynold Xin authored 9 years ago

This patch makes it consistent to use varargs in all DataFrameReader methods, including Parquet, JSON, text, and the generic load function.

Also added a few more API tests for the Java API.

Author: Reynold Xin <rxin@databricks.com>

Closes #9945 from rxin/SPARK-11967.

25bbd3c1

[SPARK-11914][SQL] Support coalesce and repartition in Dataset APIs · 238ae51b

gatorsmile authored 9 years ago

This PR is to provide two common `coalesce` and `repartition` in Dataset APIs.

After reading the comments of SPARK-9999, I am unclear about the plan for supporting re-partitioning in Dataset APIs. Currently, both RDD APIs and Dataframe APIs provide users such a flexibility to control the number of partitions.

In most traditional RDBMS, they expose the number of partitions, the partitioning columns, the table partitioning methods to DBAs for performance tuning and storage planning. Normally, these parameters could largely affect the query performance. Since the actual performance depends on the workload types, I think it is almost impossible to automate the discovery of the best partitioning strategy for all the scenarios.

I am wondering if Dataset APIs are planning to hide these APIs from users? Feel free to reject my PR if it does not match the plan.

Thank you for your answers. marmbrus rxin cloud-fan

Author: gatorsmile <gatorsmile@gmail.com>

Closes #9899 from gatorsmile/coalesce.

238ae51b

[SPARK-11783][SQL] Fixes execution Hive client when using remote Hive metastore · c7f95df5

Cheng Lian authored 9 years ago

When using remote Hive metastore, `hive.metastore.uris` is set to the metastore URI. However, it overrides `javax.jdo.option.ConnectionURL` unexpectedly, thus the execution Hive client connects to the actual remote Hive metastore instead of the Derby metastore created in the temporary directory. Cleaning this configuration for the execution Hive client fixes this issue.

Author: Cheng Lian <lian@databricks.com>

Closes #9895 from liancheng/spark-11783.clean-remote-metastore-config.

c7f95df5

Added a line of comment to explain why the extra sort exists in pivot. · 34ca392d
Reynold Xin authored 9 years ago

34ca392d

[SPARK-11805] free the array in UnsafeExternalSorter during spilling · 58d9b260

Davies Liu authored 9 years ago

After calling spill() on SortedIterator, the array inside InMemorySorter is not needed, it should be freed during spilling, this could help to join multiple tables with limited memory.

Author: Davies Liu <davies@databricks.com>

Closes #9793 from davies/free_array.

58d9b260

[SPARK-11929][CORE] Make the repl log4j configuration override the root logger. · e6dd2374

Marcelo Vanzin authored 9 years ago

In the default Spark distribution, there are currently two separate
log4j config files, with different default values for the root logger,
so that when running the shell you have a different default log level.
This makes the shell more usable, since the logs don't overwhelm the
output.

But if you install a custom log4j.properties, you lose that, because
then it's going to be used no matter whether you're running a regular
app or the shell.

With this change, the overriding of the log level is done differently;
the log level repl's main class (org.apache.spark.repl.Main) is used
to define the root logger's level when running the shell, defaulting
to WARN if it's not set explicitly.

On a somewhat related change, the shell output about the "sc" variable
was changed a bit to contain a little more useful information about
the application, since when the root logger's log level is WARN, that
information is never shown to the user.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9816 from vanzin/shell-logging.

e6dd2374

[SPARK-11946][SQL] Audit pivot API for 1.6. · f3152722

Reynold Xin authored 9 years ago

Currently pivot's signature looks like

```scala
scala.annotation.varargs
def pivot(pivotColumn: Column, values: Column*): GroupedData

scala.annotation.varargs
def pivot(pivotColumn: String, values: Any*): GroupedData
```

I think we can remove the one that takes "Column" types, since callers should always be passing in literals. It'd also be more clear if the values are not varargs, but rather Seq or java.util.List.

I also made similar changes for Python.

Author: Reynold Xin <rxin@databricks.com>

Closes #9929 from rxin/SPARK-11946.

f3152722

[SPARK-11872] Prevent the call to SparkContext#stop() in the listener bus's thread · 81012546

tedyu authored 9 years ago

This is continuation of SPARK-11761

Andrew suggested adding this protection. See tail of https://github.com/apache/spark/pull/9741

Author: tedyu <yuzhihong@gmail.com>

Closes #9852 from tedyu/master.

81012546

[SPARK-11926][SQL] unify GetStructField and GetInternalRowField · 19530da6
Wenchen Fan authored 9 years ago
```
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9909 from cloud-fan/get-struct.
```
19530da6

[SPARK-11847][ML] Model export/import for spark.ml: LDA · 52bc25c8

Yuhao Yang authored 9 years ago

Add read/write support to LDA, similar to ALS.

save/load for ml.LocalLDAModel is done.
For DistributedLDAModel, I'm not sure if we can invoke save on the mllib.DistributedLDAModel directly. I'll send update after some test.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9894 from hhbyyh/ldaMLsave.

52bc25c8

[SPARK-11521][ML][DOC] Document that Logistic, Linear Regression summaries ignore weight col · 9e24ba66

Joseph K. Bradley authored 9 years ago

Doc for 1.6 that the summaries mostly ignore the weight column.
To be corrected for 1.7

CC: mengxr thunterdb

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9927 from jkbradley/linregsummary-doc.

9e24ba66

[SPARK-11952][ML] Remove duplicate ml examples · 56a0aba0

Yanbo Liang authored 9 years ago

Remove duplicate ml examples (only for ml).  mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9933 from yanboliang/SPARK-11685.

56a0aba0

[SPARK-11942][SQL] fix encoder life cycle for CoGroup · e5aaae6e

Wenchen Fan authored 9 years ago

we should pass in resolved encodera to logical `CoGroup` and bind them in physical `CoGroup`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9928 from cloud-fan/cogroup.

e5aaae6e

[SPARK-11818][REPL] Fix ExecutorClassLoader to lookup resources from … · be9dd155

Jungtaek Lim authored 9 years ago

…parent class loader

Without patch, two additional tests of ExecutorClassLoaderSuite fails.

- "resource from parent"
- "resources from parent"

Detailed explanation is here, https://issues.apache.org/jira/browse/SPARK-11818?focusedCommentId=15011202&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15011202

Author: Jungtaek Lim <kabhwan@gmail.com>

Closes #9812 from HeartSaVioR/SPARK-11818.

be9dd155

[SPARK-11592][SQL] flush spark-sql command line history to history file · 5889880f

Daoyuan Wang authored 9 years ago

Currently, `spark-sql` would not flush command history when exiting.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #9563 from adrian-wang/jline.

5889880f

[SPARK-11043][SQL] BugFix:Set the operator log in the thrift server. · d4a5e6f7

huangzhaowei authored 9 years ago

`SessionManager` will set the `operationLog` if the configuration `hive.server2.logging.operation.enabled` is true in version of hive 1.2.1.
But the spark did not adapt to this change, so no matter enabled the configuration or not, spark thrift server will always log the warn message.
PS: if `hive.server2.logging.operation.enabled` is false, it should log the warn message (the same as hive thrift server).

Author: huangzhaowei <carlmartinmax@gmail.com>

Closes #9056 from SaintBacchus/SPARK-11043.

d4a5e6f7

[SPARK-11906][WEB UI] Speculation Tasks Cause ProgressBar UI Overflow · 800bd799

Forest Fang authored 9 years ago

When there are speculative tasks in the stage, running progress bar could overflow and goes hidden on a new line:
![image](https://cloud.githubusercontent.com/assets/4317392/11326841/5fd3482e-9142-11e5-8ca5-cb2f0c0c8964.png)
3 completed / 2 running (including 1 speculative) out of 4 total tasks

This is a simple fix by capping the started tasks at `total - completed` tasks
![image](https://cloud.githubusercontent.com/assets/4317392/11326842/6bb67260-9142-11e5-90f0-37f9174878ec.png)

I should note my preferred way to fix it is via css style
```css
.progress { display: flex; }
```
which shifts the correction burden from driver to web browser. However I couldn't get selenium test to measure the position/dimension of the progress bar correctly to get this unit tested.

It also has the side effect that the width will be calibrated so the running occupies 2 / 5 instead of 1 / 4.
![image](https://cloud.githubusercontent.com/assets/4317392/11326848/7b03e9f0-9142-11e5-89ad-bd99cb0647cf.png)

All in all, since this cosmetic bug is minor enough, I suppose the original simple fix should be good enough.

Author: Forest Fang <forest.fang@outlook.com>

Closes #9896 from saurfang/progressbar.

800bd799

[SPARK-11897][SQL] Add @scala.annotations.varargs to sql functions · 12eea834
Xiu Guo authored 9 years ago
```
Author: Xiu Guo <xguo27@gmail.com>

Closes #9918 from xguo27/SPARK-11897.
```
12eea834
[SPARK-10707][SQL] Fix nullability computation in union output · 4021a28a
Mikhail Bautin authored 9 years ago
```
Author: Mikhail Bautin <mbautin@gmail.com>

Closes #9308 from mbautin/SPARK-10707.
```
4021a28a

[SPARK-11903] Remove --skip-java-test · 6cf51a70

Nicholas Chammas authored 9 years ago

Per [pwendell's comments on SPARK-11903](https://issues.apache.org/jira/browse/SPARK-11903?focusedCommentId=15021511&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15021511) I'm removing this dead code.

If we are concerned about preserving compatibility, I can instead leave the option in and add a warning.

For example:

```sh
echo "Warning: '--skip-java-test' is deprecated and has no effect."
;;
```

cc pwendell, srowen

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #9924 from nchammas/make-distribution.

6cf51a70

[SPARK-11933][SQL] Rename mapGroup -> mapGroups and flatMapGroup -> flatMapGroups. · 8d575246

Reynold Xin authored 9 years ago

Based on feedback from Matei, this is more consistent with mapPartitions in Spark.

Also addresses some of the cleanups from a previous commit that renames the type variables.

Author: Reynold Xin <rxin@databricks.com>

Closes #9919 from rxin/SPARK-11933.

8d575246

Nov 23, 2015

Updated sql programming guide to include jdbc fetch size · 026ea2ea
Stephen Samuel authored 9 years ago
```
Author: Stephen Samuel <sam@sksamuel.com>

Closes #9377 from sksamuel/master.
```
026ea2ea

[SPARK-10560][PYSPARK][MLLIB][DOCS] Make StreamingLogisticRegressionWithSGD... · 10574564

Bryan Cutler authored 9 years ago

[SPARK-10560][PYSPARK][MLLIB][DOCS] Make StreamingLogisticRegressionWithSGD Python API equal to Scala one

This is to bring the API documentation of StreamingLogisticReressionWithSGD and StreamingLinearRegressionWithSGC in line with the Scala versions.

-Fixed the algorithm descriptions
-Added default values to parameter descriptions
-Changed StreamingLogisticRegressionWithSGD regParam to default to 0, as in the Scala version

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #9141 from BryanCutler/StreamingLogisticRegressionWithSGD-python-api-sync.

10574564

[SPARK-9866][SQL] Speed up VersionsSuite by using persistent Ivy cache · 9db5f601

Josh Rosen authored 9 years ago

This patch attempts to speed up VersionsSuite by storing fetched Hive JARs in an Ivy cache that persists across tests runs. If `SPARK_VERSIONS_SUITE_IVY_PATH` is set, that path will be used for the cache; if it is not set, VersionsSuite will create a temporary Ivy cache which is deleted after the test completes.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9624 from JoshRosen/SPARK-9866.

9db5f601

[SPARK-11140][CORE] Transfer files using network lib when using NettyRpcEnv. · c2467dad

Marcelo Vanzin authored 9 years ago

This change abstracts the code that serves jars / files to executors so that
each RpcEnv can have its own implementation; the akka version uses the existing
HTTP-based file serving mechanism, while the netty versions uses the new
stream support added to the network lib, which makes file transfers benefit
from the easier security configuration of the network library, and should also
reduce overhead overall.

The change includes a small fix to TransportChannelHandler so that it propagates
user events to downstream handlers.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9530 from vanzin/SPARK-11140.

c2467dad

[SPARK-11865][NETWORK] Avoid returning inactive client in TransportClientFactory. · 7cfa4c6b

Marcelo Vanzin authored 9 years ago

There's a very narrow race here where it would be possible for the timeout handler
to close a channel after the client factory verified that the channel was still
active. This change makes sure the client is marked as being recently in use so
that the timeout handler does not close it until a new timeout cycle elapses.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9853 from vanzin/SPARK-11865.

7cfa4c6b

[SPARK-11910][STREAMING][DOCS] Update twitter4j dependency version · 242be7da
Luciano Resende authored 9 years ago
```
Author: Luciano Resende <lresende@apache.org>

Closes #9892 from lresende/SPARK-11910.
```
242be7da

[SPARK-11836][SQL] udf/cast should not create new SQLContext · 1d912020

Davies Liu authored 9 years ago

They should use the existing SQLContext.

Author: Davies Liu <davies@databricks.com>

Closes #9914 from davies/create_udf.

1d912020

[SPARK-4424] Remove spark.driver.allowMultipleContexts override in tests · 1b6e938b

Josh Rosen authored 9 years ago

This patch removes `spark.driver.allowMultipleContexts=true` from our test configuration. The multiple SparkContexts check was originally disabled because certain tests suites in SQL needed to create multiple contexts. As far as I know, this configuration change is no longer necessary, so we should remove it in order to make it easier to find test cleanup bugs.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9865 from JoshRosen/SPARK-4424.

1b6e938b

[SPARK-11837][EC2] python3 compatibility for launching ec2 m3 instances · f6dcc6e9

Mortada Mehyar authored 9 years ago

this currently breaks for python3 because `string` module doesn't have `letters` anymore, instead `ascii_letters` should be used

Author: Mortada Mehyar <mortada.mehyar@gmail.com>

Closes #9797 from mortada/python3_fix.

f6dcc6e9

[SPARK-11920][ML][DOC] ML LinearRegression should use correct dataset in... · 98d7ec7d

Yanbo Liang authored 9 years ago

[SPARK-11920][ML][DOC] ML LinearRegression should use correct dataset in examples and user guide doc

ML ```LinearRegression``` use ```data/mllib/sample_libsvm_data.txt``` as dataset in examples and user guide doc, but it's actually classification dataset rather than regression dataset. We should use ```data/mllib/sample_linear_regression_data.txt``` instead.
The deeper causes is that ```LinearRegression``` with "normal" solver can not solve this dataset correctly, may be due to the ill condition and unreasonable label. This issue has been reported at [SPARK-11918](https://issues.apache.org/jira/browse/SPARK-11918).
It will confuse users if they run the example code but get exception, so we should make this change which can clearly illustrate the usage of ```LinearRegression``` algorithm.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9905 from yanboliang/spark-11920.

98d7ec7d

[SPARK-11762][NETWORK] Account for active streams when couting outstanding requests. · 5231cd5a

Marcelo Vanzin authored 9 years ago

This way the timeout handling code can correctly close "hung" channels that are
processing streams.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9747 from vanzin/SPARK-11762.

5231cd5a

[SPARK-7173][YARN] Add label expression support for application master · 5fd86e4f

jerryshao authored 9 years ago

Add label expression support for AM to restrict it runs on the specific set of nodes. I tested it locally and works fine.

sryza and vanzin please help to review, thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #9800 from jerryshao/SPARK-7173.

5fd86e4f

[SPARK-11913][SQL] support typed aggregate with complex buffer schema · 946b4065
Wenchen Fan authored 9 years ago
```
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9898 from cloud-fan/agg.
```
946b4065