Commits · 7c0ed13d298d9cf66842c667602e2dccb8f5605b · cs525-sp18-g07 / spark

Dec 22, 2014

[SPARK-4079] [CORE] Consolidates Errors if a CompressionCodec is not available · 7c0ed13d

Kostas Sakellis authored 10 years ago

This commit consolidates some of the exceptions thrown if compression codecs are not available. If a bad configuration string was passed in, a ClassNotFoundException was through. Also, if Snappy was not available, it would throw an InvocationTargetException when the codec was being used (not when it was being initialized). Now, an IllegalArgumentException is thrown when a codec is not available at creation time - either because the class does not exist or the codec itself is not available in the system. This will allow us to have a better message and fail faster.

Author: Kostas Sakellis <kostas@cloudera.com>

Closes #3119 from ksakellis/kostas-spark-4079 and squashes the following commits:

9709c7c [Kostas Sakellis] Removed unnecessary Logging class
63bfdd0 [Kostas Sakellis] Removed isAvailable to preserve binary compatibility
1d0ef2f [Kostas Sakellis] [SPARK-4079] [CORE] Added more information to exception
64f3d27 [Kostas Sakellis] [SPARK-4079] [CORE] Code review feedback
52dfa8f [Kostas Sakellis] [SPARK-4079] [CORE] Default to LZF if Snappy not available

7c0ed13d

SPARK-4447. Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha · d62da642

Sandy Ryza authored 10 years ago

Author: Sandy Ryza <sandy@cloudera.com>

Closes #3652 from sryza/sandy-spark-4447 and squashes the following commits:

2791158 [Sandy Ryza] Review feedback
c23507b [Sandy Ryza] Strip margin from client arguments help string
18be7ba [Sandy Ryza] SPARK-4447

d62da642

[SPARK-4733] Add missing prameter comments in ShuffleDependency · fb8e85e8

Takeshi Yamamuro authored 10 years ago

Add missing Javadoc comments in ShuffleDependency.

Author: Takeshi Yamamuro <linguin.m.s@gmail.com>

Closes #3594 from maropu/DependencyJavadocFix and squashes the following commits:

32129b4 [Takeshi Yamamuro] Fix comments in @aggregator and @mapSideCombine
303c75d [Takeshi Yamamuro] [SPARK-4733] Add missing prameter comments in ShuffleDependency

fb8e85e8

[Minor] Improve some code in BroadcastTest for short · 1d9788e4

carlmartin authored 10 years ago

Using
    val arr1 = (0 until num).toArray
instead of
    val arr1 = new Array[Int](num)
    for (i <- 0 until arr1.length) {
      arr1(i) = i
    }
for short.

Author: carlmartin <carlmartinmax@gmail.com>

Closes #3750 from SaintBacchus/BroadcastTest and squashes the following commits:

43adb70 [carlmartin] Improve some code in BroadcastTest for short

1d9788e4

[SPARK-4883][Shuffle] Add a name to the directoryCleaner thread · 8773705f

zsxwing authored 10 years ago

Author: zsxwing <zsxwing@gmail.com>

Closes #3734 from zsxwing/SPARK-4883 and squashes the following commits:

e6f2b61 [zsxwing] Fix the name
cc74727 [zsxwing] Add a name to the directoryCleaner thread

8773705f

[SPARK-4870] Add spark version to driver log · 39272c8c

Zhang, Liye authored 10 years ago

Author: Zhang, Liye <liye.zhang@intel.com>

Closes #3717 from liyezhang556520/version2Log and squashes the following commits:

ccd30d7 [Zhang, Liye] delete log in sparkConf
330f70c [Zhang, Liye] move the log from SaprkConf to SparkContext
96dc115 [Zhang, Liye] remove curly brace
e833330 [Zhang, Liye] add spark version to driver log

39272c8c

[SPARK-4915][YARN] Fix classname to be specified for external shuffle service. · 96606f69

Tsuyoshi Ozawa authored 10 years ago

Author: Tsuyoshi Ozawa <ozawa.tsuyoshi@lab.ntt.co.jp>

Closes #3757 from oza/SPARK-4915 and squashes the following commits:

3b0d6d6 [Tsuyoshi Ozawa] Fix classname to be specified for external shuffle service.

96606f69

[SPARK-4918][Core] Reuse Text in saveAsTextFile · 93b2f3a8

zsxwing authored 10 years ago

Reuse Text in saveAsTextFile to reduce GC.

/cc rxin

Author: zsxwing <zsxwing@gmail.com>

Closes #3762 from zsxwing/SPARK-4918 and squashes the following commits:

59f03eb [zsxwing] Reuse Text in saveAsTextFile

93b2f3a8

[SPARK-2075][Core] Make the compiler generate same bytes code for Hadoop 1.+ and Hadoop 2.+ · 6ee6aa70

zsxwing authored 10 years ago

`NullWritable` is a `Comparable` rather than `Comparable[NullWritable]` in Hadoop 1.+, so the compiler cannot find an implicit Ordering for it. It will generate different anonymous classes for `saveAsTextFile` in Hadoop 1.+ and Hadoop 2.+. Therefore, here we provide an Ordering for NullWritable so that the compiler will generate same codes.

I used the following commands to confirm the generated byte codes are some.
```
mvn -Dhadoop.version=1.2.1 -DskipTests clean package -pl core -am
javap -private -c -classpath core/target/scala-2.10/classes org.apache.spark.rdd.RDD > ~/hadoop1.txt

mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package -pl core -am
javap -private -c -classpath core/target/scala-2.10/classes org.apache.spark.rdd.RDD > ~/hadoop2.txt

diff ~/hadoop1.txt ~/hadoop2.txt
```

However, the compiler will generate different codes for the classes which call methods of `JobContext/TaskAttemptContext`. `JobContext/TaskAttemptContext` is a class in Hadoop 1.+, and calling its method will use `invokevirtual`, while it's an interface in Hadoop 2.+, and will use `invokeinterface`.

To fix it, we can use reflection to call `JobContext/TaskAttemptContext.getConfiguration`.

Author: zsxwing <zsxwing@gmail.com>

Closes #3740 from zsxwing/SPARK-2075 and squashes the following commits:

39d9df2 [zsxwing] Fix the code style
e4ad8b5 [zsxwing] Use null for the implicit Ordering
734bac9 [zsxwing] Explicitly set the implicit parameters
ca03559 [zsxwing] Use reflection to access JobContext/TaskAttemptContext.getConfiguration
fa40db0 [zsxwing] Add an Ordering for NullWritable to make the compiler generate same byte codes for RDD

6ee6aa70

Dec 21, 2014

SPARK-4910 [CORE] build failed (use of FileStatus.isFile in Hadoop 1.x) · c6a3c0d5

Sean Owen authored 10 years ago

Fix small Hadoop 1 compile error from SPARK-2261. In Hadoop 1.x, all we have is FileStatus.isDir, so these "is file" assertions are changed to "is not a dir". This is how similar checks are done so far in the code base.

Author: Sean Owen <sowen@cloudera.com>

Closes #3754 from srowen/SPARK-4910 and squashes the following commits:

52c5e4e [Sean Owen] Fix small Hadoop 1 compile error from SPARK-2261

c6a3c0d5

Dec 20, 2014

[Minor] Build Failed: value defaultProperties not found · a764960b

huangzhaowei authored 10 years ago

Mvn Build Failed: value defaultProperties not found .Maybe related to this pr:
https://github.com/apache/spark/commit/1d648123a77bbcd9b7a34cc0d66c14fa85edfecd
andrewor14 can you look at this problem?

Author: huangzhaowei <carlmartinmax@gmail.com>

Closes #3749 from SaintBacchus/Mvn-Build-Fail and squashes the following commits:

8e2917c [huangzhaowei] Build Failed: value defaultProperties not found

a764960b

Dec 19, 2014

[SPARK-4140] Document dynamic allocation · 15c03e1e

Andrew Or authored 10 years ago

Once the external shuffle service is also documented, the dynamic allocation section will link to it. Let me know if the whole dynamic allocation should be moved to its separate page; I personally think the organization might be cleaner that way.

This patch builds on top of oza's work in #3689.

aarondav pwendell

Author: Andrew Or <andrew@databricks.com>
Author: Tsuyoshi Ozawa <ozawa.tsuyoshi@gmail.com>

Closes #3731 from andrewor14/document-dynamic-allocation and squashes the following commits:

1281447 [Andrew Or] Address a few comments
b9843f2 [Andrew Or] Document the configs as well
246fb44 [Andrew Or] Merge branch 'SPARK-4839' of github.com:oza/spark into document-dynamic-allocation
8c64004 [Andrew Or] Add documentation for dynamic allocation (without configs)
6827b56 [Tsuyoshi Ozawa] Fixing a documentation of spark.dynamicAllocation.enabled.
53cff58 [Tsuyoshi Ozawa] Adding a documentation about dynamic resource allocation.

15c03e1e

[SPARK-4831] Do not include SPARK_CLASSPATH if empty · 7cb3f547

Daniel Darabos authored 10 years ago

My guess for fixing https://issues.apache.org/jira/browse/SPARK-4831.

Author: Daniel Darabos <darabos.daniel@gmail.com>

Closes #3678 from darabos/patch-1 and squashes the following commits:

36e1243 [Daniel Darabos] Do not include SPARK_CLASSPATH if empty.

7cb3f547

SPARK-2641: Passing num executors to spark arguments from properties file · 1d648123

Kanwaljit Singh authored 10 years ago

Since we can set spark executor memory and executor cores using property file, we must also be allowed to set the executor instances.

Author: Kanwaljit Singh <kanwaljit.singh@guavus.com>

Closes #1657 from kjsingh/branch-1.0 and squashes the following commits:

d8a5a12 [Kanwaljit Singh] SPARK-2641: Fixing how spark arguments are loaded from properties file for num executors

Conflicts:
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala

1d648123

[SPARK-3060] spark-shell.cmd doesn't accept application options in Windows OS · 8d932475

Masayoshi TSUZUKI authored 10 years ago

Added equivalent module as utils.sh and modified spark-shell2.cmd to use it to parse options.

Now we can use application options.
ex) `bin\spark-shell.cmd --master spark://master:7077 -i path\to\script.txt`

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #3350 from tsudukim/feature/SPARK-3060 and squashes the following commits:

4551e56 [Masayoshi TSUZUKI] Modified too long line which defines the submission options to pass findstr command.
3a11361 [Masayoshi TSUZUKI] [SPARK-3060] spark-shell.cmd doesn't accept application options in Windows OS

8d932475

change signature of example to match released code · c25c669d

Eran Medan authored 10 years ago

the signature of registerKryoClasses is actually of Array[Class[_]] not Seq

Author: Eran Medan <ehrann.mehdan@gmail.com>

Closes #3747 from eranation/patch-1 and squashes the following commits:

ee9885d [Eran Medan] change signature of example to match released code

c25c669d

[SPARK-2261] Make event logger use a single file. · 45645191

Marcelo Vanzin authored 10 years ago

Currently the event logger uses a directory and several files to
describe an app's event log, all but one of which are empty. This
is not very HDFS-friendly, since creating lots of nodes in HDFS
(especially when they don't contain any data) is frowned upon due
to the node metadata being kept in the NameNode's memory.

Instead, add a header section to the event log file that contains metadata
needed to read the events. This metadata includes things like the Spark
version (for future code that may need it for backwards compatibility) and
the compression codec used for the event data.

With the new approach, aside from reducing the load on the NN, there's
also a lot less remote calls needed when reading the log directory.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #1222 from vanzin/hist-server-single-log and squashes the following commits:

cc8f5de [Marcelo Vanzin] Store header in plain text.
c7e6123 [Marcelo Vanzin] Update comment.
59c561c [Marcelo Vanzin] Review feedback.
216c5a3 [Marcelo Vanzin] Review comments.
dce28e9 [Marcelo Vanzin] Fix log overwrite test.
f91c13e [Marcelo Vanzin] Handle "spark.eventLog.overwrite", and add unit test.
346f0b4 [Marcelo Vanzin] Review feedback.
ed0023e [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
3f4500f [Marcelo Vanzin] Unit test for SPARK-3697.
45c7a1f [Marcelo Vanzin] Version of SPARK-3697 for this branch.
b3ee30b [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
a6d5c50 [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
16fd491 [Marcelo Vanzin] Use unique log directory for each codec.
0ef3f70 [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
d93c44a [Marcelo Vanzin] Add a newline to make the header more readable.
9e928ba [Marcelo Vanzin] Add types.
bd6ba8c [Marcelo Vanzin] Review feedback.
a624a89 [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
04364dc [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
bb7c2d3 [Marcelo Vanzin] Fix scalastyle warning.
16661a3 [Marcelo Vanzin] Simplify some internal code.
cc6bce4 [Marcelo Vanzin] Some review feedback.
a722184 [Marcelo Vanzin] Do not encode metadata in log file name.
3700586 [Marcelo Vanzin] Restore log flushing.
f677930 [Marcelo Vanzin] Fix botched rebase.
ae571fa [Marcelo Vanzin] Fix end-to-end event logger test.
9db0efd [Marcelo Vanzin] Show prettier name in UI.
8f42274 [Marcelo Vanzin] Make history server parse old-style log directories.
6251dd7 [Marcelo Vanzin] Make event logger use a single file.

45645191

[SPARK-4890] Upgrade Boto to 2.34.0; automatically download Boto from PyPi instead of packaging it · c28083f4

Josh Rosen authored 10 years ago

This patch upgrades `spark-ec2`'s Boto version to 2.34.0, since this is blocking several features. Newer versions of Boto don't work properly when they're loaded from a zipfile since they try to read a JSON file from a path relative to the Boto library sources.

Therefore, this patch also changes spark-ec2 to automatically download Boto from PyPi if it's not present in `SPARK_EC2_DIR/lib`, similar to what we do in the `sbt/sbt` script. This shouldn't ben an issue for users since they already need to have an internet connection to launch an EC2 cluster. By performing the downloading in spark_ec2.py instead of the Bash script, this should also work for Windows users.

I've tested this with Python 2.6, too.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #3737 from JoshRosen/update-boto and squashes the following commits:

0aa43cc [Josh Rosen] Remove unused setup_standalone_cluster() method.
f02935d [Josh Rosen] Enable Python deprecation warnings and fix one Boto warning:
587ae89 [Josh Rosen] [SPARK-4890] Upgrade Boto to 2.34.0; automatically download Boto from PyPi instead of packaging it

c28083f4

[SPARK-4896] don’t redundantly overwrite executor JAR deps · 7981f969

Ryan Williams authored 10 years ago

Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #2848 from ryan-williams/fetch-file and squashes the following commits:

c14daff [Ryan Williams] Fix copy that was changed to a move inadvertently
8e39c16 [Ryan Williams] code review feedback
788ed41 [Ryan Williams] don’t redundantly overwrite executor JAR deps

7981f969

[SPARK-4889] update history server example cmds · cdb2c645

Ryan Williams authored 10 years ago

Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #3736 from ryan-williams/hist and squashes the following commits:

421d8ff [Ryan Williams] add another random typo fix
76d6a4c [Ryan Williams] remove hdfs example
a2d0f82 [Ryan Williams] code review feedback
9ca7629 [Ryan Williams] [SPARK-4889] update history server example cmds

cdb2c645

Small refactoring to pass SparkEnv into Executor rather than creating SparkEnv in Executor. · 336cd341

Reynold Xin authored 10 years ago

This consolidates some code path and makes constructor arguments simpler for a few classes.

Author: Reynold Xin <rxin@databricks.com>

Closes #3738 from rxin/sparkEnvDepRefactor and squashes the following commits:

82e02cc [Reynold Xin] Fixed couple bugs.
217062a [Reynold Xin] Code review feedback.
bd00af7 [Reynold Xin] Small refactoring to pass SparkEnv into Executor rather than creating SparkEnv in Executor.

336cd341

[Build] Remove spark-staging-1038 · 8e253ebb

scwf authored 10 years ago

Author: scwf <wangfei1@huawei.com>

Closes #3743 from scwf/abc and squashes the following commits:

7d98bc8 [scwf] removing spark-staging-1038

8e253ebb

[SPARK-4901] [SQL] Hot fix for ByteWritables.copyBytes · 5479450c

Cheng Hao authored 10 years ago

HiveInspectors.scala failed in compiling with Hadoop 1, as the BytesWritable.copyBytes is not available in Hadoop 1.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3742 from chenghao-intel/settable_oi_hotfix and squashes the following commits:

bb04d1f [Cheng Hao] hot fix for ByteWritables.copyBytes

5479450c

SPARK-3428. TaskMetrics for running tasks is missing GC time metrics · 283263ff

Sandy Ryza authored 10 years ago

Author: Sandy Ryza <sandy@cloudera.com>

Closes #3684 from sryza/sandy-spark-3428 and squashes the following commits:

cb827fe [Sandy Ryza] SPARK-3428. TaskMetrics for running tasks is missing GC time metrics

283263ff

Dec 18, 2014

[SPARK-4674] Refactor getCallSite · d7fc69a8

Liang-Chi Hsieh authored 10 years ago

The current version of `getCallSite` visits the collection of `StackTraceElement` twice. However, it is unnecessary since we can perform our work with a single visit. We also do not need to keep filtered `StackTraceElement`.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #3532 from viirya/refactor_getCallSite and squashes the following commits:

62aa124 [Liang-Chi Hsieh] Fix style.
e741017 [Liang-Chi Hsieh] Refactor getCallSite.

d7fc69a8

[SPARK-4728][MLLib] Add exponential, gamma, and log normal sampling to MLlib da... · ee1fb97a

RJ Nowling authored 10 years ago

...ta generators

This patch adds:

* Exponential, gamma, and log normal generators that wrap Apache Commons math3 to the private API
* Functions for generating exponential, gamma, and log normal RDDs and vector RDDs
* Tests for the above

Author: RJ Nowling <rnowling@gmail.com>

Closes #3680 from rnowling/spark4728 and squashes the following commits:

455f50a [RJ Nowling] Add tests for exponential, gamma, and log normal samplers to JavaRandomRDDsSuite
3e1134a [RJ Nowling] Fix val/var, unncessary creation of Distribution objects when setting seeds, and import line longer than line wrap limits
58f5b97 [RJ Nowling] Fix bounds in tests so they scale with variance, not stdev
84fd98d [RJ Nowling] Add more values for testing distributions.
9f96232 [RJ Nowling] [SPARK-4728] Add exponential, gamma, and log normal sampling to MLlib data generators

ee1fb97a

[SPARK-4861][SQL] Refactory command in spark sql · c3d91da5

wangfei authored 10 years ago

Remove ```Command``` and use ```RunnableCommand``` instead.

Author: wangfei <wangfei1@huawei.com>
Author: scwf <wangfei1@huawei.com>

Closes #3712 from scwf/cmd and squashes the following commits:

51a82f2 [wangfei] fix test failure
0e03be8 [wangfei] address comments
4033bed [scwf] remove CreateTableAsSelect in hivestrategy
5d20010 [wangfei] address comments
125f542 [scwf] factory command in spark sql

c3d91da5

[SPARK-4573] [SQL] Add SettableStructObjectInspector support in "wrap" function · ae9f1286

Cheng Hao authored 10 years ago

Hive UDAF may create an customized object constructed by SettableStructObjectInspector, this is critical when integrate Hive UDAF with the refactor-ed UDAF interface.

Performance issue in `wrap/unwrap` since more match cases added, will do it in another PR.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3429 from chenghao-intel/settable_oi and squashes the following commits:

9f0aff3 [Cheng Hao] update code style issues as feedbacks
2b0561d [Cheng Hao] Add more scala doc
f5a40e8 [Cheng Hao] add scala doc
2977e9b [Cheng Hao] remove the timezone setting for test suite
3ed284c [Cheng Hao] fix the date type comparison
f1b6749 [Cheng Hao] Update the comment
932940d [Cheng Hao] Add more unit test
72e4332 [Cheng Hao] Add settable StructObjectInspector support

ae9f1286

[SPARK-2554][SQL] Supporting SumDistinct partial aggregation · 7687415c

ravipesala authored 10 years ago

Adding support to the partial aggregation of SumDistinct

Author: ravipesala <ravindra.pesala@huawei.com>

Closes #3348 from ravipesala/SPARK-2554 and squashes the following commits:

fd28e4d [ravipesala] Fixed review comments
e60e67f [ravipesala] Fixed test cases and made it as nullable
32fe234 [ravipesala] Supporting SumDistinct partial aggregation Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala

7687415c

[SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an... · e7de7e5f

YanTangZhai authored 10 years ago

[SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references

The sql "select * from spark_test::for_test where abs(20141202) is not null" has predicates=List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)) and
partitionKeyIds=AttributeSet(). PruningPredicates is List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)). Then the exception "java.lang.IllegalArgumentException: requirement failed: Partition pruning predicates only supported for partitioned tables." is thrown.
The sql "select * from spark_test::for_test_partitioned_table where abs(20141202) is not null and type_id=11 and platform = 3" with partitioned key insert_date has predicates=List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202), (type_id#12 = 11), (platform#8 = 3)) and partitionKeyIds=AttributeSet(insert_date#24). PruningPredicates is List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)).

Author: YanTangZhai <hakeemzhai@tencent.com>
Author: yantangzhai <tyz0303@163.com>

Closes #3556 from YanTangZhai/SPARK-4693 and squashes the following commits:

620ebe3 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
37cfdf5 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
70a3544 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
efa9b03 [YanTangZhai] Update HiveQuerySuite.scala
72accf1 [YanTangZhai] Update HiveQuerySuite.scala
e572b9a [YanTangZhai] Update HiveStrategies.scala
6e643f8 [YanTangZhai] Merge pull request #11 from apache/master
e249846 [YanTangZhai] Merge pull request #10 from apache/master
d26d982 [YanTangZhai] Merge pull request #9 from apache/master
76d4027 [YanTangZhai] Merge pull request #8 from apache/master
03b62b0 [YanTangZhai] Merge pull request #7 from apache/master
8a00106 [YanTangZhai] Merge pull request #6 from apache/master
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
cdef539 [YanTangZhai] Merge pull request #1 from apache/master

e7de7e5f

[SPARK-4756][SQL] FIX: sessionToActivePool grow infinitely, even as sessions expire · 22ddb6e0

guowei2 authored 10 years ago

**sessionToActivePool** in **SparkSQLOperationManager** grow infinitely, even as sessions expire.
we should remove the pool value when the session closed, even though **sessionToActivePool** would not exist in all of sessions.

Author: guowei2 <guowei2@asiainfo.com>

Closes #3617 from guowei2/SPARK-4756 and squashes the following commits:

e9b97b8 [guowei2] fix compile bug with Shim12
cf0f521 [guowei2] Merge remote-tracking branch 'apache/master' into SPARK-4756
e070998 [guowei2] fix: remove active pool of the session when it expired

22ddb6e0

[SPARK-3928][SQL] Support wildcard matches on Parquet files. · b68bc6d2

Thu Kyaw authored 10 years ago

...arquetFile accept hadoop glob pattern in path.

Author: Thu Kyaw <trk007@gmail.com>

Closes #3407 from tkyaw/master and squashes the following commits:

19115ad [Thu Kyaw] Merge https://github.com/apache/spark
ceded32 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
d322c28 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
ce677c6 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.

b68bc6d2

[SPARK-2663] [SQL] Support the Grouping Set · f728e0fe

Cheng Hao authored 10 years ago

Add support for `GROUPING SETS`, `ROLLUP`, `CUBE` and the the virtual column `GROUPING__ID`.

More details on how to use the `GROUPING SETS" can be found at: https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup
https://issues.apache.org/jira/secure/attachment/12676811/grouping_set.pdf

The generic idea of the implementations are :
1 Replace the `ROLLUP`, `CUBE` with `GROUPING SETS`
2 Explode each of the input row, and then feed them to `Aggregate`
  * Each grouping set are represented as the bit mask for the `GroupBy Expression List`, for each bit, `1` means the expression is selected, otherwise `0` (left is the lower bit, and right is the higher bit in the `GroupBy Expression List`)
  * Several of projections are constructed according to the grouping sets, and within each projection(Seq[Expression), we replace those expressions with `Literal(null)` if it's not selected in the grouping set (based on the bit mask)
  * Output Schema of `Explode` is `child.output :+ grouping__id`
  * GroupBy Expressions of `Aggregate` is `GroupBy Expression List :+ grouping__id`
  * Keep the `Aggregation expressions` the same for the `Aggregate`

The expressions substitutions happen in Logic Plan analyzing, so we will benefit from the Logical Plan optimization (e.g. expression constant folding, and map side aggregation etc.), Only an `Explosive` operator added for Physical Plan, which will explode the rows according the pre-set projections.

A known issue will be done in the follow up PR:
* Optimization `ColumnPruning` is not supported yet for `Explosive` node.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #1567 from chenghao-intel/grouping_sets and squashes the following commits:

fe65fcc [Cheng Hao] Remove the extra space
3547056 [Cheng Hao] Add more doc and Simplify the Expand
a7c869d [Cheng Hao] update code as feedbacks
d23c672 [Cheng Hao] Add GroupingExpression to replace the Seq[Expression]
414b165 [Cheng Hao] revert the unnecessary changes
ec276c6 [Cheng Hao] Support Rollup/Cube/GroupingSets

f728e0fe

[SPARK-4754] Refactor SparkContext into ExecutorAllocationClient · 9804a759

Andrew Or authored 10 years ago

This is such that the `ExecutorAllocationManager` does not take in the `SparkContext` with all of its dependencies as an argument. This prevents future developers of this class to tie down this class further with the `SparkContext`, which has really become quite a monstrous object.

cc'ing pwendell who originally suggested this, and JoshRosen who may have thoughts about the trait mix-in style of `SparkContext`.

Author: Andrew Or <andrew@databricks.com>

Closes #3614 from andrewor14/dynamic-allocation-sc and squashes the following commits:

187070d [Andrew Or] Merge branch 'master' of github.com:apache/spark into dynamic-allocation-sc
59baf6c [Andrew Or] Merge branch 'master' of github.com:apache/spark into dynamic-allocation-sc
347a348 [Andrew Or] Refactor SparkContext into ExecutorAllocationClient

9804a759

[SPARK-4837] NettyBlockTransferService should use spark.blockManager.port config · 105293a7

Aaron Davidson authored 10 years ago

This is used in NioBlockTransferService here:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/network/nio/NioBlockTransferService.scala#L66

Author: Aaron Davidson <aaron@databricks.com>

Closes #3688 from aarondav/SPARK-4837 and squashes the following commits:

ebd2007 [Aaron Davidson] [SPARK-4837] NettyBlockTransferService should use spark.blockManager.port config

105293a7

SPARK-4743 - Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and foldByKey · f9f58b9a

Ivan Vergiliev authored 10 years ago

Author: Ivan Vergiliev <ivan@leanplum.com>

Closes #3605 from IvanVergiliev/change-serializer and squashes the following commits:

a49b7cf [Ivan Vergiliev] Use serializer instead of closureSerializer in aggregate/foldByKey.

f9f58b9a

[SPARK-4884]: Improve Partition docs · d5a596d4

Madhu Siddalingaiah authored 10 years ago

Rewording was based on this discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-data-flow-td9804.html
This is the associated JIRA ticket: https://issues.apache.org/jira/browse/SPARK-4884

Author: Madhu Siddalingaiah <madhu@madhu.com>

Closes #3722 from msiddalingaiah/master and squashes the following commits:

79e679f [Madhu Siddalingaiah] [DOC]: improve documentation
51d14b9 [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master'
38faca4 [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master'
cbccbfe [Madhu Siddalingaiah] Documentation: replace <b> with <code> (again)
332f7a2 [Madhu Siddalingaiah] Documentation: replace <b> with <code>
cd2b05a [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master'
0fc12d7 [Madhu Siddalingaiah] Documentation: add description for repartitionAndSortWithinPartitions

d5a596d4

[SPARK-4880] remove spark.locality.wait in Analytics · a7ed6f3c

Ernest authored 10 years ago

spark.locality.wait set to 100000 in examples/graphx/Analytics.scala.
Should be left to the user.

Author: Ernest <earneyzxl@gmail.com>

Closes #3730 from Earne/SPARK-4880 and squashes the following commits:

d79ed04 [Ernest] remove spark.locality.wait in Analytics

a7ed6f3c

[SPARK-4887][MLlib] Fix a bad unittest in LogisticRegressionSuite · 59a49db5

DB Tsai authored 10 years ago

The original test doesn't make sense since if you step in, the lossSum is already NaN,
and the coefficients are diverging. That's because the step size is too large for SGD,
so it doesn't work.

The correct behavior is that you should get smaller coefficients than the one
without regularization. Comparing the values using 20000.0 relative error doesn't
make sense as well.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #3735 from dbtsai/mlortestfix and squashes the following commits:

b1a3c42 [DB Tsai] first commit

59a49db5

[SPARK-3607] ConnectionManager threads.max configs on the thread pools don't work · 3720057b

Ilya Ganelin authored 10 years ago

Hi all - cleaned up the code to get rid of the unused parameter and added some discussion of the ThreadPoolExecutor parameters to explain why we can use a single threadCount instead of providing a min/max.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #3664 from ilganeli/SPARK-3607C and squashes the following commits:

3c05690 [Ilya Ganelin] Updated documentation and refactored code to extract shared variables

3720057b