Commits · 18c77cb533a2b7b2b9b8447009e31259caa2194b · cs525-sp18-g07 / spark

May 25, 2014

HOTFIX: Add no-arg SparkContext constructor in Java · 18c77cb5

Patrick Wendell authored 10 years ago


Self explanatory.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #878 from pwendell/java-constructor and squashes the following commits:

2cc1605 [Patrick Wendell] HOTFIX: Add no-arg SparkContext constructor in Java

(cherry picked from commit b6d22af0)
Signed-off-by: Aaron Davidson <aaron@databricks.com>

18c77cb5

[SQL] Minor: Introduce SchemaRDD#aggregate() for simple aggregations · a3976a27

Aaron Davidson authored 10 years ago


```scala
rdd.aggregate(Sum('val))
```
is just shorthand for

```scala
rdd.groupBy()(Sum('val))
```

but seems be more natural than doing a groupBy with no grouping expressions when you really just want an aggregation over all rows.

Did not add a JavaSchemaRDD or Python API, as these seem to be lacking several other methods like groupBy() already -- leaving that cleanup for future patches.

Author: Aaron Davidson <aaron@databricks.com>

Closes #874 from aarondav/schemardd and squashes the following commits:

e9e68ee [Aaron Davidson] Add comment
db6afe2 [Aaron Davidson] Introduce SchemaRDD#aggregate() for simple aggregations

(cherry picked from commit c3576ffc)
Signed-off-by: Reynold Xin <rxin@apache.org>

a3976a27

SPARK-1903 Document Spark's network connections · 5107a6f8

Andrew Ash authored 10 years ago

https://issues.apache.org/jira/browse/SPARK-1903



Author: Andrew Ash <andrew@andrewash.com>

Closes #856 from ash211/SPARK-1903 and squashes the following commits:

6e7782a [Andrew Ash] Add the technology used on each port
1d9b5d3 [Andrew Ash] Document port for history server
56193ee [Andrew Ash] spark.ui.port becomes worker.ui.port and master.ui.port
a774c07 [Andrew Ash] Wording in network section
90e8237 [Andrew Ash] Use real :toc instead of the hand-written one
edaa337 [Andrew Ash] Master -> Standalone Cluster Master
57e8869 [Andrew Ash] Port -> Default Port
3d4d289 [Andrew Ash] Title to title case
c7d42d9 [Andrew Ash] [WIP] SPARK-1903 Add initial port listing for documentation
a416ae9 [Andrew Ash] Word wrap to 100 lines

(cherry picked from commit 06595296)
Signed-off-by: Reynold Xin <rxin@apache.org>

5107a6f8

Fix PEP8 violations in Python mllib. · 07f34ca0

Reynold Xin authored 10 years ago


Author: Reynold Xin <rxin@apache.org>

Closes #871 from rxin/mllib-pep8 and squashes the following commits:

848416f [Reynold Xin] Fixed a typo in the previous cleanup (c -> sc).
a8db4cd [Reynold Xin] Fix PEP8 violations in Python mllib.

(cherry picked from commit d33d3c61)
Signed-off-by: Reynold Xin <rxin@apache.org>

07f34ca0

Python docstring update for sql.py. · 8891495a

Reynold Xin authored 10 years ago


Mostly related to the following two rules in PEP8 and PEP257:
- Line length < 72 chars.
- First line should be a concise description of the function/class.

Author: Reynold Xin <rxin@apache.org>

Closes #869 from rxin/docstring-schemardd and squashes the following commits:

7cf0cbc [Reynold Xin] Updated sql.py for pep8 docstring.
0a4aef9 [Reynold Xin] Merge branch 'master' into docstring-schemardd
6678937 [Reynold Xin] Python docstring update for sql.py.

(cherry picked from commit 14f0358b)
Signed-off-by: Reynold Xin <rxin@apache.org>

8891495a

Fix PEP8 violations in examples/src/main/python. · 33683974

Reynold Xin authored 10 years ago


Author: Reynold Xin <rxin@apache.org>

Closes #870 from rxin/examples-python-pep8 and squashes the following commits:

2829e84 [Reynold Xin] Fix PEP8 violations in examples/src/main/python.

(cherry picked from commit d79c2b28)
Signed-off-by: Reynold Xin <rxin@apache.org>

33683974

[maven-release-plugin] prepare for next development iteration · 832dc594
Tathagata Das authored 10 years ago

832dc594
[maven-release-plugin] prepare release v1.0.0-rc11 · 2f1dc868
Tathagata Das authored 10 years ago

2f1dc868

Added license header for tox.ini. · 7273bfc0

Reynold Xin authored 10 years ago


(cherry picked from commit fa541f32c5b92e6868a9c99cbb2c87115d624d23)
Signed-off-by: Reynold Xin <rxin@apache.org>

7273bfc0

SPARK-1822: Some minor cleanup work on SchemaRDD.count() · aeffc200

Reynold Xin authored 10 years ago


Minor cleanup following #841.

Author: Reynold Xin <rxin@apache.org>

Closes #868 from rxin/schema-count and squashes the following commits:

5442651 [Reynold Xin] SPARK-1822: Some minor cleanup work on SchemaRDD.count()

(cherry picked from commit d66642e3)
Signed-off-by: Reynold Xin <rxin@apache.org>

aeffc200

Added PEP8 style configuration file. · 291567d6

Reynold Xin authored 10 years ago


This sets the max line length to 100 as a PEP8 exception.

Author: Reynold Xin <rxin@apache.org>

Closes #872 from rxin/pep8 and squashes the following commits:

2f26029 [Reynold Xin] Added PEP8 style configuration file.

(cherry picked from commit 5c7faecd)
Signed-off-by: Reynold Xin <rxin@apache.org>

291567d6

[SPARK-1822] SchemaRDD.count() should use query optimizer · 64d0fb52

Kan Zhang authored 10 years ago


Author: Kan Zhang <kzhang@apache.org>

Closes #841 from kanzhang/SPARK-1822 and squashes the following commits:

2f8072a [Kan Zhang] [SPARK-1822] Minor style update
cf4baa4 [Kan Zhang] [SPARK-1822] Adding Scaladoc
e67c910 [Kan Zhang] [SPARK-1822] SchemaRDD.count() should use optimizer

(cherry picked from commit 6052db9d)
Signed-off-by: Reynold Xin <rxin@apache.org>

64d0fb52

spark-submit: add exec at the end of the script · 7e59335e

Colin Patrick Mccabe authored 10 years ago


Add an 'exec' at the end of the spark-submit script, to avoid keeping a
bash process hanging around while it runs.  This makes ps look a little
bit nicer.

Author: Colin Patrick Mccabe <cmccabe@cloudera.com>

Closes #858 from cmccabe/SPARK-1907 and squashes the following commits:

7023b64 [Colin Patrick Mccabe] spark-submit: add exec at the end of the script

(cherry picked from commit 6e9fb632)
Signed-off-by: Reynold Xin <rxin@apache.org>

7e59335e

May 24, 2014

[SPARK-1886] check executor id existence when executor exit · b5e96869

Zhen Peng authored 10 years ago


Author: Zhen Peng <zhenpeng01@baidu.com>

Closes #827 from zhpengg/bugfix-executor-id-not-found and squashes the following commits:

cd8bb65 [Zhen Peng] bugfix: check executor id existence when executor exit

(cherry picked from commit 4e4831b8)
Signed-off-by: Aaron Davidson <aaron@databricks.com>

b5e96869

Revert "[maven-release-plugin] prepare release v1.0.0-rc10" · 9ff42249
Tathagata Das authored 10 years ago
```
This reverts commit d8070234.
```
9ff42249
Revert "[maven-release-plugin] prepare for next development iteration" · f856b8ca
Tathagata Das authored 10 years ago
```
This reverts commit 67dd53d2.
```
f856b8ca
Updated CHANGES.txt · 84060927
Tathagata Das authored 10 years ago

84060927

SPARK-1911: Emphasize that Spark jars should be built with Java 6. · 217bd562

Patrick Wendell authored 10 years ago


This commit requires the user to manually say "yes" when buiding Spark
without Java 6. The prompt can be bypassed with a flag (e.g. if the user
is scripting around make-distribution).

Author: Patrick Wendell <pwendell@gmail.com>

Closes #859 from pwendell/java6 and squashes the following commits:

4921133 [Patrick Wendell] Adding Pyspark Notice
fee8c9e [Patrick Wendell] SPARK-1911: Emphasize that Spark jars should be built with Java 6.

(cherry picked from commit 75a03277)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

217bd562

[SPARK-1900 / 1918] PySpark on YARN is broken · 12f5ecc8

Andrew Or authored 10 years ago


If I run the following on a YARN cluster
```
bin/spark-submit sheep.py --master yarn-client
```
it fails because of a mismatch in paths: `spark-submit` thinks that `sheep.py` resides on HDFS, and balks when it can't find the file there. A natural workaround is to add the `file:` prefix to the file:
```
bin/spark-submit file:/path/to/sheep.py --master yarn-client
```
However, this also fails. This time it is because python does not understand URI schemes.

This PR fixes this by automatically resolving all paths passed as command line argument to `spark-submit` properly. This has the added benefit of keeping file and jar paths consistent across different cluster modes. For python, we strip the URI scheme before we actually try to run it.

Much of the code is originally written by @mengxr. Tested on YARN cluster. More tests pending.

Author: Andrew Or <andrewor14@gmail.com>

Closes #853 from andrewor14/submit-paths and squashes the following commits:

0bb097a [Andrew Or] Format path correctly before adding it to PYTHONPATH
323b45c [Andrew Or] Include --py-files on PYTHONPATH for pyspark shell
3c36587 [Andrew Or] Improve error messages (minor)
854aa6a [Andrew Or] Guard against NPE if user gives pathological paths
6638a6b [Andrew Or] Fix spark-shell jar paths after #849 went in
3bb0359 [Andrew Or] Update more comments (minor)
2a1f8a0 [Andrew Or] Update comments (minor)
6af2c77 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths
a68c4d1 [Andrew Or] Handle Windows python file path correctly
427a250 [Andrew Or] Resolve paths properly for Windows
a591a4a [Andrew Or] Update tests for resolving URIs
6c8621c [Andrew Or] Move resolveURIs to Utils
db8255e [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths
f542dce [Andrew Or] Fix outdated tests
691c4ce [Andrew Or] Ignore special primary resource names
5342ac7 [Andrew Or] Add missing space in error message
02f77f3 [Andrew Or] Resolve command line arguments to spark-submit properly

(cherry picked from commit 5081a0a9)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

12f5ecc8

May 23, 2014

Update LBFGSSuite.scala · 9be103a7

baishuo(白硕) authored 10 years ago

the same reason as https://github.com/apache/spark/pull/588



Author: baishuo(白硕) <vc_java@hotmail.com>

Closes #815 from baishuo/master and squashes the following commits:

6876c1e [baishuo(白硕)] Update LBFGSSuite.scala

(cherry picked from commit a08262d8)
Signed-off-by: Reynold Xin <rxin@apache.org>

9be103a7

May 22, 2014

Updated scripts for auditing releases · 6541ca24

Tathagata Das authored 10 years ago


- Added script to automatically generate change list CHANGES.txt
- Added test for verifying linking against maven distributions of `spark-sql` and `spark-hive`
- Added SBT projects for testing functionality of `spark-sql` and `spark-hive`
- Fixed issues in existing tests that might have come up because of changes in Spark 1.0

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #844 from tdas/update-dev-scripts and squashes the following commits:

25090ba [Tathagata Das] Added missing license
e2e20b3 [Tathagata Das] Updated tests for auditing releases.

(cherry picked from commit b2bdd0e5)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

6541ca24

[SPARK-1896] Respect spark.master (and --master) before MASTER in spark-shell · c3b40651

Andrew Or authored 10 years ago

The hierarchy for configuring the Spark master in the shell is as follows:
```
MASTER > --master > spark.master (spark-defaults.conf)
```
This is inconsistent with the way we run normal applications, which is:
```
--master > spark.master (spark-defaults.conf) > MASTER
```

I was trying to run a shell locally on a standalone cluster launched through the ec2 scripts, which automatically set `MASTER` in spark-env.sh. It was surprising to me that `--master` didn't take effect, considering that this is the way we tell users to set their masters [here](http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/scala-programming-guide.html#initializing-spark

).

Author: Andrew Or <andrewor14@gmail.com>

Closes #846 from andrewor14/shell-master and squashes the following commits:

2cb81c9 [Andrew Or] Respect spark.master before MASTER in REPL

(cherry picked from commit cce77457)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

c3b40651

[SPARK-1897] Respect spark.jars (and --jars) in spark-shell · 23cc40e3

Andrew Or authored 10 years ago

Spark shell currently overwrites `spark.jars` with `ADD_JARS`. In all modes except yarn-cluster, this means the `--jar` flag passed to `bin/spark-shell` is also discarded. However, in the [docs](http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/scala-programming-guide.html#initializing-spark

), we explicitly tell the users to add the jars this way.

Author: Andrew Or <andrewor14@gmail.com>

Closes #849 from andrewor14/shell-jars and squashes the following commits:

928a7e6 [Andrew Or] ',' -> "," (minor)
afc357c [Andrew Or] Handle spark.jars == "" in SparkILoop, not SparkSubmit
c6da113 [Andrew Or] Do not set spark.jars to ""
d8549f7 [Andrew Or] Respect spark.jars and --jars in spark-shell

(cherry picked from commit 8edbee7d)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

23cc40e3

Fix UISuite unit test that fails under Jenkins contention · a5662162

Aaron Davidson authored 10 years ago


Due to perhaps zombie processes on Jenkins, it seems that at least 10
Spark ports are in use. It also doesn't matter that the port increases
when used, it could in fact go down -- the only part that matters is
that it selects a different port rather than failing to bind.
Changed test to match this.

Thanks to @andrewor14 for helping diagnose this.

Author: Aaron Davidson <aaron@databricks.com>

Closes #857 from aarondav/tiny and squashes the following commits:

c199ec8 [Aaron Davidson] Fix UISuite unit test that fails under Jenkins contention

(cherry picked from commit f9f5fd5f)
Signed-off-by: Reynold Xin <rxin@apache.org>

a5662162

[SPARK-1870] Make spark-submit --jars work in yarn-cluster mode. · 79cd26c5

Xiangrui Meng authored 10 years ago


Sent secondary jars to distributed cache of all containers and add the cached jars to classpath before executors start. Tested on a YARN cluster (CDH-5.0).

`spark-submit --jars` also works in standalone server and `yarn-client`. Thanks for @andrewor14 for testing!

I removed "Doesn't work for drivers in standalone mode with "cluster" deploy mode." from `spark-submit`'s help message, though we haven't tested mesos yet.

CC: @dbtsai @sryza

Author: Xiangrui Meng <meng@databricks.com>

Closes #848 from mengxr/yarn-classpath and squashes the following commits:

23e7df4 [Xiangrui Meng] rename spark.jar to __spark__.jar and app.jar to __app__.jar to avoid confliction apped $CWD/ and $CWD/* to the classpath remove unused methods
a40f6ed [Xiangrui Meng] standalone -> cluster
65e04ad [Xiangrui Meng] update spark-submit help message and add a comment for yarn-client
11e5354 [Xiangrui Meng] minor changes
3e7e1c4 [Xiangrui Meng] use sparkConf instead of hadoop conf
dc3c825 [Xiangrui Meng] add secondary jars to classpath in yarn

(cherry picked from commit dba31402)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

79cd26c5

May 21, 2014

Configuration documentation updates · 75af8bd3

Reynold Xin authored 10 years ago

1. Add < code > to configuration options
2. List env variables in tabular format to be consistent with other pages.
3. Moved Viewing Spark Properties section up.

This is against branch-1.0, but should be cherry picked into master as well.

Author: Reynold Xin <rxin@apache.org>

Closes #851 from rxin/doc-config and squashes the following commits:

28ac0d3 [Reynold Xin] Add <code> to configuration options, and list env variables in a table.

75af8bd3

[SPARK-1889] [SQL] Apply splitConjunctivePredicates to join condition while finding join ke... · 6e7934ed

Takuya UESHIN authored 10 years ago


...ys.

When tables are equi-joined by multiple-keys `HashJoin` should be used, but `CartesianProduct` and then `Filter` are used.
The join keys are paired by `And` expression so we need to apply `splitConjunctivePredicates` to join condition while finding join keys.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #836 from ueshin/issues/SPARK-1889 and squashes the following commits:

fe1c387 [Takuya UESHIN] Apply splitConjunctivePredicates to join condition while finding join keys.

(cherry picked from commit bb88875a)
Signed-off-by: Reynold Xin <rxin@apache.org>

6e7934ed

[SPARK-1519] Support minPartitions param of wholeTextFiles() in PySpark · 30d1df5e

Kan Zhang authored 10 years ago


Author: Kan Zhang <kzhang@apache.org>

Closes #697 from kanzhang/SPARK-1519 and squashes the following commits:

4f8d1ed [Kan Zhang] [SPARK-1519] Support minPartitions param of wholeTextFiles() in PySpark

(cherry picked from commit f18fd05b)
Signed-off-by: Reynold Xin <rxin@apache.org>

30d1df5e

[Typo] Stoped -> Stopped · 9b8f7725

Andrew Or authored 10 years ago


Author: Andrew Or <andrewor14@gmail.com>

Closes #847 from andrewor14/yarn-typo and squashes the following commits:

c1906af [Andrew Or] Stoped -> Stopped

(cherry picked from commit ba5d4a99)
Signed-off-by: Reynold Xin <rxin@apache.org>

9b8f7725

[Minor] Move JdbcRDDSuite to the correct package · bc6bbfa6

Andrew Or authored 10 years ago


It was in the wrong package

Author: Andrew Or <andrewor14@gmail.com>

Closes #839 from andrewor14/jdbc-suite and squashes the following commits:

f948c5a [Andrew Or] cache -> cache()
b215279 [Andrew Or] Move JdbcRDDSuite to the correct package

(cherry picked from commit 7c79ef7d)
Signed-off-by: Reynold Xin <rxin@apache.org>

bc6bbfa6

[Docs] Correct example of creating a new SparkConf · 7295dd94

Andrew Or authored 10 years ago


The example code on the configuration page currently does not compile.

Author: Andrew Or <andrewor14@gmail.com>

Closes #842 from andrewor14/conf-docs and squashes the following commits:

aabff57 [Andrew Or] Correct example of creating a new SparkConf

(cherry picked from commit 1014668f)
Signed-off-by: Reynold Xin <rxin@apache.org>

7295dd94

[SPARK-1250] Fixed misleading comments in bin/pyspark, bin/spark-class · 364c14af

Sumedh Mungee authored 10 years ago


Fixed a couple of misleading comments in bin/pyspark and bin/spark-class. The comments make it seem like the script is looking for the Scala installation when in fact it is looking for Spark.

Author: Sumedh Mungee <smungee@gmail.com>

Closes #843 from smungee/spark-1250-fix-comments and squashes the following commits:

26870f3 [Sumedh Mungee] [SPARK-1250] Fixed misleading comments in bin/pyspark and bin/spark-class

(cherry picked from commit 6e337380)
Signed-off-by: Reynold Xin <rxin@apache.org>

364c14af

May 20, 2014

[maven-release-plugin] prepare for next development iteration · 67dd53d2
Tathagata Das authored 10 years ago

67dd53d2
[maven-release-plugin] prepare release v1.0.0-rc10 · d8070234
Tathagata Das authored 10 years ago

d8070234

[Hotfix] Blacklisted flaky HiveCompatibility test · b4d93d38

Tathagata Das authored 10 years ago


`lateral_view_outer` query sometimes returns a different set of 10 rows.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #838 from tdas/hive-test-fix2 and squashes the following commits:

9128a0d [Tathagata Das] Blacklisted flaky HiveCompatibility test.

(cherry picked from commit 7f0cfe47)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

b4d93d38

Revert "[maven-release-plugin] prepare release v1.0.0-rc9" · 0d988421
Tathagata Das authored 10 years ago
```
This reverts commit 920f947e.
```
0d988421
Revert "[maven-release-plugin] prepare for next development iteration" · 3f3e988c
Tathagata Das authored 10 years ago
```
This reverts commit f8e61195.
```
3f3e988c
Updated CHANGES.txt · 1c00f2a2
Tathagata Das authored 10 years ago

1c00f2a2

[Spark 1877] ClassNotFoundException when loading RDD with serialized objects · 6cbe2a37

Tathagata Das authored 10 years ago


Updated version of #821

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Ghidireac <bogdang@u448a5b0a73d45358d94a.ant.amazon.com>

Closes #835 from tdas/SPARK-1877 and squashes the following commits:

f346f71 [Tathagata Das] Addressed Patrick's comments.
fee0c5d [Ghidireac] SPARK-1877: ClassNotFoundException when loading RDD with serialized objects

(cherry picked from commit 52eb54d0)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

6cbe2a37

May 19, 2014

[SPARK-1874][MLLIB] Clean up MLlib sample data · 1c6c8b5b

Xiangrui Meng authored 10 years ago


1. Added synthetic datasets for `MovieLensALS`, `LinearRegression`, `BinaryClassification`.
2. Embedded instructions in the help message of those example apps.

Per discussion with Matei on the JIRA page, new example data is under `data/mllib`.

Author: Xiangrui Meng <meng@databricks.com>

Closes #833 from mengxr/mllib-sample-data and squashes the following commits:

59f0a18 [Xiangrui Meng] add sample binary classification data
3c2f92f [Xiangrui Meng] add linear regression data
050f1ca [Xiangrui Meng] add a sample dataset for MovieLensALS example

(cherry picked from commit bcb9dce6)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

1c6c8b5b