Commits · 0b769b73fb7ae314325857138a2d3138ed157908 · cs525-sp18-g07 / spark

May 28, 2014

Spark 1916 · 0b769b73

David Lemieux authored 10 years ago

The changes could be ported back to 0.9 as well.
Changing in.read to in.readFully to read the whole input stream rather than the first 1020 bytes.
This should ok considering that Flume caps the body size to 32K by default.

Author: David Lemieux <david.lemieux@radialpoint.com>

Closes #865 from lemieud/SPARK-1916 and squashes the following commits:

a265673 [David Lemieux] Updated SparkFlumeEvent to read the whole stream rather than the first X bytes.

0b769b73

Organize configuration docs · 032493e1

Patrick Wendell authored 10 years ago

This PR improves and organizes the config option page
and makes a few other changes to config docs. See a preview here:
http://people.apache.org/~pwendell/config-improvements/configuration.html



The biggest changes are:
1. The configs for the standalone master/workers were moved to the
standalone page and out of the general config doc.
2. SPARK_LOCAL_DIRS was missing from the standalone docs.
3. Expanded discussion of injecting configs with spark-submit, including an
example.
4. Config options were organized into the following categories:
- Runtime Environment
- Shuffle Behavior
- Spark UI
- Compression and Serialization
- Execution Behavior
- Networking
- Scheduling
- Security
- Spark Streaming

Author: Patrick Wendell <pwendell@gmail.com>

Closes #880 from pwendell/config-cleanup and squashes the following commits:

93f56c3 [Patrick Wendell] Feedback from Matei
6f66efc [Patrick Wendell] More feedback
16ae776 [Patrick Wendell] Adding back header section
d9c264f [Patrick Wendell] Small fix
e0c1728 [Patrick Wendell] Response to Matei's review
27d57db [Patrick Wendell] Reverting changes to index.html (covered in #896)
e230ef9 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup
a374369 [Patrick Wendell] Line wrapping fixes
fdff7fc [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup
3289ea4 [Patrick Wendell] Pulling in changes from #856
106ee31 [Patrick Wendell] Small link fix
f7e79bc [Patrick Wendell] Re-organizing config options.
54b184d [Patrick Wendell] Adding standalone configs to the standalone page
592e94a [Patrick Wendell] Stash
29b5446 [Patrick Wendell] Better discussion of spark-submit in configuration docs
2d719ef [Patrick Wendell] Small fix
4af9e07 [Patrick Wendell] Adding SPARK_LOCAL_DIRS docs
204b248 [Patrick Wendell] Small fixes
(cherry picked from commit 7801d44f)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>

032493e1

Fix doc about NetworkWordCount/JavaNetworkWordCount usage of spark streaming · 3669bb8e

jmu authored 10 years ago

Usage: NetworkWordCount <master> <hostname> <port>
-->
Usage: NetworkWordCount <hostname> <port>

Usage: JavaNetworkWordCount <master> <hostname> <port>
-->
Usage: JavaNetworkWordCount <hostname> <port>

Author: jmu <jmujmu@gmail.com>

Closes #826 from jmu/master and squashes the following commits:

9fb7980 [jmu] Merge branch 'master' of https://github.com/jmu/spark


b9a6b02 [jmu] Fix doc for NetworkWordCount/JavaNetworkWordCount Usage: NetworkWordCount <master> <hostname> <port> --> Usage: NetworkWordCount <hostname> <port>
(cherry picked from commit 82eadc3b)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>

3669bb8e

[SPARK-1938] [SQL] ApproxCountDistinctMergeFunction should return Int value. · 24a1cac4

Takuya UESHIN authored 10 years ago


`ApproxCountDistinctMergeFunction` should return `Int` value because the `dataType` of `ApproxCountDistinct` is `IntegerType`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #893 from ueshin/issues/SPARK-1938 and squashes the following commits:

3970e88 [Takuya UESHIN] Remove a superfluous line.
5ad7ec1 [Takuya UESHIN] Make dataType for each of CountDistinct, ApproxCountDistinctMerge and ApproxCountDistinct LongType.
cbe7c71 [Takuya UESHIN] Revert a change.
fc3ac0f [Takuya UESHIN] Fix evaluated value type of ApproxCountDistinctMergeFunction to Int.

(cherry picked from commit 9df86835)
Signed-off-by: Reynold Xin <rxin@apache.org>

24a1cac4

May 27, 2014

[SQL] SPARK-1922 · 5d638256

LY Lai authored 10 years ago

Allow underscore in column name of a struct field https://issues.apache.org/jira/browse/SPARK-1922

 .

Author: LY Lai <ly.lai@vpon.com>

Closes #873 from lyuanlai/master and squashes the following commits:

2253263 [LY Lai] Allow underscore in struct field column name

(cherry picked from commit 06825674)
Signed-off-by: Reynold Xin <rxin@apache.org>

5d638256

[SPARK-1915] [SQL] AverageFunction should not count if the evaluated value is null. · 50e234ba

Takuya UESHIN authored 10 years ago


Average values are difference between the calculation is done partially or not partially.
Because `AverageFunction` (in not-partially calculation) counts even if the evaluated value is null.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #862 from ueshin/issues/SPARK-1915 and squashes the following commits:

b1ff3c0 [Takuya UESHIN] Modify AverageFunction not to count if the evaluated value is null.

(cherry picked from commit 3b0babad)
Signed-off-by: Reynold Xin <rxin@apache.org>

50e234ba

[SPARK-1926] [SQL] Nullability of Max/Min/First should be true. · f5399631

Takuya UESHIN authored 10 years ago


Nullability of `Max`/`Min`/`First` should be `true` because they return `null` if there are no rows.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #881 from ueshin/issues/SPARK-1926 and squashes the following commits:

322610f [Takuya UESHIN] Fix nullability of Min/Max/First.

(cherry picked from commit d1375a2b)
Signed-off-by: Reynold Xin <rxin@apache.org>

f5399631

bugfix worker DriverStateChanged state should match DriverState.FAILED · 30be37ca

lianhuiwang authored 10 years ago

bugfix worker DriverStateChanged state should match DriverState.FAILED

Author: lianhuiwang <lianhuiwang09@gmail.com>

Closes #864 from lianhuiwang/master and squashes the following commits:

480ce94 [lianhuiwang] address aarondav comments
f2b5970 [lianhuiwang] bugfix worker DriverStateChanged state should match DriverState.FAILED

30be37ca

SPARK-1932: Fix race conditions in onReceiveCallback and cachedPeers · 214f90ee

zsxwing authored 10 years ago

`var cachedPeers: Seq[BlockManagerId] = null` is used in `def replicate(blockId: BlockId, data: ByteBuffer, level: StorageLevel)` without proper protection.

There are two place will call `replicate(blockId, bytesAfterPut, level)`
* https://github.com/apache/spark/blob/17f3075bc4aa8cbed165f7b367f70e84b1bc8db9/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L644 runs in `connectionManager.futureExecContext`
* https://github.com/apache/spark/blob/17f3075bc4aa8cbed165f7b367f70e84b1bc8db9/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L752

 `doPut` runs in `connectionManager.handleMessageExecutor`. `org.apache.spark.storage.BlockManagerWorker` calls `blockManager.putBytes` in `connectionManager.handleMessageExecutor`.

As they run in different `Executor`s, this is a race condition which may cause the memory pointed by `cachedPeers` is not correct even if `cachedPeers != null`.

The race condition of `onReceiveCallback` is that it's set in `BlockManagerWorker` but read in a different thread in `ConnectionManager.handleMessageExecutor`.

Author: zsxwing <zsxwing@gmail.com>

Closes #887 from zsxwing/SPARK-1932 and squashes the following commits:

524f69c [zsxwing] SPARK-1932: Fix race conditions in onReceiveCallback and cachedPeers

(cherry picked from commit 549830b0)
Signed-off-by: Aaron Davidson <aaron@databricks.com>

214f90ee

SPARK-1933: Throw a more meaningful exception when a directory is passed to addJar/addFile. · fcb37502

Reynold Xin authored 10 years ago

https://issues.apache.org/jira/browse/SPARK-1933



Author: Reynold Xin <rxin@apache.org>

Closes #888 from rxin/addfile and squashes the following commits:

8c402a3 [Reynold Xin] Updated comment.
ff6c162 [Reynold Xin] SPARK-1933: Throw a more meaningful exception when a directory is passed to addJar/addFile.

(cherry picked from commit 90e281b5)
Signed-off-by: Reynold Xin <rxin@apache.org>

fcb37502

May 26, 2014

Updated dev Python scripts to make them PEP8 compliant. · 9bcd9992

Reynold Xin authored 10 years ago


Author: Reynold Xin <rxin@apache.org>

Closes #875 from rxin/pep8-dev-scripts and squashes the following commits:

04b084f [Reynold Xin] Made dev Python scripts PEP8 compliant.

(cherry picked from commit 9ed37190)
Signed-off-by: Reynold Xin <rxin@apache.org>

9bcd9992

[SPARK-1931] Reconstruct routing tables in Graph.partitionBy · f268548d

Ankur Dave authored 10 years ago


905173df introduced a bug in partitionBy where, after repartitioning the edges, it reuses the VertexRDD without updating the routing tables to reflect the new edge layout. Subsequent accesses of the triplets contain nulls for many vertex properties.

This commit adds a test for this bug and fixes it by introducing `VertexRDD#withEdges` and calling it in `partitionBy`.

Author: Ankur Dave <ankurdave@gmail.com>

Closes #885 from ankurdave/SPARK-1931 and squashes the following commits:

3930cdd [Ankur Dave] Note how to set up VertexRDD for efficient joins
9bdbaa4 [Ankur Dave] [SPARK-1931] Reconstruct routing tables in Graph.partitionBy

(cherry picked from commit 56c771cb)
Signed-off-by: Reynold Xin <rxin@apache.org>

f268548d

SPARK-1925: Replace '&' with '&&' · f09cb850

zsxwing authored 10 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-1925



Author: zsxwing <zsxwing@gmail.com>

Closes #879 from zsxwing/SPARK-1925 and squashes the following commits:

5cf5a6d [zsxwing] SPARK-1925: Replace '&' with '&&'

(cherry picked from commit cb7fe503)
Signed-off-by: Reynold Xin <rxin@apache.org>

f09cb850

[SPARK-1914] [SQL] Simplify CountFunction not to traverse to evaluate all child expressions. · 7a831636

Takuya UESHIN authored 10 years ago


`CountFunction` should count up only if the child's evaluated value is not null.

Because it traverses to evaluate all child expressions, even if the child is null, it counts up if one of the all children is not null.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #861 from ueshin/issues/SPARK-1914 and squashes the following commits:

3b37315 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-1914
2afa238 [Takuya UESHIN] Simplify CountFunction not to traverse to evaluate all child expressions.

(cherry picked from commit d6395d86)
Signed-off-by: Reynold Xin <rxin@apache.org>

7a831636

[maven-release-plugin] prepare for next development iteration · 2669f57f
Tathagata Das authored 10 years ago

2669f57f
[maven-release-plugin] prepare release v1.0.0-rc11 · c69d97cd
Tathagata Das authored 10 years ago

View commits for tag v1.0.0 v1.0.0

c69d97cd
Updated CHANGES.txt · caed16e4
Tathagata Das authored 10 years ago

caed16e4
Revert "[maven-release-plugin] prepare release v1.0.0-rc11" · 6d34a6a5
Tathagata Das authored 10 years ago
```
This reverts commit 2f1dc868.
```
6d34a6a5
Revert "[maven-release-plugin] prepare for next development iteration" · 73ffd1e5
Tathagata Das authored 10 years ago
```
This reverts commit 832dc594.
```
73ffd1e5

May 25, 2014

HOTFIX: Add no-arg SparkContext constructor in Java · 18c77cb5

Patrick Wendell authored 10 years ago


Self explanatory.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #878 from pwendell/java-constructor and squashes the following commits:

2cc1605 [Patrick Wendell] HOTFIX: Add no-arg SparkContext constructor in Java

(cherry picked from commit b6d22af0)
Signed-off-by: Aaron Davidson <aaron@databricks.com>

18c77cb5

[SQL] Minor: Introduce SchemaRDD#aggregate() for simple aggregations · a3976a27

Aaron Davidson authored 10 years ago


```scala
rdd.aggregate(Sum('val))
```
is just shorthand for

```scala
rdd.groupBy()(Sum('val))
```

but seems be more natural than doing a groupBy with no grouping expressions when you really just want an aggregation over all rows.

Did not add a JavaSchemaRDD or Python API, as these seem to be lacking several other methods like groupBy() already -- leaving that cleanup for future patches.

Author: Aaron Davidson <aaron@databricks.com>

Closes #874 from aarondav/schemardd and squashes the following commits:

e9e68ee [Aaron Davidson] Add comment
db6afe2 [Aaron Davidson] Introduce SchemaRDD#aggregate() for simple aggregations

(cherry picked from commit c3576ffc)
Signed-off-by: Reynold Xin <rxin@apache.org>

a3976a27

SPARK-1903 Document Spark's network connections · 5107a6f8

Andrew Ash authored 10 years ago

https://issues.apache.org/jira/browse/SPARK-1903



Author: Andrew Ash <andrew@andrewash.com>

Closes #856 from ash211/SPARK-1903 and squashes the following commits:

6e7782a [Andrew Ash] Add the technology used on each port
1d9b5d3 [Andrew Ash] Document port for history server
56193ee [Andrew Ash] spark.ui.port becomes worker.ui.port and master.ui.port
a774c07 [Andrew Ash] Wording in network section
90e8237 [Andrew Ash] Use real :toc instead of the hand-written one
edaa337 [Andrew Ash] Master -> Standalone Cluster Master
57e8869 [Andrew Ash] Port -> Default Port
3d4d289 [Andrew Ash] Title to title case
c7d42d9 [Andrew Ash] [WIP] SPARK-1903 Add initial port listing for documentation
a416ae9 [Andrew Ash] Word wrap to 100 lines

(cherry picked from commit 06595296)
Signed-off-by: Reynold Xin <rxin@apache.org>

5107a6f8

Fix PEP8 violations in Python mllib. · 07f34ca0

Reynold Xin authored 10 years ago


Author: Reynold Xin <rxin@apache.org>

Closes #871 from rxin/mllib-pep8 and squashes the following commits:

848416f [Reynold Xin] Fixed a typo in the previous cleanup (c -> sc).
a8db4cd [Reynold Xin] Fix PEP8 violations in Python mllib.

(cherry picked from commit d33d3c61)
Signed-off-by: Reynold Xin <rxin@apache.org>

07f34ca0

Python docstring update for sql.py. · 8891495a

Reynold Xin authored 10 years ago


Mostly related to the following two rules in PEP8 and PEP257:
- Line length < 72 chars.
- First line should be a concise description of the function/class.

Author: Reynold Xin <rxin@apache.org>

Closes #869 from rxin/docstring-schemardd and squashes the following commits:

7cf0cbc [Reynold Xin] Updated sql.py for pep8 docstring.
0a4aef9 [Reynold Xin] Merge branch 'master' into docstring-schemardd
6678937 [Reynold Xin] Python docstring update for sql.py.

(cherry picked from commit 14f0358b)
Signed-off-by: Reynold Xin <rxin@apache.org>

8891495a

Fix PEP8 violations in examples/src/main/python. · 33683974

Reynold Xin authored 10 years ago


Author: Reynold Xin <rxin@apache.org>

Closes #870 from rxin/examples-python-pep8 and squashes the following commits:

2829e84 [Reynold Xin] Fix PEP8 violations in examples/src/main/python.

(cherry picked from commit d79c2b28)
Signed-off-by: Reynold Xin <rxin@apache.org>

33683974

[maven-release-plugin] prepare for next development iteration · 832dc594
Tathagata Das authored 10 years ago

832dc594
[maven-release-plugin] prepare release v1.0.0-rc11 · 2f1dc868
Tathagata Das authored 10 years ago

2f1dc868

Added license header for tox.ini. · 7273bfc0

Reynold Xin authored 10 years ago


(cherry picked from commit fa541f32c5b92e6868a9c99cbb2c87115d624d23)
Signed-off-by: Reynold Xin <rxin@apache.org>

7273bfc0

SPARK-1822: Some minor cleanup work on SchemaRDD.count() · aeffc200

Reynold Xin authored 10 years ago


Minor cleanup following #841.

Author: Reynold Xin <rxin@apache.org>

Closes #868 from rxin/schema-count and squashes the following commits:

5442651 [Reynold Xin] SPARK-1822: Some minor cleanup work on SchemaRDD.count()

(cherry picked from commit d66642e3)
Signed-off-by: Reynold Xin <rxin@apache.org>

aeffc200

Added PEP8 style configuration file. · 291567d6

Reynold Xin authored 10 years ago


This sets the max line length to 100 as a PEP8 exception.

Author: Reynold Xin <rxin@apache.org>

Closes #872 from rxin/pep8 and squashes the following commits:

2f26029 [Reynold Xin] Added PEP8 style configuration file.

(cherry picked from commit 5c7faecd)
Signed-off-by: Reynold Xin <rxin@apache.org>

291567d6

[SPARK-1822] SchemaRDD.count() should use query optimizer · 64d0fb52

Kan Zhang authored 10 years ago


Author: Kan Zhang <kzhang@apache.org>

Closes #841 from kanzhang/SPARK-1822 and squashes the following commits:

2f8072a [Kan Zhang] [SPARK-1822] Minor style update
cf4baa4 [Kan Zhang] [SPARK-1822] Adding Scaladoc
e67c910 [Kan Zhang] [SPARK-1822] SchemaRDD.count() should use optimizer

(cherry picked from commit 6052db9d)
Signed-off-by: Reynold Xin <rxin@apache.org>

64d0fb52

spark-submit: add exec at the end of the script · 7e59335e

Colin Patrick Mccabe authored 10 years ago


Add an 'exec' at the end of the spark-submit script, to avoid keeping a
bash process hanging around while it runs.  This makes ps look a little
bit nicer.

Author: Colin Patrick Mccabe <cmccabe@cloudera.com>

Closes #858 from cmccabe/SPARK-1907 and squashes the following commits:

7023b64 [Colin Patrick Mccabe] spark-submit: add exec at the end of the script

(cherry picked from commit 6e9fb632)
Signed-off-by: Reynold Xin <rxin@apache.org>

7e59335e

May 24, 2014

[SPARK-1886] check executor id existence when executor exit · b5e96869

Zhen Peng authored 10 years ago


Author: Zhen Peng <zhenpeng01@baidu.com>

Closes #827 from zhpengg/bugfix-executor-id-not-found and squashes the following commits:

cd8bb65 [Zhen Peng] bugfix: check executor id existence when executor exit

(cherry picked from commit 4e4831b8)
Signed-off-by: Aaron Davidson <aaron@databricks.com>

b5e96869

Revert "[maven-release-plugin] prepare release v1.0.0-rc10" · 9ff42249
Tathagata Das authored 10 years ago
```
This reverts commit d8070234.
```
9ff42249
Revert "[maven-release-plugin] prepare for next development iteration" · f856b8ca
Tathagata Das authored 10 years ago
```
This reverts commit 67dd53d2.
```
f856b8ca
Updated CHANGES.txt · 84060927
Tathagata Das authored 10 years ago

84060927

SPARK-1911: Emphasize that Spark jars should be built with Java 6. · 217bd562

Patrick Wendell authored 10 years ago


This commit requires the user to manually say "yes" when buiding Spark
without Java 6. The prompt can be bypassed with a flag (e.g. if the user
is scripting around make-distribution).

Author: Patrick Wendell <pwendell@gmail.com>

Closes #859 from pwendell/java6 and squashes the following commits:

4921133 [Patrick Wendell] Adding Pyspark Notice
fee8c9e [Patrick Wendell] SPARK-1911: Emphasize that Spark jars should be built with Java 6.

(cherry picked from commit 75a03277)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

217bd562

[SPARK-1900 / 1918] PySpark on YARN is broken · 12f5ecc8

Andrew Or authored 10 years ago


If I run the following on a YARN cluster
```
bin/spark-submit sheep.py --master yarn-client
```
it fails because of a mismatch in paths: `spark-submit` thinks that `sheep.py` resides on HDFS, and balks when it can't find the file there. A natural workaround is to add the `file:` prefix to the file:
```
bin/spark-submit file:/path/to/sheep.py --master yarn-client
```
However, this also fails. This time it is because python does not understand URI schemes.

This PR fixes this by automatically resolving all paths passed as command line argument to `spark-submit` properly. This has the added benefit of keeping file and jar paths consistent across different cluster modes. For python, we strip the URI scheme before we actually try to run it.

Much of the code is originally written by @mengxr. Tested on YARN cluster. More tests pending.

Author: Andrew Or <andrewor14@gmail.com>

Closes #853 from andrewor14/submit-paths and squashes the following commits:

0bb097a [Andrew Or] Format path correctly before adding it to PYTHONPATH
323b45c [Andrew Or] Include --py-files on PYTHONPATH for pyspark shell
3c36587 [Andrew Or] Improve error messages (minor)
854aa6a [Andrew Or] Guard against NPE if user gives pathological paths
6638a6b [Andrew Or] Fix spark-shell jar paths after #849 went in
3bb0359 [Andrew Or] Update more comments (minor)
2a1f8a0 [Andrew Or] Update comments (minor)
6af2c77 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths
a68c4d1 [Andrew Or] Handle Windows python file path correctly
427a250 [Andrew Or] Resolve paths properly for Windows
a591a4a [Andrew Or] Update tests for resolving URIs
6c8621c [Andrew Or] Move resolveURIs to Utils
db8255e [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths
f542dce [Andrew Or] Fix outdated tests
691c4ce [Andrew Or] Ignore special primary resource names
5342ac7 [Andrew Or] Add missing space in error message
02f77f3 [Andrew Or] Resolve command line arguments to spark-submit properly

(cherry picked from commit 5081a0a9)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

12f5ecc8

May 23, 2014

Update LBFGSSuite.scala · 9be103a7

baishuo(白硕) authored 10 years ago

the same reason as https://github.com/apache/spark/pull/588



Author: baishuo(白硕) <vc_java@hotmail.com>

Closes #815 from baishuo/master and squashes the following commits:

6876c1e [baishuo(白硕)] Update LBFGSSuite.scala

(cherry picked from commit a08262d8)
Signed-off-by: Reynold Xin <rxin@apache.org>

9be103a7

May 22, 2014

Updated scripts for auditing releases · 6541ca24

Tathagata Das authored 10 years ago


- Added script to automatically generate change list CHANGES.txt
- Added test for verifying linking against maven distributions of `spark-sql` and `spark-hive`
- Added SBT projects for testing functionality of `spark-sql` and `spark-hive`
- Fixed issues in existing tests that might have come up because of changes in Spark 1.0

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #844 from tdas/update-dev-scripts and squashes the following commits:

25090ba [Tathagata Das] Added missing license
e2e20b3 [Tathagata Das] Updated tests for auditing releases.

(cherry picked from commit b2bdd0e5)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

6541ca24