Commits · e08ea7393df46567f552aa67c60a690c231775e4 · cs525-sp18-g07 / spark

Aug 18, 2014

[SPARK-3114] [PySpark] Fix Python UDFs in Spark SQL. · 1f1819b2

Josh Rosen authored 11 years ago

This fixes SPARK-3114, an issue where we inadvertently broke Python UDFs in Spark SQL.

This PR modifiers the test runner script to always run the PySpark SQL tests, irrespective of whether SparkSQL itself has been modified. It also includes Davies' fix for the bug.

Closes #2026.

Author: Josh Rosen <joshrosen@apache.org>
Author: Davies Liu <davies.liu@gmail.com>

Closes #2027 from JoshRosen/pyspark-sql-fix and squashes the following commits:

9af2708 [Davies Liu] bugfix: disable compression of command
0d8d3a4 [Josh Rosen] Always run Python Spark SQL tests.

1f1819b2

Aug 16, 2014

[SPARK-1065] [PySpark] improve supporting for large broadcast · 2fc8aca0

Davies Liu authored 11 years ago

Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize()).

Add an option to keep object in driver (it's False by default) to save memory in driver.

Author: Davies Liu <davies.liu@gmail.com>

Closes #1912 from davies/broadcast and squashes the following commits:

e06df4a [Davies Liu] load broadcast from disk in driver automatically
db3f232 [Davies Liu] fix serialization of accumulator
631a827 [Davies Liu] Merge branch 'master' into broadcast
c7baa8c [Davies Liu] compress serrialized broadcast and command
9a7161f [Davies Liu] fix doc tests
e93cf4b [Davies Liu] address comments: add test
6226189 [Davies Liu] improve large broadcast

2fc8aca0

Jul 29, 2014

[SPARK-2580] [PySpark] keep silent in worker if JVM close the socket · ccd5ab5f

Davies Liu authored 11 years ago

During rdd.take(n), JVM will close the socket if it had got enough data, the Python worker should keep silent in this case.

In the same time, the worker should not print the trackback into stderr if it send the traceback to JVM successfully.

Author: Davies Liu <davies.liu@gmail.com>

Closes #1625 from davies/error and squashes the following commits:

4fbcc6d [Davies Liu] disable log4j during testing when exception is expected.
cc14202 [Davies Liu] keep silent in worker if JVM close the socket

ccd5ab5f

Jul 22, 2014

[SPARK-2470] PEP8 fixes to PySpark · 5d16d5bb

Nicholas Chammas authored 11 years ago

This pull request aims to resolve all outstanding PEP8 violations in PySpark.

Author: Nicholas Chammas <nicholas.chammas@gmail.com>
Author: nchammas <nicholas.chammas@gmail.com>

Closes #1505 from nchammas/master and squashes the following commits:

98171af [Nicholas Chammas] [SPARK-2470] revert PEP 8 fixes to cloudpickle
cba7768 [Nicholas Chammas] [SPARK-2470] wrap expression list in parentheses
e178dbe [Nicholas Chammas] [SPARK-2470] style - change position of line break
9127d2b [Nicholas Chammas] [SPARK-2470] wrap expression lists in parentheses
22132a4 [Nicholas Chammas] [SPARK-2470] wrap conditionals in parentheses
24639bc [Nicholas Chammas] [SPARK-2470] fix whitespace for doctest
7d557b7 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to tests.py
8f8e4c0 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to storagelevel.py
b3b96cf [Nicholas Chammas] [SPARK-2470] PEP8 fixes to statcounter.py
d644477 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to worker.py
aa3a7b6 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to sql.py
1916859 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to shell.py
95d1d95 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to serializers.py
a0fec2e [Nicholas Chammas] [SPARK-2470] PEP8 fixes to mllib
c85e1e5 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to join.py
d14f2f1 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to __init__.py
81fcb20 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to resultiterable.py
1bde265 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to java_gateway.py
7fc849c [Nicholas Chammas] [SPARK-2470] PEP8 fixes to daemon.py
ca2d28b [Nicholas Chammas] [SPARK-2470] PEP8 fixes to context.py
f4e0039 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to conf.py
a6d5e4b [Nicholas Chammas] [SPARK-2470] PEP8 fixes to cloudpickle.py
f0a7ebf [Nicholas Chammas] [SPARK-2470] PEP8 fixes to rddsampler.py
4dd148f [nchammas] Merge pull request #5 from apache/master
f7e4581 [Nicholas Chammas] unrelated pep8 fix
a36eed0 [Nicholas Chammas] name ec2 instances and security groups consistently
de7292a [nchammas] Merge pull request #4 from apache/master
2e4fe00 [nchammas] Merge pull request #3 from apache/master
89fde08 [nchammas] Merge pull request #2 from apache/master
69f6e22 [Nicholas Chammas] PEP8 fixes
2627247 [Nicholas Chammas] broke up lines before they hit 100 chars
6544b7e [Nicholas Chammas] [SPARK-2065] give launched instances names
69da6cf [nchammas] Merge pull request #1 from apache/master

5d16d5bb

May 10, 2014

Add Python includes to path before depickling broadcast values · 3776f2f2

Bouke van der Bijl authored 11 years ago

This fixes https://issues.apache.org/jira/browse/SPARK-1731 by adding the Python includes to the PYTHONPATH before depickling the broadcast values

@airhorns

Author: Bouke van der Bijl <boukevanderbijl@gmail.com>

Closes #656 from bouk/python-includes-before-broadcast and squashes the following commits:

7b0dfe4 [Bouke van der Bijl] Add Python includes to path before depickling broadcast values

3776f2f2

Feb 26, 2014

SPARK-1115: Catch depickling errors · 12738c1a

Bouke van der Bijl authored 11 years ago

This surroungs the complete worker code in a try/except block so we catch any error that arrives. An example would be the depickling failing for some reason

@JoshRosen

Author: Bouke van der Bijl <boukevanderbijl@gmail.com>

Closes #644 from bouk/catch-depickling-errors and squashes the following commits:

f0f67cc [Bouke van der Bijl] Lol indentation
0e4d504 [Bouke van der Bijl] Surround the complete python worker with the try block

12738c1a

Feb 22, 2014

Fixed minor typo in worker.py · 3ff077d4

jyotiska authored 11 years ago

Fixed minor typo in worker.py

Author: jyotiska <jyotiska123@gmail.com>

Closes #630 from jyotiska/pyspark_code and squashes the following commits:

ee44201 [jyotiska] typo fixed in worker.py

3ff077d4

Jan 28, 2014

Switch from MUTF8 to UTF8 in PySpark serializers. · 1381fc72

Josh Rosen authored 11 years ago

This fixes SPARK-1043, a bug introduced in 0.9.0
where PySpark couldn't serialize strings > 64kB.

This fix was written by @tyro89 and @bouk in #512.
This commit squashes and rebases their pull request
in order to fix some merge conflicts.

1381fc72

Jan 12, 2014

Log Python exceptions to stderr as well · 5741078c

Matei Zaharia authored 11 years ago

This helps in case the exception happened while serializing a record to
be sent to Java, leaving the stream to Java in an inconsistent state
where PythonRDD won't be able to read the error.

5741078c

Nov 10, 2013

FramedSerializer: _dumps => dumps, _loads => loads. · 13122ceb
Josh Rosen authored 11 years ago

13122ceb
Send PySpark commands as bytes insetad of strings. · ffa5bedf
Josh Rosen authored 11 years ago

ffa5bedf

Add custom serializer support to PySpark. · cbb7f04a

Josh Rosen authored 11 years ago

For now, this only adds MarshalSerializer, but it lays the groundwork
for other supporting custom serializers.  Many of these mechanisms
can also be used to support deserialization of different data formats
sent by Java, such as data encoded by MsgPack.

This also fixes a bug in SparkContext.union().

cbb7f04a

Nov 03, 2013

Remove Pickle-wrapping of Java objects in PySpark. · 7d68a81a

Josh Rosen authored 11 years ago

If we support custom serializers, the Python
worker will know what type of input to expect,
so we won't need to wrap Tuple2 and Strings into
pickled tuples and strings.

7d68a81a

Replace magic lengths with constants in PySpark. · a48d88d2

Josh Rosen authored 11 years ago

Write the length of the accumulators section up-front rather
than terminating it with a negative length.  I find this
easier to read.

a48d88d2

Sep 01, 2013
- Allow PySpark to launch worker.py directly on Windows · 6550e5e6
  Matei Zaharia authored 11 years ago
  
  6550e5e6
Aug 16, 2013

Implementing SPARK-878 for PySpark: adding zip and egg files to context and... · c7e348fa

Andre Schumacher authored 12 years ago

Implementing SPARK-878 for PySpark: adding zip and egg files to context and passing it down to workers which add these to their sys.path

c7e348fa

Jul 16, 2013
- Add Apache license headers and LICENSE and NOTICE files · af3c9d50
  Matei Zaharia authored 12 years ago
  
  af3c9d50
Jun 21, 2013
- Fix reporting of PySpark exceptions · c75bed0e
  Jey Kottalam authored 12 years ago
  
  c75bed0e
- Add tests and fixes for Python daemon shutdown · 62c47814
  Jey Kottalam authored 12 years ago
  
  62c47814
- Prefork Python worker processes · c79a6078
  Jey Kottalam authored 12 years ago
  
  c79a6078
- Add Python timing instrumentation · 40afe0d2
  Jey Kottalam authored 12 years ago
  
  40afe0d2
Feb 01, 2013
- Fix stdout redirection in PySpark. · 57b64d0d
  Josh Rosen authored 12 years ago
  
  57b64d0d
Jan 31, 2013

SPARK-673: Capture and re-throw Python exceptions · 3446d5c8

Patrick Wendell authored 12 years ago

This patch alters the Python <-> executor protocol to pass on
exception data when they occur in user Python code.

3446d5c8

Jan 23, 2013
- Allow PySpark's SparkFiles to be used from driver · ae2ed294
  Josh Rosen authored 12 years ago
  
  Fix minor documentation formatting issues.
  ae2ed294
Jan 22, 2013
- Fix sys.path bug in PySpark SparkContext.addPyFile · 35168d9c
  Josh Rosen authored 12 years ago
  
  35168d9c
Jan 21, 2013

Don't download files to master's working directory. · ef711902

Josh Rosen authored 12 years ago

This should avoid exceptions caused by existing
files with different contents.

I also removed some unused code.

ef711902

Jan 20, 2013
- Added accumulators to PySpark · 8e7f098a
  Matei Zaharia authored 12 years ago
  
  8e7f098a
Jan 08, 2013
- Add mapPartitionsWithSplit() to PySpark. · b57dd0f1
  Josh Rosen authored 12 years ago
  
  b57dd0f1
Jan 01, 2013
- Rename top-level 'pyspark' directory to 'python' · b58340db
  Josh Rosen authored 12 years ago
  
  b58340db
Dec 24, 2012

Use filesystem to collect RDDs in PySpark. · 4608902f

Josh Rosen authored 12 years ago

Passing large volumes of data through Py4J seems
to be slow.  It appears to be faster to write the
data to the local filesystem and read it back from
Python.

4608902f

Oct 19, 2012
- Update Python API for v0.6.0 compatibility. · 52989c8a
  Josh Rosen authored 12 years ago
  
  52989c8a
Aug 27, 2012
- Simplify Python worker; pipeline the map step of partitionBy(). · 200d248d
  Josh Rosen authored 12 years ago
  
  200d248d
- Use local combiners in Python API combineByKey(). · 6904cb77
  Josh Rosen authored 12 years ago
  
  6904cb77
- Add broadcast variables to Python API. · f79a1e4d
  Josh Rosen authored 12 years ago
  
  f79a1e4d
Aug 24, 2012
- Refactor Python MappedRDD to use iterator pipelines. · f3b852ce
  Josh Rosen authored 12 years ago
  
  f3b852ce
Aug 22, 2012
- Use numpy in Python k-means example. · 607b53ab
  Josh Rosen authored 12 years ago
  
  607b53ab
Aug 21, 2012

Use only cPickle for serialization in Python API. · fd94e544

Josh Rosen authored 13 years ago

Objects serialized with JSON can be compared for equality, but JSON can be slow
to serialize and only supports a limited range of data types.

fd94e544

Aug 19, 2012
- Bundle cloudpickle with pyspark. · 13b95149
  Josh Rosen authored 13 years ago
  
  13b95149
- Add Python API. · 886b39de
  Josh Rosen authored 13 years ago
  
  886b39de