Commits · ad4e60ee7e2c49c24a9972312915f7f7253c7679 · cs525-sp18-g07 / spark

May 06, 2014

[SPARK-1549] Add Python support to spark-submit · 951a5d93

Matei Zaharia authored 11 years ago

This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN.

This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging.

In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709, setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit.

In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0.

Author: Matei Zaharia <matei@databricks.com>

Closes #664 from mateiz/py-submit and squashes the following commits:

15e9669 [Matei Zaharia] Fix some uses of path.separator property
051278c [Matei Zaharia] Small style fixes
0afe886 [Matei Zaharia] Add license headers
4650412 [Matei Zaharia] Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests
15f8e1e [Matei Zaharia] Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside
47c0655 [Matei Zaharia] More work to make spark-submit work with Python:
d4375bd [Matei Zaharia] Clean up description of spark-submit args a bit and add Python ones

951a5d93

Apr 30, 2014

SPARK-1004. PySpark on YARN · ff5be9a4

Sandy Ryza authored 11 years ago

This reopens https://github.com/apache/incubator-spark/pull/640 against the new repo

Author: Sandy Ryza <sandy@cloudera.com>

Closes #30 from sryza/sandy-spark-1004 and squashes the following commits:

89889d4 [Sandy Ryza] Move unzipping py4j to the generate-resources phase so that it gets included in the jar the first time
5165a02 [Sandy Ryza] Fix docs
fd0df79 [Sandy Ryza] PySpark on YARN

ff5be9a4

Jan 23, 2014
- Fix for SPARK-1025: PySpark hang on missing files. · f8306849
  Josh Rosen authored 11 years ago
  
  f8306849
- Fix SPARK-978: ClassCastException in PySpark cartesian. · 61569906
  Josh Rosen authored 11 years ago
  
  61569906
- Fix SPARK-1034: Py4JException on PySpark Cartesian Result · 0035dbbc
  Josh Rosen authored 11 years ago
  
  0035dbbc
Dec 24, 2013
- Fixed Python API for sc.setCheckpointDir. Also other fixes based on Reynold's comments on PR 289. · d4dfab50
  Tathagata Das authored 11 years ago
  
  d4dfab50
Nov 29, 2013
- Fix UnicodeEncodeError in PySpark saveAsTextFile(). · 3787f514
  Josh Rosen authored 11 years ago
  
  Fixes SPARK-970.
  3787f514
Nov 10, 2013

Add custom serializer support to PySpark. · cbb7f04a

Josh Rosen authored 11 years ago

For now, this only adds MarshalSerializer, but it lays the groundwork
for other supporting custom serializers.  Many of these mechanisms
can also be used to support deserialization of different data formats
sent by Java, such as data encoded by MsgPack.

This also fixes a bug in SparkContext.union().

cbb7f04a

Aug 16, 2013

Implementing SPARK-878 for PySpark: adding zip and egg files to context and... · c7e348fa

Andre Schumacher authored 12 years ago

Implementing SPARK-878 for PySpark: adding zip and egg files to context and passing it down to workers which add these to their sys.path

c7e348fa

Aug 14, 2013
- Fix PySpark unit tests on Python 2.6. · 7a9abb9d
  Josh Rosen authored 12 years ago
  
  7a9abb9d
Jul 16, 2013
- Add Apache license headers and LICENSE and NOTICE files · af3c9d50
  Matei Zaharia authored 12 years ago
  
  af3c9d50
Jun 21, 2013
- Add tests and fixes for Python daemon shutdown · 62c47814
  Jey Kottalam authored 12 years ago
  
  62c47814
Feb 01, 2013

Do not launch JavaGateways on workers (SPARK-674). · 9cc6ff9c

Josh Rosen authored 12 years ago

The problem was that the gateway was being initialized whenever the
pyspark.context module was loaded.  The fix uses lazy initialization
that occurs only when SparkContext instances are actually constructed.

I also made the gateway and jvm variables private.

This change results in ~3-4x performance improvement when running the
PySpark unit tests.

9cc6ff9c

Fix stdout redirection in PySpark. · 57b64d0d
Josh Rosen authored 12 years ago

57b64d0d

Jan 25, 2013
- Replace old 'master' term with 'driver'. · 7dfb82a9
  Stephen Haberman authored 12 years ago
  
  7dfb82a9
Jan 23, 2013
- Allow PySpark's SparkFiles to be used from driver · ae2ed294
  Josh Rosen authored 12 years ago
  
  Fix minor documentation formatting issues.
  ae2ed294
Jan 22, 2013
- Fix sys.path bug in PySpark SparkContext.addPyFile · 35168d9c
  Josh Rosen authored 12 years ago
  
  35168d9c
Jan 20, 2013
- Clean up setup code in PySpark checkpointing tests · 00d70cd6
  Josh Rosen authored 12 years ago
  
  00d70cd6
- Add checkpointFile() and more tests to PySpark. · d0ba80dc
  Josh Rosen authored 12 years ago
  
  d0ba80dc
- Add RDD checkpointing to Python API. · 7ed1bf4b
  Josh Rosen authored 12 years ago
  
  7ed1bf4b