Skip to content
Snippets Groups Projects
  1. May 06, 2014
    • Matei Zaharia's avatar
      [SPARK-1549] Add Python support to spark-submit · 951a5d93
      Matei Zaharia authored
      This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN.
      
      This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging.
      
      In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709, setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit.
      
      In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #664 from mateiz/py-submit and squashes the following commits:
      
      15e9669 [Matei Zaharia] Fix some uses of path.separator property
      051278c [Matei Zaharia] Small style fixes
      0afe886 [Matei Zaharia] Add license headers
      4650412 [Matei Zaharia] Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests
      15f8e1e [Matei Zaharia] Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside
      47c0655 [Matei Zaharia] More work to make spark-submit work with Python:
      d4375bd [Matei Zaharia] Clean up description of spark-submit args a bit and add Python ones
      951a5d93
  2. Apr 30, 2014
    • Sandy Ryza's avatar
      SPARK-1004. PySpark on YARN · ff5be9a4
      Sandy Ryza authored
      This reopens https://github.com/apache/incubator-spark/pull/640 against the new repo
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #30 from sryza/sandy-spark-1004 and squashes the following commits:
      
      89889d4 [Sandy Ryza] Move unzipping py4j to the generate-resources phase so that it gets included in the jar the first time
      5165a02 [Sandy Ryza] Fix docs
      fd0df79 [Sandy Ryza] PySpark on YARN
      ff5be9a4
  3. Jan 23, 2014
  4. Dec 24, 2013
  5. Nov 29, 2013
  6. Nov 10, 2013
    • Josh Rosen's avatar
      Add custom serializer support to PySpark. · cbb7f04a
      Josh Rosen authored
      For now, this only adds MarshalSerializer, but it lays the groundwork
      for other supporting custom serializers.  Many of these mechanisms
      can also be used to support deserialization of different data formats
      sent by Java, such as data encoded by MsgPack.
      
      This also fixes a bug in SparkContext.union().
      cbb7f04a
  7. Aug 16, 2013
  8. Aug 14, 2013
  9. Jul 16, 2013
  10. Jun 21, 2013
  11. Feb 01, 2013
    • Josh Rosen's avatar
      Do not launch JavaGateways on workers (SPARK-674). · 9cc6ff9c
      Josh Rosen authored
      The problem was that the gateway was being initialized whenever the
      pyspark.context module was loaded.  The fix uses lazy initialization
      that occurs only when SparkContext instances are actually constructed.
      
      I also made the gateway and jvm variables private.
      
      This change results in ~3-4x performance improvement when running the
      PySpark unit tests.
      9cc6ff9c
    • Josh Rosen's avatar
      Fix stdout redirection in PySpark. · 57b64d0d
      Josh Rosen authored
      57b64d0d
  12. Jan 25, 2013
  13. Jan 23, 2013
  14. Jan 22, 2013
  15. Jan 20, 2013
Loading