Skip to content
Snippets Groups Projects
  1. May 06, 2014
    • Matei Zaharia's avatar
      [SPARK-1549] Add Python support to spark-submit · d7ddb26e
      Matei Zaharia authored
      This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN.
      
      This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging.
      
      In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709
      
      , setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit.
      
      In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #664 from mateiz/py-submit and squashes the following commits:
      
      15e9669 [Matei Zaharia] Fix some uses of path.separator property
      051278c [Matei Zaharia] Small style fixes
      0afe886 [Matei Zaharia] Add license headers
      4650412 [Matei Zaharia] Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests
      15f8e1e [Matei Zaharia] Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside
      47c0655 [Matei Zaharia] More work to make spark-submit work with Python:
      d4375bd [Matei Zaharia] Clean up description of spark-submit args a bit and add Python ones
      
      (cherry picked from commit 951a5d93)
      Signed-off-by: default avatarMatei Zaharia <matei@databricks.com>
      d7ddb26e
  2. Apr 24, 2014
    • Ahir Reddy's avatar
      [SPARK-986]: Job cancelation for PySpark · 7b6d7748
      Ahir Reddy authored
      
      * Additions to the PySpark API to cancel jobs
      * Monitor Thread in PythonRDD to kill Python workers if a task is interrupted
      
      Author: Ahir Reddy <ahirreddy@gmail.com>
      
      Closes #541 from ahirreddy/python-cancel and squashes the following commits:
      
      dfdf447 [Ahir Reddy] Changed success -> completed and made logging message clearer
      6c860ab [Ahir Reddy] PR Comments
      4b4100a [Ahir Reddy] Success flag
      adba6ed [Ahir Reddy] Destroy python workers
      27a2f8f [Ahir Reddy] Start the writer thread...
      d422f7b [Ahir Reddy] Remove unnecesssary vals
      adda337 [Ahir Reddy] Busy wait on the ocntext.interrupted flag, and then kill the python worker
      d9e472f [Ahir Reddy] Revert "removed unnecessary vals"
      5b9cae5 [Ahir Reddy] removed unnecessary vals
      07b54d9 [Ahir Reddy] Fix canceling unit test
      8ae9681 [Ahir Reddy] Don't interrupt worker
      7722342 [Ahir Reddy] Monitor Thread for python workers
      db04e16 [Ahir Reddy] Added canceling api to PySpark
      
      (cherry picked from commit e53eb4f0)
      Signed-off-by: default avatarMatei Zaharia <matei@databricks.com>
      7b6d7748
  3. Apr 18, 2014
  4. Apr 04, 2014
    • Haoyuan Li's avatar
      SPARK-1305: Support persisting RDD's directly to Tachyon · b50ddfde
      Haoyuan Li authored
      Move the PR#468 of apache-incubator-spark to the apache-spark
      "Adding an option to persist Spark RDD blocks into Tachyon."
      
      Author: Haoyuan Li <haoyuan@cs.berkeley.edu>
      Author: RongGu <gurongwalker@gmail.com>
      
      Closes #158 from RongGu/master and squashes the following commits:
      
      72b7768 [Haoyuan Li] merge master
      9f7fa1b [Haoyuan Li] fix code style
      ae7834b [Haoyuan Li] minor cleanup
      a8b3ec6 [Haoyuan Li] merge master branch
      e0f4891 [Haoyuan Li] better check offheap.
      55b5918 [RongGu] address matei's comment on the replication of offHeap storagelevel
      7cd4600 [RongGu] remove some logic code for tachyonstore's replication
      51149e7 [RongGu] address aaron's comment on returning value of the remove() function in tachyonstore
      8adfcfa [RongGu] address arron's comment on inTachyonSize
      120e48a [RongGu] changed the root-level dir name in Tachyon
      5cc041c [Haoyuan Li] address aaron's comments
      9b97935 [Haoyuan Li] address aaron's comments
      d9a6438 [Haoyuan Li] fix for pspark
      77d2703 [Haoyuan Li] change python api.git status
      3dcace4 [Haoyuan Li] address matei's comments
      91fa09d [Haoyuan Li] address patrick's comments
      589eafe [Haoyuan Li] use TRY_CACHE instead of MUST_CACHE
      64348b2 [Haoyuan Li] update conf docs.
      ed73e19 [Haoyuan Li] Merge branch 'master' of github.com:RongGu/spark-1
      619a9a8 [RongGu] set number of directories in TachyonStore back to 64; added a TODO tag for duplicated code from the DiskStore
      be79d77 [RongGu] find a way to clean up some unnecessay metods and classed to make the code simpler
      49cc724 [Haoyuan Li] update docs with off_headp option
      4572f9f [RongGu] reserving the old apply function API of StorageLevel
      04301d3 [RongGu] rename StorageLevel.TACHYON to Storage.OFF_HEAP
      c9aeabf [RongGu] rename the StorgeLevel.TACHYON as StorageLevel.OFF_HEAP
      76805aa [RongGu] unifies the config properties name prefix; add the configs into docs/configuration.md
      e700d9c [RongGu] add the SparkTachyonHdfsLR example and some comments
      fd84156 [RongGu] use randomUUID to generate sparkapp directory name on tachyon;minor code style fix
      939e467 [Haoyuan Li] 0.4.1-thrift from maven central
      86a2eab [Haoyuan Li] tachyon 0.4.1-thrift is in the staging repo. but jenkins failed to download it. temporarily revert it back to 0.4.1
      16c5798 [RongGu] make the dependency on tachyon as tachyon-0.4.1-thrift
      eacb2e8 [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
      bbeb4de [RongGu] fix the JsonProtocolSuite test failure problem
      6adb58f [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
      d827250 [RongGu] fix JsonProtocolSuie test failure
      716e93b [Haoyuan Li] revert the version
      ca14469 [Haoyuan Li] bump tachyon version to 0.4.1-thrift
      2825a13 [RongGu] up-merging to the current master branch of the apache spark
      6a22c1a [Haoyuan Li] fix scalastyle
      8968b67 [Haoyuan Li] exclude more libraries from tachyon dependency to be the same as referencing tachyon-client.
      77be7e8 [RongGu] address mateiz's comment about the temp folder name problem. The implementation followed mateiz's advice.
      1dcadf9 [Haoyuan Li] typo
      bf278fa [Haoyuan Li] fix python tests
      e82909c [Haoyuan Li] minor cleanup
      776a56c [Haoyuan Li] address patrick's and ali's comments from the previous PR
      8859371 [Haoyuan Li] various minor fixes and clean up
      e3ddbba [Haoyuan Li] add doc to use Tachyon cache mode.
      fcaeab2 [Haoyuan Li] address Aaron's comment
      e554b1e [Haoyuan Li] add python code
      47304b3 [Haoyuan Li] make tachyonStore in BlockMananger lazy val; add more comments StorageLevels.
      dc8ef24 [Haoyuan Li] add old storelevel constructor
      e01a271 [Haoyuan Li] update tachyon 0.4.1
      8011a96 [RongGu] fix a brought-in mistake in StorageLevel
      70ca182 [RongGu] a bit change in comment
      556978b [RongGu] fix the scalastyle errors
      791189b [RongGu] "Adding an option to persist Spark RDD blocks into Tachyon." move the PR#468 of apache-incubator-spark to the apache-spark
      b50ddfde
    • Matei Zaharia's avatar
      SPARK-1414. Python API for SparkContext.wholeTextFiles · 60e18ce7
      Matei Zaharia authored
      Also clarified comment on each file having to fit in memory
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #327 from mateiz/py-whole-files and squashes the following commits:
      
      9ad64a5 [Matei Zaharia] SPARK-1414. Python API for SparkContext.wholeTextFiles
      60e18ce7
  5. Mar 10, 2014
  6. Mar 06, 2014
    • Prabin Banka's avatar
      SPARK-1187, Added missing Python APIs · 3d3acef0
      Prabin Banka authored
      The following Python APIs are added,
      RDD.id()
      SparkContext.setJobGroup()
      SparkContext.setLocalProperty()
      SparkContext.getLocalProperty()
      SparkContext.sparkUser()
      
      was raised earlier as a part of  apache/incubator-spark#486
      
      Author: Prabin Banka <prabin.banka@imaginea.com>
      
      Closes #75 from prabinb/python-api-backup and squashes the following commits:
      
      cc3c6cd [Prabin Banka] Added missing Python APIs
      3d3acef0
  7. Feb 20, 2014
    • Ahir Reddy's avatar
      SPARK-1114: Allow PySpark to use existing JVM and Gateway · 59b13795
      Ahir Reddy authored
      Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization.
      
      Author: Ahir Reddy <ahirreddy@gmail.com>
      
      Closes #622 from ahirreddy/pyspark-existing-jvm and squashes the following commits:
      
      a86f457 [Ahir Reddy] Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization.
      59b13795
  8. Jan 28, 2014
    • Josh Rosen's avatar
      Switch from MUTF8 to UTF8 in PySpark serializers. · 1381fc72
      Josh Rosen authored
      This fixes SPARK-1043, a bug introduced in 0.9.0
      where PySpark couldn't serialize strings > 64kB.
      
      This fix was written by @tyro89 and @bouk in #512.
      This commit squashes and rebases their pull request
      in order to fix some merge conflicts.
      1381fc72
  9. Jan 01, 2014
  10. Dec 30, 2013
  11. Dec 29, 2013
  12. Dec 28, 2013
  13. Dec 24, 2013
  14. Dec 18, 2013
  15. Nov 10, 2013
  16. Nov 03, 2013
  17. Oct 22, 2013
    • Ewen Cheslack-Postava's avatar
      Pass self to SparkContext._ensure_initialized. · 317a9eb1
      Ewen Cheslack-Postava authored
      The constructor for SparkContext should pass in self so that we track
      the current context and produce errors if another one is created. Add
      a doctest to make sure creating multiple contexts triggers the
      exception.
      317a9eb1
    • Ewen Cheslack-Postava's avatar
      Add classmethod to SparkContext to set system properties. · 56d230e6
      Ewen Cheslack-Postava authored
      Add a new classmethod to SparkContext to set system properties like is
      possible in Scala/Java. Unlike the Java/Scala implementations, there's
      no access to System until the JVM bridge is created. Since
      SparkContext handles that, move the initialization of the JVM
      connection to a separate classmethod that can safely be called
      repeatedly as long as the same instance (or no instance) is provided.
      56d230e6
  18. Sep 08, 2013
  19. Sep 07, 2013
  20. Sep 06, 2013
  21. Sep 01, 2013
  22. Aug 16, 2013
  23. Jul 29, 2013
    • Matei Zaharia's avatar
      SPARK-815. Python parallelize() should split lists before batching · feba7ee5
      Matei Zaharia authored
      One unfortunate consequence of this fix is that we materialize any
      collections that are given to us as generators, but this seems necessary
      to get reasonable behavior on small collections. We could add a
      batchSize parameter later to bypass auto-computation of batch size if
      this becomes a problem (e.g. if users really want to parallelize big
      generators nicely)
      feba7ee5
  24. Jul 16, 2013
  25. Feb 03, 2013
  26. Feb 01, 2013
  27. Jan 23, 2013
  28. Jan 22, 2013
Loading