Skip to content
Snippets Groups Projects
  1. Oct 23, 2014
    • Davies Liu's avatar
      [SPARK-3993] [PySpark] fix bug while reuse worker after take() · e595c8d0
      Davies Liu authored
      After take(), maybe there are some garbage left in the socket, then next task assigned to this worker will hang because of corrupted data.
      
      We should make sure the socket is clean before reuse it, write END_OF_STREAM at the end, and check it after read out all result from python.
      
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2838 from davies/fix_reuse and squashes the following commits:
      
      8872914 [Davies Liu] fix tests
      660875b [Davies Liu] fix bug while reuse worker after take()
      e595c8d0
  2. Sep 30, 2014
    • Davies Liu's avatar
      [SPARK-3478] [PySpark] Profile the Python tasks · c5414b68
      Davies Liu authored
      This patch add profiling support for PySpark, it will show the profiling results
      before the driver exits, here is one example:
      
      ```
      ============================================================
      Profile of RDD<id=3>
      ============================================================
               5146507 function calls (5146487 primitive calls) in 71.094 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        5144576   68.331    0.000   68.331    0.000 statcounter.py:44(merge)
             20    2.735    0.137   71.071    3.554 statcounter.py:33(__init__)
             20    0.017    0.001    0.017    0.001 {cPickle.dumps}
           1024    0.003    0.000    0.003    0.000 t.py:16(<lambda>)
             20    0.001    0.000    0.001    0.000 {reduce}
             21    0.001    0.000    0.001    0.000 {cPickle.loads}
             20    0.001    0.000    0.001    0.000 copy_reg.py:95(_slotnames)
             41    0.001    0.000    0.001    0.000 serializers.py:461(read_int)
             40    0.001    0.000    0.002    0.000 serializers.py:179(_batched)
             62    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
             20    0.000    0.000   71.072    3.554 rdd.py:863(<lambda>)
             20    0.000    0.000    0.001    0.000 serializers.py:198(load_stream)
          40/20    0.000    0.000   71.072    3.554 rdd.py:2093(pipeline_func)
             41    0.000    0.000    0.002    0.000 serializers.py:130(load_stream)
             40    0.000    0.000   71.072    1.777 rdd.py:304(func)
             20    0.000    0.000   71.094    3.555 worker.py:82(process)
      ```
      
      Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk
      by `sc.dump_profiles(path)`, such as
      
      ```python
      >>> sc._conf.set("spark.python.profile", "true")
      >>> rdd = sc.parallelize(range(100)).map(str)
      >>> rdd.count()
      100
      >>> sc.show_profiles()
      ============================================================
      Profile of RDD<id=1>
      ============================================================
               284 function calls (276 primitive calls) in 0.001 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
              4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
              4    0.000    0.000    0.000    0.000 {reduce}
           12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
              4    0.000    0.000    0.000    0.000 {cPickle.loads}
              4    0.000    0.000    0.000    0.000 {cPickle.dumps}
            104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
              8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
             12    0.000    0.000    0.000    0.000 rdd.py:303(func)
      ```
      The profiling is disabled by default, can be enabled by "spark.python.profile=true".
      
      Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"
      
      This is bugfix of #2351 cc JoshRosen
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2556 from davies/profiler and squashes the following commits:
      
      e68df5a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      858e74c [Davies Liu] compatitable with python 2.6
      7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles()
      2b0daf2 [Davies Liu] fix docs
      7a56c24 [Davies Liu] bugfix
      cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext
      fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      09d02c3 [Davies Liu] Merge branch 'master' into profiler
      c23865c [Davies Liu] Merge branch 'master' into profiler
      15d6f18 [Davies Liu] add docs for two configs
      dadee1a [Davies Liu] add docs string and clear profiles after show or dump
      4f8309d [Davies Liu] address comment, add tests
      0a5b6eb [Davies Liu] fix Python UDF
      4b20494 [Davies Liu] add profile for python
      c5414b68
  3. Sep 26, 2014
    • Josh Rosen's avatar
      Revert "[SPARK-3478] [PySpark] Profile the Python tasks" · f872e4fb
      Josh Rosen authored
      This reverts commit 1aa549ba.
      f872e4fb
    • Davies Liu's avatar
      [SPARK-3478] [PySpark] Profile the Python tasks · 1aa549ba
      Davies Liu authored
      This patch add profiling support for PySpark, it will show the profiling results
      before the driver exits, here is one example:
      
      ```
      ============================================================
      Profile of RDD<id=3>
      ============================================================
               5146507 function calls (5146487 primitive calls) in 71.094 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        5144576   68.331    0.000   68.331    0.000 statcounter.py:44(merge)
             20    2.735    0.137   71.071    3.554 statcounter.py:33(__init__)
             20    0.017    0.001    0.017    0.001 {cPickle.dumps}
           1024    0.003    0.000    0.003    0.000 t.py:16(<lambda>)
             20    0.001    0.000    0.001    0.000 {reduce}
             21    0.001    0.000    0.001    0.000 {cPickle.loads}
             20    0.001    0.000    0.001    0.000 copy_reg.py:95(_slotnames)
             41    0.001    0.000    0.001    0.000 serializers.py:461(read_int)
             40    0.001    0.000    0.002    0.000 serializers.py:179(_batched)
             62    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
             20    0.000    0.000   71.072    3.554 rdd.py:863(<lambda>)
             20    0.000    0.000    0.001    0.000 serializers.py:198(load_stream)
          40/20    0.000    0.000   71.072    3.554 rdd.py:2093(pipeline_func)
             41    0.000    0.000    0.002    0.000 serializers.py:130(load_stream)
             40    0.000    0.000   71.072    1.777 rdd.py:304(func)
             20    0.000    0.000   71.094    3.555 worker.py:82(process)
      ```
      
      Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk
      by `sc.dump_profiles(path)`, such as
      
      ```python
      >>> sc._conf.set("spark.python.profile", "true")
      >>> rdd = sc.parallelize(range(100)).map(str)
      >>> rdd.count()
      100
      >>> sc.show_profiles()
      ============================================================
      Profile of RDD<id=1>
      ============================================================
               284 function calls (276 primitive calls) in 0.001 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
              4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
              4    0.000    0.000    0.000    0.000 {reduce}
           12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
              4    0.000    0.000    0.000    0.000 {cPickle.loads}
              4    0.000    0.000    0.000    0.000 {cPickle.dumps}
            104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
              8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
             12    0.000    0.000    0.000    0.000 rdd.py:303(func)
      ```
      The profiling is disabled by default, can be enabled by "spark.python.profile=true".
      
      Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2351 from davies/profiler and squashes the following commits:
      
      7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles()
      2b0daf2 [Davies Liu] fix docs
      7a56c24 [Davies Liu] bugfix
      cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext
      fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      09d02c3 [Davies Liu] Merge branch 'master' into profiler
      c23865c [Davies Liu] Merge branch 'master' into profiler
      15d6f18 [Davies Liu] add docs for two configs
      dadee1a [Davies Liu] add docs string and clear profiles after show or dump
      4f8309d [Davies Liu] address comment, add tests
      0a5b6eb [Davies Liu] fix Python UDF
      4b20494 [Davies Liu] add profile for python
      1aa549ba
  4. Sep 24, 2014
    • Davies Liu's avatar
      [SPARK-3634] [PySpark] User's module should take precedence over system modules · c854b9fc
      Davies Liu authored
      Python modules added through addPyFile should take precedence over system modules.
      
      This patch put the path for user added module in the front of sys.path (just after '').
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2492 from davies/path and squashes the following commits:
      
      4a2af78 [Davies Liu] fix tests
      f7ff4da [Davies Liu] ad license header
      6b0002f [Davies Liu] add tests
      c16c392 [Davies Liu] put addPyFile in front of sys.path
      c854b9fc
  5. Sep 18, 2014
    • Davies Liu's avatar
      [SPARK-3554] [PySpark] use broadcast automatically for large closure · e77fa81a
      Davies Liu authored
      Py4j can not handle large string efficiently, so we should use broadcast for large closure automatically. (Broadcast use local filesystem to pass through data).
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2417 from davies/command and squashes the following commits:
      
      fbf4e97 [Davies Liu] bugfix
      aefd508 [Davies Liu] use broadcast automatically for large closure
      e77fa81a
  6. Sep 14, 2014
  7. Sep 13, 2014
    • Davies Liu's avatar
      [SPARK-3030] [PySpark] Reuse Python worker · 2aea0da8
      Davies Liu authored
      Reuse Python worker to avoid the overhead of fork() Python process for each tasks. It also tracks the broadcasts for each worker, avoid sending repeated broadcasts.
      
      This can reduce the time for dummy task from 22ms to 13ms (-40%). It can help to reduce the latency for Spark Streaming.
      
      For a job with broadcast (43M after compress):
      ```
          b = sc.broadcast(set(range(30000000)))
          print sc.parallelize(range(24000), 100).filter(lambda x: x in b.value).count()
      ```
      It will finish in 281s without reused worker, and it will finish in 65s with reused worker(4 CPUs). After reusing the worker, it can save about 9 seconds for transfer and deserialize the broadcast for each tasks.
      
      It's enabled by default, could be disabled by `spark.python.worker.reuse = false`.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2259 from davies/reuse-worker and squashes the following commits:
      
      f11f617 [Davies Liu] Merge branch 'master' into reuse-worker
      3939f20 [Davies Liu] fix bug in serializer in mllib
      cf1c55e [Davies Liu] address comments
      3133a60 [Davies Liu] fix accumulator with reused worker
      760ab1f [Davies Liu] do not reuse worker if there are any exceptions
      7abb224 [Davies Liu] refactor: sychronized with itself
      ac3206e [Davies Liu] renaming
      8911f44 [Davies Liu] synchronized getWorkerBroadcasts()
      6325fc1 [Davies Liu] bugfix: bid >= 0
      e0131a2 [Davies Liu] fix name of config
      583716e [Davies Liu] only reuse completed and not interrupted worker
      ace2917 [Davies Liu] kill python worker after timeout
      6123d0f [Davies Liu] track broadcasts for each worker
      8d2f08c [Davies Liu] reuse python worker
      2aea0da8
  8. Aug 18, 2014
    • Josh Rosen's avatar
      [SPARK-3114] [PySpark] Fix Python UDFs in Spark SQL. · 1f1819b2
      Josh Rosen authored
      This fixes SPARK-3114, an issue where we inadvertently broke Python UDFs in Spark SQL.
      
      This PR modifiers the test runner script to always run the PySpark SQL tests, irrespective of whether SparkSQL itself has been modified.  It also includes Davies' fix for the bug.
      
      Closes #2026.
      
      Author: Josh Rosen <joshrosen@apache.org>
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2027 from JoshRosen/pyspark-sql-fix and squashes the following commits:
      
      9af2708 [Davies Liu] bugfix: disable compression of command
      0d8d3a4 [Josh Rosen] Always run Python Spark SQL tests.
      1f1819b2
  9. Aug 16, 2014
    • Davies Liu's avatar
      [SPARK-1065] [PySpark] improve supporting for large broadcast · 2fc8aca0
      Davies Liu authored
      Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize()).
      
      Add an option to keep object in driver (it's False by default) to save memory in driver.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1912 from davies/broadcast and squashes the following commits:
      
      e06df4a [Davies Liu] load broadcast from disk in driver automatically
      db3f232 [Davies Liu] fix serialization of accumulator
      631a827 [Davies Liu] Merge branch 'master' into broadcast
      c7baa8c [Davies Liu] compress serrialized broadcast and command
      9a7161f [Davies Liu] fix doc tests
      e93cf4b [Davies Liu] address comments: add test
      6226189 [Davies Liu] improve large broadcast
      2fc8aca0
  10. Jul 29, 2014
    • Davies Liu's avatar
      [SPARK-2580] [PySpark] keep silent in worker if JVM close the socket · ccd5ab5f
      Davies Liu authored
      During rdd.take(n), JVM will close the socket if it had got enough data, the Python worker should keep silent in this case.
      
      In the same time, the worker should not print the trackback into stderr if it send the traceback to JVM successfully.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1625 from davies/error and squashes the following commits:
      
      4fbcc6d [Davies Liu] disable log4j during testing when exception is expected.
      cc14202 [Davies Liu] keep silent in worker if JVM close the socket
      ccd5ab5f
  11. Jul 22, 2014
    • Nicholas Chammas's avatar
      [SPARK-2470] PEP8 fixes to PySpark · 5d16d5bb
      Nicholas Chammas authored
      This pull request aims to resolve all outstanding PEP8 violations in PySpark.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      Author: nchammas <nicholas.chammas@gmail.com>
      
      Closes #1505 from nchammas/master and squashes the following commits:
      
      98171af [Nicholas Chammas] [SPARK-2470] revert PEP 8 fixes to cloudpickle
      cba7768 [Nicholas Chammas] [SPARK-2470] wrap expression list in parentheses
      e178dbe [Nicholas Chammas] [SPARK-2470] style - change position of line break
      9127d2b [Nicholas Chammas] [SPARK-2470] wrap expression lists in parentheses
      22132a4 [Nicholas Chammas] [SPARK-2470] wrap conditionals in parentheses
      24639bc [Nicholas Chammas] [SPARK-2470] fix whitespace for doctest
      7d557b7 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to tests.py
      8f8e4c0 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to storagelevel.py
      b3b96cf [Nicholas Chammas] [SPARK-2470] PEP8 fixes to statcounter.py
      d644477 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to worker.py
      aa3a7b6 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to sql.py
      1916859 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to shell.py
      95d1d95 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to serializers.py
      a0fec2e [Nicholas Chammas] [SPARK-2470] PEP8 fixes to mllib
      c85e1e5 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to join.py
      d14f2f1 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to __init__.py
      81fcb20 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to resultiterable.py
      1bde265 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to java_gateway.py
      7fc849c [Nicholas Chammas] [SPARK-2470] PEP8 fixes to daemon.py
      ca2d28b [Nicholas Chammas] [SPARK-2470] PEP8 fixes to context.py
      f4e0039 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to conf.py
      a6d5e4b [Nicholas Chammas] [SPARK-2470] PEP8 fixes to cloudpickle.py
      f0a7ebf [Nicholas Chammas] [SPARK-2470] PEP8 fixes to rddsampler.py
      4dd148f [nchammas] Merge pull request #5 from apache/master
      f7e4581 [Nicholas Chammas] unrelated pep8 fix
      a36eed0 [Nicholas Chammas] name ec2 instances and security groups consistently
      de7292a [nchammas] Merge pull request #4 from apache/master
      2e4fe00 [nchammas] Merge pull request #3 from apache/master
      89fde08 [nchammas] Merge pull request #2 from apache/master
      69f6e22 [Nicholas Chammas] PEP8 fixes
      2627247 [Nicholas Chammas] broke up lines before they hit 100 chars
      6544b7e [Nicholas Chammas] [SPARK-2065] give launched instances names
      69da6cf [nchammas] Merge pull request #1 from apache/master
      5d16d5bb
  12. May 10, 2014
  13. Feb 26, 2014
    • Bouke van der Bijl's avatar
      SPARK-1115: Catch depickling errors · 12738c1a
      Bouke van der Bijl authored
      This surroungs the complete worker code in a try/except block so we catch any error that arrives. An example would be the depickling failing for some reason
      
      @JoshRosen
      
      Author: Bouke van der Bijl <boukevanderbijl@gmail.com>
      
      Closes #644 from bouk/catch-depickling-errors and squashes the following commits:
      
      f0f67cc [Bouke van der Bijl] Lol indentation
      0e4d504 [Bouke van der Bijl] Surround the complete python worker with the try block
      12738c1a
  14. Feb 22, 2014
    • jyotiska's avatar
      Fixed minor typo in worker.py · 3ff077d4
      jyotiska authored
      Fixed minor typo in worker.py
      
      Author: jyotiska <jyotiska123@gmail.com>
      
      Closes #630 from jyotiska/pyspark_code and squashes the following commits:
      
      ee44201 [jyotiska] typo fixed in worker.py
      3ff077d4
  15. Jan 28, 2014
    • Josh Rosen's avatar
      Switch from MUTF8 to UTF8 in PySpark serializers. · 1381fc72
      Josh Rosen authored
      This fixes SPARK-1043, a bug introduced in 0.9.0
      where PySpark couldn't serialize strings > 64kB.
      
      This fix was written by @tyro89 and @bouk in #512.
      This commit squashes and rebases their pull request
      in order to fix some merge conflicts.
      1381fc72
  16. Jan 12, 2014
    • Matei Zaharia's avatar
      Log Python exceptions to stderr as well · 5741078c
      Matei Zaharia authored
      This helps in case the exception happened while serializing a record to
      be sent to Java, leaving the stream to Java in an inconsistent state
      where PythonRDD won't be able to read the error.
      5741078c
  17. Nov 10, 2013
  18. Nov 03, 2013
  19. Sep 01, 2013
  20. Aug 16, 2013
  21. Jul 16, 2013
  22. Jun 21, 2013
  23. Feb 01, 2013
  24. Jan 31, 2013
  25. Jan 23, 2013
  26. Jan 22, 2013
  27. Jan 21, 2013
  28. Jan 20, 2013
  29. Jan 08, 2013
  30. Jan 01, 2013
  31. Dec 24, 2012
  32. Oct 19, 2012
  33. Aug 27, 2012
Loading