Skip to content
Snippets Groups Projects
  1. Nov 04, 2014
    • Davies Liu's avatar
      [SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default. · e4f42631
      Davies Liu authored
      This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1.
      
      Author: Davies Liu <davies@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Josh Rosen <joshrosen@databricks.com>
      
      Closes #2920 from davies/fix_autobatch and squashes the following commits:
      
      e544ef9 [Davies Liu] revert unrelated change
      6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      1d557fc [Davies Liu] fix tests
      8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      76abdce [Davies Liu] clean up
      53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      b4292ce [Davies Liu] fix bug in master
      d79744c [Davies Liu] recover hive tests
      be37ece [Davies Liu] refactor
      eb3938d [Davies Liu] refactor serializer in scala
      8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default.
      e4f42631
  2. Nov 03, 2014
    • Xiangrui Meng's avatar
      [SPARK-4192][SQL] Internal API for Python UDT · 04450d11
      Xiangrui Meng authored
      Following #2919, this PR adds Python UDT (for internal use only) with tests under "pyspark.tests". Before `SQLContext.applySchema`, we check whether we need to convert user-type instances into SQL recognizable data. In the current implementation, a Python UDT must be paired with a Scala UDT for serialization on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and Python.
      
      marmbrus jkbradley davies
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3068 from mengxr/SPARK-4192-sql and squashes the following commits:
      
      acff637 [Xiangrui Meng] merge master
      dba5ea7 [Xiangrui Meng] only use pyClass for Python UDT output sqlType as well
      2c9d7e4 [Xiangrui Meng] move import to global setup; update needsConversion
      7c4a6a9 [Xiangrui Meng] address comments
      75223db [Xiangrui Meng] minor update
      f740379 [Xiangrui Meng] remove UDT from default imports
      e98d9d0 [Xiangrui Meng] fix py style
      4e84fce [Xiangrui Meng] remove local hive tests and add more tests
      39f19e0 [Xiangrui Meng] add tests
      b7f666d [Xiangrui Meng] add Python UDT
      04450d11
    • Davies Liu's avatar
      [SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling · 24544fbc
      Davies Liu authored
      This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows and merge the types into schema, also merge fields of StructType together. If there is still NullType in schema, then it will show an warning, tell user to try with sampling.
      
      If sampling is presented, it will infer schema from all the rows after sampling.
      
      Also, add samplingRatio for jsonFile() and jsonRDD()
      
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2716 from davies/infer and squashes the following commits:
      
      e678f6d [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
      34b5c63 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
      567dc60 [Davies Liu] update docs
      9767b27 [Davies Liu] Merge branch 'master' into infer
      e48d7fb [Davies Liu] fix tests
      29e94d5 [Davies Liu] let NullType inherit from PrimitiveType
      ee5d524 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
      540d1d5 [Davies Liu] merge fields for StructType
      f93fd84 [Davies Liu] add more tests
      3603e00 [Davies Liu] take more rows to infer schema, or infer the schema by sampling the RDD
      24544fbc
    • Xiangrui Meng's avatar
      [SPARK-4148][PySpark] fix seed distribution and add some tests for rdd.sample · 3cca1962
      Xiangrui Meng authored
      The current way of seed distribution makes the random sequences from partition i and i+1 offset by 1.
      
      ~~~
      In [14]: import random
      
      In [15]: r1 = random.Random(10)
      
      In [16]: r1.randint(0, 1)
      Out[16]: 1
      
      In [17]: r1.random()
      Out[17]: 0.4288890546751146
      
      In [18]: r1.random()
      Out[18]: 0.5780913011344704
      
      In [19]: r2 = random.Random(10)
      
      In [20]: r2.randint(0, 1)
      Out[20]: 1
      
      In [21]: r2.randint(0, 1)
      Out[21]: 0
      
      In [22]: r2.random()
      Out[22]: 0.5780913011344704
      ~~~
      
      Note: The new tests are not for this bug fix.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3010 from mengxr/SPARK-4148 and squashes the following commits:
      
      869ae4b [Xiangrui Meng] move tests tests.py
      c1bacd9 [Xiangrui Meng] fix seed distribution and add some tests for rdd.sample
      3cca1962
  3. Oct 28, 2014
    • Davies Liu's avatar
      [SPARK-4133] [SQL] [PySpark] type conversionfor python udf · 8c0bfd08
      Davies Liu authored
      Call Python UDF on ArrayType/MapType/PrimitiveType, the returnType can also be ArrayType/MapType/PrimitiveType.
      
      For StructType, it will act as tuple (without attributes). If returnType is StructType, it also should be tuple.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2973 from davies/udf_array and squashes the following commits:
      
      306956e [Davies Liu] Merge branch 'master' of github.com:apache/spark into udf_array
      2c00e43 [Davies Liu] fix merge
      11395fa [Davies Liu] Merge branch 'master' of github.com:apache/spark into udf_array
      9df50a2 [Davies Liu] address comments
      79afb4e [Davies Liu] type conversionfor python udf
      8c0bfd08
  4. Oct 24, 2014
    • Davies Liu's avatar
      [SPARK-4051] [SQL] [PySpark] Convert Row into dictionary · d60a9d44
      Davies Liu authored
      Added a method to Row to turn row into dict:
      
      ```
      >>> row = Row(a=1)
      >>> row.asDict()
      {'a': 1}
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2896 from davies/dict and squashes the following commits:
      
      8d97366 [Davies Liu] convert Row into dict
      d60a9d44
  5. Oct 23, 2014
    • Davies Liu's avatar
      [SPARK-3993] [PySpark] fix bug while reuse worker after take() · e595c8d0
      Davies Liu authored
      After take(), maybe there are some garbage left in the socket, then next task assigned to this worker will hang because of corrupted data.
      
      We should make sure the socket is clean before reuse it, write END_OF_STREAM at the end, and check it after read out all result from python.
      
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2838 from davies/fix_reuse and squashes the following commits:
      
      8872914 [Davies Liu] fix tests
      660875b [Davies Liu] fix bug while reuse worker after take()
      e595c8d0
  6. Oct 22, 2014
    • freeman's avatar
      Fix for sampling error in NumPy v1.9 [SPARK-3995][PYSPARK] · 97cf19f6
      freeman authored
      Change maximum value for default seed during RDD sampling so that it is strictly less than 2 ** 32. This prevents a bug in the most recent version of NumPy, which cannot accept random seeds above this bound.
      
      Adds an extra test that uses the default seed (instead of setting it manually, as in the docstrings).
      
      mengxr
      
      Author: freeman <the.freeman.lab@gmail.com>
      
      Closes #2889 from freeman-lab/pyspark-sampling and squashes the following commits:
      
      dc385ef [freeman] Change maximum value for default seed
      97cf19f6
  7. Oct 17, 2014
    • Michael Armbrust's avatar
      [SPARK-3855][SQL] Preserve the result attribute of python UDFs though transformations · adcb7d33
      Michael Armbrust authored
      In the current implementation it was possible for the reference to change after analysis.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2717 from marmbrus/pythonUdfResults and squashes the following commits:
      
      da14879 [Michael Armbrust] Fix test
      6343bcb [Michael Armbrust] add test
      9533286 [Michael Armbrust] Correctly preserve the result attribute of python UDFs though transformations
      adcb7d33
  8. Oct 11, 2014
    • cocoatomo's avatar
      [SPARK-3867][PySpark] ./python/run-tests failed when it run with Python 2.6... · 81015a2b
      cocoatomo authored
      [SPARK-3867][PySpark] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed
      
      ./python/run-tests search a Python 2.6 executable on PATH and use it if available.
      When using Python 2.6, it is going to import unittest2 module which is not a standard library in Python 2.6, so it fails with ImportError.
      
      Author: cocoatomo <cocoatomo77@gmail.com>
      
      Closes #2759 from cocoatomo/issues/3867-unittest2-import-error and squashes the following commits:
      
      f068eb5 [cocoatomo] [SPARK-3867] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed
      81015a2b
  9. Oct 06, 2014
    • Davies Liu's avatar
      [SPARK-3786] [PySpark] speedup tests · 4f01265f
      Davies Liu authored
      This patch try to speed up tests of PySpark, re-use the SparkContext in tests.py and mllib/tests.py to reduce the overhead of create SparkContext, remove some test cases, which did not make sense. It also improve the performance of some cases, such as MergerTests and SortTests.
      
      before this patch:
      
      real	21m27.320s
      user	4m42.967s
      sys	0m17.343s
      
      after this patch:
      
      real	9m47.541s
      user	2m12.947s
      sys	0m14.543s
      
      It almost cut the time by half.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2646 from davies/tests and squashes the following commits:
      
      c54de60 [Davies Liu] revert change about memory limit
      6a2a4b0 [Davies Liu] refactor of tests, speedup 100%
      4f01265f
  10. Oct 01, 2014
    • Davies Liu's avatar
      [SPARK-3749] [PySpark] fix bugs in broadcast large closure of RDD · abf588f4
      Davies Liu authored
      1. broadcast is triggle unexpected
      2. fd is leaked in JVM (also leak in parallelize())
      3. broadcast is not unpersisted in JVM after RDD is not be used any more.
      
      cc JoshRosen , sorry for these stupid bugs.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2603 from davies/fix_broadcast and squashes the following commits:
      
      080a743 [Davies Liu] fix bugs in broadcast large closure of RDD
      abf588f4
  11. Sep 30, 2014
    • Davies Liu's avatar
      [SPARK-3478] [PySpark] Profile the Python tasks · c5414b68
      Davies Liu authored
      This patch add profiling support for PySpark, it will show the profiling results
      before the driver exits, here is one example:
      
      ```
      ============================================================
      Profile of RDD<id=3>
      ============================================================
               5146507 function calls (5146487 primitive calls) in 71.094 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        5144576   68.331    0.000   68.331    0.000 statcounter.py:44(merge)
             20    2.735    0.137   71.071    3.554 statcounter.py:33(__init__)
             20    0.017    0.001    0.017    0.001 {cPickle.dumps}
           1024    0.003    0.000    0.003    0.000 t.py:16(<lambda>)
             20    0.001    0.000    0.001    0.000 {reduce}
             21    0.001    0.000    0.001    0.000 {cPickle.loads}
             20    0.001    0.000    0.001    0.000 copy_reg.py:95(_slotnames)
             41    0.001    0.000    0.001    0.000 serializers.py:461(read_int)
             40    0.001    0.000    0.002    0.000 serializers.py:179(_batched)
             62    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
             20    0.000    0.000   71.072    3.554 rdd.py:863(<lambda>)
             20    0.000    0.000    0.001    0.000 serializers.py:198(load_stream)
          40/20    0.000    0.000   71.072    3.554 rdd.py:2093(pipeline_func)
             41    0.000    0.000    0.002    0.000 serializers.py:130(load_stream)
             40    0.000    0.000   71.072    1.777 rdd.py:304(func)
             20    0.000    0.000   71.094    3.555 worker.py:82(process)
      ```
      
      Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk
      by `sc.dump_profiles(path)`, such as
      
      ```python
      >>> sc._conf.set("spark.python.profile", "true")
      >>> rdd = sc.parallelize(range(100)).map(str)
      >>> rdd.count()
      100
      >>> sc.show_profiles()
      ============================================================
      Profile of RDD<id=1>
      ============================================================
               284 function calls (276 primitive calls) in 0.001 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
              4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
              4    0.000    0.000    0.000    0.000 {reduce}
           12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
              4    0.000    0.000    0.000    0.000 {cPickle.loads}
              4    0.000    0.000    0.000    0.000 {cPickle.dumps}
            104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
              8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
             12    0.000    0.000    0.000    0.000 rdd.py:303(func)
      ```
      The profiling is disabled by default, can be enabled by "spark.python.profile=true".
      
      Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"
      
      This is bugfix of #2351 cc JoshRosen
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2556 from davies/profiler and squashes the following commits:
      
      e68df5a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      858e74c [Davies Liu] compatitable with python 2.6
      7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles()
      2b0daf2 [Davies Liu] fix docs
      7a56c24 [Davies Liu] bugfix
      cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext
      fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      09d02c3 [Davies Liu] Merge branch 'master' into profiler
      c23865c [Davies Liu] Merge branch 'master' into profiler
      15d6f18 [Davies Liu] add docs for two configs
      dadee1a [Davies Liu] add docs string and clear profiles after show or dump
      4f8309d [Davies Liu] address comment, add tests
      0a5b6eb [Davies Liu] fix Python UDF
      4b20494 [Davies Liu] add profile for python
      c5414b68
  12. Sep 27, 2014
    • Davies Liu's avatar
      [SPARK-3681] [SQL] [PySpark] fix serialization of List and Map in SchemaRDD · 0d8cdf0e
      Davies Liu authored
      Currently, the schema of object in ArrayType or MapType is attached lazily, it will have better performance but introduce issues while serialization or accessing nested objects.
      
      This patch will apply schema to the objects of ArrayType or MapType immediately when accessing them, will be a little bit slower, but much robust.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2526 from davies/nested and squashes the following commits:
      
      2399ae5 [Davies Liu] fix serialization of List and Map in SchemaRDD
      0d8cdf0e
  13. Sep 26, 2014
    • Josh Rosen's avatar
      Revert "[SPARK-3478] [PySpark] Profile the Python tasks" · f872e4fb
      Josh Rosen authored
      This reverts commit 1aa549ba.
      f872e4fb
    • Davies Liu's avatar
      [SPARK-3478] [PySpark] Profile the Python tasks · 1aa549ba
      Davies Liu authored
      This patch add profiling support for PySpark, it will show the profiling results
      before the driver exits, here is one example:
      
      ```
      ============================================================
      Profile of RDD<id=3>
      ============================================================
               5146507 function calls (5146487 primitive calls) in 71.094 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        5144576   68.331    0.000   68.331    0.000 statcounter.py:44(merge)
             20    2.735    0.137   71.071    3.554 statcounter.py:33(__init__)
             20    0.017    0.001    0.017    0.001 {cPickle.dumps}
           1024    0.003    0.000    0.003    0.000 t.py:16(<lambda>)
             20    0.001    0.000    0.001    0.000 {reduce}
             21    0.001    0.000    0.001    0.000 {cPickle.loads}
             20    0.001    0.000    0.001    0.000 copy_reg.py:95(_slotnames)
             41    0.001    0.000    0.001    0.000 serializers.py:461(read_int)
             40    0.001    0.000    0.002    0.000 serializers.py:179(_batched)
             62    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
             20    0.000    0.000   71.072    3.554 rdd.py:863(<lambda>)
             20    0.000    0.000    0.001    0.000 serializers.py:198(load_stream)
          40/20    0.000    0.000   71.072    3.554 rdd.py:2093(pipeline_func)
             41    0.000    0.000    0.002    0.000 serializers.py:130(load_stream)
             40    0.000    0.000   71.072    1.777 rdd.py:304(func)
             20    0.000    0.000   71.094    3.555 worker.py:82(process)
      ```
      
      Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk
      by `sc.dump_profiles(path)`, such as
      
      ```python
      >>> sc._conf.set("spark.python.profile", "true")
      >>> rdd = sc.parallelize(range(100)).map(str)
      >>> rdd.count()
      100
      >>> sc.show_profiles()
      ============================================================
      Profile of RDD<id=1>
      ============================================================
               284 function calls (276 primitive calls) in 0.001 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
              4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
              4    0.000    0.000    0.000    0.000 {reduce}
           12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
              4    0.000    0.000    0.000    0.000 {cPickle.loads}
              4    0.000    0.000    0.000    0.000 {cPickle.dumps}
            104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
              8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
             12    0.000    0.000    0.000    0.000 rdd.py:303(func)
      ```
      The profiling is disabled by default, can be enabled by "spark.python.profile=true".
      
      Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2351 from davies/profiler and squashes the following commits:
      
      7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles()
      2b0daf2 [Davies Liu] fix docs
      7a56c24 [Davies Liu] bugfix
      cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext
      fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      09d02c3 [Davies Liu] Merge branch 'master' into profiler
      c23865c [Davies Liu] Merge branch 'master' into profiler
      15d6f18 [Davies Liu] add docs for two configs
      dadee1a [Davies Liu] add docs string and clear profiles after show or dump
      4f8309d [Davies Liu] address comment, add tests
      0a5b6eb [Davies Liu] fix Python UDF
      4b20494 [Davies Liu] add profile for python
      1aa549ba
  14. Sep 24, 2014
    • Davies Liu's avatar
      [SPARK-3679] [PySpark] pickle the exact globals of functions · bb96012b
      Davies Liu authored
      function.func_code.co_names has all the names used in the function, including name of attributes. It will pickle some unnecessary globals if there is a global having the same name with attribute (in co_names).
      
      There is a regression introduced by #2144, revert part of changes in that PR.
      
      cc JoshRosen
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2522 from davies/globals and squashes the following commits:
      
      dfbccf5 [Davies Liu] fix bug while pickle globals of function
      bb96012b
    • Davies Liu's avatar
      [SPARK-3634] [PySpark] User's module should take precedence over system modules · c854b9fc
      Davies Liu authored
      Python modules added through addPyFile should take precedence over system modules.
      
      This patch put the path for user added module in the front of sys.path (just after '').
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2492 from davies/path and squashes the following commits:
      
      4a2af78 [Davies Liu] fix tests
      f7ff4da [Davies Liu] ad license header
      6b0002f [Davies Liu] add tests
      c16c392 [Davies Liu] put addPyFile in front of sys.path
      c854b9fc
  15. Sep 20, 2014
  16. Sep 19, 2014
    • Davies Liu's avatar
      [SPARK-3592] [SQL] [PySpark] support applySchema to RDD of Row · a95ad99e
      Davies Liu authored
      Fix the issue when applySchema() to an RDD of Row.
      
      Also add type mapping for BinaryType.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2448 from davies/row and squashes the following commits:
      
      dd220cf [Davies Liu] fix test
      3f3f188 [Davies Liu] add more test
      f559746 [Davies Liu] add tests, fix serialization
      9688fd2 [Davies Liu] support applySchema to RDD of Row
      a95ad99e
  17. Sep 18, 2014
    • Davies Liu's avatar
      [SPARK-3554] [PySpark] use broadcast automatically for large closure · e77fa81a
      Davies Liu authored
      Py4j can not handle large string efficiently, so we should use broadcast for large closure automatically. (Broadcast use local filesystem to pass through data).
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2417 from davies/command and squashes the following commits:
      
      fbf4e97 [Davies Liu] bugfix
      aefd508 [Davies Liu] use broadcast automatically for large closure
      e77fa81a
  18. Sep 16, 2014
    • Matthew Farrellee's avatar
      [SPARK-3519] add distinct(n) to PySpark · 9d5fa763
      Matthew Farrellee authored
      Added missing rdd.distinct(numPartitions) and associated tests
      
      Author: Matthew Farrellee <matt@redhat.com>
      
      Closes #2383 from mattf/SPARK-3519 and squashes the following commits:
      
      30b837a [Matthew Farrellee] Combine test cases to save on JVM startups
      6bc4a2c [Matthew Farrellee] [SPARK-3519] add distinct(n) to SchemaRDD in PySpark
      7a17f2b [Matthew Farrellee] [SPARK-3519] add distinct(n) to PySpark
      9d5fa763
  19. Sep 15, 2014
    • Davies Liu's avatar
      [SPARK-2951] [PySpark] support unpickle array.array for Python 2.6 · da33acb8
      Davies Liu authored
      Pyrolite can not unpickle array.array which pickled by Python 2.6, this patch fix it by extend Pyrolite.
      
      There is a bug in Pyrolite when unpickle array of float/double, this patch workaround it by reverse the endianness for float/double. This workaround should be removed after Pyrolite have a new release to fix this issue.
      
      I had send an PR to Pyrolite to fix it:  https://github.com/irmen/Pyrolite/pull/11
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2365 from davies/pickle and squashes the following commits:
      
      f44f771 [Davies Liu] enable tests about array
      3908f5c [Davies Liu] Merge branch 'master' into pickle
      c77c87b [Davies Liu] cleanup debugging code
      60e4e2f [Davies Liu] support unpickle array.array for Python 2.6
      da33acb8
  20. Sep 14, 2014
  21. Sep 13, 2014
    • Davies Liu's avatar
      [SPARK-3030] [PySpark] Reuse Python worker · 2aea0da8
      Davies Liu authored
      Reuse Python worker to avoid the overhead of fork() Python process for each tasks. It also tracks the broadcasts for each worker, avoid sending repeated broadcasts.
      
      This can reduce the time for dummy task from 22ms to 13ms (-40%). It can help to reduce the latency for Spark Streaming.
      
      For a job with broadcast (43M after compress):
      ```
          b = sc.broadcast(set(range(30000000)))
          print sc.parallelize(range(24000), 100).filter(lambda x: x in b.value).count()
      ```
      It will finish in 281s without reused worker, and it will finish in 65s with reused worker(4 CPUs). After reusing the worker, it can save about 9 seconds for transfer and deserialize the broadcast for each tasks.
      
      It's enabled by default, could be disabled by `spark.python.worker.reuse = false`.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2259 from davies/reuse-worker and squashes the following commits:
      
      f11f617 [Davies Liu] Merge branch 'master' into reuse-worker
      3939f20 [Davies Liu] fix bug in serializer in mllib
      cf1c55e [Davies Liu] address comments
      3133a60 [Davies Liu] fix accumulator with reused worker
      760ab1f [Davies Liu] do not reuse worker if there are any exceptions
      7abb224 [Davies Liu] refactor: sychronized with itself
      ac3206e [Davies Liu] renaming
      8911f44 [Davies Liu] synchronized getWorkerBroadcasts()
      6325fc1 [Davies Liu] bugfix: bid >= 0
      e0131a2 [Davies Liu] fix name of config
      583716e [Davies Liu] only reuse completed and not interrupted worker
      ace2917 [Davies Liu] kill python worker after timeout
      6123d0f [Davies Liu] track broadcasts for each worker
      8d2f08c [Davies Liu] reuse python worker
      2aea0da8
  22. Sep 12, 2014
    • Davies Liu's avatar
      [SPARK-3500] [SQL] use JavaSchemaRDD as SchemaRDD._jschema_rdd · 885d1621
      Davies Liu authored
      Currently, SchemaRDD._jschema_rdd is SchemaRDD, the Scala API (coalesce(), repartition()) can not been called in Python easily, there is no way to specify the implicit parameter `ord`. The _jrdd is an JavaRDD, so _jschema_rdd should also be JavaSchemaRDD.
      
      In this patch, change _schema_rdd to JavaSchemaRDD, also added an assert for it. If some methods are missing from JavaSchemaRDD, then it's called by _schema_rdd.baseSchemaRDD().xxx().
      
      BTW, Do we need JavaSQLContext?
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2369 from davies/fix_schemardd and squashes the following commits:
      
      abee159 [Davies Liu] use JavaSchemaRDD as SchemaRDD._jschema_rdd
      885d1621
    • Davies Liu's avatar
      [SPARK-3094] [PySpark] compatitable with PyPy · 71af030b
      Davies Liu authored
      After this patch, we can run PySpark in PyPy (testing with PyPy 2.3.1 in Mac 10.9), for example:
      
      ```
      PYSPARK_PYTHON=pypy ./bin/spark-submit wordcount.py
      ```
      
      The performance speed up will depend on work load (from 20% to 3000%). Here are some benchmarks:
      
       Job | CPython 2.7 | PyPy 2.3.1  | Speed up
       ------- | ------------ | ------------- | -------
       Word Count | 41s   | 15s  | 2.7x
       Sort | 46s |  44s | 1.05x
       Stats | 174s | 3.6s | 48x
      
      Here is the code used for benchmark:
      
      ```python
      rdd = sc.textFile("text")
      def wordcount():
          rdd.flatMap(lambda x:x.split('/'))\
              .map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap()
      def sort():
          rdd.sortBy(lambda x:x, 1).count()
      def stats():
          sc.parallelize(range(1024), 20).flatMap(lambda x: xrange(5024)).stats()
      ```
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2144 from davies/pypy and squashes the following commits:
      
      9aed6c5 [Davies Liu] use protocol 2 in CloudPickle
      4bc1f04 [Davies Liu] refactor
      b20ab3a [Davies Liu] pickle sys.stdout and stderr in portable way
      3ca2351 [Davies Liu] Merge branch 'master' into pypy
      fae8b19 [Davies Liu] improve attrgetter, add tests
      591f830 [Davies Liu] try to run tests with PyPy in run-tests
      c8d62ba [Davies Liu] cleanup
      f651fd0 [Davies Liu] fix tests using array with PyPy
      1b98fb3 [Davies Liu] serialize itemgetter/attrgetter in portable ways
      3c1dbfe [Davies Liu] Merge branch 'master' into pypy
      42fb5fa [Davies Liu] Merge branch 'master' into pypy
      cb2d724 [Davies Liu] fix tests
      9986692 [Davies Liu] Merge branch 'master' into pypy
      25b4ca7 [Davies Liu] support PyPy
      71af030b
  23. Sep 09, 2014
    • Matthew Farrellee's avatar
      [SPARK-3458] enable python "with" statements for SparkContext · 25b5b867
      Matthew Farrellee authored
      allow for best practice code,
      
      ```
      try:
        sc = SparkContext()
        app(sc)
      finally:
        sc.stop()
      ```
      
      to be written using a "with" statement,
      
      ```
      with SparkContext() as sc:
        app(sc)
      ```
      
      Author: Matthew Farrellee <matt@redhat.com>
      
      Closes #2335 from mattf/SPARK-3458 and squashes the following commits:
      
      5b4e37c [Matthew Farrellee] [SPARK-3458] enable python "with" statements for SparkContext
      25b5b867
  24. Sep 08, 2014
    • Sandy Ryza's avatar
      SPARK-2978. Transformation with MR shuffle semantics · 16a73c24
      Sandy Ryza authored
      I didn't add this to the transformations list in the docs because it's kind of obscure, but would be happy to do so if others think it would be helpful.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #2274 from sryza/sandy-spark-2978 and squashes the following commits:
      
      4a5332a [Sandy Ryza] Fix Java test
      c04b447 [Sandy Ryza] Fix Python doc and add back deleted code
      433ad5b [Sandy Ryza] Add Java test
      4c25a54 [Sandy Ryza] Add s at the end and a couple other fixes
      9b0ba99 [Sandy Ryza] Fix compilation
      36e0571 [Sandy Ryza] Fix import ordering
      48c12c2 [Sandy Ryza] Add Java version and additional doc
      e5381cd [Sandy Ryza] Fix python style warnings
      f147634 [Sandy Ryza] SPARK-2978. Transformation with MR shuffle semantics
      16a73c24
  25. Sep 07, 2014
    • Ward Viaene's avatar
      [SPARK-3415] [PySpark] removes SerializingAdapter code · ecfa76cd
      Ward Viaene authored
      This code removes the SerializingAdapter code that was copied from PiCloud
      
      Author: Ward Viaene <ward.viaene@bigdatapartnership.com>
      
      Closes #2287 from wardviaene/feature/pythonsys and squashes the following commits:
      
      5f0d426 [Ward Viaene] SPARK-3415: modified test class to do dump and load
      5f5d559 [Ward Viaene] SPARK-3415: modified test class name and call cloudpickle.dumps instead using StringIO
      afc4a9a [Ward Viaene] SPARK-3415: added newlines to pass lint
      aaf10b7 [Ward Viaene] SPARK-3415: removed references to SerializingAdapter and rewrote test
      65ffeff [Ward Viaene] removed duplicate test
      a958866 [Ward Viaene] SPARK-3415: test script
      e263bf5 [Ward Viaene] SPARK-3415: removes legacy SerializingAdapter code
      ecfa76cd
  26. Sep 06, 2014
    • Davies Liu's avatar
      [SPARK-2334] fix AttributeError when call PipelineRDD.id() · 110fb8b2
      Davies Liu authored
      The underline JavaRDD for PipelineRDD is created lazily, it's delayed until call _jrdd.
      
      The id of JavaRDD is cached as `_id`, it saves a RPC call in py4j for later calls.
      
      closes #1276
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2296 from davies/id and squashes the following commits:
      
      e197958 [Davies Liu] fix style
      9721716 [Davies Liu] fix id of PipelineRDD
      110fb8b2
  27. Sep 03, 2014
  28. Sep 02, 2014
    • Davies Liu's avatar
      [SPARK-2871] [PySpark] add countApproxDistinct() API · e2c901b4
      Davies Liu authored
      RDD.countApproxDistinct(relativeSD=0.05):
      
              :: Experimental ::
              Return approximate number of distinct elements in the RDD.
      
              The algorithm used is based on streamlib's implementation of
              "HyperLogLog in Practice: Algorithmic Engineering of a State
              of The Art Cardinality Estimation Algorithm", available
              <a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>.
      
              This support all the types of objects, which is supported by
              Pyrolite, nearly all builtin types.
      
              param relativeSD Relative accuracy. Smaller values create
                                 counters that require more space.
                                 It must be greater than 0.000017.
      
              >>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct()
              >>> 950 < n < 1050
              True
              >>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct()
              >>> 18 < n < 22
              True
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2142 from davies/countApproxDistinct and squashes the following commits:
      
      e20da47 [Davies Liu] remove the correction in Python
      c38c4e4 [Davies Liu] fix doc tests
      2ab157c [Davies Liu] fix doc tests
      9d2565f [Davies Liu] add commments and link for hash collision correction
      d306492 [Davies Liu] change range of hash of tuple to [0, maxint]
      ded624f [Davies Liu] calculate hash in Python
      4cba98f [Davies Liu] add more tests
      a85a8c6 [Davies Liu] Merge branch 'master' into countApproxDistinct
      e97e342 [Davies Liu] add countApproxDistinct()
      e2c901b4
  29. Aug 26, 2014
    • Davies Liu's avatar
      [SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey() · f1e71d4c
      Davies Liu authored
      Using external sort to support sort large datasets in reduce stage.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1978 from davies/sort and squashes the following commits:
      
      bbcd9ba [Davies Liu] check spilled bytes in tests
      b125d2f [Davies Liu] add test for external sort in rdd
      eae0176 [Davies Liu] choose different disks from different processes and instances
      1f075ed [Davies Liu] Merge branch 'master' into sort
      eb53ca6 [Davies Liu] Merge branch 'master' into sort
      644abaf [Davies Liu] add license in LICENSE
      19f7873 [Davies Liu] improve tests
      55602ee [Davies Liu] use external sort in sortBy() and sortByKey()
      f1e71d4c
    • Davies Liu's avatar
      [SPARK-2871] [PySpark] add histgram() API · 3cedc4f4
      Davies Liu authored
      RDD.histogram(buckets)
      
              Compute a histogram using the provided buckets. The buckets
              are all open to the right except for the last which is closed.
              e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
              which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
              and 50 we would have a histogram of 1,0,1.
      
              If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
              this can be switched from an O(log n) inseration to O(1) per
              element(where n = # buckets).
      
              Buckets must be sorted and not contain any duplicates, must be
              at least two elements.
      
              If `buckets` is a number, it will generates buckets which is
              evenly spaced between the minimum and maximum of the RDD. For
              example, if the min value is 0 and the max is 100, given buckets
              as 2, the resulting buckets will be [0,50) [50,100]. buckets must
              be at least 1 If the RDD contains infinity, NaN throws an exception
              If the elements in RDD do not vary (max == min) always returns
              a single bucket.
      
              It will return an tuple of buckets and histogram.
      
              >>> rdd = sc.parallelize(range(51))
              >>> rdd.histogram(2)
              ([0, 25, 50], [25, 26])
              >>> rdd.histogram([0, 5, 25, 50])
              ([0, 5, 25, 50], [5, 20, 26])
              >>> rdd.histogram([0, 15, 30, 45, 60], True)
              ([0, 15, 30, 45, 60], [15, 15, 15, 6])
              >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"])
              >>> rdd.histogram(("a", "b", "c"))
              (('a', 'b', 'c'), [2, 2])
      
      closes #122, it's duplicated.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2091 from davies/histgram and squashes the following commits:
      
      a322f8a [Davies Liu] fix deprecation of e.message
      84e85fa [Davies Liu] remove evenBuckets, add more tests (including str)
      d9a0722 [Davies Liu] address comments
      0e18a2d [Davies Liu] add histgram() API
      3cedc4f4
  30. Aug 19, 2014
    • Davies Liu's avatar
      [SPARK-2790] [PySpark] fix zip with serializers which have different batch sizes. · d7e80c25
      Davies Liu authored
      If two RDDs have different batch size in serializers, then it will try to re-serialize the one with smaller batch size, then call RDD.zip() in Spark.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1894 from davies/zip and squashes the following commits:
      
      c4652ea [Davies Liu] add more test cases
      6d05fc8 [Davies Liu] Merge branch 'master' into zip
      813b1e4 [Davies Liu] add more tests for failed cases
      a4aafda [Davies Liu] fix zip with serializers which have different batch sizes.
      d7e80c25
  31. Aug 18, 2014
    • Davies Liu's avatar
      [SPARK-3103] [PySpark] fix saveAsTextFile() with utf-8 · d1d0ee41
      Davies Liu authored
      bugfix: It will raise an exception when it try to encode non-ASCII strings into unicode. It should only encode unicode as "utf-8".
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2018 from davies/fix_utf8 and squashes the following commits:
      
      4db7967 [Davies Liu] fix saveAsTextFile() with utf-8
      d1d0ee41
  32. Aug 16, 2014
    • Davies Liu's avatar
      [SPARK-1065] [PySpark] improve supporting for large broadcast · 2fc8aca0
      Davies Liu authored
      Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize()).
      
      Add an option to keep object in driver (it's False by default) to save memory in driver.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1912 from davies/broadcast and squashes the following commits:
      
      e06df4a [Davies Liu] load broadcast from disk in driver automatically
      db3f232 [Davies Liu] fix serialization of accumulator
      631a827 [Davies Liu] Merge branch 'master' into broadcast
      c7baa8c [Davies Liu] compress serrialized broadcast and command
      9a7161f [Davies Liu] fix doc tests
      e93cf4b [Davies Liu] address comments: add test
      6226189 [Davies Liu] improve large broadcast
      2fc8aca0
  33. Aug 11, 2014
    • Josh Rosen's avatar
      [PySpark] [SPARK-2954] [SPARK-2948] [SPARK-2910] [SPARK-2101] Python 2.6 Fixes · db06a81f
      Josh Rosen authored
      - Modify python/run-tests to test with Python 2.6
      - Use unittest2 when running on Python 2.6.
      - Fix issue with namedtuple.
      - Skip TestOutputFormat.test_newhadoop on Python 2.6 until SPARK-2951 is fixed.
      - Fix MLlib _deserialize_double on Python 2.6.
      
      Closes #1868.  Closes #1042.
      
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #1874 from JoshRosen/python2.6 and squashes the following commits:
      
      983d259 [Josh Rosen] [SPARK-2954] Fix MLlib _deserialize_double on Python 2.6.
      5d18fd7 [Josh Rosen] [SPARK-2948] [SPARK-2910] [SPARK-2101] Python 2.6 fixes
      db06a81f
  34. Aug 06, 2014
    • Nicholas Chammas's avatar
      [SPARK-2627] [PySpark] have the build enforce PEP 8 automatically · d614967b
      Nicholas Chammas authored
      As described in [SPARK-2627](https://issues.apache.org/jira/browse/SPARK-2627), we'd like Python code to automatically be checked for PEP 8 compliance by Jenkins. This pull request aims to do that.
      
      Notes:
      * We may need to install [`pep8`](https://pypi.python.org/pypi/pep8) on the build server.
      * I'm expecting tests to fail now that PEP 8 compliance is being checked as part of the build. I'm fine with cleaning up any remaining PEP 8 violations as part of this pull request.
      * I did not understand why the RAT and scalastyle reports are saved to text files. I did the same for the PEP 8 check, but only so that the console output style can match those for the RAT and scalastyle checks. The PEP 8 report is removed right after the check is complete.
      * Updates to the ["Contributing to Spark"](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) guide will be submitted elsewhere, as I don't believe that text is part of the Spark repo.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      Author: nchammas <nicholas.chammas@gmail.com>
      
      Closes #1744 from nchammas/master and squashes the following commits:
      
      274b238 [Nicholas Chammas] [SPARK-2627] [PySpark] minor indentation changes
      983d963 [nchammas] Merge pull request #5 from apache/master
      1db5314 [nchammas] Merge pull request #4 from apache/master
      0e0245f [Nicholas Chammas] [SPARK-2627] undo erroneous whitespace fixes
      bf30942 [Nicholas Chammas] [SPARK-2627] PEP8: comment spacing
      6db9a44 [nchammas] Merge pull request #3 from apache/master
      7b4750e [Nicholas Chammas] merge upstream changes
      91b7584 [Nicholas Chammas] [SPARK-2627] undo unnecessary line breaks
      44e3e56 [Nicholas Chammas] [SPARK-2627] use tox.ini to exclude files
      b09fae2 [Nicholas Chammas] don't wrap comments unnecessarily
      bfb9f9f [Nicholas Chammas] [SPARK-2627] keep up with the PEP 8 fixes
      9da347f [nchammas] Merge pull request #2 from apache/master
      aa5b4b5 [Nicholas Chammas] [SPARK-2627] follow Spark bash style for if blocks
      d0a83b9 [Nicholas Chammas] [SPARK-2627] check that pep8 downloaded fine
      dffb5dd [Nicholas Chammas] [SPARK-2627] download pep8 at runtime
      a1ce7ae [Nicholas Chammas] [SPARK-2627] space out test report sections
      21da538 [Nicholas Chammas] [SPARK-2627] it's PEP 8, not PEP8
      6f4900b [Nicholas Chammas] [SPARK-2627] more misc PEP 8 fixes
      fe57ed0 [Nicholas Chammas] removing merge conflict backups
      9c01d4c [nchammas] Merge pull request #1 from apache/master
      9a66cb0 [Nicholas Chammas] resolving merge conflicts
      a31ccc4 [Nicholas Chammas] [SPARK-2627] miscellaneous PEP 8 fixes
      beaa9ac [Nicholas Chammas] [SPARK-2627] fail check on non-zero status
      723ed39 [Nicholas Chammas] always delete the report file
      0541ebb [Nicholas Chammas] [SPARK-2627] call Python linter from run-tests
      12440fa [Nicholas Chammas] [SPARK-2627] add Scala linter
      61c07b9 [Nicholas Chammas] [SPARK-2627] add Python linter
      75ad552 [Nicholas Chammas] make check output style consistent
      d614967b
Loading