Skip to content
Snippets Groups Projects
  1. Mar 06, 2016
    • Shixiong Zhu's avatar
      [SPARK-13697] [PYSPARK] Fix the missing module name of TransformFunctionSerializer.loads · ee913e6e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Set the function's module name to `__main__` if it's missing in `TransformFunctionSerializer.loads`.
      
      ## How was this patch tested?
      
      Manually test in the shell.
      
      Before this patch:
      ```
      >>> from pyspark.streaming import StreamingContext
      >>> from pyspark.streaming.util import TransformFunction
      >>> ssc = StreamingContext(sc, 1)
      >>> func = TransformFunction(sc, lambda x: x, sc.serializer)
      >>> func.rdd_wrapper(lambda x: x)
      TransformFunction(<function <lambda> at 0x106ac8b18>)
      >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, func.rdd_wrap_func, func.deserializers)))
      >>> func2 = ssc._transformerSerializer.loads(bytes)
      >>> print(func2.func.__module__)
      None
      >>> print(func2.rdd_wrap_func.__module__)
      None
      >>>
      ```
      After this patch:
      ```
      >>> from pyspark.streaming import StreamingContext
      >>> from pyspark.streaming.util import TransformFunction
      >>> ssc = StreamingContext(sc, 1)
      >>> func = TransformFunction(sc, lambda x: x, sc.serializer)
      >>> func.rdd_wrapper(lambda x: x)
      TransformFunction(<function <lambda> at 0x108bf1b90>)
      >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, func.rdd_wrap_func, func.deserializers)))
      >>> func2 = ssc._transformerSerializer.loads(bytes)
      >>> print(func2.func.__module__)
      __main__
      >>> print(func2.rdd_wrap_func.__module__)
      __main__
      >>>
      ```
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11535 from zsxwing/loads-module.
      ee913e6e
  2. Sep 14, 2015
  3. Jul 30, 2015
    • Davies Liu's avatar
      [SPARK-9116] [SQL] [PYSPARK] support Python only UDT in __main__ · e044705b
      Davies Liu authored
      Also we could create a Python UDT without having a Scala one, it's important for Python users.
      
      cc mengxr JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7453 from davies/class_in_main and squashes the following commits:
      
      4dfd5e1 [Davies Liu] add tests for Python and Scala UDT
      793d9b2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      dc65f19 [Davies Liu] address comment
      a9a3c40 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      a86e1fc [Davies Liu] fix serialization
      ad528ba [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      63f52ef [Davies Liu] fix pylint check
      655b8a9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      316a394 [Davies Liu] support Python UDT with UTF
      0bcb3ef [Davies Liu] fix bug in mllib
      de986d6 [Davies Liu] fix test
      83d65ac [Davies Liu] fix bug in StructType
      55bb86e [Davies Liu] support Python UDT in __main__ (without Scala one)
      e044705b
  4. Apr 16, 2015
    • Davies Liu's avatar
      [SPARK-4897] [PySpark] Python 3 support · 04e44b37
      Davies Liu authored
      This PR update PySpark to support Python 3 (tested with 3.4).
      
      Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.
      
      TODO: ec2/spark-ec2.py is not fully tested with python3.
      
      Author: Davies Liu <davies@databricks.com>
      Author: twneale <twneale@gmail.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5173 from davies/python3 and squashes the following commits:
      
      d7d6323 [Davies Liu] fix tests
      6c52a98 [Davies Liu] fix mllib test
      99e334f [Davies Liu] update timeout
      b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      cafd5ec [Davies Liu] adddress comments from @mengxr
      bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      179fc8d [Davies Liu] tuning flaky tests
      8c8b957 [Davies Liu] fix ResourceWarning in Python 3
      5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      4006829 [Davies Liu] fix test
      2fc0066 [Davies Liu] add python3 path
      71535e9 [Davies Liu] fix xrange and divide
      5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ed498c8 [Davies Liu] fix compatibility with python 3
      820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ad7c374 [Davies Liu] fix mllib test and warning
      ef1fc2f [Davies Liu] fix tests
      4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      59bb492 [Davies Liu] fix tests
      1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ca0fdd3 [Davies Liu] fix code style
      9563a15 [Davies Liu] add imap back for python 2
      0b1ec04 [Davies Liu] make python examples work with Python 3
      d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      a716d34 [Davies Liu] test with python 3.4
      f1700e8 [Davies Liu] fix test in python3
      671b1db [Davies Liu] fix test in python3
      692ff47 [Davies Liu] fix flaky test
      7b9699f [Davies Liu] invalidate import cache for Python 3.3+
      9c58497 [Davies Liu] fix kill worker
      309bfbf [Davies Liu] keep compatibility
      5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
      8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      f53e1f0 [Davies Liu] fix tests
      70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
      a39167e [Davies Liu] support customize class in __main__
      814c77b [Davies Liu] run unittests with python 3
      7f4476e [Davies Liu] mllib tests passed
      d737924 [Davies Liu] pass ml tests
      375ea17 [Davies Liu] SQL tests pass
      6cc42a9 [Davies Liu] rename
      431a8de [Davies Liu] streaming tests pass
      78901a7 [Davies Liu] fix hash of serializer in Python 3
      24b2f2e [Davies Liu] pass all RDD tests
      35f48fe [Davies Liu] run future again
      1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
      6e3c21d [Davies Liu] make cloudpickle work with Python3
      2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
      1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
      7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
      b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
      f40d925 [twneale] xrange --> range
      e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
      79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
      2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
      854be27 [Josh Rosen] Run `futurize` on Python code:
      7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
      04e44b37
  5. Sep 24, 2014
    • Davies Liu's avatar
      [SPARK-3679] [PySpark] pickle the exact globals of functions · bb96012b
      Davies Liu authored
      function.func_code.co_names has all the names used in the function, including name of attributes. It will pickle some unnecessary globals if there is a global having the same name with attribute (in co_names).
      
      There is a regression introduced by #2144, revert part of changes in that PR.
      
      cc JoshRosen
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2522 from davies/globals and squashes the following commits:
      
      dfbccf5 [Davies Liu] fix bug while pickle globals of function
      bb96012b
  6. Sep 12, 2014
    • Davies Liu's avatar
      [SPARK-3094] [PySpark] compatitable with PyPy · 71af030b
      Davies Liu authored
      After this patch, we can run PySpark in PyPy (testing with PyPy 2.3.1 in Mac 10.9), for example:
      
      ```
      PYSPARK_PYTHON=pypy ./bin/spark-submit wordcount.py
      ```
      
      The performance speed up will depend on work load (from 20% to 3000%). Here are some benchmarks:
      
       Job | CPython 2.7 | PyPy 2.3.1  | Speed up
       ------- | ------------ | ------------- | -------
       Word Count | 41s   | 15s  | 2.7x
       Sort | 46s |  44s | 1.05x
       Stats | 174s | 3.6s | 48x
      
      Here is the code used for benchmark:
      
      ```python
      rdd = sc.textFile("text")
      def wordcount():
          rdd.flatMap(lambda x:x.split('/'))\
              .map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap()
      def sort():
          rdd.sortBy(lambda x:x, 1).count()
      def stats():
          sc.parallelize(range(1024), 20).flatMap(lambda x: xrange(5024)).stats()
      ```
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2144 from davies/pypy and squashes the following commits:
      
      9aed6c5 [Davies Liu] use protocol 2 in CloudPickle
      4bc1f04 [Davies Liu] refactor
      b20ab3a [Davies Liu] pickle sys.stdout and stderr in portable way
      3ca2351 [Davies Liu] Merge branch 'master' into pypy
      fae8b19 [Davies Liu] improve attrgetter, add tests
      591f830 [Davies Liu] try to run tests with PyPy in run-tests
      c8d62ba [Davies Liu] cleanup
      f651fd0 [Davies Liu] fix tests using array with PyPy
      1b98fb3 [Davies Liu] serialize itemgetter/attrgetter in portable ways
      3c1dbfe [Davies Liu] Merge branch 'master' into pypy
      42fb5fa [Davies Liu] Merge branch 'master' into pypy
      cb2d724 [Davies Liu] fix tests
      9986692 [Davies Liu] Merge branch 'master' into pypy
      25b4ca7 [Davies Liu] support PyPy
      71af030b
  7. Sep 07, 2014
    • Ward Viaene's avatar
      [SPARK-3415] [PySpark] removes SerializingAdapter code · ecfa76cd
      Ward Viaene authored
      This code removes the SerializingAdapter code that was copied from PiCloud
      
      Author: Ward Viaene <ward.viaene@bigdatapartnership.com>
      
      Closes #2287 from wardviaene/feature/pythonsys and squashes the following commits:
      
      5f0d426 [Ward Viaene] SPARK-3415: modified test class to do dump and load
      5f5d559 [Ward Viaene] SPARK-3415: modified test class name and call cloudpickle.dumps instead using StringIO
      afc4a9a [Ward Viaene] SPARK-3415: added newlines to pass lint
      aaf10b7 [Ward Viaene] SPARK-3415: removed references to SerializingAdapter and rewrote test
      65ffeff [Ward Viaene] removed duplicate test
      a958866 [Ward Viaene] SPARK-3415: test script
      e263bf5 [Ward Viaene] SPARK-3415: removes legacy SerializingAdapter code
      ecfa76cd
  8. Jul 29, 2014
    • Davies Liu's avatar
      [SPARK-791] [PySpark] fix pickle itemgetter with cloudpickle · 92ef0262
      Davies Liu authored
      fix the problem with pickle operator.itemgetter with multiple index.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1627 from davies/itemgetter and squashes the following commits:
      
      aabd7fa [Davies Liu] fix pickle itemgetter with cloudpickle
      92ef0262
  9. Jul 15, 2014
  10. May 31, 2014
  11. Jan 01, 2013
  12. Aug 19, 2012
Loading