Skip to content
Snippets Groups Projects
  1. Apr 11, 2017
    • David Gingrich's avatar
      [SPARK-19505][PYTHON] AttributeError on Exception.message in Python3 · 6297697f
      David Gingrich authored
      ## What changes were proposed in this pull request?
      
      Added `util._message_exception` helper to use `str(e)` when `e.message` is unavailable (Python3).  Grepped for all occurrences of `.message` in `pyspark/` and these were the only occurrences.
      
      ## How was this patch tested?
      
      - Doctests for helper function
      
      ## Legal
      
      This is my original work and I license the work to the project under the project’s open source license.
      
      Author: David Gingrich <david@textio.com>
      
      Closes #16845 from dgingrich/topic-spark-19505-py3-exceptions.
      6297697f
  2. Jan 17, 2017
    • hyukjinkwon's avatar
      [SPARK-19019] [PYTHON] Fix hijacked `collections.namedtuple` and port... · 20e62806
      hyukjinkwon authored
      [SPARK-19019] [PYTHON] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0
      
      ## What changes were proposed in this pull request?
      
      Currently, PySpark does not work with Python 3.6.0.
      
      Running `./bin/pyspark` simply throws the error as below and PySpark does not work at all:
      
      ```
      Traceback (most recent call last):
        File ".../spark/python/pyspark/shell.py", line 30, in <module>
          import pyspark
        File ".../spark/python/pyspark/__init__.py", line 46, in <module>
          from pyspark.context import SparkContext
        File ".../spark/python/pyspark/context.py", line 36, in <module>
          from pyspark.java_gateway import launch_gateway
        File ".../spark/python/pyspark/java_gateway.py", line 31, in <module>
          from py4j.java_gateway import java_import, JavaGateway, GatewayClient
        File "<frozen importlib._bootstrap>", line 961, in _find_and_load
        File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
        File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
        File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 18, in <module>
        File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", line 62, in <module>
          import pkgutil
        File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", line 22, in <module>
          ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
        File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
          cls = _old_namedtuple(*args, **kwargs)
      TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
      ```
      
      The root cause seems because some arguments of `namedtuple` are now completely keyword-only arguments from Python 3.6.0 (See https://bugs.python.org/issue25628).
      
      We currently copy this function via `types.FunctionType` which does not set the default values of keyword-only arguments (meaning `namedtuple.__kwdefaults__`) and this seems causing internally missing values in the function (non-bound arguments).
      
      This PR proposes to work around this by manually setting it via `kwargs` as `types.FunctionType` seems not supporting to set this.
      
      Also, this PR ports the changes in cloudpickle for compatibility for Python 3.6.0.
      
      ## How was this patch tested?
      
      Manually tested with Python 2.7.6 and Python 3.6.0.
      
      ```
      ./bin/pyspsark
      ```
      
      , manual creation of `namedtuple` both in local and rdd with Python 3.6.0,
      
      and Jenkins tests for other Python versions.
      
      Also,
      
      ```
      ./run-tests --python-executables=python3.6
      ```
      
      ```
      Will test against the following Python executables: ['python3.6']
      Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
      Finished test(python3.6): pyspark.sql.tests (192s)
      Finished test(python3.6): pyspark.accumulators (3s)
      Finished test(python3.6): pyspark.mllib.tests (198s)
      Finished test(python3.6): pyspark.broadcast (3s)
      Finished test(python3.6): pyspark.conf (2s)
      Finished test(python3.6): pyspark.context (14s)
      Finished test(python3.6): pyspark.ml.classification (21s)
      Finished test(python3.6): pyspark.ml.evaluation (11s)
      Finished test(python3.6): pyspark.ml.clustering (20s)
      Finished test(python3.6): pyspark.ml.linalg.__init__ (0s)
      Finished test(python3.6): pyspark.streaming.tests (240s)
      Finished test(python3.6): pyspark.tests (240s)
      Finished test(python3.6): pyspark.ml.recommendation (19s)
      Finished test(python3.6): pyspark.ml.feature (36s)
      Finished test(python3.6): pyspark.ml.regression (37s)
      Finished test(python3.6): pyspark.ml.tuning (28s)
      Finished test(python3.6): pyspark.mllib.classification (26s)
      Finished test(python3.6): pyspark.mllib.evaluation (18s)
      Finished test(python3.6): pyspark.mllib.clustering (44s)
      Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s)
      Finished test(python3.6): pyspark.mllib.feature (26s)
      Finished test(python3.6): pyspark.mllib.fpm (23s)
      Finished test(python3.6): pyspark.mllib.random (8s)
      Finished test(python3.6): pyspark.ml.tests (92s)
      Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s)
      Finished test(python3.6): pyspark.mllib.linalg.distributed (25s)
      Finished test(python3.6): pyspark.mllib.stat._statistics (15s)
      Finished test(python3.6): pyspark.mllib.recommendation (24s)
      Finished test(python3.6): pyspark.mllib.regression (26s)
      Finished test(python3.6): pyspark.profiler (9s)
      Finished test(python3.6): pyspark.mllib.tree (16s)
      Finished test(python3.6): pyspark.shuffle (1s)
      Finished test(python3.6): pyspark.mllib.util (18s)
      Finished test(python3.6): pyspark.serializers (11s)
      Finished test(python3.6): pyspark.rdd (20s)
      Finished test(python3.6): pyspark.sql.conf (8s)
      Finished test(python3.6): pyspark.sql.catalog (17s)
      Finished test(python3.6): pyspark.sql.column (18s)
      Finished test(python3.6): pyspark.sql.context (18s)
      Finished test(python3.6): pyspark.sql.group (27s)
      Finished test(python3.6): pyspark.sql.dataframe (33s)
      Finished test(python3.6): pyspark.sql.functions (35s)
      Finished test(python3.6): pyspark.sql.types (6s)
      Finished test(python3.6): pyspark.sql.streaming (13s)
      Finished test(python3.6): pyspark.streaming.util (0s)
      Finished test(python3.6): pyspark.sql.session (16s)
      Finished test(python3.6): pyspark.sql.window (4s)
      Finished test(python3.6): pyspark.sql.readwriter (35s)
      Tests passed in 433 seconds
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16429 from HyukjinKwon/SPARK-19019.
      20e62806
  3. Sep 14, 2016
    • Eric Liang's avatar
      [SPARK-17472] [PYSPARK] Better error message for serialization failures of large objects in Python · dbfc7aa4
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      For large objects, pickle does not raise useful error messages. However, we can wrap them to be slightly more user friendly:
      
      Example 1:
      ```
      def run():
        import numpy.random as nr
        b = nr.bytes(8 * 1000000000)
        sc.parallelize(range(1000), 1000).map(lambda x: len(b)).count()
      
      run()
      ```
      
      Before:
      ```
      error: 'i' format requires -2147483648 <= number <= 2147483647
      ```
      
      After:
      ```
      pickle.PicklingError: Object too large to serialize: 'i' format requires -2147483648 <= number <= 2147483647
      ```
      
      Example 2:
      ```
      def run():
        import numpy.random as nr
        b = sc.broadcast(nr.bytes(8 * 1000000000))
        sc.parallelize(range(1000), 1000).map(lambda x: len(b.value)).count()
      
      run()
      ```
      
      Before:
      ```
      SystemError: error return without exception set
      ```
      
      After:
      ```
      cPickle.PicklingError: Could not serialize broadcast: SystemError: error return without exception set
      ```
      
      ## How was this patch tested?
      
      Manually tried out these cases
      
      cc davies
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #15026 from ericl/spark-17472.
      dbfc7aa4
  4. Jun 24, 2016
  5. Mar 06, 2016
    • Shixiong Zhu's avatar
      [SPARK-13697] [PYSPARK] Fix the missing module name of TransformFunctionSerializer.loads · ee913e6e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Set the function's module name to `__main__` if it's missing in `TransformFunctionSerializer.loads`.
      
      ## How was this patch tested?
      
      Manually test in the shell.
      
      Before this patch:
      ```
      >>> from pyspark.streaming import StreamingContext
      >>> from pyspark.streaming.util import TransformFunction
      >>> ssc = StreamingContext(sc, 1)
      >>> func = TransformFunction(sc, lambda x: x, sc.serializer)
      >>> func.rdd_wrapper(lambda x: x)
      TransformFunction(<function <lambda> at 0x106ac8b18>)
      >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, func.rdd_wrap_func, func.deserializers)))
      >>> func2 = ssc._transformerSerializer.loads(bytes)
      >>> print(func2.func.__module__)
      None
      >>> print(func2.rdd_wrap_func.__module__)
      None
      >>>
      ```
      After this patch:
      ```
      >>> from pyspark.streaming import StreamingContext
      >>> from pyspark.streaming.util import TransformFunction
      >>> ssc = StreamingContext(sc, 1)
      >>> func = TransformFunction(sc, lambda x: x, sc.serializer)
      >>> func.rdd_wrapper(lambda x: x)
      TransformFunction(<function <lambda> at 0x108bf1b90>)
      >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, func.rdd_wrap_func, func.deserializers)))
      >>> func2 = ssc._transformerSerializer.loads(bytes)
      >>> print(func2.func.__module__)
      __main__
      >>> print(func2.rdd_wrap_func.__module__)
      __main__
      >>>
      ```
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11535 from zsxwing/loads-module.
      ee913e6e
  6. Sep 14, 2015
  7. Jul 30, 2015
    • Davies Liu's avatar
      [SPARK-9116] [SQL] [PYSPARK] support Python only UDT in __main__ · e044705b
      Davies Liu authored
      Also we could create a Python UDT without having a Scala one, it's important for Python users.
      
      cc mengxr JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7453 from davies/class_in_main and squashes the following commits:
      
      4dfd5e1 [Davies Liu] add tests for Python and Scala UDT
      793d9b2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      dc65f19 [Davies Liu] address comment
      a9a3c40 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      a86e1fc [Davies Liu] fix serialization
      ad528ba [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      63f52ef [Davies Liu] fix pylint check
      655b8a9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      316a394 [Davies Liu] support Python UDT with UTF
      0bcb3ef [Davies Liu] fix bug in mllib
      de986d6 [Davies Liu] fix test
      83d65ac [Davies Liu] fix bug in StructType
      55bb86e [Davies Liu] support Python UDT in __main__ (without Scala one)
      e044705b
  8. Apr 16, 2015
    • Davies Liu's avatar
      [SPARK-4897] [PySpark] Python 3 support · 04e44b37
      Davies Liu authored
      This PR update PySpark to support Python 3 (tested with 3.4).
      
      Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.
      
      TODO: ec2/spark-ec2.py is not fully tested with python3.
      
      Author: Davies Liu <davies@databricks.com>
      Author: twneale <twneale@gmail.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5173 from davies/python3 and squashes the following commits:
      
      d7d6323 [Davies Liu] fix tests
      6c52a98 [Davies Liu] fix mllib test
      99e334f [Davies Liu] update timeout
      b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      cafd5ec [Davies Liu] adddress comments from @mengxr
      bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      179fc8d [Davies Liu] tuning flaky tests
      8c8b957 [Davies Liu] fix ResourceWarning in Python 3
      5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      4006829 [Davies Liu] fix test
      2fc0066 [Davies Liu] add python3 path
      71535e9 [Davies Liu] fix xrange and divide
      5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ed498c8 [Davies Liu] fix compatibility with python 3
      820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ad7c374 [Davies Liu] fix mllib test and warning
      ef1fc2f [Davies Liu] fix tests
      4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      59bb492 [Davies Liu] fix tests
      1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ca0fdd3 [Davies Liu] fix code style
      9563a15 [Davies Liu] add imap back for python 2
      0b1ec04 [Davies Liu] make python examples work with Python 3
      d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      a716d34 [Davies Liu] test with python 3.4
      f1700e8 [Davies Liu] fix test in python3
      671b1db [Davies Liu] fix test in python3
      692ff47 [Davies Liu] fix flaky test
      7b9699f [Davies Liu] invalidate import cache for Python 3.3+
      9c58497 [Davies Liu] fix kill worker
      309bfbf [Davies Liu] keep compatibility
      5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
      8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      f53e1f0 [Davies Liu] fix tests
      70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
      a39167e [Davies Liu] support customize class in __main__
      814c77b [Davies Liu] run unittests with python 3
      7f4476e [Davies Liu] mllib tests passed
      d737924 [Davies Liu] pass ml tests
      375ea17 [Davies Liu] SQL tests pass
      6cc42a9 [Davies Liu] rename
      431a8de [Davies Liu] streaming tests pass
      78901a7 [Davies Liu] fix hash of serializer in Python 3
      24b2f2e [Davies Liu] pass all RDD tests
      35f48fe [Davies Liu] run future again
      1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
      6e3c21d [Davies Liu] make cloudpickle work with Python3
      2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
      1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
      7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
      b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
      f40d925 [twneale] xrange --> range
      e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
      79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
      2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
      854be27 [Josh Rosen] Run `futurize` on Python code:
      7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
      04e44b37
  9. Sep 24, 2014
    • Davies Liu's avatar
      [SPARK-3679] [PySpark] pickle the exact globals of functions · bb96012b
      Davies Liu authored
      function.func_code.co_names has all the names used in the function, including name of attributes. It will pickle some unnecessary globals if there is a global having the same name with attribute (in co_names).
      
      There is a regression introduced by #2144, revert part of changes in that PR.
      
      cc JoshRosen
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2522 from davies/globals and squashes the following commits:
      
      dfbccf5 [Davies Liu] fix bug while pickle globals of function
      bb96012b
  10. Sep 12, 2014
    • Davies Liu's avatar
      [SPARK-3094] [PySpark] compatitable with PyPy · 71af030b
      Davies Liu authored
      After this patch, we can run PySpark in PyPy (testing with PyPy 2.3.1 in Mac 10.9), for example:
      
      ```
      PYSPARK_PYTHON=pypy ./bin/spark-submit wordcount.py
      ```
      
      The performance speed up will depend on work load (from 20% to 3000%). Here are some benchmarks:
      
       Job | CPython 2.7 | PyPy 2.3.1  | Speed up
       ------- | ------------ | ------------- | -------
       Word Count | 41s   | 15s  | 2.7x
       Sort | 46s |  44s | 1.05x
       Stats | 174s | 3.6s | 48x
      
      Here is the code used for benchmark:
      
      ```python
      rdd = sc.textFile("text")
      def wordcount():
          rdd.flatMap(lambda x:x.split('/'))\
              .map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap()
      def sort():
          rdd.sortBy(lambda x:x, 1).count()
      def stats():
          sc.parallelize(range(1024), 20).flatMap(lambda x: xrange(5024)).stats()
      ```
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2144 from davies/pypy and squashes the following commits:
      
      9aed6c5 [Davies Liu] use protocol 2 in CloudPickle
      4bc1f04 [Davies Liu] refactor
      b20ab3a [Davies Liu] pickle sys.stdout and stderr in portable way
      3ca2351 [Davies Liu] Merge branch 'master' into pypy
      fae8b19 [Davies Liu] improve attrgetter, add tests
      591f830 [Davies Liu] try to run tests with PyPy in run-tests
      c8d62ba [Davies Liu] cleanup
      f651fd0 [Davies Liu] fix tests using array with PyPy
      1b98fb3 [Davies Liu] serialize itemgetter/attrgetter in portable ways
      3c1dbfe [Davies Liu] Merge branch 'master' into pypy
      42fb5fa [Davies Liu] Merge branch 'master' into pypy
      cb2d724 [Davies Liu] fix tests
      9986692 [Davies Liu] Merge branch 'master' into pypy
      25b4ca7 [Davies Liu] support PyPy
      71af030b
  11. Sep 07, 2014
    • Ward Viaene's avatar
      [SPARK-3415] [PySpark] removes SerializingAdapter code · ecfa76cd
      Ward Viaene authored
      This code removes the SerializingAdapter code that was copied from PiCloud
      
      Author: Ward Viaene <ward.viaene@bigdatapartnership.com>
      
      Closes #2287 from wardviaene/feature/pythonsys and squashes the following commits:
      
      5f0d426 [Ward Viaene] SPARK-3415: modified test class to do dump and load
      5f5d559 [Ward Viaene] SPARK-3415: modified test class name and call cloudpickle.dumps instead using StringIO
      afc4a9a [Ward Viaene] SPARK-3415: added newlines to pass lint
      aaf10b7 [Ward Viaene] SPARK-3415: removed references to SerializingAdapter and rewrote test
      65ffeff [Ward Viaene] removed duplicate test
      a958866 [Ward Viaene] SPARK-3415: test script
      e263bf5 [Ward Viaene] SPARK-3415: removes legacy SerializingAdapter code
      ecfa76cd
  12. Jul 29, 2014
    • Davies Liu's avatar
      [SPARK-791] [PySpark] fix pickle itemgetter with cloudpickle · 92ef0262
      Davies Liu authored
      fix the problem with pickle operator.itemgetter with multiple index.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1627 from davies/itemgetter and squashes the following commits:
      
      aabd7fa [Davies Liu] fix pickle itemgetter with cloudpickle
      92ef0262
  13. Jul 15, 2014
  14. May 31, 2014
  15. Jan 01, 2013
  16. Aug 19, 2012
Loading