Skip to content
Snippets Groups Projects
  • Davies Liu's avatar
    71af030b
    [SPARK-3094] [PySpark] compatitable with PyPy · 71af030b
    Davies Liu authored
    After this patch, we can run PySpark in PyPy (testing with PyPy 2.3.1 in Mac 10.9), for example:
    
    ```
    PYSPARK_PYTHON=pypy ./bin/spark-submit wordcount.py
    ```
    
    The performance speed up will depend on work load (from 20% to 3000%). Here are some benchmarks:
    
     Job | CPython 2.7 | PyPy 2.3.1  | Speed up
     ------- | ------------ | ------------- | -------
     Word Count | 41s   | 15s  | 2.7x
     Sort | 46s |  44s | 1.05x
     Stats | 174s | 3.6s | 48x
    
    Here is the code used for benchmark:
    
    ```python
    rdd = sc.textFile("text")
    def wordcount():
        rdd.flatMap(lambda x:x.split('/'))\
            .map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap()
    def sort():
        rdd.sortBy(lambda x:x, 1).count()
    def stats():
        sc.parallelize(range(1024), 20).flatMap(lambda x: xrange(5024)).stats()
    ```
    
    Author: Davies Liu <davies.liu@gmail.com>
    
    Closes #2144 from davies/pypy and squashes the following commits:
    
    9aed6c5 [Davies Liu] use protocol 2 in CloudPickle
    4bc1f04 [Davies Liu] refactor
    b20ab3a [Davies Liu] pickle sys.stdout and stderr in portable way
    3ca2351 [Davies Liu] Merge branch 'master' into pypy
    fae8b19 [Davies Liu] improve attrgetter, add tests
    591f830 [Davies Liu] try to run tests with PyPy in run-tests
    c8d62ba [Davies Liu] cleanup
    f651fd0 [Davies Liu] fix tests using array with PyPy
    1b98fb3 [Davies Liu] serialize itemgetter/attrgetter in portable ways
    3c1dbfe [Davies Liu] Merge branch 'master' into pypy
    42fb5fa [Davies Liu] Merge branch 'master' into pypy
    cb2d724 [Davies Liu] fix tests
    9986692 [Davies Liu] Merge branch 'master' into pypy
    25b4ca7 [Davies Liu] support PyPy
    71af030b
    History
    [SPARK-3094] [PySpark] compatitable with PyPy
    Davies Liu authored
    After this patch, we can run PySpark in PyPy (testing with PyPy 2.3.1 in Mac 10.9), for example:
    
    ```
    PYSPARK_PYTHON=pypy ./bin/spark-submit wordcount.py
    ```
    
    The performance speed up will depend on work load (from 20% to 3000%). Here are some benchmarks:
    
     Job | CPython 2.7 | PyPy 2.3.1  | Speed up
     ------- | ------------ | ------------- | -------
     Word Count | 41s   | 15s  | 2.7x
     Sort | 46s |  44s | 1.05x
     Stats | 174s | 3.6s | 48x
    
    Here is the code used for benchmark:
    
    ```python
    rdd = sc.textFile("text")
    def wordcount():
        rdd.flatMap(lambda x:x.split('/'))\
            .map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap()
    def sort():
        rdd.sortBy(lambda x:x, 1).count()
    def stats():
        sc.parallelize(range(1024), 20).flatMap(lambda x: xrange(5024)).stats()
    ```
    
    Author: Davies Liu <davies.liu@gmail.com>
    
    Closes #2144 from davies/pypy and squashes the following commits:
    
    9aed6c5 [Davies Liu] use protocol 2 in CloudPickle
    4bc1f04 [Davies Liu] refactor
    b20ab3a [Davies Liu] pickle sys.stdout and stderr in portable way
    3ca2351 [Davies Liu] Merge branch 'master' into pypy
    fae8b19 [Davies Liu] improve attrgetter, add tests
    591f830 [Davies Liu] try to run tests with PyPy in run-tests
    c8d62ba [Davies Liu] cleanup
    f651fd0 [Davies Liu] fix tests using array with PyPy
    1b98fb3 [Davies Liu] serialize itemgetter/attrgetter in portable ways
    3c1dbfe [Davies Liu] Merge branch 'master' into pypy
    42fb5fa [Davies Liu] Merge branch 'master' into pypy
    cb2d724 [Davies Liu] fix tests
    9986692 [Davies Liu] Merge branch 'master' into pypy
    25b4ca7 [Davies Liu] support PyPy