Skip to content
Snippets Groups Projects
  • Davies Liu's avatar
    d39f2e9c
    [SPARK-4477] [PySpark] remove numpy from RDDSampler · d39f2e9c
    Davies Liu authored
    In RDDSampler, it try use numpy to gain better performance for possion(), but the number of call of random() is only (1+faction) * N in the pure python implementation of possion(), so there is no much performance gain from numpy.
    
    numpy is not a dependent of pyspark, so it maybe introduce some problem, such as there is no numpy installed in slaves, but only installed master, as reported in SPARK-927.
    
    It also complicate the code a lot, so we may should remove numpy from RDDSampler.
    
    I also did some benchmark to verify that:
    ```
    >>> from pyspark.mllib.random import RandomRDDs
    >>> rdd = RandomRDDs.uniformRDD(sc, 1 << 20, 1).cache()
    >>> rdd.count()  # cache it
    >>> rdd.sample(True, 0.9).count()    # measure this line
    ```
    the results:
    
    |withReplacement      |  random  | numpy.random |
     ------- | ------------ |  -------
    |True | 1.5 s|  1.4 s|
    |False|  0.6 s | 0.8 s|
    
    closes #2313
    
    Note: this patch including some commits that not mirrored to github, it will be OK after it catches up.
    
    Author: Davies Liu <davies@databricks.com>
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #3351 from davies/numpy and squashes the following commits:
    
    5c438d7 [Davies Liu] fix comment
    c5b9252 [Davies Liu] Merge pull request #1 from mengxr/SPARK-4477
    98eb31b [Xiangrui Meng] make poisson sampling slightly faster
    ee17d78 [Davies Liu] remove = for float
    13f7b05 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into numpy
    f583023 [Davies Liu] fix tests
    51649f5 [Davies Liu] remove numpy in RDDSampler
    78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain
    f5fdf63 [Davies Liu] fix bug with int in weights
    4dfa2cd [Davies Liu] refactor
    f866bcf [Davies Liu] remove unneeded change
    c7a2007 [Davies Liu] switch to python implementation
    95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit
    0d9b256 [Davies Liu] refactor
    1715ee3 [Davies Liu] address comments
    41fce54 [Davies Liu] randomSplit()
    d39f2e9c
    History
    [SPARK-4477] [PySpark] remove numpy from RDDSampler
    Davies Liu authored
    In RDDSampler, it try use numpy to gain better performance for possion(), but the number of call of random() is only (1+faction) * N in the pure python implementation of possion(), so there is no much performance gain from numpy.
    
    numpy is not a dependent of pyspark, so it maybe introduce some problem, such as there is no numpy installed in slaves, but only installed master, as reported in SPARK-927.
    
    It also complicate the code a lot, so we may should remove numpy from RDDSampler.
    
    I also did some benchmark to verify that:
    ```
    >>> from pyspark.mllib.random import RandomRDDs
    >>> rdd = RandomRDDs.uniformRDD(sc, 1 << 20, 1).cache()
    >>> rdd.count()  # cache it
    >>> rdd.sample(True, 0.9).count()    # measure this line
    ```
    the results:
    
    |withReplacement      |  random  | numpy.random |
     ------- | ------------ |  -------
    |True | 1.5 s|  1.4 s|
    |False|  0.6 s | 0.8 s|
    
    closes #2313
    
    Note: this patch including some commits that not mirrored to github, it will be OK after it catches up.
    
    Author: Davies Liu <davies@databricks.com>
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #3351 from davies/numpy and squashes the following commits:
    
    5c438d7 [Davies Liu] fix comment
    c5b9252 [Davies Liu] Merge pull request #1 from mengxr/SPARK-4477
    98eb31b [Xiangrui Meng] make poisson sampling slightly faster
    ee17d78 [Davies Liu] remove = for float
    13f7b05 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into numpy
    f583023 [Davies Liu] fix tests
    51649f5 [Davies Liu] remove numpy in RDDSampler
    78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain
    f5fdf63 [Davies Liu] fix bug with int in weights
    4dfa2cd [Davies Liu] refactor
    f866bcf [Davies Liu] remove unneeded change
    c7a2007 [Davies Liu] switch to python implementation
    95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit
    0d9b256 [Davies Liu] refactor
    1715ee3 [Davies Liu] address comments
    41fce54 [Davies Liu] randomSplit()