Skip to content
Snippets Groups Projects
  • Liang-Chi Hsieh's avatar
    1e35e969
    [SPARK-17817] [PYSPARK] [FOLLOWUP] PySpark RDD Repartitioning Results in... · 1e35e969
    Liang-Chi Hsieh authored
    [SPARK-17817] [PYSPARK] [FOLLOWUP] PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes
    
    ## What changes were proposed in this pull request?
    
    This change is a followup for #15389 which calls `_to_java_object_rdd()` to solve this issue. Due to the concern of the possible expensive cost of the call, we can choose to decrease the batch size to solve this issue too.
    
    Simple benchmark:
    
        import time
        num_partitions = 20000
        a = sc.parallelize(range(int(1e6)), 2)
        start = time.time()
        l = a.repartition(num_partitions).glom().map(len).collect()
        end = time.time()
        print(end - start)
    
    Before: 419.447577953
    _to_java_object_rdd(): 421.916361094
    decreasing the batch size: 423.712255955
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #15445 from viirya/repartition-batch-size.
    1e35e969
    History
    [SPARK-17817] [PYSPARK] [FOLLOWUP] PySpark RDD Repartitioning Results in...
    Liang-Chi Hsieh authored
    [SPARK-17817] [PYSPARK] [FOLLOWUP] PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes
    
    ## What changes were proposed in this pull request?
    
    This change is a followup for #15389 which calls `_to_java_object_rdd()` to solve this issue. Due to the concern of the possible expensive cost of the call, we can choose to decrease the batch size to solve this issue too.
    
    Simple benchmark:
    
        import time
        num_partitions = 20000
        a = sc.parallelize(range(int(1e6)), 2)
        start = time.time()
        l = a.repartition(num_partitions).glom().map(len).collect()
        end = time.time()
        print(end - start)
    
    Before: 419.447577953
    _to_java_object_rdd(): 421.916361094
    decreasing the batch size: 423.712255955
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #15445 from viirya/repartition-batch-size.