Skip to content
Snippets Groups Projects
  • Liang-Chi Hsieh's avatar
    07508bd0
    [SPARK-17817][PYSPARK] PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes · 07508bd0
    Liang-Chi Hsieh authored
    ## What changes were proposed in this pull request?
    
    Quoted from JIRA description:
    
    Calling repartition on a PySpark RDD to increase the number of partitions results in highly skewed partition sizes, with most having 0 rows. The repartition method should evenly spread out the rows across the partitions, and this behavior is correctly seen on the Scala side.
    
    Please reference the following code for a reproducible example of this issue:
    
        num_partitions = 20000
        a = sc.parallelize(range(int(1e6)), 2)  # start with 2 even partitions
        l = a.repartition(num_partitions).glom().map(len).collect()  # get length of each partition
        min(l), max(l), sum(l)/len(l), len(l)  # skewed!
    
    In Scala's `repartition` code, we will distribute elements evenly across output partitions. However, the RDD from Python is serialized as a single binary data, so the distribution fails. We need to convert the RDD in Python to java object before repartitioning.
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #15389 from viirya/pyspark-rdd-repartition.
    07508bd0
    History
    [SPARK-17817][PYSPARK] PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes
    Liang-Chi Hsieh authored
    ## What changes were proposed in this pull request?
    
    Quoted from JIRA description:
    
    Calling repartition on a PySpark RDD to increase the number of partitions results in highly skewed partition sizes, with most having 0 rows. The repartition method should evenly spread out the rows across the partitions, and this behavior is correctly seen on the Scala side.
    
    Please reference the following code for a reproducible example of this issue:
    
        num_partitions = 20000
        a = sc.parallelize(range(int(1e6)), 2)  # start with 2 even partitions
        l = a.repartition(num_partitions).glom().map(len).collect()  # get length of each partition
        min(l), max(l), sum(l)/len(l), len(l)  # skewed!
    
    In Scala's `repartition` code, we will distribute elements evenly across output partitions. However, the RDD from Python is serialized as a single binary data, so the distribution fails. We need to convert the RDD in Python to java object before repartitioning.
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #15389 from viirya/pyspark-rdd-repartition.
tests.py 83.98 KiB