Skip to content
Snippets Groups Projects
Commit 07508bd0 authored by Liang-Chi Hsieh's avatar Liang-Chi Hsieh Committed by Felix Cheung
Browse files

[SPARK-17817][PYSPARK] PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes

## What changes were proposed in this pull request?

Quoted from JIRA description:

Calling repartition on a PySpark RDD to increase the number of partitions results in highly skewed partition sizes, with most having 0 rows. The repartition method should evenly spread out the rows across the partitions, and this behavior is correctly seen on the Scala side.

Please reference the following code for a reproducible example of this issue:

    num_partitions = 20000
    a = sc.parallelize(range(int(1e6)), 2)  # start with 2 even partitions
    l = a.repartition(num_partitions).glom().map(len).collect()  # get length of each partition
    min(l), max(l), sum(l)/len(l), len(l)  # skewed!

In Scala's `repartition` code, we will distribute elements evenly across output partitions. However, the RDD from Python is serialized as a single binary data, so the distribution fails. We need to convert the RDD in Python to java object before repartitioning.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #15389 from viirya/pyspark-rdd-repartition.
parent 75b9e351
No related branches found
No related tags found
No related merge requests found
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment