Skip to content
  • Aaron Davidson's avatar
    f46e02fc
    SPARK-2203: PySpark defaults to use same num reduce partitions as map side · f46e02fc
    Aaron Davidson authored
    For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), PySpark will always assume that the default parallelism to use for the reduce side is ctx.defaultParallelism, which is a constant typically determined by the number of cores in cluster.
    
    In contrast, Spark's Partitioner#defaultPartitioner will use the same number of reduce partitions as map partitions unless the defaultParallelism config is explicitly set. This tends to be a better default in order to avoid OOMs, and should also be the behavior of PySpark.
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-2203
    
    Author: Aaron Davidson <aaron@databricks.com>
    
    Closes #1138 from aarondav/pyfix and squashes the following commits:
    
    1bd5751 [Aaron Davidson] SPARK-2203: PySpark defaults to use same num reduce partitions as map partitions
    f46e02fc
    SPARK-2203: PySpark defaults to use same num reduce partitions as map side
    Aaron Davidson authored
    For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), PySpark will always assume that the default parallelism to use for the reduce side is ctx.defaultParallelism, which is a constant typically determined by the number of cores in cluster.
    
    In contrast, Spark's Partitioner#defaultPartitioner will use the same number of reduce partitions as map partitions unless the defaultParallelism config is explicitly set. This tends to be a better default in order to avoid OOMs, and should also be the behavior of PySpark.
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-2203
    
    Author: Aaron Davidson <aaron@databricks.com>
    
    Closes #1138 from aarondav/pyfix and squashes the following commits:
    
    1bd5751 [Aaron Davidson] SPARK-2203: PySpark defaults to use same num reduce partitions as map partitions
Loading