Skip to content
Snippets Groups Projects
  • Holden Karau's avatar
    1cd67415
    [SPARK-9821] [PYSPARK] pyspark-reduceByKey-should-take-a-custom-partitioner · 1cd67415
    Holden Karau authored
    from the issue:
    
    In Scala, I can supply a custom partitioner to reduceByKey (and other aggregation/repartitioning methods like aggregateByKey and combinedByKey), but as far as I can tell from the Pyspark API, there's no way to do the same in Python.
    Here's an example of my code in Scala:
    weblogs.map(s => (getFileType(s), 1)).reduceByKey(new FileTypePartitioner(),_+_)
    But I can't figure out how to do the same in Python. The closest I can get is to call repartition before reduceByKey like so:
    weblogs.map(lambda s: (getFileType(s), 1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: v1+v2).collect()
    But that defeats the purpose, because I'm shuffling twice instead of once, so my performance is worse instead of better.
    
    Author: Holden Karau <holden@pigscanfly.ca>
    
    Closes #8569 from holdenk/SPARK-9821-pyspark-reduceByKey-should-take-a-custom-partitioner.
    1cd67415
    History
    [SPARK-9821] [PYSPARK] pyspark-reduceByKey-should-take-a-custom-partitioner
    Holden Karau authored
    from the issue:
    
    In Scala, I can supply a custom partitioner to reduceByKey (and other aggregation/repartitioning methods like aggregateByKey and combinedByKey), but as far as I can tell from the Pyspark API, there's no way to do the same in Python.
    Here's an example of my code in Scala:
    weblogs.map(s => (getFileType(s), 1)).reduceByKey(new FileTypePartitioner(),_+_)
    But I can't figure out how to do the same in Python. The closest I can get is to call repartition before reduceByKey like so:
    weblogs.map(lambda s: (getFileType(s), 1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: v1+v2).collect()
    But that defeats the purpose, because I'm shuffling twice instead of once, so my performance is worse instead of better.
    
    Author: Holden Karau <holden@pigscanfly.ca>
    
    Closes #8569 from holdenk/SPARK-9821-pyspark-reduceByKey-should-take-a-custom-partitioner.