python/pyspark/rdd.py · 5e035403d41527f092705c68f0dd1b4b946014d6 · cs525-sp18-g07 / spark

9 years ago

[SPARK-9821] [PYSPARK] pyspark-reduceByKey-should-take-a-custom-partitioner · 1cd67415

Holden Karau authored 9 years ago

from the issue:

In Scala, I can supply a custom partitioner to reduceByKey (and other aggregation/repartitioning methods like aggregateByKey and combinedByKey), but as far as I can tell from the Pyspark API, there's no way to do the same in Python.
Here's an example of my code in Scala:
weblogs.map(s => (getFileType(s), 1)).reduceByKey(new FileTypePartitioner(),_+_)
But I can't figure out how to do the same in Python. The closest I can get is to call repartition before reduceByKey like so:
weblogs.map(lambda s: (getFileType(s), 1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: v1+v2).collect()
But that defeats the purpose, because I'm shuffling twice instead of once, so my performance is worse instead of better.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8569 from holdenk/SPARK-9821-pyspark-reduceByKey-should-take-a-custom-partitioner.

1cd67415

History

[SPARK-9821] [PYSPARK] pyspark-reduceByKey-should-take-a-custom-partitioner

Holden Karau authored 9 years ago

from the issue:

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8569 from holdenk/SPARK-9821-pyspark-reduceByKey-should-take-a-custom-partitioner.