-
- Downloads
Use filesystem to collect RDDs in PySpark.
Passing large volumes of data through Py4J seems to be slow. It appears to be faster to write the data to the local filesystem and read it back from Python.
Showing
- core/src/main/scala/spark/api/python/PythonRDD.scala 24 additions, 42 deletionscore/src/main/scala/spark/api/python/PythonRDD.scala
- pyspark/pyspark/context.py 4 additions, 5 deletionspyspark/pyspark/context.py
- pyspark/pyspark/rdd.py 28 additions, 6 deletionspyspark/pyspark/rdd.py
- pyspark/pyspark/serializers.py 8 additions, 0 deletionspyspark/pyspark/serializers.py
- pyspark/pyspark/worker.py 2 additions, 10 deletionspyspark/pyspark/worker.py
Loading
Please register or sign in to comment