-
- Downloads
[SPARK-4860][pyspark][sql] speeding up `sample()` and `takeSample()`
This PR modifies the python `SchemaRDD` to use `sample()` and `takeSample()` from Scala instead of the slower python implementations from `rdd.py`. This is worthwhile because the `Row`'s are already serialized as Java objects. In order to use the faster `takeSample()`, a `takeSampleToPython()` method was implemented in `SchemaRDD.scala` following the pattern of `collectToPython()`. Author: jbencook <jbenjamincook@gmail.com> Author: J. Benjamin Cook <jbenjamincook@gmail.com> Closes #3764 from jbencook/master and squashes the following commits: 6fbc769 [J. Benjamin Cook] [SPARK-4860][pyspark][sql] fixing sloppy indentation for takeSampleToPython() arguments 5170da2 [J. Benjamin Cook] [SPARK-4860][pyspark][sql] fixing typo: from RDD to SchemaRDD de22f70 [jbencook] [SPARK-4860][pyspark][sql] using sample() method from JavaSchemaRDD b916442 [jbencook] [SPARK-4860][pyspark][sql] adding sample() to JavaSchemaRDD 020cbdf [jbencook] [SPARK-4860][pyspark][sql] using Scala implementations of `sample()` and `takeSample()`
Showing
- python/pyspark/sql.py 28 additions, 0 deletionspython/pyspark/sql.py
- sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala 15 additions, 0 deletionssql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala
- sql/core/src/main/scala/org/apache/spark/sql/api/java/JavaSchemaRDD.scala 6 additions, 0 deletions...n/scala/org/apache/spark/sql/api/java/JavaSchemaRDD.scala
Loading
Please register or sign in to comment