Skip to content
  • jbencook's avatar
    fd41eb95
    [SPARK-4860][pyspark][sql] speeding up `sample()` and `takeSample()` · fd41eb95
    jbencook authored
    This PR modifies the python `SchemaRDD` to use `sample()` and `takeSample()` from Scala instead of the slower python implementations from `rdd.py`. This is worthwhile because the `Row`'s are already serialized as Java objects.
    
    In order to use the faster `takeSample()`, a `takeSampleToPython()` method was implemented in `SchemaRDD.scala` following the pattern of `collectToPython()`.
    
    Author: jbencook <jbenjamincook@gmail.com>
    Author: J. Benjamin Cook <jbenjamincook@gmail.com>
    
    Closes #3764 from jbencook/master and squashes the following commits:
    
    6fbc769 [J. Benjamin Cook] [SPARK-4860][pyspark][sql] fixing sloppy indentation for takeSampleToPython() arguments
    5170da2 [J. Benjamin Cook] [SPARK-4860][pyspark][sql] fixing typo: from RDD to SchemaRDD
    de22f70 [jbencook] [SPARK-4860][pyspark][sql] using sample() method from JavaSchemaRDD
    b916442 [jbencook] [SPARK-4860][pyspark][sql] adding sample() to JavaSchemaRDD
    020cbdf [jbencook] [SPARK-4860][pyspark][sql] using Scala implementations of `sample()` and `takeSample()`
    fd41eb95
    [SPARK-4860][pyspark][sql] speeding up `sample()` and `takeSample()`
    jbencook authored
    This PR modifies the python `SchemaRDD` to use `sample()` and `takeSample()` from Scala instead of the slower python implementations from `rdd.py`. This is worthwhile because the `Row`'s are already serialized as Java objects.
    
    In order to use the faster `takeSample()`, a `takeSampleToPython()` method was implemented in `SchemaRDD.scala` following the pattern of `collectToPython()`.
    
    Author: jbencook <jbenjamincook@gmail.com>
    Author: J. Benjamin Cook <jbenjamincook@gmail.com>
    
    Closes #3764 from jbencook/master and squashes the following commits:
    
    6fbc769 [J. Benjamin Cook] [SPARK-4860][pyspark][sql] fixing sloppy indentation for takeSampleToPython() arguments
    5170da2 [J. Benjamin Cook] [SPARK-4860][pyspark][sql] fixing typo: from RDD to SchemaRDD
    de22f70 [jbencook] [SPARK-4860][pyspark][sql] using sample() method from JavaSchemaRDD
    b916442 [jbencook] [SPARK-4860][pyspark][sql] adding sample() to JavaSchemaRDD
    020cbdf [jbencook] [SPARK-4860][pyspark][sql] using Scala implementations of `sample()` and `takeSample()`
Loading