Skip to content
Snippets Groups Projects
  • Hossein's avatar
    5cc503f4
    [SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB · 5cc503f4
    Hossein authored
    ## What changes were proposed in this pull request?
    If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD.
    
    I tested this on my MacBook. Following code works with this patch:
    ```R
    intMax <- .Machine$integer.max
    largeVec <- 1:intMax
    rdd <- SparkR:::parallelize(sc, largeVec, 2)
    ```
    
    ## How was this patch tested?
    * [x] Unit tests
    
    Author: Hossein <hossein@databricks.com>
    
    Closes #15375 from falaki/SPARK-17790.
    5cc503f4
    History
    [SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB
    Hossein authored
    ## What changes were proposed in this pull request?
    If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD.
    
    I tested this on my MacBook. Following code works with this patch:
    ```R
    intMax <- .Machine$integer.max
    largeVec <- 1:intMax
    rdd <- SparkR:::parallelize(sc, largeVec, 2)
    ```
    
    ## How was this patch tested?
    * [x] Unit tests
    
    Author: Hossein <hossein@databricks.com>
    
    Closes #15375 from falaki/SPARK-17790.