-
- Downloads
[SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB
## What changes were proposed in this pull request? If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD. I tested this on my MacBook. Following code works with this patch: ```R intMax <- .Machine$integer.max largeVec <- 1:intMax rdd <- SparkR:::parallelize(sc, largeVec, 2) ``` ## How was this patch tested? * [x] Unit tests Author: Hossein <hossein@databricks.com> Closes #15375 from falaki/SPARK-17790.
Showing
- R/pkg/R/context.R 43 additions, 2 deletionsR/pkg/R/context.R
- R/pkg/inst/tests/testthat/test_sparkSQL.R 11 additions, 0 deletionsR/pkg/inst/tests/testthat/test_sparkSQL.R
- core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala 1 addition, 1 deletion...c/main/scala/org/apache/spark/api/r/RBackendHandler.scala
- core/src/main/scala/org/apache/spark/api/r/RRDD.scala 13 additions, 0 deletionscore/src/main/scala/org/apache/spark/api/r/RRDD.scala
Loading
Please register or sign in to comment