R · 444c2d22e38a8a78135adf0d3a3774f0e9fc866c · cs525-sp18-g07 / spark

An error occurred while fetching folder content.

[SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB

Hossein authored 8 years ago

## What changes were proposed in this pull request?
If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD.

I tested this on my MacBook. Following code works with this patch:
```R
intMax <- .Machine$integer.max
largeVec <- 1:intMax
rdd <- SparkR:::parallelize(sc, largeVec, 2)
```

## How was this patch tested?
* [x] Unit tests

Author: Hossein <hossein@databricks.com>

Closes #15375 from falaki/SPARK-17790.

5cc503f4

History

5cc503f4 8 years ago

History

Name	Last commit	Last update
..