diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index 2e99aa026da5557985be38b438a2ee9a8a160e32..a4733313ed16c9640f78187b7ae26120fc1828a0 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -35,7 +35,7 @@ setOldClass("structType") #' @slot env An R environment that stores bookkeeping states of the SparkDataFrame #' @slot sdf A Java object reference to the backing Scala DataFrame #' @seealso \link{createDataFrame}, \link{read.json}, \link{table} -#' @seealso \url{https://spark.apache.org/docs/latest/sparkr.html#sparkdataframe} +#' @seealso \url{https://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes} #' @export #' @examples #'\dontrun{ diff --git a/R/pkg/R/sparkR.R b/R/pkg/R/sparkR.R index ff5297ffd51cbb2802824a500aaae9b0c4ec5028..524f7c4a26b67bbab4745440e8e66b1dd87b92ed 100644 --- a/R/pkg/R/sparkR.R +++ b/R/pkg/R/sparkR.R @@ -28,14 +28,6 @@ connExists <- function(env) { }) } -#' @rdname sparkR.session.stop -#' @name sparkR.stop -#' @export -#' @note sparkR.stop since 1.4.0 -sparkR.stop <- function() { - sparkR.session.stop() -} - #' Stop the Spark Session and Spark Context #' #' Stop the Spark Session and Spark Context. @@ -90,6 +82,14 @@ sparkR.session.stop <- function() { clearJobjs() } +#' @rdname sparkR.session.stop +#' @name sparkR.stop +#' @export +#' @note sparkR.stop since 1.4.0 +sparkR.stop <- function() { + sparkR.session.stop() +} + #' (Deprecated) Initialize a new Spark Context #' #' This function initializes a new SparkContext. diff --git a/docs/sparkr.md b/docs/sparkr.md index dfa5278ef84910d969a2e98433d78f9c44197177..4bbc362c52086cbacb4b80ee5fac9486ff1c5f7b 100644 --- a/docs/sparkr.md +++ b/docs/sparkr.md @@ -322,8 +322,59 @@ head(ldf, 3) Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to that key. The groups are chosen from `SparkDataFrame`s column(s). The output of function should be a `data.frame`. Schema specifies the row format of the resulting -`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below is the data type mapping between R -and Spark. +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark [data types](#data-type-mapping-between-r-and-spark). The column names of the returned `data.frame` are set by user. + +<div data-lang="r" markdown="1"> +{% highlight r %} + +# Determine six waiting times with the largest eruption time in minutes. +schema <- structType(structField("waiting", "double"), structField("max_eruption", "double")) +result <- gapply( + df, + "waiting", + function(key, x) { + y <- data.frame(key, max(x$eruptions)) + }, + schema) +head(collect(arrange(result, "max_eruption", decreasing = TRUE))) + +## waiting max_eruption +##1 64 5.100 +##2 69 5.067 +##3 71 5.033 +##4 87 5.000 +##5 63 4.933 +##6 89 4.900 +{% endhighlight %} +</div> + +##### gapplyCollect +Like `gapply`, applies a function to each partition of a `SparkDataFrame` and collect the result back to R data.frame. The output of the function should be a `data.frame`. But, the schema is not required to be passed. Note that `gapplyCollect` can fail if the output of UDF run on all the partition cannot be pulled to the driver and fit in driver memory. + +<div data-lang="r" markdown="1"> +{% highlight r %} + +# Determine six waiting times with the largest eruption time in minutes. +result <- gapplyCollect( + df, + "waiting", + function(key, x) { + y <- data.frame(key, max(x$eruptions)) + colnames(y) <- c("waiting", "max_eruption") + y + }) +head(result[order(result$max_eruption, decreasing = TRUE), ]) + +## waiting max_eruption +##1 64 5.100 +##2 69 5.067 +##3 71 5.033 +##4 87 5.000 +##5 63 4.933 +##6 89 4.900 + +{% endhighlight %} +</div> #### Data type mapping between R and Spark <table class="table"> @@ -394,58 +445,6 @@ and Spark. </tr> </table> -<div data-lang="r" markdown="1"> -{% highlight r %} - -# Determine six waiting times with the largest eruption time in minutes. -schema <- structType(structField("waiting", "double"), structField("max_eruption", "double")) -result <- gapply( - df, - "waiting", - function(key, x) { - y <- data.frame(key, max(x$eruptions)) - }, - schema) -head(collect(arrange(result, "max_eruption", decreasing = TRUE))) - -## waiting max_eruption -##1 64 5.100 -##2 69 5.067 -##3 71 5.033 -##4 87 5.000 -##5 63 4.933 -##6 89 4.900 -{% endhighlight %} -</div> - -##### gapplyCollect -Like `gapply`, applies a function to each partition of a `SparkDataFrame` and collect the result back to R data.frame. The output of the function should be a `data.frame`. But, the schema is not required to be passed. Note that `gapplyCollect` can fail if the output of UDF run on all the partition cannot be pulled to the driver and fit in driver memory. - -<div data-lang="r" markdown="1"> -{% highlight r %} - -# Determine six waiting times with the largest eruption time in minutes. -result <- gapplyCollect( - df, - "waiting", - function(key, x) { - y <- data.frame(key, max(x$eruptions)) - colnames(y) <- c("waiting", "max_eruption") - y - }) -head(result[order(result$max_eruption, decreasing = TRUE), ]) - -## waiting max_eruption -##1 64 5.100 -##2 69 5.067 -##3 71 5.033 -##4 87 5.000 -##5 63 4.933 -##6 89 4.900 - -{% endhighlight %} -</div> - #### Run local R functions distributed using `spark.lapply` ##### spark.lapply