Commits · 595893d33a26c838c8c5c0c599fbee7fa61cbdff · cs525-sp18-g07 / spark

Oct 20, 2016

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

Fix for a bunch of test warnings that were added recently.
We need to investigate why warnings are not turning into errors.

```
Warnings -----------------------------------------------------------------------
1. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Sepal_Length instead of Sepal.Length  as column name

2. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Sepal_Width instead of Sepal.Width  as column name

3. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Petal_Length instead of Petal.Length  as column name

4. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Petal_Width instead of Petal.Width  as column name

Consider adding
  importFrom("utils", "object.size")
to your NAMESPACE file.
```

## How was this patch tested?

unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15560 from felixcheung/rwarnings.

3180272d

Oct 12, 2016

[SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB · 5cc503f4

Hossein authored 8 years ago

## What changes were proposed in this pull request?
If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD.

I tested this on my MacBook. Following code works with this patch:
```R
intMax <- .Machine$integer.max
largeVec <- 1:intMax
rdd <- SparkR:::parallelize(sc, largeVec, 2)
```

## How was this patch tested?
* [x] Unit tests

Author: Hossein <hossein@databricks.com>

Closes #15375 from falaki/SPARK-17790.

5cc503f4

Oct 11, 2016

[SPARK-17720][SQL] introduce static SQL conf · b9a14718

Wenchen Fan authored 8 years ago

## What changes were proposed in this pull request?

SQLConf is session-scoped and mutable. However, we do have the requirement for a static SQL conf, which is global and immutable, e.g. the `schemaStringThreshold` in `HiveExternalCatalog`, the flag to enable/disable hive support, the global temp view database in https://github.com/apache/spark/pull/14897.

Actually we've already implemented static SQL conf implicitly via `SparkConf`, this PR just make it explicit and expose it to users, so that they can see the config value via SQL command or `SparkSession.conf`, and forbid users to set/unset static SQL conf.

## How was this patch tested?

new tests in SQLConfSuite

Author: Wenchen Fan <wenchen@databricks.com>

Closes #15295 from cloud-fan/global-conf.

b9a14718

[SPARK-15153][ML][SPARKR] Fix SparkR spark.naiveBayes error when label is numeric type · 23405f32

Yanbo Liang authored 8 years ago

## What changes were proposed in this pull request?
Fix SparkR ```spark.naiveBayes``` error when response variable of dataset is numeric type.
See details and how to reproduce this bug at [SPARK-15153](https://issues.apache.org/jira/browse/SPARK-15153).

## How was this patch tested?
Add unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15431 from yanboliang/spark-15153-2.

23405f32

Oct 07, 2016

[SPARK-17665][SPARKR] Support options/mode all for read/write APIs and options in other types · 9d8ae853

hyukjinkwon authored 8 years ago

## What changes were proposed in this pull request?

This PR includes the changes below:

  - Support `mode`/`options` in `read.parquet`, `write.parquet`, `read.orc`, `write.orc`, `read.text`, `write.text`, `read.json` and `write.json` APIs

  - Support other types (logical, numeric and string) as options for `write.df`, `read.df`, `read.parquet`, `write.parquet`, `read.orc`, `write.orc`, `read.text`, `write.text`, `read.json` and `write.json`

## How was this patch tested?

Unit tests in `test_sparkSQL.R`/ `utils.R`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15239 from HyukjinKwon/SPARK-17665.

9d8ae853

Oct 05, 2016

[SPARK-17658][SPARKR] read.df/write.df API taking path optionally in SparkR · c9fe10d4

hyukjinkwon authored 8 years ago

## What changes were proposed in this pull request?

`write.df`/`read.df` API require path which is not actually always necessary in Spark. Currently, it only affects the datasources implementing `CreatableRelationProvider`. Currently, Spark currently does not have internal data sources implementing this but it'd affect other external datasources.

In addition we'd be able to use this way in Spark's JDBC datasource after https://github.com/apache/spark/pull/12601 is merged.

**Before**

 - `read.df`

  ```r
> read.df(source = "json")
Error in dispatchFunc("read.df(path = NULL, source = NULL, schema = NULL, ...)",  :
  argument "x" is missing with no default
```

  ```r
> read.df(path = c(1, 2))
Error in dispatchFunc("read.df(path = NULL, source = NULL, schema = NULL, ...)",  :
  argument "x" is missing with no default
```

  ```r
> read.df(c(1, 2))
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.String
	at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:300)
	at
...
In if (is.na(object)) { :
...
```

 - `write.df`

  ```r
> write.df(df, source = "json")
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘write.df’ for signature ‘"function", "missing"’
```

  ```r
> write.df(df, source = c(1, 2))
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘write.df’ for signature ‘"SparkDataFrame", "missing"’
```

  ```r
> write.df(df, mode = TRUE)
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘write.df’ for signature ‘"SparkDataFrame", "missing"’
```

**After**

- `read.df`

  ```r
> read.df(source = "json")
Error in loadDF : analysis error - Unable to infer schema for JSON at . It must be specified manually;
```

  ```r
> read.df(path = c(1, 2))
Error in f(x, ...) : path should be charactor, null or omitted.
```

  ```r
> read.df(c(1, 2))
Error in f(x, ...) : path should be charactor, null or omitted.
```

- `write.df`

  ```r
> write.df(df, source = "json")
Error in save : illegal argument - 'path' is not specified
```

  ```r
> write.df(df, source = c(1, 2))
Error in .local(df, path, ...) :
  source should be charactor, null or omitted. It is 'parquet' by default.
```

  ```r
> write.df(df, mode = TRUE)
Error in .local(df, path, ...) :
  mode should be charactor or omitted. It is 'error' by default.
```

## How was this patch tested?

Unit tests in `test_sparkSQL.R`

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15231 from HyukjinKwon/write-default-r.

c9fe10d4

Oct 04, 2016

[SPARKR][DOC] minor formatting and output cleanup for R vignettes · 068c198e

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

Clean up output, format table, truncate long example output, hide warnings

(new - Left; existing - Right)
![image](https://cloud.githubusercontent.com/assets/8969467/19064018/5dcde4d0-89bc-11e6-857b-052df3f52a4e.png)

![image](https://cloud.githubusercontent.com/assets/8969467/19064034/6db09956-89bc-11e6-8e43-232d5c3fe5e6.png)

![image](https://cloud.githubusercontent.com/assets/8969467/19064058/88f09590-89bc-11e6-9993-61639e29dfdd.png)

![image](https://cloud.githubusercontent.com/assets/8969467/19064066/95ccbf64-89bc-11e6-877f-45af03ddcadc.png)

![image](https://cloud.githubusercontent.com/assets/8969467/19064082/a8445404-89bc-11e6-8532-26d8bc9b206f.png)

## How was this patch tested?

Run create-doc.sh manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15340 from felixcheung/vignettes.

068c198e

Sep 27, 2016

[SPARK-17499][SPARKR][FOLLOWUP] Check null first for layers in spark.mlp to... · 4a833956

hyukjinkwon authored 8 years ago

[SPARK-17499][SPARKR][FOLLOWUP] Check null first for layers in spark.mlp to avoid warnings in test results

## What changes were proposed in this pull request?

Some tests in `test_mllib.r` are as below:

```r
expect_error(spark.mlp(df, layers = NULL), "layers must be a integer vector with length > 1.")
expect_error(spark.mlp(df, layers = c()), "layers must be a integer vector with length > 1.")
```

The problem is, `is.na` is internally called via `na.omit` in `spark.mlp` which causes warnings as below:

```
Warnings -----------------------------------------------------------------------
1. spark.mlp (test_mllib.R#400) - is.na() applied to non-(list or vector) of type 'NULL'

2. spark.mlp (test_mllib.R#401) - is.na() applied to non-(list or vector) of type 'NULL'
```

## How was this patch tested?

Manually tested. Also, Jenkins tests and AppVeyor.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15232 from HyukjinKwon/remove-warnnings.

4a833956

Sep 26, 2016

[SPARK-17577][FOLLOW-UP][SPARKR] SparkR spark.addFile supports adding directory recursively · 93c743f1

Yanbo Liang authored 8 years ago

## What changes were proposed in this pull request?
#15140 exposed ```JavaSparkContext.addFile(path: String, recursive: Boolean)``` to Python/R, then we can update SparkR ```spark.addFile``` to support adding directory recursively.

## How was this patch tested?
Added unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15216 from yanboliang/spark-17577-2.

93c743f1

Sep 23, 2016

[SPARK-17210][SPARKR] sparkr.zip is not distributed to executors when running sparkr in RStudio · f62ddc59

Jeff Zhang authored 8 years ago

## What changes were proposed in this pull request?

Spark will add sparkr.zip to archive only when it is yarn mode (SparkSubmit.scala).
```
    if (args.isR && clusterManager == YARN) {
      val sparkRPackagePath = RUtils.localSparkRPackagePath
      if (sparkRPackagePath.isEmpty) {
        printErrorAndExit("SPARK_HOME does not exist for R application in YARN mode.")
      }
      val sparkRPackageFile = new File(sparkRPackagePath.get, SPARKR_PACKAGE_ARCHIVE)
      if (!sparkRPackageFile.exists()) {
        printErrorAndExit(s"$SPARKR_PACKAGE_ARCHIVE does not exist for R application in YARN mode.")
      }
      val sparkRPackageURI = Utils.resolveURI(sparkRPackageFile.getAbsolutePath).toString

      // Distribute the SparkR package.
      // Assigns a symbol link name "sparkr" to the shipped package.
      args.archives = mergeFileLists(args.archives, sparkRPackageURI + "#sparkr")

      // Distribute the R package archive containing all the built R packages.
      if (!RUtils.rPackages.isEmpty) {
        val rPackageFile =
          RPackageUtils.zipRLibraries(new File(RUtils.rPackages.get), R_PACKAGE_ARCHIVE)
        if (!rPackageFile.exists()) {
          printErrorAndExit("Failed to zip all the built R packages.")
        }

        val rPackageURI = Utils.resolveURI(rPackageFile.getAbsolutePath).toString
        // Assigns a symbol link name "rpkg" to the shipped package.
        args.archives = mergeFileLists(args.archives, rPackageURI + "#rpkg")
      }
    }
```
So it is necessary to pass spark.master from R process to JVM. Otherwise sparkr.zip won't be distributed to executor.  Besides that I also pass spark.yarn.keytab/spark.yarn.principal to spark side, because JVM process need them to access secured cluster.

## How was this patch tested?

Verify it manually in R Studio using the following code.
```
Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark")
.libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths()))
library(SparkR)
sparkR.session(master="yarn-client", sparkConfig = list(spark.executor.instances="1"))
df <- as.DataFrame(mtcars)
head(df)

```

…

Author: Jeff Zhang <zjffdu@apache.org>

Closes #14784 from zjffdu/SPARK-17210.

f62ddc59

[SPARK-17499][SPARKR][ML][MLLIB] make the default params in sparkR spark.mlp... · f89808b0

WeichenXu authored 8 years ago

[SPARK-17499][SPARKR][ML][MLLIB] make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier

## What changes were proposed in this pull request?

update `MultilayerPerceptronClassifierWrapper.fit` paramter type:
`layers: Array[Int]`
`seed: String`

update several default params in sparkR `spark.mlp`:
`tol` --> 1e-6
`stepSize` --> 0.03
`seed` --> NULL ( when seed == NULL, the scala-side wrapper regard it as a `null` value and the seed will use the default one )
r-side `seed` only support 32bit integer.

remove `layers` default value, and move it in front of those parameters with default value.
add `layers` parameter validation check.

## How was this patch tested?

tests added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #15051 from WeichenXu123/update_py_mlp_default.

f89808b0

Sep 22, 2016

Skip building R vignettes if Spark is not built · 9f24a17c

Shivaram Venkataraman authored 8 years ago

## What changes were proposed in this pull request?

When we build the docs separately we don't have the JAR files from the Spark build in
the same tree. As the SparkR vignettes need to launch a SparkContext to be built, we skip building them if JAR files don't exist

## How was this patch tested?

To test this we can run the following:
```
build/mvn -DskipTests -Psparkr clean
./R/create-docs.sh
```
You should see a line `Skipping R vignettes as Spark JARs not found` at the end

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #15200 from shivaram/sparkr-vignette-skip.

9f24a17c

Sep 21, 2016

[SPARK-17315][FOLLOW-UP][SPARKR][ML] Fix print of Kolmogorov-Smirnov test summary · 6902edab

Yanbo Liang authored 8 years ago

## What changes were proposed in this pull request?
#14881 added Kolmogorov-Smirnov Test wrapper to SparkR. I found that ```print.summary.KSTest``` was implemented inappropriately and result in no effect.
Running the following code for KSTest:
```Scala
data <- data.frame(test = c(0.1, 0.15, 0.2, 0.3, 0.25, -1, -0.5))
df <- createDataFrame(data)
testResult <- spark.kstest(df, "test", "norm")
summary(testResult)
```
Before this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18615016/b9a2823a-7d4f-11e6-934b-128beade355e.png)
After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18615014/aafe2798-7d4f-11e6-8b99-c705bb9fe8f2.png)
The new implementation is similar with [```print.summary.GeneralizedLinearRegressionModel```](https://github.com/apache/spark/blob/master/R/pkg/R/mllib.R#L284) of SparkR and [```print.summary.glm```](https://svn.r-project.org/R/trunk/src/library/stats/R/glm.R) of native R.

BTW, I removed the comparison of ```print.summary.KSTest``` in unit test, since it's only wrappers of the summary output which has been checked. Another reason is that these comparison will output summary information to the test console, it will make the test output in a mess.

## How was this patch tested?
Existing test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15139 from yanboliang/spark-17315.

6902edab

[SPARK-17577][SPARKR][CORE] SparkR support add files to Spark job and get by executors · c133907c

Yanbo Liang authored 8 years ago

## What changes were proposed in this pull request?
Scala/Python users can add files to Spark job by submit options ```--files``` or ```SparkContext.addFile()```. Meanwhile, users can get the added file by ```SparkFiles.get(filename)```.
We should also support this function for SparkR users, since they also have the requirements for some shared dependency files. For example, SparkR users can download third party R packages to driver firstly, add these files to the Spark job as dependency by this API and then each executor can install these packages by ```install.packages```.

## How was this patch tested?
Add unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15131 from yanboliang/spark-17577.

c133907c

Sep 19, 2016

[SPARK-17297][DOCS] Clarify window/slide duration as absolute time, not relative to a calendar · d720a401

Sean Owen authored 8 years ago

## What changes were proposed in this pull request?

Clarify that slide and window duration are absolute, and not relative to a calendar.

## How was this patch tested?

Doc build (no functional change)

Author: Sean Owen <sowen@cloudera.com>

Closes #15142 from srowen/SPARK-17297.

Unverified

d720a401

Sep 14, 2016

[SPARK-17445][DOCS] Reference an ASF page as the main place to find third-party packages · dc0a4c91

Sean Owen authored 8 years ago

## What changes were proposed in this pull request?

Point references to spark-packages.org to https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects

This will be accompanied by a parallel change to the spark-website repo, and additional changes to this wiki.

## How was this patch tested?

Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #15075 from srowen/SPARK-17445.

dc0a4c91

Sep 13, 2016

[SPARK-17317][SPARKR] Add SparkR vignette · a454a4d8

junyangq authored 8 years ago

## What changes were proposed in this pull request?

This PR tries to add a SparkR vignette, which works as a friendly guidance going through the functionality provided by SparkR.

## How was this patch tested?

Manual test.

Author: junyangq <qianjunyang@gmail.com>
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Author: Junyang Qian <junyangq@databricks.com>

Closes #14980 from junyangq/SPARKR-vignette.

a454a4d8

Sep 10, 2016

[SPARK-16445][MLLIB][SPARKR] Fix @return description for sparkR mlp summary() method · 71b7d42f

Xin Ren authored 8 years ago

## What changes were proposed in this pull request?

Fix summary() method's `return` description for spark.mlp

## How was this patch tested?

Ran tests locally on my laptop.

Author: Xin Ren <iamshrek@126.com>

Closes #15015 from keypointt/SPARK-16445-2.

71b7d42f

Sep 09, 2016

[SPARK-17464][SPARKR][ML] SparkR spark.als argument reg should be 0.1 by default. · 2ed60121

Yanbo Liang authored 8 years ago

## What changes were proposed in this pull request?
SparkR ```spark.als``` arguments ```reg``` should be 0.1 by default, which need to be consistent with ML.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15021 from yanboliang/spark-17464.

2ed60121

Sep 08, 2016

[SPARK-17442][SPARKR] Additional arguments in write.df are not passed to data source · f0d21b7f

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

additional options were not passed down in write.df.

## How was this patch tested?

unit tests
falaki shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15010 from felixcheung/testreadoptions.

f0d21b7f

Sep 07, 2016

[SPARK-17339][SPARKR][CORE] Fix some R tests and use Path.toUri in... · 6b41195b

hyukjinkwon authored 8 years ago

[SPARK-17339][SPARKR][CORE] Fix some R tests and use Path.toUri in SparkContext for Windows paths in SparkR

## What changes were proposed in this pull request?

This PR fixes the Windows path issues in several APIs. Please refer https://issues.apache.org/jira/browse/SPARK-17339 for more details.

## How was this patch tested?

Tests via AppVeyor CI - https://ci.appveyor.com/project/HyukjinKwon/spark/build/82-SPARK-17339-fix-r

Also, manually,

![2016-09-06 3 14 38](https://cloud.githubusercontent.com/assets/6477701/18263406/b93a98be-7444-11e6-9521-b28ee65a4771.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14960 from HyukjinKwon/SPARK-17339.

6b41195b

[SPARK-16785] R dapply doesn't return array or raw columns · 9fccde4f

Clark Fitzgerald authored 8 years ago

## What changes were proposed in this pull request?

Fixed bug in `dapplyCollect` by changing the `compute` function of `worker.R` to explicitly handle raw (binary) vectors.

cc shivaram

## How was this patch tested?

Unit tests

Author: Clark Fitzgerald <clarkfitzg@gmail.com>

Closes #14783 from clarkfitzg/SPARK-16785.

9fccde4f

Sep 03, 2016

[SPARK-17315][SPARKR] Kolmogorov-Smirnov test SparkR wrapper · abb2f921

Junyang Qian authored 8 years ago

## What changes were proposed in this pull request?

This PR tries to add Kolmogorov-Smirnov Test wrapper to SparkR. This wrapper implementation only supports one sample test against normal distribution.

## How was this patch tested?

R unit test.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14881 from junyangq/SPARK-17315.

abb2f921

Sep 02, 2016

[SPARKR][MINOR] Fix docs for sparkR.session and count · d2fde6b7

Junyang Qian authored 8 years ago

## What changes were proposed in this pull request?

This PR tries to add some more explanation to `sparkR.session`. It also modifies doc for `count` so when grouped in one doc, the description doesn't confuse users.

## How was this patch tested?

Manual test.

![screen shot 2016-09-02 at 1 21 36 pm](https://cloud.githubusercontent.com/assets/15318264/18217198/409613ac-7110-11e6-8dae-cb0c8df557bf.png)

Author: Junyang Qian <junyangq@databricks.com>

Closes #14942 from junyangq/fixSparkRSessionDoc.

d2fde6b7

[SPARK-17298][SQL] Require explicit CROSS join for cartesian products · e6132a6c

Srinath Shankar authored 8 years ago

## What changes were proposed in this pull request?

Require the use of CROSS join syntax in SQL (and a new crossJoin
DataFrame API) to specify explicit cartesian products between relations.
By cartesian product we mean a join between relations R and S where
there is no join condition involving columns from both R and S.

If a cartesian product is detected in the absence of an explicit CROSS
join, an error must be thrown. Turning on the
"spark.sql.crossJoin.enabled" configuration flag will disable this check
and allow cartesian products without an explicit CROSS join.

The new crossJoin DataFrame API must be used to specify explicit cross
joins. The existing join(DataFrame) method will produce a INNER join
that will require a subsequent join condition.
That is df1.join(df2) is equivalent to select * from df1, df2.

## How was this patch tested?

Added cross-join.sql to the SQLQueryTestSuite to test the check for cartesian products. Added a couple of tests to the DataFrameJoinSuite to test the crossJoin API. Modified various other test suites to explicitly specify a cross join where an INNER join or a comma-separated list was previously used.

Author: Srinath Shankar <srinath@databricks.com>

Closes #14866 from srinathshankar/crossjoin.

e6132a6c

[SPARK-17376][SPARKR] followup - change since version · eac1d0e9

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

change since version in doc

## How was this patch tested?

manual

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14939 from felixcheung/rsparkversion2.

eac1d0e9

[SPARKR][DOC] regexp_extract should doc that it returns empty string when match fails · 419eefd8

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

Doc change - see https://issues.apache.org/jira/browse/SPARK-16324

## How was this patch tested?

manual check

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14934 from felixcheung/regexpextractdoc.

419eefd8

[SPARK-17376][SPARKR] Spark version should be available in R · 812333e4

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

Add sparkR.version() API.

```
> sparkR.version()
[1] "2.1.0-SNAPSHOT"
```

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14935 from felixcheung/rsparksessionversion.

812333e4

[SPARK-16883][SPARKR] SQL decimal type is not properly cast to number when... · 0f30cded

wm624@hotmail.com authored 8 years ago

[SPARK-16883][SPARKR] SQL decimal type is not properly cast to number when collecting SparkDataFrame

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

registerTempTable(createDataFrame(iris), "iris")
str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y  from iris limit 5")))

'data.frame':	5 obs. of  2 variables:
 $ x: num  1 1 1 1 1
 $ y:List of 5
  ..$ : num 2
  ..$ : num 2
  ..$ : num 2
  ..$ : num 2
  ..$ : num 2

The problem is that spark returns `decimal(10, 0)` col type, instead of `decimal`. Thus, `decimal(10, 0)` is not handled correctly. It should be handled as "double".

As discussed in JIRA thread, we can have two potential fixes:
1). Scala side fix to add a new case when writing the object back; However, I can't use spark.sql.types._ in Spark core due to dependency issues. I don't find a way of doing type case match;

2). SparkR side fix: Add a helper function to check special type like `"decimal(10, 0)"` and replace it with `double`, which is PRIMITIVE type. This special helper is generic for adding new types handling in the future.

I open this PR to discuss pros and cons of both approaches. If we want to do Scala side fix, we need to find a way to match the case of DecimalType and StructType in Spark Core.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manual test:
> str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y  from iris limit 5")))
'data.frame':	5 obs. of  2 variables:
 $ x: num  1 1 1 1 1
 $ y: num  2 2 2 2 2
R Unit tests

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #14613 from wangmiao1981/type.

0f30cded

Aug 31, 2016

[SPARK-17241][SPARKR][MLLIB] SparkR spark.glm should have configurable regularization parameter · 7a5000f3

Xin Ren authored 8 years ago

https://issues.apache.org/jira/browse/SPARK-17241

## What changes were proposed in this pull request?

Spark has configurable L2 regularization parameter for generalized linear regression. It is very important to have them in SparkR so that users can run ridge regression.

## How was this patch tested?

Test manually on local laptop.

Author: Xin Ren <iamshrek@126.com>

Closes #14856 from keypointt/SPARK-17241.

7a5000f3

[SPARKR][MINOR] Fix windowPartitionBy example · d008638f

Junyang Qian authored 8 years ago

## What changes were proposed in this pull request?

The usage in the original example is incorrect. This PR fixes it.

## How was this patch tested?

Manual test.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14903 from junyangq/SPARKR-FixWindowPartitionByDoc.

d008638f

[SPARK-16581][SPARKR] Fix JVM API tests in SparkR · 2f9c2736

Shivaram Venkataraman authored 8 years ago

## What changes were proposed in this pull request?

Remove cleanup.jobj test. Use JVM wrapper API for other test cases.

## How was this patch tested?

Run R unit tests with testthat 1.0

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #14904 from shivaram/sparkr-jvm-tests-fix.

2f9c2736

[SPARK-17326][SPARKR] Fix tests with HiveContext in SparkR not to be skipped always · 50bb1423

hyukjinkwon authored 8 years ago

## What changes were proposed in this pull request?

Currently, `HiveContext` in SparkR is not being tested and always skipped.
This is because the initiation of `TestHiveContext` is being failed due to trying to load non-existing data paths (test tables).

This is introduced from https://github.com/apache/spark/pull/14005

This enables the tests with SparkR.

## How was this patch tested?

Manually,

**Before** (on Mac OS)

```
...
Skipped ------------------------------------------------------------------------
1. create DataFrame from RDD (test_sparkSQL.R#200) - Hive is not build with SparkSQL, skipped
2. test HiveContext (test_sparkSQL.R#1041) - Hive is not build with SparkSQL, skipped
3. read/write ORC files (test_sparkSQL.R#1748) - Hive is not build with SparkSQL, skipped
4. enableHiveSupport on SparkSession (test_sparkSQL.R#2480) - Hive is not build with SparkSQL, skipped
5. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped
...
```

**After** (on Mac OS)

```
...
Skipped ------------------------------------------------------------------------
1. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped
...
```

Please refer the tests below (on Windows)
 - Before: https://ci.appveyor.com/project/HyukjinKwon/spark/build/45-test123
 - After: https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14889 from HyukjinKwon/SPARK-17326.

50bb1423

[MINOR][SPARKR] Verbose build comment in WINDOWS.md rather than promoting... · 9953442a

hyukjinkwon authored 8 years ago

[MINOR][SPARKR] Verbose build comment in WINDOWS.md rather than promoting default build without Hive

## What changes were proposed in this pull request?

This PR fixes `WINDOWS.md` to imply referring other profiles in http://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn rather than directly pointing to run `mvn -DskipTests -Psparkr package` without Hive supports.

## How was this patch tested?

Manually,

<img width="626" alt="2016-08-31 6 01 08" src="https://cloud.githubusercontent.com/assets/6477701/18122549/f6297b2c-6fa4-11e6-9b5e-fd4347355d87.png">

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14890 from HyukjinKwon/minor-build-r.

9953442a

Aug 29, 2016

[SPARK-16581][SPARKR] Make JVM backend calling functions public · 736a7911

Shivaram Venkataraman authored 8 years ago

## What changes were proposed in this pull request?

This change exposes a public API in SparkR to create objects, call methods on the Spark driver JVM

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Unit tests, CRAN checks

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #14775 from shivaram/sparkr-java-api.

736a7911

[SPARKR][MINOR] Fix LDA doc · 6a0fda2c

Junyang Qian authored 8 years ago

## What changes were proposed in this pull request?

This PR tries to fix the name of the `SparkDataFrame` used in the example. Also, it gives a reference url of an example data file so that users can play with.

## How was this patch tested?

Manual test.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14853 from junyangq/SPARKR-FixLDADoc.

6a0fda2c

Aug 26, 2016

[SPARKR][MINOR] Fix example of spark.naiveBayes · 18832162

Junyang Qian authored 8 years ago

## What changes were proposed in this pull request?

The original example doesn't work because the features are not categorical. This PR fixes this by changing to another dataset.

## How was this patch tested?

Manual test.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14820 from junyangq/SPARK-FixNaiveBayes.

18832162

Aug 24, 2016

[SPARKR][MINOR] Add installation message for remote master mode and improve other messages · 3a60be4b

Junyang Qian authored 8 years ago

## What changes were proposed in this pull request?

This PR gives informative message to users when they try to connect to a remote master but don't have Spark package in their local machine.

As a clarification, for now, automatic installation will only happen if they start SparkR in R console (rather than from sparkr-shell) and connect to local master. In the remote master mode, local Spark package is still needed, but we will not trigger the install.spark function because the versions have to match those on the cluster, which involves more user input. Instead, we here try to provide detailed message that may help the users.

Some of the other messages have also been slightly changed.

## How was this patch tested?

Manual test.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14761 from junyangq/SPARK-16579-V1.

3a60be4b

[SPARKR][MINOR] Add more examples to window function docs · 18708f76

Junyang Qian authored 8 years ago

## What changes were proposed in this pull request?

This PR adds more examples to window function docs to make them more accessible to the users.

It also fixes default value issues for `lag` and `lead`.

## How was this patch tested?

Manual test, R unit test.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14779 from junyangq/SPARKR-FixWindowFunctionDocs.

18708f76

[MINOR][SPARKR] fix R MLlib parameter documentation · 945c04bc

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

Fixed several misplaced param tag - they should be on the spark.* method generics

## How was this patch tested?

run knitr
junyangq

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14792 from felixcheung/rdocmllib.

945c04bc