Commits · 7763b0b8bd33b0baa99434136528efb5de261919 · cs525-sp18-g07 / spark

Feb 14, 2017

[SPARK-19387][SPARKR] Tests do not run with SparkR source package in CRAN check · 7763b0b8

Felix Cheung authored 8 years ago


## What changes were proposed in this pull request?

- this is cause by changes in SPARK-18444, SPARK-18643 that we no longer install Spark when `master = ""` (default), but also related to SPARK-18449 since the real `master` value is not known at the time the R code in `sparkR.session` is run. (`master` cannot default to "local" since it could be overridden by spark-submit commandline or spark config)
- as a result, while running SparkR as a package in IDE is working fine, CRAN check is not as it is launching it via non-interactive script
- fix is to add check to the beginning of each test and vignettes; the same would also work by changing `sparkR.session()` to `sparkR.session(master = "local")` in tests, but I think being more explicit is better.

## How was this patch tested?

Tested this by reverting version to 2.1, since it needs to download the release jar with matching version. But since there are changes in 2.2 (specifically around SparkR ML) that are incompatible with 2.1, some tests are failing in this config. Will need to port this to branch-2.1 and retest with 2.1 release jar.

manually as:
```
# modify DESCRIPTION to revert version to 2.1.0
SPARK_HOME=/usr/spark R CMD build pkg
# run cran check without SPARK_HOME
R CMD check --as-cran SparkR_2.1.0.tar.gz
```

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16720 from felixcheung/rcranchecktest.

(cherry picked from commit a3626ca3)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

7763b0b8

Feb 12, 2017

[SPARK-19319][BACKPORT-2.1][SPARKR] SparkR Kmeans summary returns error when... · 06e77e00

wm624@hotmail.com authored 8 years ago

[SPARK-19319][BACKPORT-2.1][SPARKR] SparkR Kmeans summary returns error when the cluster size doesn't equal to k

## What changes were proposed in this pull request?

Backport fix of #16666

## How was this patch tested?

Backport unit tests

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16761 from wangmiao1981/kmeansport.

06e77e00

[SPARK-19342][SPARKR] bug fixed in collect method for collecting timestamp column · 173c2387

titicaca authored 8 years ago


## What changes were proposed in this pull request?

Fix a bug in collect method for collecting timestamp column, the bug can be reproduced as shown in the following codes and outputs:

```
library(SparkR)
sparkR.session(master = "local")
df <- data.frame(col1 = c(0, 1, 2),
                 col2 = c(as.POSIXct("2017-01-01 00:00:01"), NA, as.POSIXct("2017-01-01 12:00:01")))

sdf1 <- createDataFrame(df)
print(dtypes(sdf1))
df1 <- collect(sdf1)
print(lapply(df1, class))

sdf2 <- filter(sdf1, "col1 > 0")
print(dtypes(sdf2))
df2 <- collect(sdf2)
print(lapply(df2, class))
```

As we can see from the printed output, the column type of col2 in df2 is converted to numeric unexpectedly, when NA exists at the top of the column.

This is caused by method `do.call(c, list)`, if we convert a list, i.e. `do.call(c, list(NA, as.POSIXct("2017-01-01 12:00:01"))`, the class of the result is numeric instead of POSIXct.

Therefore, we need to cast the data type of the vector explicitly.

## How was this patch tested?

The patch can be tested manually with the same code above.

Author: titicaca <fangzhou.yang@hotmail.com>

Closes #16689 from titicaca/sparkr-dev.

(cherry picked from commit bc0a0e63)
Signed-off-by: Felix Cheung <felixcheung@apache.org>

173c2387

Jan 31, 2017

[BACKPORT-2.1][SPARKR][DOCS] update R API doc for subset/extract · e43f161b

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

backport #16721 to branch-2.1

## How was this patch tested?

manual

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16749 from felixcheung/rsubsetdocbackport.

e43f161b

Jan 27, 2017

[SPARK-19324][SPARKR] Spark VJM stdout output is getting dropped in SparkR · 9a49f9af

Felix Cheung authored 8 years ago


## What changes were proposed in this pull request?

This affects mostly running job from the driver in client mode when results are expected to be through stdout (which should be somewhat rare, but possible)

Before:
```
> a <- as.DataFrame(cars)
> b <- group_by(a, "dist")
> c <- count(b)
> sparkR.callJMethod(c$countjc, "explain", TRUE)
NULL
```

After:
```
> a <- as.DataFrame(cars)
> b <- group_by(a, "dist")
> c <- count(b)
> sparkR.callJMethod(c$countjc, "explain", TRUE)
count#11L
NULL
```

Now, `column.explain()` doesn't seem very useful (we can get more extensive output with `DataFrame.explain()`) but there are other more complex examples with calls of `println` in Scala/JVM side, that are getting dropped.

## How was this patch tested?

manual

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16670 from felixcheung/rjvmstdout.

(cherry picked from commit a7ab6f9a)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

9a49f9af

[SPARK-19333][SPARKR] Add Apache License headers to R files · 4002ee97

Felix Cheung authored 8 years ago


## What changes were proposed in this pull request?

add header

## How was this patch tested?

Manual run to check vignettes html is created properly

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16709 from felixcheung/rfilelicense.

(cherry picked from commit 385d7384)
Signed-off-by: Felix Cheung <felixcheung@apache.org>

4002ee97

Jan 26, 2017

[SPARK-18788][SPARKR] Add API for getNumPartitions · ba2a5ada

Felix Cheung authored 8 years ago


## What changes were proposed in this pull request?

With doc to say this would convert DF into RDD

## How was this patch tested?

unit tests, manual tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16668 from felixcheung/rgetnumpartitions.

(cherry picked from commit 90817a6c)
Signed-off-by: Felix Cheung <felixcheung@apache.org>

ba2a5ada

Jan 24, 2017

[SPARK-18823][SPARKR] add support for assigning to column · 9c04e427

Felix Cheung authored 8 years ago


## What changes were proposed in this pull request?

Support for
```
df[[myname]] <- 1
df[[2]] <- df$eruptions
```

## How was this patch tested?

manual tests, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16663 from felixcheung/rcolset.

(cherry picked from commit f27e0247)
Signed-off-by: Felix Cheung <felixcheung@apache.org>

9c04e427

Jan 18, 2017

[SPARK-19231][SPARKR] add error handling for download and untar for Spark release · 77202a6c

Felix Cheung authored 8 years ago


## What changes were proposed in this pull request?

When R is starting as a package and it needs to download the Spark release distribution we need to handle error for download and untar, and clean up, otherwise it will get stuck.

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16589 from felixcheung/rtarreturncode.

(cherry picked from commit 278fa1eb)
Signed-off-by: Felix Cheung <felixcheung@apache.org>

77202a6c

Jan 17, 2017

[SPARK-19066][SPARKR][BACKPORT-2.1] LDA doesn't set optimizer correctly · 29b954bb

wm624@hotmail.com authored 8 years ago

## What changes were proposed in this pull request?
Back port the fix to SPARK-19066 to 2.1 branch.

## How was this patch tested?
Unit tests

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16623 from wangmiao1981/bugport.

29b954bb

Jan 16, 2017

[SPARK-19232][SPARKR] Update Spark distribution download cache location on Windows · 97589050

Felix Cheung authored 8 years ago


## What changes were proposed in this pull request?

Windows seems to be the only place with appauthor in the path, for which we should say "Apache" (and case sensitive)
Current path of `AppData\Local\spark\spark\Cache` is a bit odd.

## How was this patch tested?

manual.

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16590 from felixcheung/rcachedir.

(cherry picked from commit a115a543)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

97589050

Jan 13, 2017

[SPARK-18335][SPARKR] createDataFrame to support numPartitions parameter · ee3642f5

Felix Cheung authored 8 years ago


## What changes were proposed in this pull request?

To allow specifying number of partitions when the DataFrame is created

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16512 from felixcheung/rnumpart.

(cherry picked from commit b0e8eb6d)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

ee3642f5

Jan 11, 2017

[SPARK-19130][SPARKR] Support setting literal value as column implicitly · 82fcc133

Felix Cheung authored 8 years ago


## What changes were proposed in this pull request?

```
df$foo <- 1
```

instead of
```
df$foo <- lit(1)
```

## How was this patch tested?

unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16510 from felixcheung/rlitcol.

(cherry picked from commit d749c066)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

82fcc133

Jan 10, 2017

[SPARK-19133][SPARKR][ML][BACKPORT-2.1] fix glm for Gamma, clarify glm family supported · 1022049c

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

backporting to 2.1, 2.0 and 1.6

## How was this patch tested?

unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16532 from felixcheung/rgammabackport.

1022049c

Jan 08, 2017

[SPARK-18903][SPARKR][BACKPORT-2.1] Add API to get SparkUI URL · 80a3e13e

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

backport to 2.1

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16507 from felixcheung/portsparkuir21.

80a3e13e

[SPARK-19126][DOCS] Update Join Documentation Across Languages · 8779e6a4

anabranch authored 8 years ago

## What changes were proposed in this pull request?

- [X] Make sure all join types are clearly mentioned
- [X] Make join labeling/style consistent
- [X] Make join label ordering docs the same
- [X] Improve join documentation according to above for Scala
- [X] Improve join documentation according to above for Python
- [X] Improve join documentation according to above for R

## How was this patch tested?
No tests b/c docs.

Please review http://spark.apache.org/contributing.html

 before opening a pull request.

Author: anabranch <wac.chambers@gmail.com>

Closes #16504 from anabranch/SPARK-19126.

(cherry picked from commit 19d9d4c8)
Signed-off-by: Felix Cheung <felixcheung@apache.org>

8779e6a4

[SPARK-19127][DOCS] Update Rank Function Documentation · 8690d4bd

anabranch authored 8 years ago

## What changes were proposed in this pull request?

- [X] Fix inconsistencies in function reference for dense rank and dense
- [X] Make all languages equivalent in their reference to `dense_rank` and `rank`.

## How was this patch tested?

N/A for docs.

Please review http://spark.apache.org/contributing.html

 before opening a pull request.

Author: anabranch <wac.chambers@gmail.com>

Closes #16505 from anabranch/SPARK-19127.

(cherry picked from commit 1f6ded64)
Signed-off-by: Reynold Xin <rxin@databricks.com>

8690d4bd

Dec 17, 2016

[SPARK-18849][ML][SPARKR][DOC] vignettes final check reorg · 001f49b7

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

Reorganizing content (copy/paste)

## How was this patch tested?

https://felixcheung.github.io/sparkr-vignettes.html

Previous:
https://felixcheung.github.io/sparkr-vignettes_old.html



Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16301 from felixcheung/rvignettespass2.

(cherry picked from commit 38fd163d)
Signed-off-by: Felix Cheung <felixcheung@apache.org>

001f49b7

Dec 16, 2016

[SPARK-18897][SPARKR] Fix SparkR SQL Test to drop test table · df589be5

Dongjoon Hyun authored 8 years ago


## What changes were proposed in this pull request?

SparkR tests, `R/run-tests.sh`, succeeds only once because `test_sparkSQL.R` does not clean up the test table, `people`.

As a result, the rows in `people` table are accumulated at every run and the test cases fail.

The following is the failure result for the second run.

```r
Failed -------------------------------------------------------------------------
1. Failure: create DataFrame from RDD (test_sparkSQL.R#204) -------------------
collect(sql("SELECT age from people WHERE name = 'Bob'"))$age not equal to c(16).
Lengths differ: 2 vs 1

2. Failure: create DataFrame from RDD (test_sparkSQL.R#206) -------------------
collect(sql("SELECT height from people WHERE name ='Bob'"))$height not equal to c(176.5).
Lengths differ: 2 vs 1
```

## How was this patch tested?

Manual. Run `run-tests.sh` twice and check if it passes without failures.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #16310 from dongjoon-hyun/SPARK-18897.

(cherry picked from commit 1169db44)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

df589be5

Dec 15, 2016
- Preparing development version 2.1.1-SNAPSHOT · 483624c2
  Patrick Wendell authored 8 years ago
  
  483624c2
- Preparing Spark release v2.1.0-rc5 · cd0a0836
  Patrick Wendell authored 8 years ago
  
  View commits for tag v2.1.0 v2.1.0
  
  cd0a0836
- Preparing development version 2.1.1-SNAPSHOT · 62a6577b
  Patrick Wendell authored 8 years ago
  
  62a6577b
- Preparing Spark release v2.1.0-rc4 · ec317265
  Patrick Wendell authored 8 years ago
  
  ec317265
- Preparing development version 2.1.1-SNAPSHOT · a7364a82
  Patrick Wendell authored 8 years ago
  
  a7364a82
Dec 14, 2016

[SPARK-18849][ML][SPARKR][DOC] vignettes final check update · 2a8de2e1

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

doc cleanup

## How was this patch tested?

~~vignettes is not building for me. I'm going to kick off a full clean build and try again and attach output here for review.~~
Output html here: https://felixcheung.github.io/sparkr-vignettes.html



Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16286 from felixcheung/rvignettespass.

(cherry picked from commit 7d858bc5)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

2a8de2e1

[SPARK-18865][SPARKR] SparkR vignettes MLP and LDA updates · 0d94201e

wm624@hotmail.com authored 8 years ago


## What changes were proposed in this pull request?

When do the QA work, I found that the following issues:

1). `spark.mlp` doesn't include an example;
2). `spark.mlp` and `spark.lda` have redundant parameter explanations;
3). `spark.lda` document misses default values for some parameters.

I also changed the `spark.logit` regParam in the examples, as we discussed in #16222.

## How was this patch tested?

Manual test

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16284 from wangmiao1981/ks.

(cherry picked from commit 32438853)
Signed-off-by: Felix Cheung <felixcheung@apache.org>

0d94201e

[SPARK-18795][ML][SPARKR][DOC] Added KSTest section to SparkR vignettes · d0d9c572

Joseph K. Bradley authored 8 years ago

## What changes were proposed in this pull request?

Added short section for KSTest.
Also added logreg model to list of ML models in vignette.  (This will be reorganized under SPARK-18849)

![screen shot 2016-12-14 at 1 37 31 pm](https://cloud.githubusercontent.com/assets/5084283/21202140/7f24e240-c202-11e6-9362-458208bb9159.png

)

## How was this patch tested?

Manually tested example locally.
Built vignettes locally.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #16283 from jkbradley/ksTest-vignette.

(cherry picked from commit 78627425)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

d0d9c572

Dec 13, 2016

[MINOR][SPARKR] fix kstest example error and add unit test · 8ef00593

wm624@hotmail.com authored 8 years ago


## What changes were proposed in this pull request?

While adding vignettes for kstest, I found some errors in the example:
1. There is a typo of kstest;
2. print.summary.KStest doesn't work with the example;

Fix the example errors;
Add a new unit test for print.summary.KStest;

## How was this patch tested?
Manual test;
Add new unit test;

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16259 from wangmiao1981/ks.

(cherry picked from commit f2ddabfa)
Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

8ef00593

[SPARK-18793][SPARK-18794][R] add spark.randomForest/spark.gbt to vignettes · 5693ac8e

Xiangrui Meng authored 8 years ago


## What changes were proposed in this pull request?

Mention `spark.randomForest` and `spark.gbt` in vignettes. Keep the content minimal since users can type `?spark.randomForest` to see the full doc.

cc: jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #16264 from mengxr/SPARK-18793.

(cherry picked from commit 594b14f1)
Signed-off-by: Xiangrui Meng <meng@databricks.com>

5693ac8e

[SPARK-18797][SPARKR] Update spark.logit in sparkr-vignettes · 9f0e3be6

wm624@hotmail.com authored 8 years ago

## What changes were proposed in this pull request?
spark.logit is added in 2.1. We need to update spark-vignettes to reflect the changes. This is part of SparkR QA work.

## How was this patch tested?

Manual build html. Please see attached image for the result.
![test](https://cloud.githubusercontent.com/assets/5033592/21032237/01b565fe-bd5d-11e6-8b59-4de4b6ef611d.jpeg

)

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16222 from wangmiao1981/veg.

(cherry picked from commit 2aa16d03)
Signed-off-by: Xiangrui Meng <meng@databricks.com>

9f0e3be6

Dec 12, 2016

[SPARK-18810][SPARKR] SparkR install.spark does not work for RCs, snapshots · 1aeb7f42

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

Support overriding the download url (include version directory) in an environment variable, `SPARKR_RELEASE_DOWNLOAD_URL`

## How was this patch tested?

unit test, manually testing
- snapshot build url
  - download when spark jar not cached
  - when spark jar is cached
- RC build url
  - download when spark jar not cached
  - when spark jar is cached
- multiple cached spark versions
- starting with sparkR shell

To use this,
```
SPARKR_RELEASE_DOWNLOAD_URL=http://this_is_the_url_to_spark_release_tgz

 R
```
then in R,
```
library(SparkR) # or specify lib.loc
sparkR.session()
```

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16248 from felixcheung/rinstallurl.

(cherry picked from commit 8a51cfdc)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

1aeb7f42

Dec 09, 2016

[SPARK-18807][SPARKR] Should suppress output print for calls to JVM methods with void return values · 8bf56cc4

Felix Cheung authored 8 years ago


## What changes were proposed in this pull request?

Several SparkR API calling into JVM methods that have void return values are getting printed out, especially when running in a REPL or IDE.
example:
```
> setLogLevel("WARN")
NULL
```
We should fix this to make the result more clear.

Also found a small change to return value of dropTempView in 2.1 - adding doc and test for it.

## How was this patch tested?

manually - I didn't find a expect_*() method in testthat for this

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16237 from felixcheung/rinvis.

(cherry picked from commit 3e11d5bf)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

8bf56cc4

[SPARK-18349][SPARKR] Update R API documentation on ml model summary · 4ceed95b

wm624@hotmail.com authored 8 years ago


## What changes were proposed in this pull request?
In this PR, the document of `summary` method is improved in the format:

returns summary information of the fitted model, which is a list. The list includes .......

Since `summary` in R is mainly about the model, which is not the same as `summary` object on scala side, if there is one, the scala API doc is not pointed here.

In current document, some `return` have `.` and some don't have. `.` is added to missed ones.

Since spark.logit `summary` has a big refactoring, this PR doesn't include this one. It will be changed when the `spark.logit` PR is merged.

## How was this patch tested?

Manual build.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16150 from wangmiao1981/audit2.

(cherry picked from commit 86a96034)
Signed-off-by: Felix Cheung <felixcheung@apache.org>

4ceed95b

Dec 08, 2016

[SPARK-18590][SPARKR] build R source package when making distribution · d69df907

Felix Cheung authored 8 years ago

This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not)

But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below.

This PR also includes a few minor fixes.

These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md

) on what's going to a CRAN release, which is now run during make-distribution.sh.
1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path
2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation)
3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN
 (will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests)
4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1)
 (the output of this step is what we package into Spark dist and sparkr.zip)

Alternatively,
   R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead.
 But in any case, despite installing the package multiple times this is relatively fast.
Building vignettes takes a while though.

Manually, CI.

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16014 from felixcheung/rdist.

(cherry picked from commit c3d3a9d0)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

d69df907

Preparing development version 2.1.1-SNAPSHOT · 48aa6775
Patrick Wendell authored 8 years ago

48aa6775
Preparing Spark release v2.1.0-rc2 · 08071749
Patrick Wendell authored 8 years ago

08071749

Dec 07, 2016

[SPARK-18326][SPARKR][ML] Review SparkR ML wrappers API for 2.1 · 1c3f1da8

Yanbo Liang authored 8 years ago

## What changes were proposed in this pull request?
Reviewing SparkR ML wrappers API for 2.1 release, mainly two issues:
* Remove ```probabilityCol``` from the argument list of ```spark.logit``` and ```spark.randomForest```. Since it was used when making prediction and should be an argument of ```predict```, and we will work on this at [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618

) in the next release cycle.
* Fix ```spark.als``` params to make it consistent with MLlib.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16169 from yanboliang/spark-18326.

(cherry picked from commit 97255497)
Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

1c3f1da8

[SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils · 51754d6d

Sean Owen authored 8 years ago


## What changes were proposed in this pull request?

Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k.

## How was this patch tested?

Existing test plus new test case.

Author: Sean Owen <sowen@cloudera.com>

Closes #16129 from srowen/SPARK-18678.

(cherry picked from commit 79f5f281)
Signed-off-by: Sean Owen <sowen@cloudera.com>

Unverified

51754d6d

[SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit. · 340e9aea

Yanbo Liang authored 8 years ago


## What changes were proposed in this pull request?
Several cleanup and improvements for ```spark.logit```:
* ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model.
* ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently.
* SparkR test improvement: comparing the training result with native R glmnet.
* Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users.

## How was this patch tested?
Unit tests.

The ```summary``` output after this change:
multinomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> model <- spark.logit(df, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
             versicolor  virginica   setosa
(Intercept)  1.514031    -2.609108   1.095077
Sepal_Length 0.02511006  0.2649821   -0.2900921
Sepal_Width  -0.5291215  -0.02016446 0.549286
Petal_Length 0.03647411  0.1544119   -0.190886
Petal_Width  0.000236092 0.4195804   -0.4198165
```
binomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> training <- df[df$Species %in% c("versicolor", "virginica"), ]
> model <- spark.logit(training, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
             Estimate
(Intercept)  -6.053815
Sepal_Length 0.2449379
Sepal_Width  0.1648321
Petal_Length 0.4730718
Petal_Width  1.031947
```

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16117 from yanboliang/spark-18686.

(cherry picked from commit 90b59d1b)
Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

340e9aea

Dec 04, 2016

[SPARK-18643][SPARKR] SparkR hangs at session start when installed as a package without Spark · c13c2939

Felix Cheung authored 8 years ago


## What changes were proposed in this pull request?

If SparkR is running as a package and it has previously downloaded Spark Jar it should be able to run as before without having to set SPARK_HOME. Basically with this bug the auto install Spark will only work in the first session.

This seems to be a regression on the earlier behavior.

Fix is to always try to install or check for the cached Spark if running in an interactive session.
As discussed before, we should probably only install Spark iff running in an interactive session (R shell, RStudio etc)

## How was this patch tested?

Manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16077 from felixcheung/rsessioninteractive.

(cherry picked from commit b019b3a8)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

c13c2939