Commits · 708129187a460aca30790281e9221c0cd5e271df · cs525-sp18-g07 / spark

Dec 07, 2015

[SPARK-12034][SPARKR] Eliminate warnings in SparkR test cases. · 39d677c8

Sun Rui authored 9 years ago

This PR:
1. Suppress all known warnings.
2. Cleanup test cases and fix some errors in test cases.
3. Fix errors in HiveContext related test cases. These test cases are actually not run previously due to a bug of creating TestHiveContext.
4. Support 'testthat' package version 0.11.0 which prefers that test cases be under 'tests/testthat'
5. Make sure the default Hadoop file system is local when running test cases.
6. Turn on warnings into errors.

Author: Sun Rui <rui.sun@intel.com>

Closes #10030 from sun-rui/SPARK-12034.

39d677c8

Dec 06, 2015

[SPARK-12044][SPARKR] Fix usage of isnan, isNaN · b6e8e63a

Yanbo Liang authored 9 years ago

1, Add ```isNaN``` to ```Column``` for SparkR. ```Column``` should has three related variable functions: ```isNaN, isNull, isNotNull```.
2, Replace ```DataFrame.isNaN``` with ```DataFrame.isnan``` at SparkR side. Because ```DataFrame.isNaN``` has been deprecated and will be removed at Spark 2.0.
<del>3, Add ```isnull``` to ```DataFrame``` for SparkR. ```DataFrame``` should has two related functions: ```isnan, isnull```.<del>

cc shivaram sun-rui felixcheung

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10037 from yanboliang/spark-12044.

b6e8e63a

Dec 05, 2015

[SPARK-12115][SPARKR] Change numPartitions() to getNumPartitions() to be... · 6979edf4

Yanbo Liang authored 9 years ago

[SPARK-12115][SPARKR] Change numPartitions() to getNumPartitions() to be consistent with Scala/Python

Change ```numPartitions()``` to ```getNumPartitions()``` to be consistent with Scala/Python.
<del>Note: If we can not catch up with 1.6 release, it will be breaking change for 1.7 that we also need to explain in release note.<del>

cc sun-rui felixcheung shivaram

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10123 from yanboliang/spark-12115.

6979edf4

[SPARK-11715][SPARKR] Add R support corr for Column Aggregration · 895b6c47

felixcheung authored 9 years ago

Need to match existing method signature

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9680 from felixcheung/rcorr.

895b6c47

[SPARK-11774][SPARKR] Implement struct(), encode(), decode() functions in SparkR. · c8d0e160
Sun Rui authored 9 years ago
```
Author: Sun Rui <rui.sun@intel.com>

Closes #9804 from sun-rui/SPARK-11774.
```
c8d0e160

Dec 03, 2015
- [SPARK-12104][SPARKR] collect() does not handle multiple columns with same name. · 5011f264
  Sun Rui authored 9 years ago
  
  Author: Sun Rui <rui.sun@intel.com> Closes #10118 from sun-rui/SPARK-12104.
  5011f264
- [SPARK-12019][SPARKR] Support character vector for sparkR.init(), check param and fix doc · 2213441e
  felixcheung authored 9 years ago
  
  and add tests. Spark submit expects comma-separated list Author: felixcheung <felixcheung_m@hotmail.com> Closes #10034 from felixcheung/sparkrinitdoc.
  2213441e
Nov 29, 2015
- [SPARK-11781][SPARKR] SparkR has problem in inferring type of raw type. · cc7a1bc9
  Sun Rui authored 9 years ago
  
  Author: Sun Rui <rui.sun@intel.com> Closes #9769 from sun-rui/SPARK-11781.
  cc7a1bc9
Nov 28, 2015

[SPARK-9319][SPARKR] Add support for setting column names, types · c793d2d9

felixcheung authored 9 years ago

Add support for for colnames, colnames<-, coltypes<-
Also added tests for names, names<- which have no test previously.

I merged with PR 8984 (coltypes). Clicked the wrong thing, crewed up the PR. Recreated it here. Was #9218

shivaram sun-rui

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9654 from felixcheung/colnamescoltypes.

c793d2d9

[SPARK-12029][SPARKR] Improve column functions signature, param check, tests,... · 28e46ab4

felixcheung authored 9 years ago

[SPARK-12029][SPARKR] Improve column functions signature, param check, tests, fix doc and add examples

shivaram sun-rui

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10019 from felixcheung/rfunctionsdoc.

28e46ab4

Nov 27, 2015

[SPARK-12025][SPARKR] Rename some window rank function names for SparkR · ba02f6cb

Yanbo Liang authored 9 years ago

Change ```cumeDist -> cume_dist, denseRank -> dense_rank, percentRank -> percent_rank, rowNumber -> row_number``` at SparkR side.
There are two reasons that we should make this change:
* We should follow the [naming convention rule of R](http://www.inside-r.org/node/230645)
* Spark DataFrame has deprecated the old convention (such as ```cumeDist```) and will remove it in Spark 2.0.

It's better to fix this issue before 1.6 release, otherwise we will make breaking API change.
cc shivaram sun-rui

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10016 from yanboliang/SPARK-12025.

ba02f6cb

Nov 20, 2015

[SPARK-11756][SPARKR] Fix use of aliases - SparkR can not output help... · a6239d58

felixcheung authored 9 years ago

[SPARK-11756][SPARKR] Fix use of aliases - SparkR can not output help information for SparkR:::summary correctly

Fix use of aliases and changes uses of rdname and seealso
`aliases` is the hint for `?` - it should not be linked to some other name - those should be seealso
https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html

Clean up usage on family, as multiple use of family with the same rdname is causing duplicated See Also html blocks (like http://spark.apache.org/docs/latest/api/R/count.html)
Also changing some rdname for dplyr-like variant for better R user visibility in R doc, eg. rbind, summary, mutate, summarize

shivaram yanboliang

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9750 from felixcheung/rdocaliases.

a6239d58

Nov 19, 2015

[SPARK-11339][SPARKR] Document the list of functions in R base package that... · 1a93323c

felixcheung authored 9 years ago

[SPARK-11339][SPARKR] Document the list of functions in R base package that are masked by functions with same name in SparkR

Added tests for function that are reported as masked, to make sure the base:: or stats:: function can be called.

For those we can't call, added them to SparkR programming guide.

It would seem to me `table, sample, subset, filter, cov` not working are not actually expected - I investigated/experimented with them but couldn't get them to work. It looks like as they are defined in base or stats they are missing the S3 generic, eg.
```
> methods("transform")
[1] transform,ANY-method       transform.data.frame
[3] transform,DataFrame-method transform.default
see '?methods' for accessing help and source code
> methods("subset")
[1] subset.data.frame       subset,DataFrame-method subset.default
[4] subset.matrix
see '?methods' for accessing help and source code
Warning message:
In .S3methods(generic.function, class, parent.frame()) :
  function 'subset' appears not to be S3 generic; found functions that look like S3 methods
```
Any idea?

More information on masking:
http://www.ats.ucla.edu/stat/r/faq/referencing_objects.htm
http://www.sfu.ca/~sweldon/howTo/guide4.pdf

This is what the output doc looks like (minus css):
![image](https://cloud.githubusercontent.com/assets/8969467/11229714/2946e5de-8d4d-11e5-94b0-dda9696b6fdd.png)

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9785 from felixcheung/rmasked.

1a93323c

Nov 18, 2015

[SPARK-11684][R][ML][DOC] Update SparkR glm API doc, user guide and example codes · e222d758

Yanbo Liang authored 9 years ago

This PR includes:
* Update SparkR:::glm, SparkR:::summary API docs.
* Update SparkR machine learning user guide and example codes to show:
  * supporting feature interaction in R formula.
  * summary for gaussian GLM model.
  * coefficients for binomial GLM model.

mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9727 from yanboliang/spark-11684.

e222d758

[SPARK-11773][SPARKR] Implement collection functions in SparkR. · 224723e6
Sun Rui authored 9 years ago
```
Author: Sun Rui <rui.sun@intel.com>

Closes #9764 from sun-rui/SPARK-11773.
```
224723e6

[SPARK-11281][SPARKR] Add tests covering the issue. · a97d6f3a

zero323 authored 9 years ago

The goal of this PR is to add tests covering the issue to ensure that is was resolved by [SPARK-11086](https://issues.apache.org/jira/browse/SPARK-11086).

Author: zero323 <matthew.szymkiewicz@gmail.com>

Closes #9743 from zero323/SPARK-11281-tests.

a97d6f3a

[SPARK-11755][R] SparkR should export "predict" · 8fb775ba

Yanbo Liang authored 9 years ago

The bug described at [SPARK-11755](https://issues.apache.org/jira/browse/SPARK-11755), after exporting ```predict``` we can both get the help information from the SparkR and base R package like the following:
```Java
> help(predict)
Help on topic ‘predict’ was found in the following packages:

  Package               Library
  SparkR                /Users/yanboliang/data/trunk2/spark/R/lib
  stats                 /Library/Frameworks/R.framework/Versions/3.2/Resources/library

Choose one

1: Make predictions from a model {SparkR}
2: Model Predictions {stats}
```

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9732 from yanboliang/spark-11755.

8fb775ba

Nov 15, 2015

[SPARK-10500][SPARKR] sparkr.zip cannot be created if /R/lib is unwritable · 835a79d7

Sun Rui authored 9 years ago

The basic idea is that:
The archive of the SparkR package itself, that is sparkr.zip, is created during build process and is contained in the Spark binary distribution. No change to it after the distribution is installed as the directory it resides ($SPARK_HOME/R/lib) may not be writable.

When there is R source code contained in jars or Spark packages specified with "--jars" or "--packages" command line option, a temporary directory is created by calling Utils.createTempDir() where the R packages built from the R source code will be installed. The temporary directory is writable, and won't interfere with each other when there are multiple SparkR sessions, and will be deleted when this SparkR session ends. The R binary packages installed in the temporary directory then are packed into an archive named rpkg.zip.

sparkr.zip and rpkg.zip are distributed to the cluster in YARN modes.

The distribution of rpkg.zip in Standalone modes is not supported in this PR, and will be address in another PR.

Various R files are updated to accept multiple lib paths (one is for SparkR package, the other is for other R packages) so that these package can be accessed in R.

Author: Sun Rui <rui.sun@intel.com>

Closes #9390 from sun-rui/SPARK-10500.

835a79d7

[SPARK-11086][SPARKR] Use dropFactors column-wise instead of nested loop when createDataFrame · d7d9fa0b

zero323 authored 9 years ago

Use `dropFactors` column-wise instead of nested loop when `createDataFrame` from a `data.frame`

At this moment SparkR createDataFrame is using nested loop to convert factors to character when called on a local data.frame. It works but is incredibly slow especially with data.table (~ 2 orders of magnitude compared to PySpark / Pandas version on a DateFrame of size 1M rows x 2 columns).

A simple improvement is to apply `dropFactor `column-wise and then reshape output list.

It should at least partially address [SPARK-8277](https://issues.apache.org/jira/browse/SPARK-8277).

Author: zero323 <matthew.szymkiewicz@gmail.com>

Closes #9099 from zero323/SPARK-11086.

d7d9fa0b

Nov 12, 2015

[SPARK-11263][SPARKR] lintr Throws Warnings on Commented Code in Documentation · ed04846e

felixcheung authored 9 years ago

Clean out hundreds of `style: Commented code should be removed.` from lintr

Like these:
```
/opt/spark-1.6.0-bin-hadoop2.6/R/pkg/R/DataFrame.R:513:3: style: Commented code should be removed.
# sc <- sparkR.init()
  ^~~~~~~~~~~~~~~~~~~
/opt/spark-1.6.0-bin-hadoop2.6/R/pkg/R/DataFrame.R:514:3: style: Commented code should be removed.
# sqlContext <- sparkRSQL.init(sc)
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/spark-1.6.0-bin-hadoop2.6/R/pkg/R/DataFrame.R:515:3: style: Commented code should be removed.
# path <- "path/to/file.json"
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~
```

tried without export or rdname, neither work
instead, added this `#' noRd` to suppress .Rd file generation

also updated `family` for DataFrame functions for longer descriptive text instead of `dataframe_funcs`
![image](https://cloud.githubusercontent.com/assets/8969467/10933937/17bf5b1e-8291-11e5-9777-40fc632105dc.png)

this covers *most* of 'Commented code' but I left out a few that looks legitimate.

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9463 from felixcheung/rlintr.

ed04846e

[SPARK-11420] Updating Stddev support via Imperative Aggregate · d292f748

JihongMa authored 9 years ago

switched stddev support from DeclarativeAggregate to ImperativeAggregate.

Author: JihongMa <linlin200605@gmail.com>

Closes #9380 from JihongMA/SPARK-11420.

d292f748

Nov 11, 2015

[SPARK-11468] [SPARKR] add stddev/variance agg functions for Column · 1a8e0468

felixcheung authored 9 years ago

Checked names, none of them should conflict with anything in base

shivaram davies rxin

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9489 from felixcheung/rstddev.

1a8e0468

Nov 10, 2015

[ML][R] SparkR::glm summary result to compare with native R · f14e9511

Yanbo Liang authored 9 years ago

Follow up #9561. Due to [SPARK-11587](https://issues.apache.org/jira/browse/SPARK-11587) has been fixed, we should compare SparkR::glm summary result with native R output rather than hard-code one. mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9590 from yanboliang/glm-r-test.

f14e9511

[SPARK-10863][SPARKR] Method coltypes() (New version) · 47735cdc

Oscar D. Lara Yejas authored 9 years ago

This is a follow up on PR #8984, as the corresponding branch for such PR was damaged.

Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu>

Closes #9579 from olarayej/SPARK-10863_NEW14.

47735cdc

[SPARK-9830][SQL] Remove AggregateExpression1 and Aggregate Operator used to... · e0701c75

Yin Huai authored 9 years ago

[SPARK-9830][SQL] Remove AggregateExpression1 and Aggregate Operator used to evaluate AggregateExpression1s

https://issues.apache.org/jira/browse/SPARK-9830

This PR contains the following main changes.
* Removing `AggregateExpression1`.
* Removing `Aggregate` operator, which is used to evaluate `AggregateExpression1`.
* Removing planner rule used to plan `Aggregate`.
* Linking `MultipleDistinctRewriter` to analyzer.
* Renaming `AggregateExpression2` to `AggregateExpression` and `AggregateFunction2` to `AggregateFunction`.
* Updating places where we create aggregate expression. The way to create aggregate expressions is `AggregateExpression(aggregateFunction, mode, isDistinct)`.
* Changing `val`s in `DeclarativeAggregate`s that touch children of this function to `lazy val`s (when we create aggregate expression in DataFrame API, children of an aggregate function can be unresolved).

Author: Yin Huai <yhuai@databricks.com>

Closes #9556 from yhuai/removeAgg1.

e0701c75

Nov 09, 2015

[SPARK-11587][SPARKR] Fix the summary generic to match base R · c4e19b38

Shivaram Venkataraman authored 9 years ago

The signature is summary(object, ...) as defined in
https://stat.ethz.ch/R-manual/R-devel/library/base/html/summary.html

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #9582 from shivaram/summary-fix.

c4e19b38

[SPARK-9865][SPARKR] Flaky SparkR test: test_sparkSQL.R: sample on a DataFrame · cd174882

felixcheung authored 9 years ago

Make sample test less flaky by setting the seed

Tested with
```
repeat {  if (count(sample(df, FALSE, 0.1)) == 3) { break } }
```

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9549 from felixcheung/rsample.

cd174882

[SPARK-11494][ML][R] Expose R-like summary statistics in SparkR::glm for linear regression · 8c0e1b50

Yanbo Liang authored 9 years ago

Expose R-like summary statistics in SparkR::glm for linear regression, the output of ```summary``` like
```Java
$DevianceResiduals
 Min        Max
 -0.9509607 0.7291832

$Coefficients
                   Estimate   Std. Error t value   Pr(>|t|)
(Intercept)        1.6765     0.2353597  7.123139  4.456124e-11
Sepal_Length       0.3498801  0.04630128 7.556598  4.187317e-12
Species_versicolor -0.9833885 0.07207471 -13.64402 0
Species_virginica  -1.00751   0.09330565 -10.79796 0
```

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9561 from yanboliang/spark-11494.

8c0e1b50

Nov 06, 2015

[SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bits · 49f1a820

Imran Rashid authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-10116

This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.

mengxr mkolod

Author: Imran Rashid <irashid@cloudera.com>

Closes #8314 from squito/SPARK-10116.

49f1a820

Nov 05, 2015

[SPARK-11542] [SPARKR] fix glm with long fomular · 24401062

Davies Liu authored 9 years ago

Because deparse() will break the long string into multiple lines, the deserialization will fail

Author: Davies Liu <davies@databricks.com>

Closes #9510 from davies/fix_glm.

24401062

[SPARK-11260][SPARKR] with() function support · b9455d1f

adrian555 authored 9 years ago

Author: adrian555 <wzhuang@us.ibm.com>
Author: Adrian Zhuang <adrian555@users.noreply.github.com>

Closes #9443 from adrian555/with.

b9455d1f

Nov 04, 2015

[SPARK-9492][ML][R] LogisticRegression in R should provide model statistics · e328b69c

Yanbo Liang authored 9 years ago

Like ml ```LinearRegression```, ```LogisticRegression``` should provide a training summary including feature names and their coefficients.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9303 from yanboliang/spark-9492.

e328b69c

Nov 03, 2015

[DOC] Missing link to R DataFrame API doc · d648a4ad

lewuathe authored 9 years ago

Author: lewuathe <lewuathe@me.com>
Author: Lewuathe <lewuathe@me.com>

Closes #9394 from Lewuathe/missing-link-to-R-dataframe.

d648a4ad

Nov 02, 2015

[SPARK-10592] [ML] [PySpark] Deprecate weights and use coefficients instead in ML models · c020f7d9

vectorijk authored 9 years ago

Deprecated in `LogisticRegression` and `LinearRegression`

Author: vectorijk <jiangkai@gmail.com>

Closes #9311 from vectorijk/spark-10592.

c020f7d9

Oct 30, 2015

[SPARK-11340][SPARKR] Support setting driver properties when starting Spark... · bb5a2af0

felixcheung authored 9 years ago

[SPARK-11340][SPARKR] Support setting driver properties when starting Spark from R programmatically or from RStudio

Mapping spark.driver.memory from sparkEnvir to spark-submit commandline arguments.

shivaram suggested that we possibly add other spark.driver.* properties - do we want to add all of those? I thought those could be set in SparkConf?
sun-rui

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9290 from felixcheung/rdrivermem.

bb5a2af0

[SPARK-11210][SPARKR] Add window functions into SparkR [step 2]. · 40c77fb2
Sun Rui authored 9 years ago
```
Author: Sun Rui <rui.sun@intel.com>

Closes #9196 from sun-rui/SPARK-11210.
```
40c77fb2

Oct 29, 2015

[SPARK-11409][SPARKR] Enable url link in R doc for Persist · d89be0bf

felixcheung authored 9 years ago

Quick one line doc fix
link is not clickable
![image](https://cloud.githubusercontent.com/assets/8969467/10833041/4e91dd7c-7e4c-11e5-8905-713b986dbbde.png)

shivaram

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9363 from felixcheung/rpersistdoc.

d89be0bf

Oct 28, 2015

[SPARK-11369][ML][R] SparkR glm should support setting standardize · fba9e954

Yanbo Liang authored 9 years ago

SparkR glm currently support :
```formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0```
We should also support setting standardize which has been defined at [design documentation](https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit)

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9331 from yanboliang/spark-11369.

fba9e954

Oct 26, 2015

[SPARK-11209][SPARKR] Add window functions into SparkR [step 1]. · dc3220ce
Sun Rui authored 9 years ago
```
Author: Sun Rui <rui.sun@intel.com>

Closes #9193 from sun-rui/SPARK-11209.
```
dc3220ce

[SPARK-10979][SPARKR] Sparkrmerge: Add merge to DataFrame with R signature · 3689beb9

Narine Kokhlikyan authored 9 years ago

Add merge function to DataFrame, which supports R signature.
https://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>

Closes #9012 from NarineK/sparkrmerge.

3689beb9