Commits · 0bad10d3e36d3238c7ee7c0fc5465072734b3ae4 · cs525-sp18-g07 / spark

Sep 14, 2017

[SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json support converting MapType to... · a28728a9

goldmedal authored 7 years ago

[SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json support converting MapType to json for PySpark and SparkR

## What changes were proposed in this pull request?
In previous work SPARK-21513, we has allowed `MapType` and `ArrayType` of `MapType`s convert to a json string but only for Scala API. In this follow-up PR, we will make SparkSQL support it for PySpark and SparkR, too. We also fix some little bugs and comments of the previous work in this follow-up PR.

### For PySpark
```
>>> data = [(1, {"name": "Alice"})]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json=u'{"name":"Alice")']
>>> data = [(1, [{"name": "Alice"}, {"name": "Bob"}])]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json=u'[{"name":"Alice"},{"name":"Bob"}]')]
```
### For SparkR
```
# Converts a map into a JSON object
df2 <- sql("SELECT map('name', 'Bob')) as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
# Converts an array of maps into a JSON array
df2 <- sql("SELECT array(map('name', 'Bob'), map('name', 'Alice')) as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
```
## How was this patch tested?
Add unit test cases.

cc viirya HyukjinKwon

Author: goldmedal <liugs963@gmail.com>

Closes #19223 from goldmedal/SPARK-21513-fp-PySaprkAndSparkR.

a28728a9

Sep 06, 2017

[SPARK-21801][SPARKR][TEST] set random seed for predictable test · 36b48ee6

Felix Cheung authored 7 years ago

## What changes were proposed in this pull request?

set.seed() before running tests

## How was this patch tested?

jenkins, appveyor

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #19111 from felixcheung/rranseed.

36b48ee6

Sep 03, 2017

[SPARK-21897][PYTHON][R] Add unionByName API to DataFrame in Python and R · 07fd68a2

hyukjinkwon authored 7 years ago

## What changes were proposed in this pull request?

This PR proposes to add a wrapper for `unionByName` API to R and Python as well.

**Python**

```python
df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])
df1.unionByName(df2).show()
```

```
+----+----+----+
|col0|col1|col3|
+----+----+----+
|   1|   2|   3|
|   6|   4|   5|
+----+----+----+
```

**R**

```R
df1 <- select(createDataFrame(mtcars), "carb", "am", "gear")
df2 <- select(createDataFrame(mtcars), "am", "gear", "carb")
head(unionByName(limit(df1, 2), limit(df2, 2)))
```

```
  carb am gear
1    4  1    4
2    4  1    4
3    4  1    4
4    4  1    4
```

## How was this patch tested?

Doctests for Python and unit test added in `test_sparkSQL.R` for R.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19105 from HyukjinKwon/unionByName-r-python.

07fd68a2

Aug 29, 2017

[SPARK-21801][SPARKR][TEST] unit test randomly fail with randomforest · 6077e3ef

Felix Cheung authored 7 years ago

## What changes were proposed in this pull request?

fix the random seed to eliminate variability

## How was this patch tested?

jenkins, appveyor, lots more jenkins

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #19018 from felixcheung/rrftest.

6077e3ef

Aug 23, 2017

[SPARK-21805][SPARKR] Disable R vignettes code on Windows · 43cbfad9

Felix Cheung authored 7 years ago

## What changes were proposed in this pull request?

Code in vignettes requires winutils on windows to run, when publishing to CRAN or building from source, winutils might not be available, so it's better to disable code run (so resulting vigenttes will not have output from code, but text is still there and code is still there)

fix * checking re-building of vignette outputs ... WARNING
and
> %LOCALAPPDATA% not found. Please define the environment variable or restart and enter an installation path in localDir.

## How was this patch tested?

jenkins, appveyor, r-hub

before: https://artifacts.r-hub.io/SparkR_2.2.0.tar.gz-49cecef3bb09db1db130db31604e0293/SparkR.Rcheck/00check.log
after: https://artifacts.r-hub.io/SparkR_2.2.0.tar.gz-86a066c7576f46794930ad114e5cff7c/SparkR.Rcheck/00check.log

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #19016 from felixcheung/rvigwind.

43cbfad9

Aug 22, 2017

[SPARK-21584][SQL][SPARKR] Update R method for summary to call new implementation · 5c9b3017

Andrew Ray authored 7 years ago

## What changes were proposed in this pull request?

SPARK-21100 introduced a new `summary` method to the Scala/Java Dataset API that included  expanded statistics (vs `describe`) and control over which statistics to compute. Currently in the R API `summary` acts as an alias for `describe`. This patch updates the R API to call the new `summary` method in the JVM that includes additional statistics and ability to select which to compute.

This does not break the current interface as the present `summary` method does not take additional arguments like `describe` and the output was never meant to be used programmatically.

## How was this patch tested?

Modified and additional unit tests.

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #18786 from aray/summary-r.

5c9b3017

Aug 06, 2017

[SPARK-21622][ML][SPARKR] Support offset in SparkR GLM · 55aa4da2

actuaryzhang authored 7 years ago

## What changes were proposed in this pull request?
Support offset in SparkR GLM #16699

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18831 from actuaryzhang/sparkROffset.

55aa4da2

Aug 03, 2017

[SPARK-21602][R] Add map_keys and map_values functions to R · 97ba4918

hyukjinkwon authored 7 years ago

## What changes were proposed in this pull request?

This PR adds `map_values` and `map_keys` to R API.

```r
> df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))
> tmp <- mutate(df, v = create_map(df$model, df$cyl))
> head(select(tmp, map_keys(tmp$v)))
```
```
        map_keys(v)
1         Mazda RX4
2     Mazda RX4 Wag
3        Datsun 710
4    Hornet 4 Drive
5 Hornet Sportabout
6           Valiant
```
```r
> head(select(tmp, map_values(tmp$v)))
```
```
  map_values(v)
1             6
2             6
3             4
4             6
5             8
6             6
```

## How was this patch tested?

Manual tests and unit tests in `R/pkg/tests/fulltests/test_sparkSQL.R`

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18809 from HyukjinKwon/map-keys-values-r.

97ba4918

Jul 31, 2017

[SPARK-21381][SPARKR] SparkR: pass on setHandleInvalid for classification algorithms · 9570e81a

wangmiao1981 authored 7 years ago

## What changes were proposed in this pull request?

SPARK-20307 Added handleInvalid option to RFormula for tree-based classification algorithms. We should add this parameter for other classification algorithms in SparkR.

This is a followup PR for SPARK-20307.

## How was this patch tested?

New Unit tests are added.

Author: wangmiao1981 <wm624@hotmail.com>

Closes #18605 from wangmiao1981/class.

9570e81a

Jul 15, 2017

[SPARK-20307][ML][SPARKR][FOLLOW-UP] RFormula should handle invalid for both... · 69e5282d

Yanbo Liang authored 7 years ago

[SPARK-20307][ML][SPARKR][FOLLOW-UP] RFormula should handle invalid for both features and label column.

## What changes were proposed in this pull request?
```RFormula``` should handle invalid for both features and label column.
#18496 only handle invalid values in features column. This PR add handling invalid values for label column and test cases.

## How was this patch tested?
Add test cases.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #18613 from yanboliang/spark-20307.

69e5282d

[SPARK-21267][DOCS][MINOR] Follow up to avoid referencing programming-guide redirector · 74ac1fb0

Sean Owen authored 7 years ago

## What changes were proposed in this pull request?

Update internal references from programming-guide to rdd-programming-guide

See https://github.com/apache/spark-website/commit/5ddf243fd84a0f0f98a5193a207737cea9cdc083 and https://github.com/apache/spark/pull/18485#issuecomment-314789751

Let's keep the redirector even if it's problematic to build, but not rely on it internally.

## How was this patch tested?

(Doc build)

Author: Sean Owen <sowen@cloudera.com>

Closes #18625 from srowen/SPARK-21267.2.

74ac1fb0

Jul 13, 2017

[SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 · 425c4ada

Sean Owen authored 7 years ago

## What changes were proposed in this pull request?

- Remove Scala 2.10 build profiles and support
- Replace some 2.10 support in scripts with commented placeholders for 2.12 later
- Remove deprecated API calls from 2.10 support
- Remove usages of deprecated context bounds where possible
- Remove Scala 2.10 workarounds like ScalaReflectionLock
- Other minor Scala warning fixes

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #17150 from srowen/SPARK-19810.

425c4ada

Jul 10, 2017

[SPARK-21266][R][PYTHON] Support schema a DDL-formatted string in dapply/gapply/from_json · 2bfd5acc

hyukjinkwon authored 7 years ago

## What changes were proposed in this pull request?

This PR supports schema in a DDL formatted string for `from_json` in R/Python and `dapply` and `gapply` in R, which are commonly used and/or consistent with Scala APIs.

Additionally, this PR exposes `structType` in R to allow working around in other possible corner cases.

**Python**

`from_json`

```python
from pyspark.sql.functions import from_json

data = [(1, '''{"a": 1}''')]
df = spark.createDataFrame(data, ("key", "value"))
df.select(from_json(df.value, "a INT").alias("json")).show()
```

**R**

`from_json`

```R
df <- sql("SELECT named_struct('name', 'Bob') as people")
df <- mutate(df, people_json = to_json(df$people))
head(select(df, from_json(df$people_json, "name STRING")))
```

`structType.character`

```R
structType("a STRING, b INT")
```

`dapply`

```R
dapply(createDataFrame(list(list(1.0)), "a"), function(x) {x}, "a DOUBLE")
```

`gapply`

```R
gapply(createDataFrame(list(list(1.0)), "a"), "a", function(key, x) { x }, "a DOUBLE")
```

## How was this patch tested?

Doc tests for `from_json` in Python and unit tests `test_sparkSQL.R` in R.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18498 from HyukjinKwon/SPARK-21266.

2bfd5acc

Jul 08, 2017

[SPARK-21093][R] Terminate R's worker processes in the parent of R's daemon to prevent a leak · 08e0d033

hyukjinkwon authored 7 years ago

## What changes were proposed in this pull request?

This is a retry for #18320. This PR was reverted due to unexpected test failures with -10 error code.

I was unable to reproduce in MacOS, CentOS and Ubuntu but only in Jenkins. So, the tests proceeded to verify this and revert the past try here - https://github.com/apache/spark/pull/18456

This new approach was tested in https://github.com/apache/spark/pull/18463.

**Test results**:

- With the part of suspicious change in the past try (https://github.com/apache/spark/pull/18463/commits/466325d3fd353668583f3bde38ae490d9db0b189)

  Tests ran 4 times and 2 times passed and 2 time failed.

- Without the part of suspicious change in the past try (https://github.com/apache/spark/pull/18463/commits/466325d3fd353668583f3bde38ae490d9db0b189)

  Tests ran 5 times and they all passed.

- With this new approach (https://github.com/apache/spark/pull/18463/commits/0a7589c09f53dfc2094497d8d3e59d6407569417)

  Tests ran 5 times and they all passed.

It looks the cause is as below (see https://github.com/apache/spark/pull/18463/commits/466325d3fd353668583f3bde38ae490d9db0b189):

```diff
+ exitCode <- 1
...
+   data <- parallel:::readChild(child)
+   if (is.raw(data)) {
+     if (unserialize(data) == exitCode) {
      ...
+     }
+   }

...

- parallel:::mcexit(0L)
+ parallel:::mcexit(0L, send = exitCode)
```

Two possibilities I think

 - `parallel:::mcexit(.. , send = exitCode)`

   https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/mcfork.html

   > It sends send to the master (unless NULL) and then shuts down the child process.

   However, it looks possible that the parent attemps to terminate the child right after getting our custom exit code. So, the child gets terminated between "send" and "shuts down", failing to exit properly.

 - A bug between `parallel:::mcexit(..., send = ...)` and `parallel:::readChild`.

**Proposal**:

To resolve this, I simply decided to avoid both possibilities with this new approach here (https://github.com/apache/spark/pull/18465/commits/9ff89a7859cb9f427fc774f33c3521c7d962b723). To support this idea, I explained with some quotation of the documentation as below:

https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/mcfork.html

> `readChild` and `readChildren` return a raw vector with a "pid" attribute if data were available, an integer vector of length one with the process ID if a child terminated or `NULL` if the child no longer exists (no children at all for `readChildren`).

`readChild` returns "an integer vector of length one with the process ID if a child terminated" so we can check if it is `integer` and the same selected "process ID". I believe this makes sure that the children are exited.

In case that children happen to send any data manually to parent (which is why we introduced the suspicious part of the change (https://github.com/apache/spark/pull/18463/commits/466325d3fd353668583f3bde38ae490d9db0b189)), this should be raw bytes and will be discarded (and then will try to read the next and check if it is `integer` in the next loop).

## How was this patch tested?

Manual tests and Jenkins tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18465 from HyukjinKwon/SPARK-21093-retry-1.

08e0d033

[SPARK-20456][DOCS] Add examples for functions collection for pyspark · f5f02d21

Michael Patterson authored 7 years ago

## What changes were proposed in this pull request?

This adds documentation to many functions in pyspark.sql.functions.py:
`upper`, `lower`, `reverse`, `unix_timestamp`, `from_unixtime`, `rand`, `randn`, `collect_list`, `collect_set`, `lit`
Add units to the trigonometry functions.
Renames columns in datetime examples to be more informative.
Adds links between some functions.

## How was this patch tested?

`./dev/lint-python`
`python python/pyspark/sql/functions.py`
`./python/run-tests.py --module pyspark-sql`

Author: Michael Patterson <map222@gmail.com>

Closes #17865 from map222/spark-20456.

f5f02d21

[SPARK-20307][SPARKR] SparkR: pass on setHandleInvalid to spark.mllib... · a7b46c62

wangmiao1981 authored 7 years ago

[SPARK-20307][SPARKR] SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

## What changes were proposed in this pull request?

For randomForest classifier, if test data contains unseen labels, it will throw an error. The StringIndexer already has the handleInvalid logic. The patch add a new method to set the underlying StringIndexer handleInvalid logic.

This patch should also apply to other classifiers. This PR focuses on the main logic and randomForest classifier. I will do follow-up PR for other classifiers.

## How was this patch tested?

Add a new unit test based on the error case in the JIRA.

Author: wangmiao1981 <wm624@hotmail.com>

Closes #18496 from wangmiao1981/handle.

a7b46c62

Jul 04, 2017

[SPARK-20889][SPARKR][FOLLOWUP] Clean up grouped doc for column methods · e9a93f81

actuaryzhang authored 7 years ago

## What changes were proposed in this pull request?
Add doc for methods that were left out, and fix various style and consistency issues.

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18493 from actuaryzhang/sparkRDocCleanup.

e9a93f81

[SPARK-20889][SPARKR] Grouped documentation for WINDOW column methods · cec39215

actuaryzhang authored 7 years ago

## What changes were proposed in this pull request?

Grouped documentation for column window methods.

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18481 from actuaryzhang/sparkRDocWindow.

cec39215

Jun 30, 2017

[SPARK-20889][SPARKR] Grouped documentation for COLLECTION column methods · 52981715

actuaryzhang authored 7 years ago

## What changes were proposed in this pull request?

Grouped documentation for column collection methods.

Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>

Closes #18458 from actuaryzhang/sparkRDocCollection.

52981715

Jun 29, 2017

[SPARK-20889][SPARKR] Grouped documentation for MISC column methods · fddb63f4

actuaryzhang authored 7 years ago

## What changes were proposed in this pull request?
Grouped documentation for column misc methods.

Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>

Closes #18448 from actuaryzhang/sparkRDocMisc.

fddb63f4

[SPARK-20889][SPARKR] Grouped documentation for NONAGGREGATE column methods · a2d56235

actuaryzhang authored 7 years ago

## What changes were proposed in this pull request?

Grouped documentation for nonaggregate column methods.

Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>

Closes #18422 from actuaryzhang/sparkRDocNonAgg.

a2d56235

Jun 28, 2017

Revert "[SPARK-21094][R] Terminate R's worker processes in the parent of R's... · fc92d25f

Felix Cheung authored 7 years ago

Revert "[SPARK-21094][R] Terminate R's worker processes in the parent of R's daemon to prevent a leak"

This reverts commit 6b3d0228.

fc92d25f

[SPARK-21224][R] Specify a schema by using a DDL-formatted string when reading in R · db44f5f3

hyukjinkwon authored 7 years ago

## What changes were proposed in this pull request?

This PR proposes to support a DDL-formetted string as schema as below:

```r
mockLines <- c("{\"name\":\"Michael\"}",
               "{\"name\":\"Andy\", \"age\":30}",
               "{\"name\":\"Justin\", \"age\":19}")
jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
writeLines(mockLines, jsonPath)
df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
collect(df)
```

## How was this patch tested?

Tests added in `test_streaming.R` and `test_sparkSQL.R` and manual tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18431 from HyukjinKwon/r-ddl-schema.

db44f5f3

[SPARK-20889][SPARKR] Grouped documentation for STRING column methods · 376d90d5

actuaryzhang authored 7 years ago

## What changes were proposed in this pull request?

Grouped documentation for string column methods.

Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>

Closes #18366 from actuaryzhang/sparkRDocString.

376d90d5

[SPARK-20889][SPARKR] Grouped documentation for MATH column methods · e793bf24

actuaryzhang authored 7 years ago

## What changes were proposed in this pull request?

Grouped documentation for math column methods.

Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>

Closes #18371 from actuaryzhang/sparkRDocMath.

e793bf24

Jun 25, 2017

[SPARK-21093][R] Terminate R's worker processes in the parent of R's daemon to prevent a leak · 6b3d0228

hyukjinkwon authored 7 years ago

## What changes were proposed in this pull request?

`mcfork` in R looks opening a pipe ahead but the existing logic does not properly close it when it is executed hot. This leads to the failure of more forking due to the limit for number of files open.

This hot execution looks particularly for `gapply`/`gapplyCollect`. For unknown reason, this happens more easily in CentOS and could be reproduced in Mac too.

All the details are described in https://issues.apache.org/jira/browse/SPARK-21093

This PR proposes simply to terminate R's worker processes in the parent of R's daemon to prevent a leak.

## How was this patch tested?

I ran the codes below on both CentOS and Mac with that configuration disabled/enabled.

```r
df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
collect(gapply(df, "a", function(key, x) { x }, schema(df)))
collect(gapply(df, "a", function(key, x) { x }, schema(df)))
...  # 30 times
```

Also, now it passes R tests on CentOS as below:

```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark
..............................................................................................................................................................
..............................................................................................................................................................
..............................................................................................................................................................
..............................................................................................................................................................
..............................................................................................................................................................
....................................................................................................................................
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18320 from HyukjinKwon/SPARK-21093.

6b3d0228

Jun 23, 2017

[SPARK-21149][R] Add job description API for R · 07479b3c

hyukjinkwon authored 7 years ago

## What changes were proposed in this pull request?

Extend `setJobDescription` to SparkR API.

## How was this patch tested?

It looks difficult to add a test. Manually tested as below:

```r
df <- createDataFrame(iris)
count(df)
setJobDescription("This is an example job.")
count(df)
```

prints ...

![2017-06-22 12 05 49](https://user-images.githubusercontent.com/6477701/27415670-2a649936-5743-11e7-8e95-312f1cd103af.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18382 from HyukjinKwon/SPARK-21149.

07479b3c

Jun 22, 2017

[SPARK-20889][SPARKR] Grouped documentation for DATETIME column methods · 19331b8e

actuaryzhang authored 7 years ago

## What changes were proposed in this pull request?
Grouped documentation for datetime column methods.

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18114 from actuaryzhang/sparkRDocDate.

19331b8e

Jun 21, 2017

[SPARK-20906][SPARKR] Constrained Logistic Regression for SparkR · 53543374

wangmiao1981 authored 7 years ago

## What changes were proposed in this pull request?

PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic Regression for ML. We should add it to SparkR.

## How was this patch tested?

Add new unit tests.

Author: wangmiao1981 <wm624@hotmail.com>

Closes #18128 from wangmiao1981/test.

53543374

[SPARK-20917][ML][SPARKR] SparkR supports string encoding consistent with R · ad459cfb

actuaryzhang authored 7 years ago

## What changes were proposed in this pull request?

Add `stringIndexerOrderType` to `spark.glm` and `spark.survreg` to support string encoding that is consistent with default R.

## How was this patch tested?
new tests

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18140 from actuaryzhang/sparkRFormula.

ad459cfb

Jun 20, 2017

[SPARK-20929][ML] LinearSVC should use its own threshold param · cc67bd57

Joseph K. Bradley authored 7 years ago

## What changes were proposed in this pull request?

LinearSVC should use its own threshold param, rather than the shared one, since it applies to rawPrediction instead of probability.  This PR changes the param in the Scala, Python and R APIs.

## How was this patch tested?

New unit test to make sure the threshold can be set to any Double value.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #18151 from jkbradley/ml-2.2-linearsvc-cleanup.

cc67bd57

Jun 19, 2017

[SPARK-20889][SPARKR] Grouped documentation for AGGREGATE column methods · 8965fe76

actuaryzhang authored 7 years ago

## What changes were proposed in this pull request?
Grouped documentation for the aggregate functions for Column.

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18025 from actuaryzhang/sparkRDoc4.

8965fe76

[MINOR] Bump SparkR and PySpark version to 2.3.0. · 9a145fd7

hyukjinkwon authored 7 years ago

## What changes were proposed in this pull request?

#17753 bumps master branch version to 2.3.0-SNAPSHOT, but it seems SparkR and PySpark version were omitted.

ditto of https://github.com/apache/spark/pull/16488 / https://github.com/apache/spark/pull/17523

## How was this patch tested?

N/A

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18341 from HyukjinKwon/r-version.

9a145fd7

Jun 18, 2017

[SPARK-20892][SPARKR] Add SQL trunc function to SparkR · 110ce1f2

actuaryzhang authored 7 years ago

## What changes were proposed in this pull request?

Add SQL trunc function

## How was this patch tested?
standard test

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18291 from actuaryzhang/sparkRTrunc2.

110ce1f2

[SPARK-21128][R] Remove both "spark-warehouse" and "metastore_db" before listing files in R tests · 05f83c53

hyukjinkwon authored 7 years ago

## What changes were proposed in this pull request?

This PR proposes to list the files in test _after_ removing both "spark-warehouse" and "metastore_db" so that the next run of R tests pass fine. This is sometimes a bit annoying.

## How was this patch tested?

Manually running multiple times R tests via `./R/run-tests.sh`.

**Before**

Second run:

```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
....................................................................................................1234.......................

Failed -------------------------------------------------------------------------
1. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
length(list1) not equal to length(list2).
1/1 mismatches
[1] 25 - 23 == 2

2. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
10/25 mismatches
x[16]: "metastore_db"
y[16]: "pkg"

x[17]: "pkg"
y[17]: "R"

x[18]: "R"
y[18]: "README.md"

x[19]: "README.md"
y[19]: "run-tests.sh"

x[20]: "run-tests.sh"
y[20]: "SparkR_2.2.0.tar.gz"

x[21]: "metastore_db"
y[21]: "pkg"

x[22]: "pkg"
y[22]: "R"

x[23]: "R"
y[23]: "README.md"

x[24]: "README.md"
y[24]: "run-tests.sh"

x[25]: "run-tests.sh"
y[25]: "SparkR_2.2.0.tar.gz"

3. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
length(list1) not equal to length(list2).
1/1 mismatches
[1] 25 - 23 == 2

4. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
10/25 mismatches
x[16]: "metastore_db"
y[16]: "pkg"

x[17]: "pkg"
y[17]: "R"

x[18]: "R"
y[18]: "README.md"

x[19]: "README.md"
y[19]: "run-tests.sh"

x[20]: "run-tests.sh"
y[20]: "SparkR_2.2.0.tar.gz"

x[21]: "metastore_db"
y[21]: "pkg"

x[22]: "pkg"
y[22]: "R"

x[23]: "R"
y[23]: "README.md"

x[24]: "README.md"
y[24]: "run-tests.sh"

x[25]: "run-tests.sh"
y[25]: "SparkR_2.2.0.tar.gz"

DONE ===========================================================================
```

**After**

Second run:

```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18335 from HyukjinKwon/SPARK-21128.

05f83c53

Jun 16, 2017

[MINOR][DOCS] Improve Running R Tests docs · 45824fb6

Yuming Wang authored 7 years ago

## What changes were proposed in this pull request?

Update Running R Tests dependence packages to:
```bash
R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 'survival'), repos='http://cran.us.r-project.org')"
```

## How was this patch tested?
manual tests

Author: Yuming Wang <wgyumg@gmail.com>

Closes #18271 from wangyum/building-spark.

45824fb6

Jun 15, 2017

[SPARK-20980][SQL] Rename `wholeFile` to `multiLine` for both CSV and JSON · 20514281

Xiao Li authored 7 years ago

### What changes were proposed in this pull request?
The current option name `wholeFile` is misleading for CSV users. Currently, it is not representing a record per file. Actually, one file could have multiple records. Thus, we should rename it. Now, the proposal is `multiLine`.

### How was this patch tested?
N/A

Author: Xiao Li <gatorsmile@gmail.com>

Closes #18202 from gatorsmile/renameCVSOption.

20514281

Jun 11, 2017

[SPARK-20877][SPARKR][FOLLOWUP] clean up after test move · 9f4ff955

Felix Cheung authored 7 years ago

## What changes were proposed in this pull request?

clean up after big test move

## How was this patch tested?

unit tests, jenkins

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #18267 from felixcheung/rtestset2.

9f4ff955

[SPARK-20877][SPARKR] refactor tests to basic tests only for CRAN · dc4c3518

Felix Cheung authored 7 years ago

## What changes were proposed in this pull request?

Move all existing tests to non-installed directory so that it will never run by installing SparkR package

For a follow-up PR:
- remove all skip_on_cran() calls in tests
- clean up test timer
- improve or change basic tests that do run on CRAN (if anyone has suggestion)

It looks like `R CMD build pkg` will still put pkg\tests (ie. the full tests) into the source package but `R CMD INSTALL` on such source package does not install these tests (and so `R CMD check` does not run them)

## How was this patch tested?

- [x] unit tests, Jenkins
- [x] AppVeyor
- [x] make a source package, install it, `R CMD check` it - verify the full tests are not installed or run

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #18264 from felixcheung/rtestset.

dc4c3518

Jun 09, 2017

[SPARK-21042][SQL] Document Dataset.union is resolution by position · b78e3849

Reynold Xin authored 7 years ago

## What changes were proposed in this pull request?
Document Dataset.union is resolution by position, not by name, since this has been a confusing point for a lot of users.

## How was this patch tested?
N/A - doc only change.

Author: Reynold Xin <rxin@databricks.com>

Closes #18256 from rxin/SPARK-21042.

b78e3849