- Mar 21, 2017
Felix Cheung authored
## What changes were proposed in this pull request? When SparkR is installed as a R package there might not be any java runtime. If it is not there SparkR's `sparkR.session()` will block waiting for the connection timeout, hanging the R IDE/shell, without any notification or message. ## How was this patch tested? manually - [x] need to test on Windows Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16596 from felixcheung/rcheckjava.
- Mar 20, 2017
Wenchen Fan authored
## What changes were proposed in this pull request? Currently JSON and CSV have exactly the same logic about handling bad records, this PR tries to abstract it and put it in a upper level to reduce code duplication. The overall idea is, we make the JSON and CSV parser to throw a BadRecordException, then the upper level, FailureSafeParser, handles bad records according to the parse mode. Behavior changes: 1. with PERMISSIVE mode, if the number of tokens doesn't match the schema, previously CSV parser will treat it as a legal record and parse as many tokens as possible. After this PR, we treat it as an illegal record, and put the raw record string in a special column, but we still parse as many tokens as possible. 2. all logging is removed as they are not very useful in practice. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Author: hyukjinkwon <gurwls223@gmail.com> Author: Wenchen Fan <cloud0fan@gmail.com> Closes #17315 from cloud-fan/bad-record2.
Felix Cheung authored
## What changes were proposed in this pull request? Add checkpoint, setCheckpointDir API to R ## How was this patch tested? unit tests, manual tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17351 from felixcheung/rdfcheckpoint.
hyukjinkwon authored
## What changes were proposed in this pull request? This PR proposes to support an array of struct type in `to_json` as below: ```scala import org.apache.spark.sql.functions._ val df = Seq(Tuple1(Tuple1(1) :: Nil)).toDF("a") df.select(to_json($"a").as("json")).show() ``` ``` +----------+ | json| +----------+ |[{"_1":1}]| +----------+ ``` Currently, it throws an exception as below (a newline manually inserted for readability): ``` org.apache.spark.sql.AnalysisException: cannot resolve 'structtojson(`array`)' due to data type mismatch: structtojson requires that the expression is a struct expression.;; ``` This allows the roundtrip with `from_json` as below: ```scala import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil)) val df = Seq("""[{"a":1}, {"a":2}]""").toDF("json").select(from_json($"json", schema).as("array")) df.show() // Read back. df.select(to_json($"array").as("json")).show() ``` ``` +----------+ | array| +----------+ |[[1], [2]]| +----------+ +-----------------+ | json| +-----------------+ |[{"a":1},{"a":2}]| +-----------------+ ``` Also, this PR proposes to rename from `StructToJson` to `StructsToJson ` and `JsonToStruct` to `JsonToStructs`. ## How was this patch tested? Unit tests in `JsonFunctionsSuite` and `JsonExpressionsSuite` for Scala, doctest for Python and test in `test_sparkSQL.R` for R. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17192 from HyukjinKwon/SPARK-19849.
- Mar 19, 2017
Felix Cheung authored
## What changes were proposed in this pull request? Passes R `tempdir()` (this is the R session temp dir, shared with other temp files/dirs) to JVM, set System.Property for derby home dir to move derby.log ## How was this patch tested? Manually, unit tests With this, these are relocated to under /tmp ``` # ls /tmp/RtmpG2M0cB/ derby.log ``` And they are removed automatically when the R session is ended. Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16330 from felixcheung/rderby.
- Mar 18, 2017
Felix Cheung authored
## What changes were proposed in this pull request? Add "experimental" API for SS in R ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16982 from felixcheung/rss.
- Mar 14, 2017
hyukjinkwon authored
## What changes were proposed in this pull request? Since we could not directly define the array type in R, this PR proposes to support array types in R as string types that are used in `structField` as below: ```R jsonArr <- "[{\"name\":\"Bob\"}, {\"name\":\"Alice\"}]" df <- as.DataFrame(list(list("people" = jsonArr))) collect(select(df, alias(from_json(df$people, "array<struct<name:string>>"), "arrcol"))) ``` prints ```R arrcol 1 Bob, Alice ``` ## How was this patch tested? Unit tests in `test_sparkSQL.R`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17178 from HyukjinKwon/SPARK-19828.
actuaryzhang authored
## What changes were proposed in this pull request? Port Tweedie GLM #16344 to SparkR felixcheung yanboliang ## How was this patch tested? new test in SparkR Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #16729 from actuaryzhang/sparkRTweedie.
- Mar 12, 2017
Xin Ren authored
## What changes were proposed in this pull request? RandomForest R Wrapper and GBT R Wrapper return param `maxDepth` to R models. Below 4 R wrappers are changed: * `RandomForestClassificationWrapper` * `RandomForestRegressionWrapper` * `GBTClassificationWrapper` * `GBTRegressionWrapper` ## How was this patch tested? Test manually on my local machine. Author: Xin Ren <iamshrek@126.com> Closes #17207 from keypointt/SPARK-19282.
- Mar 08, 2017
Xiao Li authored
### What changes were proposed in this pull request? Observed by felixcheung in https://github.com/apache/spark/pull/16739, when users use the shuffle-enabled `repartition` API, they expect the partition they got should be the exact number they provided, even if they call shuffle-disabled `coalesce` later. Currently, `CollapseRepartition` rule does not consider whether shuffle is enabled or not. Thus, we got the following unexpected result. ```Scala val df = spark.range(0, 10000, 1, 5) val df2 = df.repartition(10) assert(df2.coalesce(13).rdd.getNumPartitions == 5) assert(df2.coalesce(7).rdd.getNumPartitions == 5) assert(df2.coalesce(3).rdd.getNumPartitions == 3) ``` This PR is to fix the issue. We preserve shuffle-enabled Repartition. ### How was this patch tested? Added a test case Author: Xiao Li <gatorsmile@gmail.com> Closes #16933 from gatorsmile/CollapseRepartition.
- Mar 06, 2017
actuaryzhang authored
## What changes were proposed in this pull request? Added checks for name consistency of input data frames in union. ## How was this patch tested? new test. Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #17159 from actuaryzhang/sparkRUnion.
- Mar 05, 2017
Felix Cheung authored
## What changes were proposed in this pull request? Add column functions: to_json, from_json, and tests covering error cases. ## How was this patch tested? unit tests, manual Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17134 from felixcheung/rtojson.
- Mar 01, 2017
actuaryzhang authored
Update R doc: 1. columns, names and colnames returns a vector of strings, not **list** as in current doc. 2. `colnames<-` does allow the subset assignment, so the length of `value` can be less than the number of columns, e.g., `colnames(df)[1] <- "a"`. felixcheung Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #17115 from actuaryzhang/sparkRMinorDoc.
- Feb 23, 2017
actuaryzhang authored
## What changes were proposed in this pull request? The `[[` method is supposed to take a single index and return a column. This is different from base R which takes a vector index. We should check for this and issue warning or error when vector index is supplied (which is very likely given the behavior in base R). Currently I'm issuing a warning message and just take the first element of the vector index. We could change this to an error it that's better. ## How was this patch tested? new tests Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #17017 from actuaryzhang/sparkRSubsetter.
- Feb 22, 2017
wm624@hotmail.com authored
## What changes were proposed in this pull request? This is a follow-up PR of #16800 When doing SPARK-19456, we found that "" should be consider a NULL column name and should not be set. aggregationDepth should be exposed as an expert parameter. ## How was this patch tested? Existing tests. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16945 from wangmiao1981/svc.
- Feb 17, 2017
Yanbo Liang authored
## What changes were proposed in this pull request? SparkR ```approxQuantile``` supports input multiple columns. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16951 from yanboliang/spark-19619.
- Feb 15, 2017
Felix Cheung authored
## What changes were proposed in this pull request? Add coalesce on DataFrame for down partitioning without shuffle and coalesce on Column ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16739 from felixcheung/rcoalesce.
wm624@hotmail.com authored
## What changes were proposed in this pull request? Linear SVM classifier is newly added into ML and python API has been added. This JIRA is to add R side API. Marked as WIP, as I am designing unit tests. ## How was this patch tested? Please review http://spark.apache.org/contributing.html before opening a pull request. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16800 from wangmiao1981/svc.
- Feb 12, 2017
titicaca authored
## What changes were proposed in this pull request? Fix a bug in collect method for collecting timestamp column, the bug can be reproduced as shown in the following codes and outputs: ``` library(SparkR) sparkR.session(master = "local") df <- data.frame(col1 = c(0, 1, 2), col2 = c(as.POSIXct("2017-01-01 00:00:01"), NA, as.POSIXct("2017-01-01 12:00:01"))) sdf1 <- createDataFrame(df) print(dtypes(sdf1)) df1 <- collect(sdf1) print(lapply(df1, class)) sdf2 <- filter(sdf1, "col1 > 0") print(dtypes(sdf2)) df2 <- collect(sdf2) print(lapply(df2, class)) ``` As we can see from the printed output, the column type of col2 in df2 is converted to numeric unexpectedly, when NA exists at the top of the column. This is caused by method `do.call(c, list)`, if we convert a list, i.e. `do.call(c, list(NA, as.POSIXct("2017-01-01 12:00:01"))`, the class of the result is numeric instead of POSIXct. Therefore, we need to cast the data type of the vector explicitly. ## How was this patch tested? The patch can be tested manually with the same code above. Author: titicaca <fangzhou.yang@hotmail.com> Closes #16689 from titicaca/sparkr-dev.
- Feb 08, 2017
Dongjoon Hyun authored
## What changes were proposed in this pull request? After SPARK-19464, **SparkPullRequestBuilder** fails because it still tries to use hadoop2.3. **BEFORE** https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72595/console ``` ======================================================================== Building Spark ======================================================================== [error] Could not find hadoop2.3 in the list. Valid options are ['hadoop2.6', 'hadoop2.7'] Attempting to post to Github... > Post successful. ``` **AFTER** https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72595/console ``` ======================================================================== Building Spark ======================================================================== [info] Building Spark (w/Hive 1.2.1) using SBT with these arguments: -Phadoop-2.6 -Pmesos -Pkinesis-asl -Pyarn -Phive-thriftserver -Phive test:package streaming-kafka-0-8-assembly/assembly streaming-flume-assembly/assembly streaming-kinesis-asl-assembly/assembly Using /usr/java/jdk1.8.0_60 as default JAVA_HOME. Note, this will be overridden by -java-home if it is set. ``` ## How was this patch tested? Pass the existing test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16858 from dongjoon-hyun/hotfix_run-tests.
- Feb 07, 2017
anabranch authored
## What changes were proposed in this pull request? This pull request adds two new user facing functions: - `to_date` which accepts an expression and a format and returns a date. - `to_timestamp` which accepts an expression and a format and returns a timestamp. For example, Given a date in format: `2016-21-05`. (YYYY-dd-MM) ### Date Function *Previously* ``` to_date(unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp")) ``` *Current* ``` to_date(lit("2016-21-05"), "yyyy-dd-MM") ``` ### Timestamp Function *Previously* ``` unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp") ``` *Current* ``` to_timestamp(lit("2016-21-05"), "yyyy-dd-MM") ``` ### Tasks - [X] Add `to_date` to Scala Functions - [x] Add `to_date` to Python Functions - [x] Add `to_date` to SQL Functions - [X] Add `to_timestamp` to Scala Functions - [x] Add `to_timestamp` to Python Functions - [x] Add `to_timestamp` to SQL Functions - [x] Add function to R ## How was this patch tested? - [x] Add Functions to `DateFunctionsSuite` - Test new `ParseToTimestamp` Expression (*not necessary*) - Test new `ParseToDate` Expression (*not necessary*) - [x] Add test for R - [x] Add test for Python in test.py Please review http://spark.apache.org/contributing.html before opening a pull request. Author: anabranch <wac.chambers@gmail.com> Author: Bill Chambers <bill@databricks.com> Author: anabranch <bill@databricks.com> Closes #16138 from anabranch/SPARK-16609.
- Feb 05, 2017
actuaryzhang authored
## What changes were proposed in this pull request? The names method fails to check for validity of the assignment values. This can be fixed by calling colnames within names. ## How was this patch tested? new tests. Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #16794 from actuaryzhang/sparkRNames.
- Jan 31, 2017
wm624@hotmail.com authored
## What changes were proposed in this pull request When Kmeans using initMode = "random" and some random seed, it is possible the actual cluster size doesn't equal to the configured `k`. In this case, summary(model) returns error due to the number of cols of coefficient matrix doesn't equal to k. Example: > col1 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0) > col2 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0) > col3 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0) > cols <- as.data.frame(cbind(col1, col2, col3)) > df <- createDataFrame(cols) > > model2 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10, initMode = "random", seed = 22222, tol = 1E-5) > > summary(model2) Error in `colnames<-`(`*tmp*`, value = c("col1", "col2", "col3")) : length of 'dimnames' [2] not equal to array extent In addition: Warning message: In matrix(coefficients, ncol = k) : data length [9] is not a sub-multiple or multiple of the number of rows [2] Fix: Get the actual cluster size in the summary and use it to build the coefficient matrix. ## How was this patch tested? Add unit tests. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16666 from wangmiao1981/kmeans.
actuaryzhang authored
## What changes were proposed in this pull request? The `coefficients` component in model summary should be 'matrix' but the underlying structure is indeed list. This affects several models except for 'AFTSurvivalRegressionModel' which has the correct implementation. The fix is to first `unlist` the coefficients returned from the `callJMethod` before converting to matrix. An example illustrates the issues: ``` data(iris) df <- createDataFrame(iris) model <- spark.glm(df, Sepal_Length ~ Sepal_Width, family = "gaussian") s <- summary(model) > str(s$coefficients) List of 8 $ : num 6.53 $ : num -0.223 $ : num 0.479 $ : num 0.155 $ : num 13.6 $ : num -1.44 $ : num 0 $ : num 0.152 - attr(*, "dim")= int [1:2] 2 4 - attr(*, "dimnames")=List of 2 ..$ : chr [1:2] "(Intercept)" "Sepal_Width" ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)" > s$coefficients[, 2] $`(Intercept)` [1] 0.4788963 $Sepal_Width [1] 0.1550809 ``` This shows that the underlying structure of coefficients is still `list`. felixcheung wangmiao1981 Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #16730 from actuaryzhang/sparkRCoef.
- Jan 27, 2017
Felix Cheung authored
## What changes were proposed in this pull request? This affects mostly running job from the driver in client mode when results are expected to be through stdout (which should be somewhat rare, but possible) Before: ``` > a <- as.DataFrame(cars) > b <- group_by(a, "dist") > c <- count(b) > sparkR.callJMethod(c$countjc, "explain", TRUE) NULL ``` After: ``` > a <- as.DataFrame(cars) > b <- group_by(a, "dist") > c <- count(b) > sparkR.callJMethod(c$countjc, "explain", TRUE) count#11L NULL ``` Now, `column.explain()` doesn't seem very useful (we can get more extensive output with `DataFrame.explain()`) but there are other more complex examples with calls of `println` in Scala/JVM side, that are getting dropped. ## How was this patch tested? manual Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16670 from felixcheung/rjvmstdout.
- Jan 26, 2017
Felix Cheung authored
## What changes were proposed in this pull request? With doc to say this would convert DF into RDD ## How was this patch tested? unit tests, manual tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16668 from felixcheung/rgetnumpartitions.
wm624@hotmail.com authored
## What changes were proposed in this pull request? Add R wrapper for bisecting Kmeans. As JIRA is down, I will update title to link with corresponding JIRA later. ## How was this patch tested? Add new unit tests. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16566 from wangmiao1981/bk.
- Jan 24, 2017
Felix Cheung authored
## What changes were proposed in this pull request? Support for ``` df[[myname]] <- 1 df[[2]] <- df$eruptions ``` ## How was this patch tested? manual tests, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16663 from felixcheung/rcolset.
- Jan 21, 2017
Yanbo Liang authored
## What changes were proposed in this pull request? ```spark.gaussianMixture``` supports output total log-likelihood for the model like R ```mvnormalmixEM```. ## How was this patch tested? R unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16646 from yanboliang/spark-19291.
- Jan 16, 2017
wm624@hotmail.com authored
## What changes were proposed in this pull request? spark.lda passes the optimizer "em" or "online" as a string to the backend. However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, for optimizer "em", the `isDistributed` field is FALSE, which should be TRUE based on scala code. In addition, the `summary` method should bring back the results related to `DistributedLDAModel`. ## How was this patch tested? Manual tests by comparing with scala example. Modified the current unit test: fix the incorrect unit test and add necessary tests for `summary` method. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16464 from wangmiao1981/new.
- Jan 13, 2017
Felix Cheung authored
## What changes were proposed in this pull request? To allow specifying number of partitions when the DataFrame is created ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16512 from felixcheung/rnumpart.
wm624@hotmail.com authored
## What changes were proposed in this pull request? spark.kmeans doesn't have interface to set initSteps, seed and tol. As Spark Kmeans algorithm doesn't take the same set of parameters as R kmeans, we should maintain a different interface in spark.kmeans. Add missing parameters and corresponding document. Modified existing unit tests to take additional parameters. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16523 from wangmiao1981/kmeans.
- Jan 11, 2017
Felix Cheung authored
## What changes were proposed in this pull request? ``` df$foo <- 1 ``` instead of ``` df$foo <- lit(1) ``` ## How was this patch tested? unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16510 from felixcheung/rlitcol.
- Jan 10, 2017
Felix Cheung authored
## What changes were proposed in this pull request? R family is a longer list than what Spark supports. ## How was this patch tested? manual Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16511 from felixcheung/rdocglmfamily.
- Jan 08, 2017
Yanbo Liang authored
## What changes were proposed in this pull request? SparkR ```mllib.R``` is getting bigger as we add more ML wrappers, I'd like to split it into multiple files to make us easy to maintain: * mllib_classification.R * mllib_clustering.R * mllib_recommendation.R * mllib_regression.R * mllib_stat.R * mllib_tree.R * mllib_utils.R Note: Only reorg, no actual code change. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16312 from yanboliang/spark-18862.
- Dec 22, 2016
Felix Cheung authored
## What changes were proposed in this pull request? It would make it easier to integrate with other component expecting row-based JSON format. This replaces the non-public toJSON RDD API. ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16368 from felixcheung/rJSON.
- Dec 21, 2016
Felix Cheung authored
## What changes were proposed in this pull request? API for SparkUI URL from SparkContext ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16367 from felixcheung/rwebui.
- Dec 16, 2016
Dongjoon Hyun authored
## What changes were proposed in this pull request? SparkR tests, `R/run-tests.sh`, succeeds only once because `test_sparkSQL.R` does not clean up the test table, `people`. As a result, the rows in `people` table are accumulated at every run and the test cases fail. The following is the failure result for the second run. ```r Failed ------------------------------------------------------------------------- 1. Failure: create DataFrame from RDD (test_sparkSQL.R#204) ------------------- collect(sql("SELECT age from people WHERE name = 'Bob'"))$age not equal to c(16). Lengths differ: 2 vs 1 2. Failure: create DataFrame from RDD (test_sparkSQL.R#206) ------------------- collect(sql("SELECT height from people WHERE name ='Bob'"))$height not equal to c(176.5). Lengths differ: 2 vs 1 ``` ## How was this patch tested? Manual. Run `run-tests.sh` twice and check if it passes without failures. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16310 from dongjoon-hyun/SPARK-18897.
- Dec 13, 2016
wm624@hotmail.com authored
## What changes were proposed in this pull request? While adding vignettes for kstest, I found some errors in the example: 1. There is a typo of kstest; 2. print.summary.KStest doesn't work with the example; Fix the example errors; Add a new unit test for print.summary.KStest; ## How was this patch tested? Manual test; Add new unit test; Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16259 from wangmiao1981/ks.
- Dec 12, 2016
Felix Cheung authored
## What changes were proposed in this pull request? Support overriding the download url (include version directory) in an environment variable, `SPARKR_RELEASE_DOWNLOAD_URL` ## How was this patch tested? unit test, manually testing - snapshot build url - download when spark jar not cached - when spark jar is cached - RC build url - download when spark jar not cached - when spark jar is cached - multiple cached spark versions - starting with sparkR shell To use this, ``` SPARKR_RELEASE_DOWNLOAD_URL=http://this_is_the_url_to_spark_release_tgz R ``` then in R, ``` library(SparkR) # or specify lib.loc sparkR.session() ``` Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16248 from felixcheung/rinstallurl.