Skip to content
Snippets Groups Projects
  1. Apr 24, 2017
    • zero323's avatar
      [SPARK-20438][R] SparkR wrappers for split and repeat · 8a272ddc
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Add wrappers for `o.a.s.sql.functions`:
      
      - `split` as `split_string`
      - `repeat` as `repeat_string`
      
      ## How was this patch tested?
      
      Existing tests, additional unit tests, `check-cran.sh`
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17729 from zero323/SPARK-20438.
      8a272ddc
  2. Apr 21, 2017
    • zero323's avatar
      [SPARK-20371][R] Add wrappers for collect_list and collect_set · fd648bff
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Adds wrappers for `collect_list` and `collect_set`.
      
      ## How was this patch tested?
      
      Unit tests, `check-cran.sh`
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17672 from zero323/SPARK-20371.
      fd648bff
  3. Apr 19, 2017
    • zero323's avatar
      [SPARK-20375][R] R wrappers for array and map · 46c57497
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Adds wrappers for `o.a.s.sql.functions.array` and `o.a.s.sql.functions.map`
      
      ## How was this patch tested?
      
      Unit tests, `check-cran.sh`
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17674 from zero323/SPARK-20375.
      46c57497
    • Shixiong Zhu's avatar
      [SPARK-20397][SPARKR][SS] Fix flaky test: test_streaming.R.Terminated by error · 4fea7848
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Checking a source parameter is asynchronous. When the query is created, it's not guaranteed that source has been created. This PR just increases the timeout of awaitTermination to ensure the parsing error is thrown.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17687 from zsxwing/SPARK-20397.
      4fea7848
  4. Apr 18, 2017
    • zero323's avatar
      [SPARK-20208][R][DOCS] Document R fpGrowth support · 702d85af
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Document  fpGrowth in:
      
      - vignettes
      - programming guide
      - code example
      
      ## How was this patch tested?
      
      Manual tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17557 from zero323/SPARK-20208.
      702d85af
  5. Apr 17, 2017
    • hyukjinkwon's avatar
      [SPARK-19828][R][FOLLOWUP] Rename asJsonArray to as.json.array in from_json function in R · 24f09b39
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This was suggested to be `as.json.array` at the first place in the PR to SPARK-19828 but we could not do this as the lint check emits an error for multiple dots in the variable names.
      
      After SPARK-20278, now we are able to use `multiple.dots.in.names`. `asJsonArray` in `from_json` function is still able to be changed as 2.2 is not released yet.
      
      So, this PR proposes to rename `asJsonArray` to `as.json.array`.
      
      ## How was this patch tested?
      
      Jenkins tests, local tests with `./R/run-tests.sh` and manual `./dev/lint-r`. Existing tests should cover this.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17653 from HyukjinKwon/SPARK-19828-followup.
      24f09b39
  6. Apr 16, 2017
    • hyukjinkwon's avatar
      [SPARK-20278][R] Disable 'multiple_dots_linter' lint rule that is against project's code style · 86d251c5
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, multi-dot separated variables in R is not allowed. For example,
      
      ```diff
       setMethod("from_json", signature(x = "Column", schema = "structType"),
      -          function(x, schema, asJsonArray = FALSE, ...) {
      +          function(x, schema, as.json.array = FALSE, ...) {
                   if (asJsonArray) {
                     jschema <- callJStatic("org.apache.spark.sql.types.DataTypes",
                                            "createArrayType",
      ```
      
      produces an error as below:
      
      ```
      R/functions.R:2462:31: style: Words within variable and function names should be separated by '_' rather than '.'.
                function(x, schema, as.json.array = FALSE, ...) {
                                    ^~~~~~~~~~~~~
      ```
      
      This seems against https://google.github.io/styleguide/Rguide.xml#identifiers which says
      
      > The preferred form for variable names is all lower case letters and words separated with dots
      
      This looks because lintr by default https://github.com/jimhester/lintr follows http://r-pkgs.had.co.nz/style.html as written in the README.md. Few cases seems not following Google's one as "a few tweaks".
      
      Per [SPARK-6813](https://issues.apache.org/jira/browse/SPARK-6813), we follow Google's R Style Guide with few exceptions https://google.github.io/styleguide/Rguide.xml. This is also merged into Spark's website - https://github.com/apache/spark-website/pull/43
      
      Also, it looks we have no limit on function name. This rule also looks affecting to the name of functions as written in the README.md.
      
      > `multiple_dots_linter`: check that function and variable names are separated by _ rather than ..
      
      ## How was this patch tested?
      
      Manually tested `./dev/lint-r`with the manual change below in `R/functions.R`:
      
      ```diff
       setMethod("from_json", signature(x = "Column", schema = "structType"),
      -          function(x, schema, asJsonArray = FALSE, ...) {
      +          function(x, schema, as.json.array = FALSE, ...) {
                   if (asJsonArray) {
                     jschema <- callJStatic("org.apache.spark.sql.types.DataTypes",
                                            "createArrayType",
      ```
      
      **Before**
      
      ```R
      R/functions.R:2462:31: style: Words within variable and function names should be separated by '_' rather than '.'.
                function(x, schema, as.json.array = FALSE, ...) {
                                    ^~~~~~~~~~~~~
      ```
      
      **After**
      
      ```
      lintr checks passed.
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17590 from HyukjinKwon/disable-dot-in-name.
      86d251c5
  7. Apr 12, 2017
  8. Apr 07, 2017
  9. Apr 06, 2017
  10. Apr 05, 2017
    • Felix Cheung's avatar
      [SPARKR][DOC] update doc for fpgrowth · c1b8b667
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      minor update
      
      zero323
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17526 from felixcheung/rfpgrowthfollowup.
      c1b8b667
  11. Apr 04, 2017
    • hyukjinkwon's avatar
      [MINOR][R] Reorder `Collate` fields in DESCRIPTION file · 0e2ee820
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      It seems cran check scripts corrects `R/pkg/DESCRIPTION` and follows the order in `Collate` fields.
      
      This PR proposes to fix `catalog.R`'s order so that running this script does not show up a small diff in this file every time.
      
      ## How was this patch tested?
      
      Manually via `./R/check-cran.sh`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17528 from HyukjinKwon/minor-reorder-description.
      0e2ee820
    • zero323's avatar
      [SPARK-19825][R][ML] spark.ml R API for FPGrowth · b34f7665
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Adds SparkR API for FPGrowth: [SPARK-19825](https://issues.apache.org/jira/browse/SPARK-19825):
      
      - `spark.fpGrowth` -model training.
      - `freqItemsets` and `associationRules` methods with new corresponding generics.
      - Scala helper: `org.apache.spark.ml.r. FPGrowthWrapper`
      - unit tests.
      
      ## How was this patch tested?
      
      Feature specific unit tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17170 from zero323/SPARK-19825.
      b34f7665
  12. Apr 02, 2017
  13. Mar 27, 2017
  14. Mar 26, 2017
  15. Mar 21, 2017
  16. Mar 20, 2017
    • Wenchen Fan's avatar
      [SPARK-19949][SQL] unify bad record handling in CSV and JSON · 68d65fae
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Currently JSON and CSV have exactly the same logic about handling bad records, this PR tries to abstract it and put it in a upper level to reduce code duplication.
      
      The overall idea is, we make the JSON and CSV parser to throw a BadRecordException, then the upper level, FailureSafeParser, handles bad records according to the parse mode.
      
      Behavior changes:
      1. with PERMISSIVE mode, if the number of tokens doesn't match the schema, previously CSV parser will treat it as a legal record and parse as many tokens as possible. After this PR, we treat it as an illegal record, and put the raw record string in a special column, but we still parse as many tokens as possible.
      2. all logging is removed as they are not very useful in practice.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: Wenchen Fan <cloud0fan@gmail.com>
      
      Closes #17315 from cloud-fan/bad-record2.
      68d65fae
    • Felix Cheung's avatar
      [SPARK-20020][SPARKR][FOLLOWUP] DataFrame checkpoint API fix version tag · f14f81e9
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      doc only change
      
      ## How was this patch tested?
      
      manual
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17356 from felixcheung/rdfcheckpoint2.
      f14f81e9
    • Felix Cheung's avatar
      [SPARK-20020][SPARKR] DataFrame checkpoint API · c4059772
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Add checkpoint, setCheckpointDir API to R
      
      ## How was this patch tested?
      
      unit tests, manual tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17351 from felixcheung/rdfcheckpoint.
      c4059772
    • hyukjinkwon's avatar
      [SPARK-19849][SQL] Support ArrayType in to_json to produce JSON array · 0cdcf911
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to support an array of struct type in `to_json` as below:
      
      ```scala
      import org.apache.spark.sql.functions._
      
      val df = Seq(Tuple1(Tuple1(1) :: Nil)).toDF("a")
      df.select(to_json($"a").as("json")).show()
      ```
      
      ```
      +----------+
      |      json|
      +----------+
      |[{"_1":1}]|
      +----------+
      ```
      
      Currently, it throws an exception as below (a newline manually inserted for readability):
      
      ```
      org.apache.spark.sql.AnalysisException: cannot resolve 'structtojson(`array`)' due to data type
      mismatch: structtojson requires that the expression is a struct expression.;;
      ```
      
      This allows the roundtrip with `from_json` as below:
      
      ```scala
      import org.apache.spark.sql.functions._
      import org.apache.spark.sql.types._
      
      val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
      val df = Seq("""[{"a":1}, {"a":2}]""").toDF("json").select(from_json($"json", schema).as("array"))
      df.show()
      
      // Read back.
      df.select(to_json($"array").as("json")).show()
      ```
      
      ```
      +----------+
      |     array|
      +----------+
      |[[1], [2]]|
      +----------+
      
      +-----------------+
      |             json|
      +-----------------+
      |[{"a":1},{"a":2}]|
      +-----------------+
      ```
      
      Also, this PR proposes to rename from `StructToJson` to `StructsToJson ` and `JsonToStruct` to `JsonToStructs`.
      
      ## How was this patch tested?
      
      Unit tests in `JsonFunctionsSuite` and `JsonExpressionsSuite` for Scala, doctest for Python and test in `test_sparkSQL.R` for R.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17192 from HyukjinKwon/SPARK-19849.
      0cdcf911
  17. Mar 19, 2017
    • Felix Cheung's avatar
      [SPARK-18817][SPARKR][SQL] change derby log output to temp dir · 422aa67d
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Passes R `tempdir()` (this is the R session temp dir, shared with other temp files/dirs) to JVM, set System.Property for derby home dir to move derby.log
      
      ## How was this patch tested?
      
      Manually, unit tests
      
      With this, these are relocated to under /tmp
      ```
      # ls /tmp/RtmpG2M0cB/
      derby.log
      ```
      And they are removed automatically when the R session is ended.
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16330 from felixcheung/rderby.
      422aa67d
    • hyukjinkwon's avatar
      [MINOR][R] Reorder `Collate` fields in DESCRIPTION file · 60262bc9
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      It seems cran check scripts corrects `R/pkg/DESCRIPTION` and follows the order in `Collate` fields.
      
      This PR proposes to fix this so that running this script does not show up a diff in this file.
      
      ## How was this patch tested?
      
      Manually via `./R/check-cran.sh`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17349 from HyukjinKwon/minor-cran.
      60262bc9
  18. Mar 18, 2017
  19. Mar 14, 2017
    • hyukjinkwon's avatar
      [SPARK-19828][R] Support array type in from_json in R · d1f6c64c
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Since we could not directly define the array type in R, this PR proposes to support array types in R as string types that are used in `structField` as below:
      
      ```R
      jsonArr <- "[{\"name\":\"Bob\"}, {\"name\":\"Alice\"}]"
      df <- as.DataFrame(list(list("people" = jsonArr)))
      collect(select(df, alias(from_json(df$people, "array<struct<name:string>>"), "arrcol")))
      ```
      
      prints
      
      ```R
            arrcol
      1 Bob, Alice
      ```
      
      ## How was this patch tested?
      
      Unit tests in `test_sparkSQL.R`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17178 from HyukjinKwon/SPARK-19828.
      d1f6c64c
    • actuaryzhang's avatar
      [SPARK-19391][SPARKR][ML] Tweedie GLM API for SparkR · f6314eab
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      Port Tweedie GLM  #16344  to SparkR
      
      felixcheung yanboliang
      
      ## How was this patch tested?
      new test in SparkR
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #16729 from actuaryzhang/sparkRTweedie.
      f6314eab
  20. Mar 12, 2017
    • Xin Ren's avatar
      [SPARK-19282][ML][SPARKR] RandomForest Wrapper and GBT Wrapper return param "maxDepth" to R models · 9f8ce482
      Xin Ren authored
      ## What changes were proposed in this pull request?
      
      RandomForest R Wrapper and GBT R Wrapper return param `maxDepth` to R models.
      
      Below 4 R wrappers are changed:
      * `RandomForestClassificationWrapper`
      * `RandomForestRegressionWrapper`
      * `GBTClassificationWrapper`
      * `GBTRegressionWrapper`
      
      ## How was this patch tested?
      
      Test manually on my local machine.
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #17207 from keypointt/SPARK-19282.
      9f8ce482
  21. Mar 08, 2017
    • Xiao Li's avatar
      [SPARK-19601][SQL] Fix CollapseRepartition rule to preserve shuffle-enabled Repartition · 9a6ac722
      Xiao Li authored
      ### What changes were proposed in this pull request?
      
      Observed by felixcheung  in https://github.com/apache/spark/pull/16739, when users use the shuffle-enabled `repartition` API, they expect the partition they got should be the exact number they provided, even if they call shuffle-disabled `coalesce` later.
      
      Currently, `CollapseRepartition` rule does not consider whether shuffle is enabled or not. Thus, we got the following unexpected result.
      
      ```Scala
          val df = spark.range(0, 10000, 1, 5)
          val df2 = df.repartition(10)
          assert(df2.coalesce(13).rdd.getNumPartitions == 5)
          assert(df2.coalesce(7).rdd.getNumPartitions == 5)
          assert(df2.coalesce(3).rdd.getNumPartitions == 3)
      ```
      
      This PR is to fix the issue. We preserve shuffle-enabled Repartition.
      
      ### How was this patch tested?
      Added a test case
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #16933 from gatorsmile/CollapseRepartition.
      9a6ac722
  22. Mar 06, 2017
  23. Mar 05, 2017
    • Felix Cheung's avatar
      [SPARK-19795][SPARKR] add column functions to_json, from_json · 80d5338b
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Add column functions: to_json, from_json, and tests covering error cases.
      
      ## How was this patch tested?
      
      unit tests, manual
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17134 from felixcheung/rtojson.
      80d5338b
  24. Mar 04, 2017
  25. Mar 02, 2017
  26. Mar 01, 2017
    • actuaryzhang's avatar
      [DOC][MINOR][SPARKR] Update SparkR doc for names, columns and colnames · 2ff1467d
      actuaryzhang authored
      Update R doc:
      1. columns, names and colnames returns a vector of strings, not **list** as in current doc.
      2. `colnames<-` does allow the subset assignment, so the length of `value` can be less than the number of columns, e.g., `colnames(df)[1] <- "a"`.
      
      felixcheung
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #17115 from actuaryzhang/sparkRMinorDoc.
      2ff1467d
    • wm624@hotmail.com's avatar
      [SPARK-19460][SPARKR] Update dataset used in R documentation, examples to... · 89cd3845
      wm624@hotmail.com authored
      [SPARK-19460][SPARKR] Update dataset used in R documentation, examples to reduce warning noise and confusions
      
      ## What changes were proposed in this pull request?
      
      Replace `iris` dataset with `Titanic` or other dataset in example and document.
      
      ## How was this patch tested?
      
      Manual and existing test
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #17032 from wangmiao1981/example.
      89cd3845
  27. Feb 28, 2017
  28. Feb 23, 2017
    • actuaryzhang's avatar
      [SPARK-19682][SPARKR] Issue warning (or error) when subset method "[[" takes vector index · 7bf09433
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      The `[[` method is supposed to take a single index and return a column. This is different from base R which takes a vector index.  We should check for this and issue warning or error when vector index is supplied (which is very likely given the behavior in base R).
      
      Currently I'm issuing a warning message and just take the first element of the vector index. We could change this to an error it that's better.
      
      ## How was this patch tested?
      new tests
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #17017 from actuaryzhang/sparkRSubsetter.
      7bf09433
Loading