Skip to content
Snippets Groups Projects
  1. Sep 14, 2017
    • goldmedal's avatar
      [SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json support converting MapType to... · a28728a9
      goldmedal authored
      [SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json support converting MapType to json for PySpark and SparkR
      
      ## What changes were proposed in this pull request?
      In previous work SPARK-21513, we has allowed `MapType` and `ArrayType` of `MapType`s convert to a json string but only for Scala API. In this follow-up PR, we will make SparkSQL support it for PySpark and SparkR, too. We also fix some little bugs and comments of the previous work in this follow-up PR.
      
      ### For PySpark
      ```
      >>> data = [(1, {"name": "Alice"})]
      >>> df = spark.createDataFrame(data, ("key", "value"))
      >>> df.select(to_json(df.value).alias("json")).collect()
      [Row(json=u'{"name":"Alice")']
      >>> data = [(1, [{"name": "Alice"}, {"name": "Bob"}])]
      >>> df = spark.createDataFrame(data, ("key", "value"))
      >>> df.select(to_json(df.value).alias("json")).collect()
      [Row(json=u'[{"name":"Alice"},{"name":"Bob"}]')]
      ```
      ### For SparkR
      ```
      # Converts a map into a JSON object
      df2 <- sql("SELECT map('name', 'Bob')) as people")
      df2 <- mutate(df2, people_json = to_json(df2$people))
      # Converts an array of maps into a JSON array
      df2 <- sql("SELECT array(map('name', 'Bob'), map('name', 'Alice')) as people")
      df2 <- mutate(df2, people_json = to_json(df2$people))
      ```
      ## How was this patch tested?
      Add unit test cases.
      
      cc viirya HyukjinKwon
      
      Author: goldmedal <liugs963@gmail.com>
      
      Closes #19223 from goldmedal/SPARK-21513-fp-PySaprkAndSparkR.
      a28728a9
  2. Sep 06, 2017
  3. Sep 03, 2017
    • hyukjinkwon's avatar
      [SPARK-21897][PYTHON][R] Add unionByName API to DataFrame in Python and R · 07fd68a2
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to add a wrapper for `unionByName` API to R and Python as well.
      
      **Python**
      
      ```python
      df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
      df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])
      df1.unionByName(df2).show()
      ```
      
      ```
      +----+----+----+
      |col0|col1|col3|
      +----+----+----+
      |   1|   2|   3|
      |   6|   4|   5|
      +----+----+----+
      ```
      
      **R**
      
      ```R
      df1 <- select(createDataFrame(mtcars), "carb", "am", "gear")
      df2 <- select(createDataFrame(mtcars), "am", "gear", "carb")
      head(unionByName(limit(df1, 2), limit(df2, 2)))
      ```
      
      ```
        carb am gear
      1    4  1    4
      2    4  1    4
      3    4  1    4
      4    4  1    4
      ```
      
      ## How was this patch tested?
      
      Doctests for Python and unit test added in `test_sparkSQL.R` for R.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19105 from HyukjinKwon/unionByName-r-python.
      07fd68a2
  4. Aug 29, 2017
  5. Aug 23, 2017
  6. Aug 22, 2017
    • Andrew Ray's avatar
      [SPARK-21584][SQL][SPARKR] Update R method for summary to call new implementation · 5c9b3017
      Andrew Ray authored
      ## What changes were proposed in this pull request?
      
      SPARK-21100 introduced a new `summary` method to the Scala/Java Dataset API that included  expanded statistics (vs `describe`) and control over which statistics to compute. Currently in the R API `summary` acts as an alias for `describe`. This patch updates the R API to call the new `summary` method in the JVM that includes additional statistics and ability to select which to compute.
      
      This does not break the current interface as the present `summary` method does not take additional arguments like `describe` and the output was never meant to be used programmatically.
      
      ## How was this patch tested?
      
      Modified and additional unit tests.
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #18786 from aray/summary-r.
      5c9b3017
  7. Aug 06, 2017
  8. Aug 03, 2017
    • hyukjinkwon's avatar
      [SPARK-21602][R] Add map_keys and map_values functions to R · 97ba4918
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR adds `map_values` and `map_keys` to R API.
      
      ```r
      > df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))
      > tmp <- mutate(df, v = create_map(df$model, df$cyl))
      > head(select(tmp, map_keys(tmp$v)))
      ```
      ```
              map_keys(v)
      1         Mazda RX4
      2     Mazda RX4 Wag
      3        Datsun 710
      4    Hornet 4 Drive
      5 Hornet Sportabout
      6           Valiant
      ```
      ```r
      > head(select(tmp, map_values(tmp$v)))
      ```
      ```
        map_values(v)
      1             6
      2             6
      3             4
      4             6
      5             8
      6             6
      ```
      
      ## How was this patch tested?
      
      Manual tests and unit tests in `R/pkg/tests/fulltests/test_sparkSQL.R`
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18809 from HyukjinKwon/map-keys-values-r.
      97ba4918
  9. Jul 31, 2017
    • wangmiao1981's avatar
      [SPARK-21381][SPARKR] SparkR: pass on setHandleInvalid for classification algorithms · 9570e81a
      wangmiao1981 authored
      ## What changes were proposed in this pull request?
      
      SPARK-20307 Added handleInvalid option to RFormula for tree-based classification algorithms. We should add this parameter for other classification algorithms in SparkR.
      
      This is a followup PR for SPARK-20307.
      
      ## How was this patch tested?
      
      New Unit tests are added.
      
      Author: wangmiao1981 <wm624@hotmail.com>
      
      Closes #18605 from wangmiao1981/class.
      9570e81a
  10. Jul 15, 2017
  11. Jul 13, 2017
    • Sean Owen's avatar
      [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 · 425c4ada
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Remove Scala 2.10 build profiles and support
      - Replace some 2.10 support in scripts with commented placeholders for 2.12 later
      - Remove deprecated API calls from 2.10 support
      - Remove usages of deprecated context bounds where possible
      - Remove Scala 2.10 workarounds like ScalaReflectionLock
      - Other minor Scala warning fixes
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17150 from srowen/SPARK-19810.
      425c4ada
  12. Jul 10, 2017
    • hyukjinkwon's avatar
      [SPARK-21266][R][PYTHON] Support schema a DDL-formatted string in dapply/gapply/from_json · 2bfd5acc
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR supports schema in a DDL formatted string for `from_json` in R/Python and `dapply` and `gapply` in R, which are commonly used and/or consistent with Scala APIs.
      
      Additionally, this PR exposes `structType` in R to allow working around in other possible corner cases.
      
      **Python**
      
      `from_json`
      
      ```python
      from pyspark.sql.functions import from_json
      
      data = [(1, '''{"a": 1}''')]
      df = spark.createDataFrame(data, ("key", "value"))
      df.select(from_json(df.value, "a INT").alias("json")).show()
      ```
      
      **R**
      
      `from_json`
      
      ```R
      df <- sql("SELECT named_struct('name', 'Bob') as people")
      df <- mutate(df, people_json = to_json(df$people))
      head(select(df, from_json(df$people_json, "name STRING")))
      ```
      
      `structType.character`
      
      ```R
      structType("a STRING, b INT")
      ```
      
      `dapply`
      
      ```R
      dapply(createDataFrame(list(list(1.0)), "a"), function(x) {x}, "a DOUBLE")
      ```
      
      `gapply`
      
      ```R
      gapply(createDataFrame(list(list(1.0)), "a"), "a", function(key, x) { x }, "a DOUBLE")
      ```
      
      ## How was this patch tested?
      
      Doc tests for `from_json` in Python and unit tests `test_sparkSQL.R` in R.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18498 from HyukjinKwon/SPARK-21266.
      2bfd5acc
  13. Jul 08, 2017
  14. Jul 04, 2017
  15. Jun 30, 2017
  16. Jun 29, 2017
  17. Jun 28, 2017
  18. Jun 25, 2017
    • hyukjinkwon's avatar
      [SPARK-21093][R] Terminate R's worker processes in the parent of R's daemon to prevent a leak · 6b3d0228
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      `mcfork` in R looks opening a pipe ahead but the existing logic does not properly close it when it is executed hot. This leads to the failure of more forking due to the limit for number of files open.
      
      This hot execution looks particularly for `gapply`/`gapplyCollect`. For unknown reason, this happens more easily in CentOS and could be reproduced in Mac too.
      
      All the details are described in https://issues.apache.org/jira/browse/SPARK-21093
      
      This PR proposes simply to terminate R's worker processes in the parent of R's daemon to prevent a leak.
      
      ## How was this patch tested?
      
      I ran the codes below on both CentOS and Mac with that configuration disabled/enabled.
      
      ```r
      df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
      collect(gapply(df, "a", function(key, x) { x }, schema(df)))
      collect(gapply(df, "a", function(key, x) { x }, schema(df)))
      ...  # 30 times
      ```
      
      Also, now it passes R tests on CentOS as below:
      
      ```
      SparkSQL functions: Spark package found in SPARK_HOME: .../spark
      ..............................................................................................................................................................
      ..............................................................................................................................................................
      ..............................................................................................................................................................
      ..............................................................................................................................................................
      ..............................................................................................................................................................
      ....................................................................................................................................
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18320 from HyukjinKwon/SPARK-21093.
      6b3d0228
  19. Jun 23, 2017
  20. Jun 22, 2017
  21. Jun 21, 2017
  22. Jun 20, 2017
    • Joseph K. Bradley's avatar
      [SPARK-20929][ML] LinearSVC should use its own threshold param · cc67bd57
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      LinearSVC should use its own threshold param, rather than the shared one, since it applies to rawPrediction instead of probability.  This PR changes the param in the Scala, Python and R APIs.
      
      ## How was this patch tested?
      
      New unit test to make sure the threshold can be set to any Double value.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #18151 from jkbradley/ml-2.2-linearsvc-cleanup.
      cc67bd57
  23. Jun 19, 2017
  24. Jun 18, 2017
    • actuaryzhang's avatar
      [SPARK-20892][SPARKR] Add SQL trunc function to SparkR · 110ce1f2
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      
      Add SQL trunc function
      
      ## How was this patch tested?
      standard test
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #18291 from actuaryzhang/sparkRTrunc2.
      110ce1f2
    • hyukjinkwon's avatar
      [SPARK-21128][R] Remove both "spark-warehouse" and "metastore_db" before listing files in R tests · 05f83c53
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to list the files in test _after_ removing both "spark-warehouse" and "metastore_db" so that the next run of R tests pass fine. This is sometimes a bit annoying.
      
      ## How was this patch tested?
      
      Manually running multiple times R tests via `./R/run-tests.sh`.
      
      **Before**
      
      Second run:
      
      ```
      SparkSQL functions: Spark package found in SPARK_HOME: .../spark
      ...............................................................................................................................................................
      ...............................................................................................................................................................
      ...............................................................................................................................................................
      ...............................................................................................................................................................
      ...............................................................................................................................................................
      ....................................................................................................1234.......................
      
      Failed -------------------------------------------------------------------------
      1. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
      length(list1) not equal to length(list2).
      1/1 mismatches
      [1] 25 - 23 == 2
      
      2. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
      sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
      10/25 mismatches
      x[16]: "metastore_db"
      y[16]: "pkg"
      
      x[17]: "pkg"
      y[17]: "R"
      
      x[18]: "R"
      y[18]: "README.md"
      
      x[19]: "README.md"
      y[19]: "run-tests.sh"
      
      x[20]: "run-tests.sh"
      y[20]: "SparkR_2.2.0.tar.gz"
      
      x[21]: "metastore_db"
      y[21]: "pkg"
      
      x[22]: "pkg"
      y[22]: "R"
      
      x[23]: "R"
      y[23]: "README.md"
      
      x[24]: "README.md"
      y[24]: "run-tests.sh"
      
      x[25]: "run-tests.sh"
      y[25]: "SparkR_2.2.0.tar.gz"
      
      3. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
      length(list1) not equal to length(list2).
      1/1 mismatches
      [1] 25 - 23 == 2
      
      4. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
      sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
      10/25 mismatches
      x[16]: "metastore_db"
      y[16]: "pkg"
      
      x[17]: "pkg"
      y[17]: "R"
      
      x[18]: "R"
      y[18]: "README.md"
      
      x[19]: "README.md"
      y[19]: "run-tests.sh"
      
      x[20]: "run-tests.sh"
      y[20]: "SparkR_2.2.0.tar.gz"
      
      x[21]: "metastore_db"
      y[21]: "pkg"
      
      x[22]: "pkg"
      y[22]: "R"
      
      x[23]: "R"
      y[23]: "README.md"
      
      x[24]: "README.md"
      y[24]: "run-tests.sh"
      
      x[25]: "run-tests.sh"
      y[25]: "SparkR_2.2.0.tar.gz"
      
      DONE ===========================================================================
      ```
      
      **After**
      
      Second run:
      
      ```
      SparkSQL functions: Spark package found in SPARK_HOME: .../spark
      ...............................................................................................................................................................
      ...............................................................................................................................................................
      ...............................................................................................................................................................
      ...............................................................................................................................................................
      ...............................................................................................................................................................
      ...............................................................................................................................
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18335 from HyukjinKwon/SPARK-21128.
      05f83c53
  25. Jun 16, 2017
    • Yuming Wang's avatar
      [MINOR][DOCS] Improve Running R Tests docs · 45824fb6
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      
      Update Running R Tests dependence packages to:
      ```bash
      R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 'survival'), repos='http://cran.us.r-project.org')"
      ```
      
      ## How was this patch tested?
      manual tests
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #18271 from wangyum/building-spark.
      45824fb6
  26. Jun 15, 2017
    • Xiao Li's avatar
      [SPARK-20980][SQL] Rename `wholeFile` to `multiLine` for both CSV and JSON · 20514281
      Xiao Li authored
      ### What changes were proposed in this pull request?
      The current option name `wholeFile` is misleading for CSV users. Currently, it is not representing a record per file. Actually, one file could have multiple records. Thus, we should rename it. Now, the proposal is `multiLine`.
      
      ### How was this patch tested?
      N/A
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #18202 from gatorsmile/renameCVSOption.
      20514281
  27. Jun 11, 2017
    • Felix Cheung's avatar
      [SPARK-20877][SPARKR][FOLLOWUP] clean up after test move · 9f4ff955
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      clean up after big test move
      
      ## How was this patch tested?
      
      unit tests, jenkins
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #18267 from felixcheung/rtestset2.
      9f4ff955
    • Felix Cheung's avatar
      [SPARK-20877][SPARKR] refactor tests to basic tests only for CRAN · dc4c3518
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Move all existing tests to non-installed directory so that it will never run by installing SparkR package
      
      For a follow-up PR:
      - remove all skip_on_cran() calls in tests
      - clean up test timer
      - improve or change basic tests that do run on CRAN (if anyone has suggestion)
      
      It looks like `R CMD build pkg` will still put pkg\tests (ie. the full tests) into the source package but `R CMD INSTALL` on such source package does not install these tests (and so `R CMD check` does not run them)
      
      ## How was this patch tested?
      
      - [x] unit tests, Jenkins
      - [x] AppVeyor
      - [x] make a source package, install it, `R CMD check` it - verify the full tests are not installed or run
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #18264 from felixcheung/rtestset.
      dc4c3518
  28. Jun 09, 2017
    • Reynold Xin's avatar
      [SPARK-21042][SQL] Document Dataset.union is resolution by position · b78e3849
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Document Dataset.union is resolution by position, not by name, since this has been a confusing point for a lot of users.
      
      ## How was this patch tested?
      N/A - doc only change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #18256 from rxin/SPARK-21042.
      b78e3849
Loading