Skip to content
Snippets Groups Projects
  1. Jul 15, 2017
  2. Jul 13, 2017
    • Sean Owen's avatar
      [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 · 425c4ada
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Remove Scala 2.10 build profiles and support
      - Replace some 2.10 support in scripts with commented placeholders for 2.12 later
      - Remove deprecated API calls from 2.10 support
      - Remove usages of deprecated context bounds where possible
      - Remove Scala 2.10 workarounds like ScalaReflectionLock
      - Other minor Scala warning fixes
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17150 from srowen/SPARK-19810.
      425c4ada
  3. Jul 10, 2017
    • hyukjinkwon's avatar
      [SPARK-21266][R][PYTHON] Support schema a DDL-formatted string in dapply/gapply/from_json · 2bfd5acc
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR supports schema in a DDL formatted string for `from_json` in R/Python and `dapply` and `gapply` in R, which are commonly used and/or consistent with Scala APIs.
      
      Additionally, this PR exposes `structType` in R to allow working around in other possible corner cases.
      
      **Python**
      
      `from_json`
      
      ```python
      from pyspark.sql.functions import from_json
      
      data = [(1, '''{"a": 1}''')]
      df = spark.createDataFrame(data, ("key", "value"))
      df.select(from_json(df.value, "a INT").alias("json")).show()
      ```
      
      **R**
      
      `from_json`
      
      ```R
      df <- sql("SELECT named_struct('name', 'Bob') as people")
      df <- mutate(df, people_json = to_json(df$people))
      head(select(df, from_json(df$people_json, "name STRING")))
      ```
      
      `structType.character`
      
      ```R
      structType("a STRING, b INT")
      ```
      
      `dapply`
      
      ```R
      dapply(createDataFrame(list(list(1.0)), "a"), function(x) {x}, "a DOUBLE")
      ```
      
      `gapply`
      
      ```R
      gapply(createDataFrame(list(list(1.0)), "a"), "a", function(key, x) { x }, "a DOUBLE")
      ```
      
      ## How was this patch tested?
      
      Doc tests for `from_json` in Python and unit tests `test_sparkSQL.R` in R.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18498 from HyukjinKwon/SPARK-21266.
      2bfd5acc
  4. Jul 08, 2017
  5. Jul 04, 2017
  6. Jun 30, 2017
  7. Jun 29, 2017
  8. Jun 28, 2017
  9. Jun 25, 2017
    • hyukjinkwon's avatar
      [SPARK-21093][R] Terminate R's worker processes in the parent of R's daemon to prevent a leak · 6b3d0228
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      `mcfork` in R looks opening a pipe ahead but the existing logic does not properly close it when it is executed hot. This leads to the failure of more forking due to the limit for number of files open.
      
      This hot execution looks particularly for `gapply`/`gapplyCollect`. For unknown reason, this happens more easily in CentOS and could be reproduced in Mac too.
      
      All the details are described in https://issues.apache.org/jira/browse/SPARK-21093
      
      This PR proposes simply to terminate R's worker processes in the parent of R's daemon to prevent a leak.
      
      ## How was this patch tested?
      
      I ran the codes below on both CentOS and Mac with that configuration disabled/enabled.
      
      ```r
      df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
      collect(gapply(df, "a", function(key, x) { x }, schema(df)))
      collect(gapply(df, "a", function(key, x) { x }, schema(df)))
      ...  # 30 times
      ```
      
      Also, now it passes R tests on CentOS as below:
      
      ```
      SparkSQL functions: Spark package found in SPARK_HOME: .../spark
      ..............................................................................................................................................................
      ..............................................................................................................................................................
      ..............................................................................................................................................................
      ..............................................................................................................................................................
      ..............................................................................................................................................................
      ....................................................................................................................................
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18320 from HyukjinKwon/SPARK-21093.
      6b3d0228
  10. Jun 23, 2017
  11. Jun 22, 2017
  12. Jun 21, 2017
  13. Jun 20, 2017
    • Joseph K. Bradley's avatar
      [SPARK-20929][ML] LinearSVC should use its own threshold param · cc67bd57
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      LinearSVC should use its own threshold param, rather than the shared one, since it applies to rawPrediction instead of probability.  This PR changes the param in the Scala, Python and R APIs.
      
      ## How was this patch tested?
      
      New unit test to make sure the threshold can be set to any Double value.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #18151 from jkbradley/ml-2.2-linearsvc-cleanup.
      cc67bd57
  14. Jun 19, 2017
  15. Jun 18, 2017
    • actuaryzhang's avatar
      [SPARK-20892][SPARKR] Add SQL trunc function to SparkR · 110ce1f2
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      
      Add SQL trunc function
      
      ## How was this patch tested?
      standard test
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #18291 from actuaryzhang/sparkRTrunc2.
      110ce1f2
    • hyukjinkwon's avatar
      [SPARK-21128][R] Remove both "spark-warehouse" and "metastore_db" before listing files in R tests · 05f83c53
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to list the files in test _after_ removing both "spark-warehouse" and "metastore_db" so that the next run of R tests pass fine. This is sometimes a bit annoying.
      
      ## How was this patch tested?
      
      Manually running multiple times R tests via `./R/run-tests.sh`.
      
      **Before**
      
      Second run:
      
      ```
      SparkSQL functions: Spark package found in SPARK_HOME: .../spark
      ...............................................................................................................................................................
      ...............................................................................................................................................................
      ...............................................................................................................................................................
      ...............................................................................................................................................................
      ...............................................................................................................................................................
      ....................................................................................................1234.......................
      
      Failed -------------------------------------------------------------------------
      1. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
      length(list1) not equal to length(list2).
      1/1 mismatches
      [1] 25 - 23 == 2
      
      2. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
      sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
      10/25 mismatches
      x[16]: "metastore_db"
      y[16]: "pkg"
      
      x[17]: "pkg"
      y[17]: "R"
      
      x[18]: "R"
      y[18]: "README.md"
      
      x[19]: "README.md"
      y[19]: "run-tests.sh"
      
      x[20]: "run-tests.sh"
      y[20]: "SparkR_2.2.0.tar.gz"
      
      x[21]: "metastore_db"
      y[21]: "pkg"
      
      x[22]: "pkg"
      y[22]: "R"
      
      x[23]: "R"
      y[23]: "README.md"
      
      x[24]: "README.md"
      y[24]: "run-tests.sh"
      
      x[25]: "run-tests.sh"
      y[25]: "SparkR_2.2.0.tar.gz"
      
      3. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
      length(list1) not equal to length(list2).
      1/1 mismatches
      [1] 25 - 23 == 2
      
      4. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
      sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
      10/25 mismatches
      x[16]: "metastore_db"
      y[16]: "pkg"
      
      x[17]: "pkg"
      y[17]: "R"
      
      x[18]: "R"
      y[18]: "README.md"
      
      x[19]: "README.md"
      y[19]: "run-tests.sh"
      
      x[20]: "run-tests.sh"
      y[20]: "SparkR_2.2.0.tar.gz"
      
      x[21]: "metastore_db"
      y[21]: "pkg"
      
      x[22]: "pkg"
      y[22]: "R"
      
      x[23]: "R"
      y[23]: "README.md"
      
      x[24]: "README.md"
      y[24]: "run-tests.sh"
      
      x[25]: "run-tests.sh"
      y[25]: "SparkR_2.2.0.tar.gz"
      
      DONE ===========================================================================
      ```
      
      **After**
      
      Second run:
      
      ```
      SparkSQL functions: Spark package found in SPARK_HOME: .../spark
      ...............................................................................................................................................................
      ...............................................................................................................................................................
      ...............................................................................................................................................................
      ...............................................................................................................................................................
      ...............................................................................................................................................................
      ...............................................................................................................................
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18335 from HyukjinKwon/SPARK-21128.
      05f83c53
  16. Jun 16, 2017
    • Yuming Wang's avatar
      [MINOR][DOCS] Improve Running R Tests docs · 45824fb6
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      
      Update Running R Tests dependence packages to:
      ```bash
      R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 'survival'), repos='http://cran.us.r-project.org')"
      ```
      
      ## How was this patch tested?
      manual tests
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #18271 from wangyum/building-spark.
      45824fb6
  17. Jun 15, 2017
    • Xiao Li's avatar
      [SPARK-20980][SQL] Rename `wholeFile` to `multiLine` for both CSV and JSON · 20514281
      Xiao Li authored
      ### What changes were proposed in this pull request?
      The current option name `wholeFile` is misleading for CSV users. Currently, it is not representing a record per file. Actually, one file could have multiple records. Thus, we should rename it. Now, the proposal is `multiLine`.
      
      ### How was this patch tested?
      N/A
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #18202 from gatorsmile/renameCVSOption.
      20514281
  18. Jun 11, 2017
    • Felix Cheung's avatar
      [SPARK-20877][SPARKR][FOLLOWUP] clean up after test move · 9f4ff955
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      clean up after big test move
      
      ## How was this patch tested?
      
      unit tests, jenkins
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #18267 from felixcheung/rtestset2.
      9f4ff955
    • Felix Cheung's avatar
      [SPARK-20877][SPARKR] refactor tests to basic tests only for CRAN · dc4c3518
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Move all existing tests to non-installed directory so that it will never run by installing SparkR package
      
      For a follow-up PR:
      - remove all skip_on_cran() calls in tests
      - clean up test timer
      - improve or change basic tests that do run on CRAN (if anyone has suggestion)
      
      It looks like `R CMD build pkg` will still put pkg\tests (ie. the full tests) into the source package but `R CMD INSTALL` on such source package does not install these tests (and so `R CMD check` does not run them)
      
      ## How was this patch tested?
      
      - [x] unit tests, Jenkins
      - [x] AppVeyor
      - [x] make a source package, install it, `R CMD check` it - verify the full tests are not installed or run
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #18264 from felixcheung/rtestset.
      dc4c3518
  19. Jun 09, 2017
    • Reynold Xin's avatar
      [SPARK-21042][SQL] Document Dataset.union is resolution by position · b78e3849
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Document Dataset.union is resolution by position, not by name, since this has been a confusing point for a lot of users.
      
      ## How was this patch tested?
      N/A - doc only change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #18256 from rxin/SPARK-21042.
      b78e3849
  20. May 31, 2017
    • Felix Cheung's avatar
      [SPARK-20877][SPARKR][WIP] add timestamps to test runs · 382fefd1
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      to investigate how long they run
      
      ## How was this patch tested?
      
      Jenkins, AppVeyor
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #18104 from felixcheung/rtimetest.
      382fefd1
  21. May 26, 2017
    • Zheng RuiFeng's avatar
      [SPARK-20849][DOC][SPARKR] Document R DecisionTree · a97c4970
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      1, add an example for sparkr `decisionTree`
      2, document it in user guide
      
      ## How was this patch tested?
      local submit
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #18067 from zhengruifeng/dt_example.
      a97c4970
  22. May 23, 2017
    • Yanbo Liang's avatar
      [MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM summary. · ad09e4ca
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Joint coefficients with intercept for SparkR linear SVM summary.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #18035 from yanboliang/svm-r.
      ad09e4ca
    • Shivaram Venkataraman's avatar
      [SPARK-20727] Skip tests that use Hadoop utils on CRAN Windows · d06610f9
      Shivaram Venkataraman authored
      ## What changes were proposed in this pull request?
      
      This change skips tests that use the Hadoop libraries while running
      on CRAN check with Windows as the operating system. This is to handle
      cases where the Hadoop winutils binaries are missing on the target
      system. The skipped tests consist of
      1. Tests that save, load a model in MLlib
      2. Tests that save, load CSV, JSON and Parquet files in SQL
      3. Hive tests
      
      ## How was this patch tested?
      
      Tested by running on a local windows VM with HADOOP_HOME unset. Also testing with https://win-builder.r-project.org
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #17966 from shivaram/sparkr-windows-cran.
      d06610f9
  23. May 22, 2017
  24. May 19, 2017
  25. May 14, 2017
Loading