Skip to content
Snippets Groups Projects
  1. Dec 02, 2016
  2. Nov 30, 2016
  3. Nov 28, 2016
  4. Nov 23, 2016
    • Burak Yavuz's avatar
      [SPARK-18510] Fix data corruption from inferred partition column dataTypes · 15d2cf26
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      ### The Issue
      
      If I specify my schema when doing
      ```scala
      spark.read
        .schema(someSchemaWherePartitionColumnsAreStrings)
      ```
      but if the partition inference can infer it as IntegerType or I assume LongType or DoubleType (basically fixed size types), then once UnsafeRows are generated, your data will be corrupted.
      
      ### Proposed solution
      
      The partition handling code path is kind of a mess. In my fix I'm probably adding to the mess, but at least trying to standardize the code path.
      
      The real issue is that a user that uses the `spark.read` code path can never clearly specify what the partition columns are. If you try to specify the fields in `schema`, we practically ignore what the user provides, and fall back to our inferred data types. What happens in the end is data corruption.
      
      My solution tries to fix this by always trying to infer partition columns the first time you specify the table. Once we find what the partition columns are, we try to find them in the user specified schema and use the dataType provided there, or fall back to the smallest common data type.
      
      We will ALWAYS append partition columns to the user's schema, even if they didn't ask for it. We will only use the data type they provided if they specified it. While this is confusing, this has been the behavior since Spark 1.6, and I didn't want to change this behavior in the QA period of Spark 2.1. We may revisit this decision later.
      
      A side effect of this PR is that we won't need https://github.com/apache/spark/pull/15942
      
       if this PR goes in.
      
      ## How was this patch tested?
      
      Regression tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #15951 from brkyvz/partition-corruption.
      
      (cherry picked from commit 0d1bf2b6)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      15d2cf26
    • Sean Owen's avatar
      [SPARK-18073][DOCS][WIP] Migrate wiki to spark.apache.org web site · 5f198d20
      Sean Owen authored
      
      ## What changes were proposed in this pull request?
      
      Updates links to the wiki to links to the new location of content on spark.apache.org.
      
      ## How was this patch tested?
      
      Doc builds
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15967 from srowen/SPARK-18073.1.
      
      (cherry picked from commit 7e0cd1d9)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      5f198d20
  5. Nov 22, 2016
    • Yanbo Liang's avatar
      [SPARK-18501][ML][SPARKR] Fix spark.glm errors when fitting on collinear data · fc5fee83
      Yanbo Liang authored
      
      ## What changes were proposed in this pull request?
      * Fix SparkR ```spark.glm``` errors when fitting on collinear data, since ```standard error of coefficients, t value and p value``` are not available in this condition.
      * Scala/Python GLM summary should throw exception if users get ```standard error of coefficients, t value and p value``` but the underlying WLS was solved by local "l-bfgs".
      
      ## How was this patch tested?
      Add unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15930 from yanboliang/spark-18501.
      
      (cherry picked from commit 982b82e3)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      fc5fee83
    • hyukjinkwon's avatar
      [SPARK-18514][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that` across R API documentation · 63aa01ff
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      It seems in R, there are
      
      - `Note:`
      - `NOTE:`
      - `Note that`
      
      This PR proposes to fix those to `Note:` to be consistent.
      
      **Before**
      
      ![2016-11-21 11 30 07](https://cloud.githubusercontent.com/assets/6477701/20468848/2f27b0fa-afde-11e6-89e3-993701269dbe.png)
      
      **After**
      
      ![2016-11-21 11 29 44](https://cloud.githubusercontent.com/assets/6477701/20468851/39469664-afde-11e6-9929-ad80be7fc405.png
      
      )
      
      ## How was this patch tested?
      
      The notes were found via
      
      ```bash
      grep -r "NOTE: " .
      grep -r "Note that " .
      ```
      
      And then fixed one by one comparing with API documentation.
      
      After that, manually tested via `sh create-docs.sh` under `./R`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15952 from HyukjinKwon/SPARK-18514.
      
      (cherry picked from commit 4922f9cd)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      63aa01ff
    • Yanbo Liang's avatar
      [SPARK-18444][SPARKR] SparkR running in yarn-cluster mode should not download Spark package. · c7021407
      Yanbo Liang authored
      
      ## What changes were proposed in this pull request?
      When running SparkR job in yarn-cluster mode, it will download Spark package from apache website which is not necessary.
      ```
      ./bin/spark-submit --master yarn-cluster ./examples/src/main/r/dataframe.R
      ```
      The following is output:
      ```
      Attaching package: ‘SparkR’
      
      The following objects are masked from ‘package:stats’:
      
          cov, filter, lag, na.omit, predict, sd, var, window
      
      The following objects are masked from ‘package:base’:
      
          as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
          rank, rbind, sample, startsWith, subset, summary, transform, union
      
      Spark not found in SPARK_HOME:
      Spark not found in the cache directory. Installation will start.
      MirrorUrl not provided.
      Looking for preferred site from apache website...
      ......
      ```
      There's no ```SPARK_HOME``` in yarn-cluster mode since the R process is in a remote host of the yarn cluster rather than in the client host. The JVM comes up first and the R process then connects to it. So in such cases we should never have to download Spark as Spark is already running.
      
      ## How was this patch tested?
      Offline test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15888 from yanboliang/spark-18444.
      
      (cherry picked from commit acb97157)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      c7021407
  6. Nov 17, 2016
  7. Nov 16, 2016
  8. Nov 13, 2016
    • Yanbo Liang's avatar
      [SPARK-18412][SPARKR][ML] Fix exception for some SparkR ML algorithms training on libsvm data · 8fc6455c
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      * Fix the following exceptions which throws when ```spark.randomForest```(classification), ```spark.gbt```(classification), ```spark.naiveBayes``` and ```spark.glm```(binomial family) were fitted on libsvm data.
      ```
      java.lang.IllegalArgumentException: requirement failed: If label column already exists, forceIndexLabel can not be set with true.
      ```
      See [SPARK-18412](https://issues.apache.org/jira/browse/SPARK-18412
      
      ) for more detail about how to reproduce this bug.
      * Refactor out ```getFeaturesAndLabels``` to RWrapperUtils, since lots of ML algorithm wrappers use this function.
      * Drop some unwanted columns when making prediction.
      
      ## How was this patch tested?
      Add unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15851 from yanboliang/spark-18412.
      
      (cherry picked from commit 07be232e)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      8fc6455c
  9. Nov 11, 2016
  10. Nov 10, 2016
  11. Nov 08, 2016
    • Felix Cheung's avatar
      [SPARK-18239][SPARKR] Gradient Boosted Tree for R · 98dd7ac7
      Felix Cheung authored
      
      ## What changes were proposed in this pull request?
      
      Gradient Boosted Tree in R.
      With a few minor improvements to RandomForest in R.
      
      Since this is relatively isolated I'd like to target this for branch-2.1
      
      ## How was this patch tested?
      
      manual tests, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15746 from felixcheung/rgbt.
      
      (cherry picked from commit 55964c15)
      Signed-off-by: default avatarFelix Cheung <felixcheung@apache.org>
      98dd7ac7
  12. Nov 07, 2016
    • Yanbo Liang's avatar
      [SPARK-18291][SPARKR][ML] SparkR glm predict should output original label when family = binomial. · 6b332909
      Yanbo Liang authored
      
      ## What changes were proposed in this pull request?
      SparkR ```spark.glm``` predict should output original label when family = "binomial".
      
      ## How was this patch tested?
      Add unit test.
      You can also run the following code to test:
      ```R
      training <- suppressWarnings(createDataFrame(iris))
      training <- training[training$Species %in% c("versicolor", "virginica"), ]
      model <- spark.glm(training, Species ~ Sepal_Length + Sepal_Width,family = binomial(link = "logit"))
      showDF(predict(model, training))
      ```
      Before this change:
      ```
      +------------+-----------+------------+-----------+----------+-----+-------------------+
      |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|   Species|label|         prediction|
      +------------+-----------+------------+-----------+----------+-----+-------------------+
      |         7.0|        3.2|         4.7|        1.4|versicolor|  0.0| 0.8271421517601544|
      |         6.4|        3.2|         4.5|        1.5|versicolor|  0.0| 0.6044595910413112|
      |         6.9|        3.1|         4.9|        1.5|versicolor|  0.0| 0.7916340858281998|
      |         5.5|        2.3|         4.0|        1.3|versicolor|  0.0|0.16080518180591158|
      |         6.5|        2.8|         4.6|        1.5|versicolor|  0.0| 0.6112229217050189|
      |         5.7|        2.8|         4.5|        1.3|versicolor|  0.0| 0.2555087295500885|
      |         6.3|        3.3|         4.7|        1.6|versicolor|  0.0| 0.5681507664364834|
      |         4.9|        2.4|         3.3|        1.0|versicolor|  0.0|0.05990570219972002|
      |         6.6|        2.9|         4.6|        1.3|versicolor|  0.0| 0.6644434078306246|
      |         5.2|        2.7|         3.9|        1.4|versicolor|  0.0|0.11293577405862379|
      |         5.0|        2.0|         3.5|        1.0|versicolor|  0.0|0.06152372321585971|
      |         5.9|        3.0|         4.2|        1.5|versicolor|  0.0|0.35250697207602555|
      |         6.0|        2.2|         4.0|        1.0|versicolor|  0.0|0.32267018290814303|
      |         6.1|        2.9|         4.7|        1.4|versicolor|  0.0|  0.433391153814592|
      |         5.6|        2.9|         3.6|        1.3|versicolor|  0.0| 0.2280744262436993|
      |         6.7|        3.1|         4.4|        1.4|versicolor|  0.0| 0.7219848389339459|
      |         5.6|        3.0|         4.5|        1.5|versicolor|  0.0|0.23527698971404695|
      |         5.8|        2.7|         4.1|        1.0|versicolor|  0.0|  0.285024533520016|
      |         6.2|        2.2|         4.5|        1.5|versicolor|  0.0| 0.4107047877447493|
      |         5.6|        2.5|         3.9|        1.1|versicolor|  0.0|0.20083561961645083|
      +------------+-----------+------------+-----------+----------+-----+-------------------+
      ```
      After this change:
      ```
      +------------+-----------+------------+-----------+----------+-----+----------+
      |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|   Species|label|prediction|
      +------------+-----------+------------+-----------+----------+-----+----------+
      |         7.0|        3.2|         4.7|        1.4|versicolor|  0.0| virginica|
      |         6.4|        3.2|         4.5|        1.5|versicolor|  0.0| virginica|
      |         6.9|        3.1|         4.9|        1.5|versicolor|  0.0| virginica|
      |         5.5|        2.3|         4.0|        1.3|versicolor|  0.0|versicolor|
      |         6.5|        2.8|         4.6|        1.5|versicolor|  0.0| virginica|
      |         5.7|        2.8|         4.5|        1.3|versicolor|  0.0|versicolor|
      |         6.3|        3.3|         4.7|        1.6|versicolor|  0.0| virginica|
      |         4.9|        2.4|         3.3|        1.0|versicolor|  0.0|versicolor|
      |         6.6|        2.9|         4.6|        1.3|versicolor|  0.0| virginica|
      |         5.2|        2.7|         3.9|        1.4|versicolor|  0.0|versicolor|
      |         5.0|        2.0|         3.5|        1.0|versicolor|  0.0|versicolor|
      |         5.9|        3.0|         4.2|        1.5|versicolor|  0.0|versicolor|
      |         6.0|        2.2|         4.0|        1.0|versicolor|  0.0|versicolor|
      |         6.1|        2.9|         4.7|        1.4|versicolor|  0.0|versicolor|
      |         5.6|        2.9|         3.6|        1.3|versicolor|  0.0|versicolor|
      |         6.7|        3.1|         4.4|        1.4|versicolor|  0.0| virginica|
      |         5.6|        3.0|         4.5|        1.5|versicolor|  0.0|versicolor|
      |         5.8|        2.7|         4.1|        1.0|versicolor|  0.0|versicolor|
      |         6.2|        2.2|         4.5|        1.5|versicolor|  0.0|versicolor|
      |         5.6|        2.5|         3.9|        1.1|versicolor|  0.0|versicolor|
      +------------+-----------+------------+-----------+----------+-----+----------+
      ```
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15788 from yanboliang/spark-18291.
      
      (cherry picked from commit daa975f4)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      6b332909
  13. Nov 05, 2016
  14. Nov 04, 2016
  15. Nov 02, 2016
    • Wenchen Fan's avatar
      [SPARK-17470][SQL] unify path for data source table and locationUri for hive serde table · 5ea2f9e5
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      Due to a limitation of hive metastore(table location must be directory path, not file path), we always store `path` for data source table in storage properties, instead of the `locationUri` field. However, we should not expose this difference to `CatalogTable` level, but just treat it as a hack in `HiveExternalCatalog`, like we store table schema of data source table in table properties.
      
      This PR unifies `path` and `locationUri` outside of `HiveExternalCatalog`, both data source table and hive serde table should use the `locationUri` field.
      
      This PR also unifies the way we handle default table location for managed table. Previously, the default table location of hive serde managed table is set by external catalog, but the one of data source table is set by command. After this PR, we follow the hive way and the default table location is always set by external catalog.
      
      For managed non-file-based tables, we will assign a default table location and create an empty directory for it, the table location will be removed when the table is dropped. This is reasonable as metastore doesn't care about whether a table is file-based or not, and an empty table directory has no harm.
      For external non-file-based tables, ideally we can omit the table location, but due to a hive metastore issue, we will assign a random location to it, and remove it right after the table is created. See SPARK-15269 for more details. This is fine as it's well isolated in `HiveExternalCatalog`.
      
      To keep the existing behaviour of the `path` option, in this PR we always add the `locationUri` to storage properties using key `path`, before passing storage properties to `DataSource` as data source options.
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15024 from cloud-fan/path.
      
      (cherry picked from commit 3a1bc6f4)
      Signed-off-by: default avatarYin Huai <yhuai@databricks.com>
      5ea2f9e5
    • eyal farago's avatar
      [SPARK-16839][SQL] Simplify Struct creation code path · 41491e54
      eyal farago authored
      
      ## What changes were proposed in this pull request?
      
      Simplify struct creation, especially the aspect of `CleanupAliases` which missed some aliases when handling trees created by `CreateStruct`.
      
      This PR includes:
      
      1. A failing test (create struct with nested aliases, some of the aliases survive `CleanupAliases`).
      2. A fix that transforms `CreateStruct` into a `CreateNamedStruct` constructor, effectively eliminating `CreateStruct` from all expression trees.
      3. A `NamePlaceHolder` used by `CreateStruct` when column names cannot be extracted from unresolved `NamedExpression`.
      4. A new Analyzer rule that resolves `NamePlaceHolder` into a string literal once the `NamedExpression` is resolved.
      5. `CleanupAliases` code was simplified as it no longer has to deal with `CreateStruct`'s top level columns.
      
      ## How was this patch tested?
      Running all tests-suits in package org.apache.spark.sql, especially including the analysis suite, making sure added test initially fails, after applying suggested fix rerun the entire analysis package successfully.
      
      Modified few tests that expected `CreateStruct` which is now transformed into `CreateNamedStruct`.
      
      Author: eyal farago <eyal farago>
      Author: Herman van Hovell <hvanhovell@databricks.com>
      Author: eyal farago <eyal.farago@gmail.com>
      Author: Eyal Farago <eyal.farago@actimize.com>
      Author: Hyukjin Kwon <gurwls223@gmail.com>
      Author: eyalfa <eyal.farago@gmail.com>
      
      Closes #15718 from hvanhovell/SPARK-16839-2.
      
      (cherry picked from commit f151bd1a)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      41491e54
    • hyukjinkwon's avatar
      [SPARK-17838][SPARKR] Check named arguments for options and use formatted R... · 1ecfafa0
      hyukjinkwon authored
      [SPARK-17838][SPARKR] Check named arguments for options and use formatted R friendly message from JVM exception message
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to
      - improve the R-friendly error messages rather than raw JVM exception one.
      
        As `read.json`, `read.text`, `read.orc`, `read.parquet` and `read.jdbc` are executed in the same  path with `read.df`, and `write.json`, `write.text`, `write.orc`, `write.parquet` and `write.jdbc` shares the same path with `write.df`, it seems it is safe to call `handledCallJMethod` to handle
        JVM messages.
      -  prevent `zero-length variable name` and prints the ignored options as an warning message.
      
      **Before**
      
      ``` r
      > read.json("path", a = 1, 2, 3, "a")
      Error in env[[name]] <- value :
        zero-length variable name
      ```
      
      ``` r
      > read.json("arbitrary_path")
      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
        org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
        ...
      
      > read.orc("arbitrary_path")
      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
        org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
        ...
      
      > read.text("arbitrary_path")
      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
        org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
        ...
      
      > read.parquet("arbitrary_path")
      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
        org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
        ...
      ```
      
      ``` r
      > write.json(df, "existing_path")
      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
        org.apache.spark.sql.AnalysisException: path file:/... already exists.;
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)
      
      > write.orc(df, "existing_path")
      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
        org.apache.spark.sql.AnalysisException: path file:/... already exists.;
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)
      
      > write.text(df, "existing_path")
      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
        org.apache.spark.sql.AnalysisException: path file:/... already exists.;
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)
      
      > write.parquet(df, "existing_path")
      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
        org.apache.spark.sql.AnalysisException: path file:/... already exists.;
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)
      ```
      
      **After**
      
      ``` r
      read.json("arbitrary_path", a = 1, 2, 3, "a")
      Unnamed arguments ignored: 2, 3, a.
      ```
      
      ``` r
      > read.json("arbitrary_path")
      Error in json : analysis error - Path does not exist: file:/...
      
      > read.orc("arbitrary_path")
      Error in orc : analysis error - Path does not exist: file:/...
      
      > read.text("arbitrary_path")
      Error in text : analysis error - Path does not exist: file:/...
      
      > read.parquet("arbitrary_path")
      Error in parquet : analysis error - Path does not exist: file:/...
      ```
      
      ``` r
      > write.json(df, "existing_path")
      Error in json : analysis error - path file:/... already exists.;
      
      > write.orc(df, "existing_path")
      Error in orc : analysis error - path file:/... already exists.;
      
      > write.text(df, "existing_path")
      Error in text : analysis error - path file:/... already exists.;
      
      > write.parquet(df, "existing_path")
      Error in parquet : analysis error - path file:/... already exists.;
      ```
      ## How was this patch tested?
      
      Unit tests in `test_utils.R` and `test_sparkSQL.R`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15608 from HyukjinKwon/SPARK-17838.
      1ecfafa0
  16. Nov 01, 2016
    • Herman van Hovell's avatar
      0cba535a
    • eyal farago's avatar
      [SPARK-16839][SQL] redundant aliases after cleanupAliases · 5441a626
      eyal farago authored
      ## What changes were proposed in this pull request?
      
      Simplify struct creation, especially the aspect of `CleanupAliases` which missed some aliases when handling trees created by `CreateStruct`.
      
      This PR includes:
      
      1. A failing test (create struct with nested aliases, some of the aliases survive `CleanupAliases`).
      2. A fix that transforms `CreateStruct` into a `CreateNamedStruct` constructor, effectively eliminating `CreateStruct` from all expression trees.
      3. A `NamePlaceHolder` used by `CreateStruct` when column names cannot be extracted from unresolved `NamedExpression`.
      4. A new Analyzer rule that resolves `NamePlaceHolder` into a string literal once the `NamedExpression` is resolved.
      5. `CleanupAliases` code was simplified as it no longer has to deal with `CreateStruct`'s top level columns.
      
      ## How was this patch tested?
      
      running all tests-suits in package org.apache.spark.sql, especially including the analysis suite, making sure added test initially fails, after applying suggested fix rerun the entire analysis package successfully.
      
      modified few tests that expected `CreateStruct` which is now transformed into `CreateNamedStruct`.
      
      Credit goes to hvanhovell for assisting with this PR.
      
      Author: eyal farago <eyal farago>
      Author: eyal farago <eyal.farago@gmail.com>
      Author: Herman van Hovell <hvanhovell@databricks.com>
      Author: Eyal Farago <eyal.farago@actimize.com>
      Author: Hyukjin Kwon <gurwls223@gmail.com>
      Author: eyalfa <eyal.farago@gmail.com>
      
      Closes #14444 from eyalfa/SPARK-16839_redundant_aliases_after_cleanupAliases.
      5441a626
  17. Oct 30, 2016
    • Felix Cheung's avatar
      [SPARK-16137][SPARKR] randomForest for R · b6879b8b
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Random Forest Regression and Classification for R
      Clean-up/reordering generics.R
      
      ## How was this patch tested?
      
      manual tests, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15607 from felixcheung/rrandomforest.
      b6879b8b
    • Hossein's avatar
      [SPARK-17919] Make timeout to RBackend configurable in SparkR · 2881a2d1
      Hossein authored
      ## What changes were proposed in this pull request?
      
      This patch makes RBackend connection timeout configurable by user.
      
      ## How was this patch tested?
      N/A
      
      Author: Hossein <hossein@databricks.com>
      
      Closes #15471 from falaki/SPARK-17919.
      2881a2d1
  18. Oct 27, 2016
    • Felix Cheung's avatar
      [SQL][DOC] updating doc for JSON source to link to jsonlines.org · 44c8bfda
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      API and programming guide doc changes for Scala, Python and R.
      
      ## How was this patch tested?
      
      manual test
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15629 from felixcheung/jsondoc.
      44c8bfda
    • Felix Cheung's avatar
      [SPARK-17157][SPARKR][FOLLOW-UP] doc fixes · 1dbe9896
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      a couple of small late finding fixes for doc
      
      ## How was this patch tested?
      
      manually
      wangmiao1981
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15650 from felixcheung/logitfix.
      1dbe9896
  19. Oct 26, 2016
    • wm624@hotmail.com's avatar
      [SPARK-17157][SPARKR] Add multiclass logistic regression SparkR Wrapper · 29cea8f3
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      As we discussed in #14818, I added a separate R wrapper spark.logit for logistic regression.
      
      This single interface supports both binary and multinomial logistic regression. It also has "predict" and "summary" for binary logistic regression.
      
      ## How was this patch tested?
      
      New unit tests are added.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #15365 from wangmiao1981/glm.
      29cea8f3
    • WeichenXu's avatar
      [SPARK-17961][SPARKR][SQL] Add storageLevel to DataFrame for SparkR · fb0a8a8d
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      Add storageLevel to DataFrame for SparkR.
      This is similar to this RP:  https://github.com/apache/spark/pull/13780
      
      but in R I do not make a class for `StorageLevel`
      but add a method `storageToString`
      
      ## How was this patch tested?
      
      test added.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #15516 from WeichenXu123/storageLevel_df_r.
      fb0a8a8d
  20. Oct 25, 2016
  21. Oct 23, 2016
  22. Oct 21, 2016
    • Hossein's avatar
      [SPARK-17811] SparkR cannot parallelize data.frame with NA or NULL in Date columns · e371040a
      Hossein authored
      ## What changes were proposed in this pull request?
      NA date values are serialized as "NA" and NA time values are serialized as NaN from R. In the backend we did not have proper logic to deal with them. As a result we got an IllegalArgumentException for Date and wrong value for time. This PR adds support for deserializing NA as Date and Time.
      
      ## How was this patch tested?
      * [x] TODO
      
      Author: Hossein <hossein@databricks.com>
      
      Closes #15421 from falaki/SPARK-17811.
      e371040a
    • Felix Cheung's avatar
      [SPARK-18013][SPARKR] add crossJoin API · e21e1c94
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Add crossJoin and do not default to cross join if joinExpr is left out
      
      ## How was this patch tested?
      
      unit test
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15559 from felixcheung/rcrossjoin.
      e21e1c94
    • Felix Cheung's avatar
      [SPARK-17674][SPARKR] check for warning in test output · 4efdc764
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      testthat library we are using for testing R is redirecting warning (and disabling `options("warn" = 2)`), we need to have a way to detect any new warning and fail
      
      ## How was this patch tested?
      
      manual testing, Jenkins
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15576 from felixcheung/rtestwarning.
      4efdc764
  23. Oct 20, 2016
    • Felix Cheung's avatar
      [SPARKR] fix warnings · 3180272d
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Fix for a bunch of test warnings that were added recently.
      We need to investigate why warnings are not turning into errors.
      
      ```
      Warnings -----------------------------------------------------------------------
      1. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Sepal_Length instead of Sepal.Length  as column name
      
      2. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Sepal_Width instead of Sepal.Width  as column name
      
      3. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Petal_Length instead of Petal.Length  as column name
      
      4. createDataFrame uses files for large objects (test_sparkSQL.R#215) - Use Petal_Width instead of Petal.Width  as column name
      
      Consider adding
        importFrom("utils", "object.size")
      to your NAMESPACE file.
      ```
      
      ## How was this patch tested?
      
      unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15560 from felixcheung/rwarnings.
      3180272d
  24. Oct 12, 2016
    • Hossein's avatar
      [SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB · 5cc503f4
      Hossein authored
      ## What changes were proposed in this pull request?
      If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD.
      
      I tested this on my MacBook. Following code works with this patch:
      ```R
      intMax <- .Machine$integer.max
      largeVec <- 1:intMax
      rdd <- SparkR:::parallelize(sc, largeVec, 2)
      ```
      
      ## How was this patch tested?
      * [x] Unit tests
      
      Author: Hossein <hossein@databricks.com>
      
      Closes #15375 from falaki/SPARK-17790.
      5cc503f4
  25. Oct 11, 2016
    • Wenchen Fan's avatar
      [SPARK-17720][SQL] introduce static SQL conf · b9a14718
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      SQLConf is session-scoped and mutable. However, we do have the requirement for a static SQL conf, which is global and immutable, e.g. the `schemaStringThreshold` in `HiveExternalCatalog`, the flag to enable/disable hive support, the global temp view database in https://github.com/apache/spark/pull/14897.
      
      Actually we've already implemented static SQL conf implicitly via `SparkConf`, this PR just make it explicit and expose it to users, so that they can see the config value via SQL command or `SparkSession.conf`, and forbid users to set/unset static SQL conf.
      
      ## How was this patch tested?
      
      new tests in SQLConfSuite
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15295 from cloud-fan/global-conf.
      b9a14718
    • Yanbo Liang's avatar
      [SPARK-15153][ML][SPARKR] Fix SparkR spark.naiveBayes error when label is numeric type · 23405f32
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Fix SparkR ```spark.naiveBayes``` error when response variable of dataset is numeric type.
      See details and how to reproduce this bug at [SPARK-15153](https://issues.apache.org/jira/browse/SPARK-15153).
      
      ## How was this patch tested?
      Add unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15431 from yanboliang/spark-15153-2.
      23405f32
  26. Oct 07, 2016
    • hyukjinkwon's avatar
      [SPARK-17665][SPARKR] Support options/mode all for read/write APIs and options in other types · 9d8ae853
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR includes the changes below:
      
        - Support `mode`/`options` in `read.parquet`, `write.parquet`, `read.orc`, `write.orc`, `read.text`, `write.text`, `read.json` and `write.json` APIs
      
        - Support other types (logical, numeric and string) as options for `write.df`, `read.df`, `read.parquet`, `write.parquet`, `read.orc`, `write.orc`, `read.text`, `write.text`, `read.json` and `write.json`
      
      ## How was this patch tested?
      
      Unit tests in `test_sparkSQL.R`/ `utils.R`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15239 from HyukjinKwon/SPARK-17665.
      9d8ae853
  27. Oct 05, 2016
    • hyukjinkwon's avatar
      [SPARK-17658][SPARKR] read.df/write.df API taking path optionally in SparkR · c9fe10d4
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      `write.df`/`read.df` API require path which is not actually always necessary in Spark. Currently, it only affects the datasources implementing `CreatableRelationProvider`. Currently, Spark currently does not have internal data sources implementing this but it'd affect other external datasources.
      
      In addition we'd be able to use this way in Spark's JDBC datasource after https://github.com/apache/spark/pull/12601 is merged.
      
      **Before**
      
       - `read.df`
      
        ```r
      > read.df(source = "json")
      Error in dispatchFunc("read.df(path = NULL, source = NULL, schema = NULL, ...)",  :
        argument "x" is missing with no default
      ```
      
        ```r
      > read.df(path = c(1, 2))
      Error in dispatchFunc("read.df(path = NULL, source = NULL, schema = NULL, ...)",  :
        argument "x" is missing with no default
      ```
      
        ```r
      > read.df(c(1, 2))
      Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
        java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.String
      	at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:300)
      	at
      ...
      In if (is.na(object)) { :
      ...
      ```
      
       - `write.df`
      
        ```r
      > write.df(df, source = "json")
      Error in (function (classes, fdef, mtable)  :
        unable to find an inherited method for function ‘write.df’ for signature ‘"function", "missing"’
      ```
      
        ```r
      > write.df(df, source = c(1, 2))
      Error in (function (classes, fdef, mtable)  :
        unable to find an inherited method for function ‘write.df’ for signature ‘"SparkDataFrame", "missing"’
      ```
      
        ```r
      > write.df(df, mode = TRUE)
      Error in (function (classes, fdef, mtable)  :
        unable to find an inherited method for function ‘write.df’ for signature ‘"SparkDataFrame", "missing"’
      ```
      
      **After**
      
      - `read.df`
      
        ```r
      > read.df(source = "json")
      Error in loadDF : analysis error - Unable to infer schema for JSON at . It must be specified manually;
      ```
      
        ```r
      > read.df(path = c(1, 2))
      Error in f(x, ...) : path should be charactor, null or omitted.
      ```
      
        ```r
      > read.df(c(1, 2))
      Error in f(x, ...) : path should be charactor, null or omitted.
      ```
      
      - `write.df`
      
        ```r
      > write.df(df, source = "json")
      Error in save : illegal argument - 'path' is not specified
      ```
      
        ```r
      > write.df(df, source = c(1, 2))
      Error in .local(df, path, ...) :
        source should be charactor, null or omitted. It is 'parquet' by default.
      ```
      
        ```r
      > write.df(df, mode = TRUE)
      Error in .local(df, path, ...) :
        mode should be charactor or omitted. It is 'error' by default.
      ```
      
      ## How was this patch tested?
      
      Unit tests in `test_sparkSQL.R`
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15231 from HyukjinKwon/write-default-r.
      c9fe10d4
  28. Oct 04, 2016
Loading