Skip to content
Snippets Groups Projects
  1. Apr 01, 2016
    • Yanbo Liang's avatar
      [SPARK-14303][ML][SPARKR] Define and use KMeansWrapper for SparkR::kmeans · 22249afb
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Define and use ```KMeansWrapper``` for ```SparkR::kmeans```. It's only the code refactor for the original ```KMeans``` wrapper.
      
      ## How was this patch tested?
      Existing tests.
      
      cc mengxr
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #12039 from yanboliang/spark-14059.
      22249afb
  2. Mar 28, 2016
    • Sun Rui's avatar
      [SPARK-12792] [SPARKR] Refactor RRDD to support R UDF. · d3638d7b
      Sun Rui authored
      ## What changes were proposed in this pull request?
      
      Refactor RRDD by separating the common logic interacting with the R worker to a new class RRunner, which can be used to evaluate R UDFs.
      
      Now RRDD relies on RRuner for RDD computation and RRDD could be reomved if we want to remove RDD API in SparkR later.
      
      ## How was this patch tested?
      dev/lint-r
      SparkR unit tests
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #12024 from sun-rui/SPARK-12792_new.
      d3638d7b
    • Davies Liu's avatar
      Revert "[SPARK-12792] [SPARKR] Refactor RRDD to support R UDF." · e5a1b301
      Davies Liu authored
      This reverts commit 40984f67.
      e5a1b301
    • Sun Rui's avatar
      [SPARK-12792] [SPARKR] Refactor RRDD to support R UDF. · 40984f67
      Sun Rui authored
      Refactor RRDD by separating the common logic interacting with the R worker to a new class RRunner, which can be used to evaluate R UDFs.
      
      Now RRDD relies on RRuner for RDD computation and RRDD could be reomved if we want to remove RDD API in SparkR later.
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #10947 from sun-rui/SPARK-12792.
      40984f67
  3. Mar 25, 2016
    • Andrew Or's avatar
      [SPARK-14014][SQL] Integrate session catalog (attempt #2) · 20ddf5fd
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      This reopens #11836, which was merged but promptly reverted because it introduced flaky Hive tests.
      
      ## How was this patch tested?
      
      See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #11938 from andrewor14/session-catalog-again.
      20ddf5fd
    • Yanbo Liang's avatar
      [SPARK-13010][ML][SPARKR] Implement a simple wrapper of AFTSurvivalRegression in SparkR · 13cbb2de
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      This PR continues the work in #11447, we implemented the wrapper of ```AFTSurvivalRegression``` named ```survreg``` in SparkR.
      
      ## How was this patch tested?
      Test against output from R package survival's survreg.
      
      cc mengxr felixcheung
      
      Close #11447
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11932 from yanboliang/spark-13010-new.
      13cbb2de
  4. Mar 24, 2016
  5. Mar 23, 2016
    • Andrew Or's avatar
      [SPARK-14014][SQL] Replace existing catalog with SessionCatalog · 5dfc0197
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      `SessionCatalog`, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`.
      
      As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely:
      - SPARK-14013: Properly implement temporary functions in `SessionCatalog`
      - SPARK-13879: Decide which DDL/DML commands to support natively in Spark
      - SPARK-?????: Implement the ones we do want to support through `SessionCatalog`.
      - SPARK-?????: Merge SQL/HiveContext
      
      ## How was this patch tested?
      
      This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`.
      
      Author: Andrew Or <andrew@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #11836 from andrewor14/use-session-catalog.
      5dfc0197
  6. Mar 22, 2016
    • Xusen Yin's avatar
      [SPARK-13449] Naive Bayes wrapper in SparkR · d6dc12ef
      Xusen Yin authored
      ## What changes were proposed in this pull request?
      
      This PR continues the work in #11486 from yinxusen with some code refactoring. In R package e1071, `naiveBayes` supports both categorical (Bernoulli) and continuous features (Gaussian), while in MLlib we support Bernoulli and multinomial. This PR implements the common subset: Bernoulli.
      
      I moved the implementation out from SparkRWrappers to NaiveBayesWrapper to make it easier to read. Argument names, default values, and summary now match e1071's naiveBayes.
      
      I removed the preprocess part that omit NA values because we don't know which columns to process.
      
      ## How was this patch tested?
      
      Test against output from R package e1071's naiveBayes.
      
      cc: yanboliang yinxusen
      
      Closes #11486
      
      Author: Xusen Yin <yinxusen@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #11890 from mengxr/SPARK-13449.
      d6dc12ef
  7. Mar 19, 2016
    • Dongjoon Hyun's avatar
      [MINOR][DOCS] Use `spark-submit` instead of `sparkR` to submit R script. · 2082a495
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Since `sparkR` is not used for submitting R Scripts from Spark 2.0, a user faces the following error message if he follows the instruction on `R/README.md`. This PR updates `R/README.md`.
      ```bash
      $ ./bin/sparkR examples/src/main/r/dataframe.R
      Running R applications through 'sparkR' is not supported as of Spark 2.0.
      Use ./bin/spark-submit <R file>
      ```
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11842 from dongjoon-hyun/update_r_readme.
      2082a495
  8. Mar 13, 2016
    • Sun Rui's avatar
      [SPARK-13812][SPARKR] Fix SparkR lint-r test errors. · c7e68c39
      Sun Rui authored
      ## What changes were proposed in this pull request?
      
      This PR fixes all newly captured SparkR lint-r errors after the lintr package is updated from github.
      
      ## How was this patch tested?
      
      dev/lint-r
      SparkR unit tests
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #11652 from sun-rui/SPARK-13812.
      c7e68c39
  9. Mar 10, 2016
  10. Feb 25, 2016
    • Yanbo Liang's avatar
      [SPARK-13504] [SPARKR] Add approxQuantile for SparkR · 50e60e36
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Add ```approxQuantile``` for SparkR.
      ## How was this patch tested?
      unit tests
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11383 from yanboliang/spark-13504 and squashes the following commits:
      
      4f17adb [Yanbo Liang] Add approxQuantile for SparkR
      50e60e36
  11. Feb 24, 2016
  12. Feb 23, 2016
  13. Feb 22, 2016
  14. Feb 21, 2016
    • Cheng Lian's avatar
      [SPARK-12799] Simplify various string output for expressions · d9efe63e
      Cheng Lian authored
      This PR introduces several major changes:
      
      1. Replacing `Expression.prettyString` with `Expression.sql`
      
         The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users.
      
      1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed)
      
         Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird.  Here are several examples:
      
         Expression         | `prettyString` | `sql`      | Note
         ------------------ | -------------- | ---------- | ---------------
         `a && b`           | `a && b`       | `a AND b`  |
         `a.getField("f")`  | `a[f]`         | `a.f`      | `a` is a struct
      
      1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders)
      
         `NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #10757 from liancheng/spark-12799.simplify-expression-string-methods.
      d9efe63e
  15. Feb 19, 2016
  16. Feb 11, 2016
  17. Jan 26, 2016
    • Yanbo Liang's avatar
      [SPARK-12903][SPARKR] Add covar_samp and covar_pop for SparkR · e7f9199e
      Yanbo Liang authored
      Add ```covar_samp``` and ```covar_pop``` for SparkR.
      Should we also provide ```cov``` alias for ```covar_samp```? There is ```cov``` implementation at stats.R which masks ```stats::cov``` already, but may bring to breaking API change.
      
      cc sun-rui felixcheung shivaram
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #10829 from yanboliang/spark-12903.
      e7f9199e
  18. Jan 22, 2016
  19. Jan 20, 2016
    • Sun Rui's avatar
      [SPARK-12204][SPARKR] Implement drop method for DataFrame in SparkR. · 1b2a918e
      Sun Rui authored
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #10201 from sun-rui/SPARK-12204.
      1b2a918e
    • smishra8's avatar
      [SPARK-12910] Fixes : R version for installing sparkR · d7415991
      smishra8 authored
      Testing code:
      ```
      $ ./install-dev.sh
      USING R_HOME = /usr/bin
      ERROR: this R is version 2.15.1, package 'SparkR' requires R >= 3.0
      ```
      
      Using the new argument:
      ```
      $ ./install-dev.sh /content/username/SOFTWARE/R-3.2.3
      USING R_HOME = /content/username/SOFTWARE/R-3.2.3/bin
      * installing *source* package ‘SparkR’ ...
      ** R
      ** inst
      ** preparing package for lazy loading
      Creating a new generic function for ‘colnames’ in package ‘SparkR’
      Creating a new generic function for ‘colnames<-’ in package ‘SparkR’
      Creating a new generic function for ‘cov’ in package ‘SparkR’
      Creating a new generic function for ‘na.omit’ in package ‘SparkR’
      Creating a new generic function for ‘filter’ in package ‘SparkR’
      Creating a new generic function for ‘intersect’ in package ‘SparkR’
      Creating a new generic function for ‘sample’ in package ‘SparkR’
      Creating a new generic function for ‘transform’ in package ‘SparkR’
      Creating a new generic function for ‘subset’ in package ‘SparkR’
      Creating a new generic function for ‘summary’ in package ‘SparkR’
      Creating a new generic function for ‘lag’ in package ‘SparkR’
      Creating a new generic function for ‘rank’ in package ‘SparkR’
      Creating a new generic function for ‘sd’ in package ‘SparkR’
      Creating a new generic function for ‘var’ in package ‘SparkR’
      Creating a new generic function for ‘predict’ in package ‘SparkR’
      Creating a new generic function for ‘rbind’ in package ‘SparkR’
      Creating a generic function for ‘lapply’ from package ‘base’ in package ‘SparkR’
      Creating a generic function for ‘Filter’ from package ‘base’ in package ‘SparkR’
      Creating a generic function for ‘alias’ from package ‘stats’ in package ‘SparkR’
      Creating a generic function for ‘substr’ from package ‘base’ in package ‘SparkR’
      Creating a generic function for ‘%in%’ from package ‘base’ in package ‘SparkR’
      Creating a generic function for ‘mean’ from package ‘base’ in package ‘SparkR’
      Creating a generic function for ‘unique’ from package ‘base’ in package ‘SparkR’
      Creating a generic function for ‘nrow’ from package ‘base’ in package ‘SparkR’
      Creating a generic function for ‘ncol’ from package ‘base’ in package ‘SparkR’
      Creating a generic function for ‘head’ from package ‘utils’ in package ‘SparkR’
      Creating a generic function for ‘factorial’ from package ‘base’ in package ‘SparkR’
      Creating a generic function for ‘atan2’ from package ‘base’ in package ‘SparkR’
      Creating a generic function for ‘ifelse’ from package ‘base’ in package ‘SparkR’
      ** help
      No man pages found in package  ‘SparkR’
      *** installing help indices
      ** building package indices
      ** testing if installed package can be loaded
      * DONE (SparkR)
      
      ```
      
      Author: Shubhanshu Mishra <smishra8@illinois.edu>
      
      Closes #10836 from napsternxg/master.
      d7415991
    • Herman van Hovell's avatar
      [SPARK-12848][SQL] Change parsed decimal literal datatype from Double to Decimal · 10173279
      Herman van Hovell authored
      The current parser turns a decimal literal, for example ```12.1```, into a Double. The problem with this approach is that we convert an exact literal into a non-exact ```Double```. The PR changes this behavior, a Decimal literal is now converted into an extact ```BigDecimal```.
      
      The behavior for scientific decimals, for example ```12.1e01```, is unchanged. This will be converted into a Double.
      
      This PR replaces the ```BigDecimal``` literal by a ```Double``` literal, because the ```BigDecimal``` is the default now. You can use the double literal by appending a 'D' to the value, for instance: ```3.141527D```
      
      cc davies rxin
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #10796 from hvanhovell/SPARK-12848.
      10173279
  20. Jan 19, 2016
    • felixcheung's avatar
      [SPARK-12232][SPARKR] New R API for read.table to avoid name conflict · 488bbb21
      felixcheung authored
      shivaram sorry it took longer to fix some conflicts, this is the change to add an alias for `table`
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #10406 from felixcheung/readtable.
      488bbb21
    • Sun Rui's avatar
      [SPARK-12337][SPARKR] Implement dropDuplicates() method of DataFrame in SparkR. · 3ac64828
      Sun Rui authored
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #10309 from sun-rui/SPARK-12337.
      3ac64828
    • felixcheung's avatar
      [SPARK-12168][SPARKR] Add automated tests for conflicted function in R · 37fefa66
      felixcheung authored
      Currently this is reported when loading the SparkR package in R (probably would add is.nan)
      ```
      Loading required package: methods
      
      Attaching package: ‘SparkR’
      
      The following objects are masked from ‘package:stats’:
      
          cov, filter, lag, na.omit, predict, sd, var
      
      The following objects are masked from ‘package:base’:
      
          colnames, colnames<-, intersect, rank, rbind, sample, subset,
          summary, table, transform
      ```
      
      Adding this test adds an automated way to track changes to masked method.
      Also, the second part of this test check for those functions that would not be accessible without namespace/package prefix.
      
      Incidentally, this might point to how we would fix those inaccessible functions in base or stats.
      Looking for feedback for adding this test.
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #10171 from felixcheung/rmaskedtest.
      37fefa66
  21. Jan 17, 2016
  22. Jan 15, 2016
    • Oscar D. Lara Yejas's avatar
      [SPARK-11031][SPARKR] Method str() on a DataFrame · ba4a6419
      Oscar D. Lara Yejas authored
      Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
      Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu>
      Author: Oscar D. Lara Yejas <oscar.lara.yejas@us.ibm.com>
      Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>
      
      Closes #9613 from olarayej/SPARK-11031.
      ba4a6419
  23. Jan 14, 2016
    • Wenchen Fan's avatar
      [SPARK-12756][SQL] use hash expression in Exchange · 962e9bcf
      Wenchen Fan authored
      This PR makes bucketing and exchange share one common hash algorithm, so that we can guarantee the data distribution is same between shuffle and bucketed data source, which enables us to only shuffle one side when join a bucketed table and a normal one.
      
      This PR also fixes the tests that are broken by the new hash behaviour in shuffle.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10703 from cloud-fan/use-hash-expr-in-shuffle.
      962e9bcf
  24. Jan 09, 2016
  25. Jan 06, 2016
  26. Jan 05, 2016
    • felixcheung's avatar
      [SPARK-12625][SPARKR][SQL] replace R usage of Spark SQL deprecated API · cc4d5229
      felixcheung authored
      rxin davies shivaram
      Took save mode from my PR #10480, and move everything to writer methods. This is related to PR #10559
      
      - [x] it seems jsonRDD() is broken, need to investigate - this is not a public API though; will look into some more tonight. (fixed)
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #10584 from felixcheung/rremovedeprecated.
      cc4d5229
  27. Jan 03, 2016
  28. Dec 29, 2015
    • Hossein's avatar
      [SPARK-11199][SPARKR] Improve R context management story and add getOrCreate · f6ecf143
      Hossein authored
      * Changes api.r.SQLUtils to use ```SQLContext.getOrCreate``` instead of creating a new context.
      * Adds a simple test
      
      [SPARK-11199] #comment link with JIRA
      
      Author: Hossein <hossein@databricks.com>
      
      Closes #9185 from falaki/SPARK-11199.
      f6ecf143
    • Forest Fang's avatar
      [SPARK-12526][SPARKR] ifelse`, `when`, `otherwise` unable to take Column as value · d80cc90b
      Forest Fang authored
      `ifelse`, `when`, `otherwise` is unable to take `Column` typed S4 object as values.
      
      For example:
      ```r
      ifelse(lit(1) == lit(1), lit(2), lit(3))
      ifelse(df$mpg > 0, df$mpg, 0)
      ```
      will both fail with
      ```r
      attempt to replicate an object of type 'environment'
      ```
      
      The PR replaces `ifelse` calls with `if ... else ...` inside the function implementations to avoid attempt to vectorize(i.e. `rep()`). It remains to be discussed whether we should instead support vectorization in these functions for consistency because `ifelse` in base R is vectorized but I cannot foresee any scenarios these functions will want to be vectorized in SparkR.
      
      For reference, added test cases which trigger failures:
      ```r
      . Error: when(), otherwise() and ifelse() with column on a DataFrame ----------
      error in evaluating the argument 'x' in selecting a method for function 'collect':
        error in evaluating the argument 'col' in selecting a method for function 'select':
        attempt to replicate an object of type 'environment'
      Calls: when -> when -> ifelse -> ifelse
      
      1: withCallingHandlers(eval(code, new_test_environment), error = capture_calls, message = function(c) invokeRestart("muffleMessage"))
      2: eval(code, new_test_environment)
      3: eval(expr, envir, enclos)
      4: expect_equal(collect(select(df, when(df$a > 1 & df$b > 2, lit(1))))[, 1], c(NA, 1)) at test_sparkSQL.R:1126
      5: expect_that(object, equals(expected, label = expected.label, ...), info = info, label = label)
      6: condition(object)
      7: compare(actual, expected, ...)
      8: collect(select(df, when(df$a > 1 & df$b > 2, lit(1))))
      Error: Test failures
      Execution halted
      ```
      
      Author: Forest Fang <forest.fang@outlook.com>
      
      Closes #10481 from saurfang/spark-12526.
      d80cc90b
  29. Dec 19, 2015
  30. Dec 16, 2015
Loading