Skip to content
Snippets Groups Projects
  1. Dec 03, 2015
  2. Nov 29, 2015
  3. Nov 28, 2015
  4. Nov 27, 2015
    • Yanbo Liang's avatar
      [SPARK-12025][SPARKR] Rename some window rank function names for SparkR · ba02f6cb
      Yanbo Liang authored
      Change ```cumeDist -> cume_dist, denseRank -> dense_rank, percentRank -> percent_rank, rowNumber -> row_number``` at SparkR side.
      There are two reasons that we should make this change:
      * We should follow the [naming convention rule of R](http://www.inside-r.org/node/230645)
      * Spark DataFrame has deprecated the old convention (such as ```cumeDist```) and will remove it in Spark 2.0.
      
      It's better to fix this issue before 1.6 release, otherwise we will make breaking API change.
      cc shivaram sun-rui
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #10016 from yanboliang/SPARK-12025.
      ba02f6cb
  5. Nov 20, 2015
  6. Nov 19, 2015
    • felixcheung's avatar
      [SPARK-11339][SPARKR] Document the list of functions in R base package that... · 1a93323c
      felixcheung authored
      [SPARK-11339][SPARKR] Document the list of functions in R base package that are masked by functions with same name in SparkR
      
      Added tests for function that are reported as masked, to make sure the base:: or stats:: function can be called.
      
      For those we can't call, added them to SparkR programming guide.
      
      It would seem to me `table, sample, subset, filter, cov` not working are not actually expected - I investigated/experimented with them but couldn't get them to work. It looks like as they are defined in base or stats they are missing the S3 generic, eg.
      ```
      > methods("transform")
      [1] transform,ANY-method       transform.data.frame
      [3] transform,DataFrame-method transform.default
      see '?methods' for accessing help and source code
      > methods("subset")
      [1] subset.data.frame       subset,DataFrame-method subset.default
      [4] subset.matrix
      see '?methods' for accessing help and source code
      Warning message:
      In .S3methods(generic.function, class, parent.frame()) :
        function 'subset' appears not to be S3 generic; found functions that look like S3 methods
      ```
      Any idea?
      
      More information on masking:
      http://www.ats.ucla.edu/stat/r/faq/referencing_objects.htm
      http://www.sfu.ca/~sweldon/howTo/guide4.pdf
      
      This is what the output doc looks like (minus css):
      ![image](https://cloud.githubusercontent.com/assets/8969467/11229714/2946e5de-8d4d-11e5-94b0-dda9696b6fdd.png)
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #9785 from felixcheung/rmasked.
      1a93323c
  7. Nov 18, 2015
  8. Nov 15, 2015
    • Sun Rui's avatar
      [SPARK-10500][SPARKR] sparkr.zip cannot be created if /R/lib is unwritable · 835a79d7
      Sun Rui authored
      The basic idea is that:
      The archive of the SparkR package itself, that is sparkr.zip, is created during build process and is contained in the Spark binary distribution. No change to it after the distribution is installed as the directory it resides ($SPARK_HOME/R/lib) may not be writable.
      
      When there is R source code contained in jars or Spark packages specified with "--jars" or "--packages" command line option, a temporary directory is created by calling Utils.createTempDir() where the R packages built from the R source code will be installed. The temporary directory is writable, and won't interfere with each other when there are multiple SparkR sessions, and will be deleted when this SparkR session ends. The R binary packages installed in the temporary directory then are packed into an archive named rpkg.zip.
      
      sparkr.zip and rpkg.zip are distributed to the cluster in YARN modes.
      
      The distribution of rpkg.zip in Standalone modes is not supported in this PR, and will be address in another PR.
      
      Various R files are updated to accept multiple lib paths (one is for SparkR package, the other is for other R packages)  so that these package can be accessed in R.
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #9390 from sun-rui/SPARK-10500.
      835a79d7
    • zero323's avatar
      [SPARK-11086][SPARKR] Use dropFactors column-wise instead of nested loop when createDataFrame · d7d9fa0b
      zero323 authored
      Use `dropFactors` column-wise instead of nested loop when `createDataFrame` from a `data.frame`
      
      At this moment SparkR createDataFrame is using nested loop to convert factors to character when called on a local data.frame.  It works but is incredibly slow especially with data.table (~ 2 orders of magnitude compared to PySpark / Pandas version on a DateFrame of size 1M rows x 2 columns).
      
      A simple improvement is to apply `dropFactor `column-wise and then reshape output list.
      
      It should at least partially address [SPARK-8277](https://issues.apache.org/jira/browse/SPARK-8277).
      
      Author: zero323 <matthew.szymkiewicz@gmail.com>
      
      Closes #9099 from zero323/SPARK-11086.
      d7d9fa0b
  9. Nov 12, 2015
    • felixcheung's avatar
      [SPARK-11263][SPARKR] lintr Throws Warnings on Commented Code in Documentation · ed04846e
      felixcheung authored
      Clean out hundreds of `style: Commented code should be removed.` from lintr
      
      Like these:
      ```
      /opt/spark-1.6.0-bin-hadoop2.6/R/pkg/R/DataFrame.R:513:3: style: Commented code should be removed.
      # sc <- sparkR.init()
        ^~~~~~~~~~~~~~~~~~~
      /opt/spark-1.6.0-bin-hadoop2.6/R/pkg/R/DataFrame.R:514:3: style: Commented code should be removed.
      # sqlContext <- sparkRSQL.init(sc)
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      /opt/spark-1.6.0-bin-hadoop2.6/R/pkg/R/DataFrame.R:515:3: style: Commented code should be removed.
      # path <- "path/to/file.json"
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~
      ```
      
      tried without export or rdname, neither work
      instead, added this `#' noRd` to suppress .Rd file generation
      
      also updated `family` for DataFrame functions for longer descriptive text instead of `dataframe_funcs`
      ![image](https://cloud.githubusercontent.com/assets/8969467/10933937/17bf5b1e-8291-11e5-9777-40fc632105dc.png)
      
      this covers *most* of 'Commented code' but I left out a few that looks legitimate.
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #9463 from felixcheung/rlintr.
      ed04846e
    • JihongMa's avatar
      [SPARK-11420] Updating Stddev support via Imperative Aggregate · d292f748
      JihongMa authored
      switched stddev support from DeclarativeAggregate to ImperativeAggregate.
      
      Author: JihongMa <linlin200605@gmail.com>
      
      Closes #9380 from JihongMA/SPARK-11420.
      d292f748
  10. Nov 11, 2015
  11. Nov 10, 2015
    • Yanbo Liang's avatar
      [ML][R] SparkR::glm summary result to compare with native R · f14e9511
      Yanbo Liang authored
      Follow up #9561. Due to [SPARK-11587](https://issues.apache.org/jira/browse/SPARK-11587) has been fixed, we should compare SparkR::glm summary result with native R output rather than hard-code one. mengxr
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #9590 from yanboliang/glm-r-test.
      f14e9511
    • Oscar D. Lara Yejas's avatar
      [SPARK-10863][SPARKR] Method coltypes() (New version) · 47735cdc
      Oscar D. Lara Yejas authored
      This is a follow up on PR #8984, as the corresponding branch for such PR was damaged.
      
      Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu>
      
      Closes #9579 from olarayej/SPARK-10863_NEW14.
      47735cdc
    • Yin Huai's avatar
      [SPARK-9830][SQL] Remove AggregateExpression1 and Aggregate Operator used to... · e0701c75
      Yin Huai authored
      [SPARK-9830][SQL] Remove AggregateExpression1 and Aggregate Operator used to evaluate AggregateExpression1s
      
      https://issues.apache.org/jira/browse/SPARK-9830
      
      This PR contains the following main changes.
      * Removing `AggregateExpression1`.
      * Removing `Aggregate` operator, which is used to evaluate `AggregateExpression1`.
      * Removing planner rule used to plan `Aggregate`.
      * Linking `MultipleDistinctRewriter` to analyzer.
      * Renaming `AggregateExpression2` to `AggregateExpression` and `AggregateFunction2` to `AggregateFunction`.
      * Updating places where we create aggregate expression. The way to create aggregate expressions is `AggregateExpression(aggregateFunction, mode, isDistinct)`.
      * Changing `val`s in `DeclarativeAggregate`s that touch children of this function to `lazy val`s (when we create aggregate expression in DataFrame API, children of an aggregate function can be unresolved).
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #9556 from yhuai/removeAgg1.
      e0701c75
  12. Nov 09, 2015
  13. Nov 06, 2015
  14. Nov 05, 2015
  15. Nov 04, 2015
  16. Nov 03, 2015
  17. Nov 02, 2015
  18. Oct 30, 2015
  19. Oct 29, 2015
  20. Oct 28, 2015
  21. Oct 26, 2015
  22. Oct 23, 2015
  23. Oct 22, 2015
    • Forest Fang's avatar
      [SPARK-11244][SPARKR] sparkR.stop() should remove SQLContext · 94e2064f
      Forest Fang authored
      SparkR should remove `.sparkRSQLsc` and `.sparkRHivesc` when `sparkR.stop()` is called. Otherwise even when SparkContext is reinitialized, `sparkRSQL.init` returns the stale copy of the object and complains:
      
      ```r
      sc <- sparkR.init("local")
      sqlContext <- sparkRSQL.init(sc)
      sparkR.stop()
      sc <- sparkR.init("local")
      sqlContext <- sparkRSQL.init(sc)
      sqlContext
      ```
      producing
      ```r
      Error in callJMethod(x, "getClass") :
        Invalid jobj 1. If SparkR was restarted, Spark operations need to be re-executed.
      ```
      
      I have added the check and removal only when SparkContext itself is initialized. I have also added corresponding test for this fix. Let me know if you want me to move the test to SQL test suite instead.
      
      p.s. I tried lint-r but ended up a lots of errors on existing code.
      
      Author: Forest Fang <forest.fang@outlook.com>
      
      Closes #9205 from saurfang/sparkR.stop.
      94e2064f
  24. Oct 21, 2015
    • Davies Liu's avatar
      [SPARK-11197][SQL] run SQL on files directly · f8c6bec6
      Davies Liu authored
      This PR introduce a new feature to run SQL directly on files without create a table, for example:
      
      ```
      select id from json.`path/to/json/files` as j
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9173 from davies/source.
      f8c6bec6
  25. Oct 20, 2015
  26. Oct 19, 2015
  27. Oct 14, 2015
Loading