Skip to content
Snippets Groups Projects
  1. Dec 13, 2016
  2. Dec 12, 2016
    • Felix Cheung's avatar
      [SPARK-18810][SPARKR] SparkR install.spark does not work for RCs, snapshots · 1aeb7f42
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Support overriding the download url (include version directory) in an environment variable, `SPARKR_RELEASE_DOWNLOAD_URL`
      
      ## How was this patch tested?
      
      unit test, manually testing
      - snapshot build url
        - download when spark jar not cached
        - when spark jar is cached
      - RC build url
        - download when spark jar not cached
        - when spark jar is cached
      - multiple cached spark versions
      - starting with sparkR shell
      
      To use this,
      ```
      SPARKR_RELEASE_DOWNLOAD_URL=http://this_is_the_url_to_spark_release_tgz
      
       R
      ```
      then in R,
      ```
      library(SparkR) # or specify lib.loc
      sparkR.session()
      ```
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16248 from felixcheung/rinstallurl.
      
      (cherry picked from commit 8a51cfdc)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      1aeb7f42
    • Yuming Wang's avatar
      [SPARK-18681][SQL] Fix filtering to compatible with partition keys of type int · 523071f3
      Yuming Wang authored
      
      ## What changes were proposed in this pull request?
      
      Cloudera put `/var/run/cloudera-scm-agent/process/15000-hive-HIVEMETASTORE/hive-site.xml` as the configuration file for the Hive Metastore Server, where `hive.metastore.try.direct.sql=false`. But Spark isn't reading this configuration file and get default value `hive.metastore.try.direct.sql=true`. As mallman said, we should use `getMetaConf` method to obtain the original configuration from Hive Metastore Server. I have tested this method few times and the return value is always consistent with Hive Metastore Server.
      
      ## How was this patch tested?
      
      The existing tests.
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #16122 from wangyum/SPARK-18681.
      
      (cherry picked from commit 90abfd15)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      523071f3
    • Bill Chambers's avatar
      [DOCS][MINOR] Clarify Where AccumulatorV2s are Displayed · 35011608
      Bill Chambers authored
      ## What changes were proposed in this pull request?
      
      This PR clarifies where accumulators will be displayed.
      
      ## How was this patch tested?
      
      No testing.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Bill Chambers <bill@databricks.com>
      Author: anabranch <wac.chambers@gmail.com>
      Author: Bill Chambers <wchambers@ischool.berkeley.edu>
      
      Closes #16180 from anabranch/improve-acc-docs.
      
      (cherry picked from commit 70ffff21)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      35011608
    • Tyson Condie's avatar
      [SPARK-18790][SS] Keep a general offset history of stream batches · 63693c17
      Tyson Condie authored
      ## What changes were proposed in this pull request?
      
      Instead of only keeping the minimum number of offsets around, we should keep enough information to allow us to roll back n batches and reexecute the stream starting from a given point. In particular, we should create a config in SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and ensure that we keep enough log files in the following places to roll back the specified number of batches:
      the offsets that are present in each batch
      versions of the state store
      the files lists stored for the FileStreamSource
      the metadata log stored by the FileStreamSink
      
      marmbrus zsxwing
      
      ## How was this patch tested?
      
      The following tests were added.
      
      ### StreamExecution offset metadata
      Test added to StreamingQuerySuite that ensures offset metadata is garbage collected according to minBatchesRetain
      
      ### CompactibleFileStreamLog
      Tests added in CompactibleFileStreamLogSuite to ensure that logs are purged starting before the first compaction file that proceeds the current batch id - minBatchesToRetain.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Tyson Condie <tcondie@gmail.com>
      
      Closes #16219 from tcondie/offset_hist.
      
      (cherry picked from commit 83a42897)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      63693c17
  3. Dec 11, 2016
  4. Dec 10, 2016
  5. Dec 09, 2016
    • Felix Cheung's avatar
      [SPARK-18807][SPARKR] Should suppress output print for calls to JVM methods with void return values · 8bf56cc4
      Felix Cheung authored
      
      ## What changes were proposed in this pull request?
      
      Several SparkR API calling into JVM methods that have void return values are getting printed out, especially when running in a REPL or IDE.
      example:
      ```
      > setLogLevel("WARN")
      NULL
      ```
      We should fix this to make the result more clear.
      
      Also found a small change to return value of dropTempView in 2.1 - adding doc and test for it.
      
      ## How was this patch tested?
      
      manually - I didn't find a expect_*() method in testthat for this
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16237 from felixcheung/rinvis.
      
      (cherry picked from commit 3e11d5bf)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      8bf56cc4
    • Xiangrui Meng's avatar
      [SPARK-18812][MLLIB] explain "Spark ML" · e45345d9
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      There has been some confusion around "Spark ML" vs. "MLlib". This PR adds some FAQ-like entries to the MLlib user guide to explain "Spark ML" and reduce the confusion.
      
      I check the [Spark FAQ page](http://spark.apache.org/faq.html
      
      ), which seems too high-level for the content here. So I added it to the MLlib user guide instead.
      
      cc: mateiz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #16241 from mengxr/SPARK-18812.
      
      (cherry picked from commit d2493a20)
      Signed-off-by: default avatarXiangrui Meng <meng@databricks.com>
      e45345d9
    • Kazuaki Ishizaki's avatar
      [SPARK-18745][SQL] Fix signed integer overflow due to toInt cast · 562507ef
      Kazuaki Ishizaki authored
      
      ## What changes were proposed in this pull request?
      
      This PR avoids that a result of a cast `toInt` is negative due to signed integer overflow (e.g. 0x0000_0000_1???????L.toInt < 0 ). This PR performs casts after we can ensure the value is within range of signed integer (the result of `max(array.length, ???)` is always integer).
      
      ## How was this patch tested?
      
      Manually executed query68 of TPC-DS with 100TB
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #16235 from kiszk/SPARK-18745.
      
      (cherry picked from commit d60ab5fd)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      562507ef
    • Shivaram Venkataraman's avatar
      [MINOR][SPARKR] Fix SparkR regex in copy command · eb2d9bfd
      Shivaram Venkataraman authored
      
      Fix SparkR package copy regex. The existing code leads to
      ```
      Copying release tarballs to /home/****/public_html/spark-nightly/spark-branch-2.1-bin/spark-2.1.1-SNAPSHOT-2016_12_08_22_38-e8f351f9-bin
      mput: SparkR-*: no files found
      ```
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #16231 from shivaram/typo-sparkr-build.
      
      (cherry picked from commit be5fc6ef)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      eb2d9bfd
    • Xiangrui Meng's avatar
      [SPARK-17822][R] Make JVMObjectTracker a member variable of RBackend · 0c6415ae
      Xiangrui Meng authored
      
      ## What changes were proposed in this pull request?
      
      * This PR changes `JVMObjectTracker` from `object` to `class` and let its instance associated with each RBackend. So we can manage the lifecycle of JVM objects when there are multiple `RBackend` sessions. `RBackend.close` will clear the object tracker explicitly.
      * I assume that `SQLUtils` and `RRunner` do not need to track JVM instances, which could be wrong.
      * Small refactor of `SerDe.sqlSerDe` to increase readability.
      
      ## How was this patch tested?
      
      * Added unit tests for `JVMObjectTracker`.
      * Wait for Jenkins to run full tests.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #16154 from mengxr/SPARK-17822.
      
      (cherry picked from commit fd48d80a)
      Signed-off-by: default avatarXiangrui Meng <meng@databricks.com>
      0c6415ae
    • Jacek Laskowski's avatar
      [MINOR][CORE][SQL][DOCS] Typo fixes · b226f10e
      Jacek Laskowski authored
      
      ## What changes were proposed in this pull request?
      
      Typo fixes
      
      ## How was this patch tested?
      
      Local build. Awaiting the official build.
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #16144 from jaceklaskowski/typo-fixes.
      
      (cherry picked from commit b162cc0c)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      b226f10e
    • Zhan Zhang's avatar
      [SPARK-18637][SQL] Stateful UDF should be considered as nondeterministic · 72bf5199
      Zhan Zhang authored
      
      Make stateful udf as nondeterministic
      
      Add new test cases with both Stateful and Stateless UDF.
      Without the patch, the test cases will throw exception:
      
      1 did not equal 10
      ScalaTestFailureLocation: org.apache.spark.sql.hive.execution.HiveUDFSuite$$anonfun$21 at (HiveUDFSuite.scala:501)
      org.scalatest.exceptions.TestFailedException: 1 did not equal 10
              at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
              at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
              ...
      
      Author: Zhan Zhang <zhanzhang@fb.com>
      
      Closes #16068 from zhzhan/state.
      
      (cherry picked from commit 67587d96)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      72bf5199
    • Felix Cheung's avatar
      Copy pyspark and SparkR packages to latest release dir too · 2c88e1dc
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Copy pyspark and SparkR packages to latest release dir, as per comment [here](https://github.com/apache/spark/pull/16226#discussion_r91664822
      
      )
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16227 from felixcheung/pyrftp.
      
      (cherry picked from commit c074c96d)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      2c88e1dc
    • Shivaram Venkataraman's avatar
      Copy the SparkR source package with LFTP · e8f351f9
      Shivaram Venkataraman authored
      
      This PR adds a line in release-build.sh to copy the SparkR source archive using LFTP
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #16226 from shivaram/fix-sparkr-copy-build.
      
      (cherry picked from commit 934035ae)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      e8f351f9
    • wm624@hotmail.com's avatar
      [SPARK-18349][SPARKR] Update R API documentation on ml model summary · 4ceed95b
      wm624@hotmail.com authored
      
      ## What changes were proposed in this pull request?
      In this PR, the document of `summary` method is improved in the format:
      
      returns summary information of the fitted model, which is a list. The list includes .......
      
      Since `summary` in R is mainly about the model, which is not the same as `summary` object on scala side, if there is one, the scala API doc is not pointed here.
      
      In current document, some `return` have `.` and some don't have. `.` is added to missed ones.
      
      Since spark.logit `summary` has a big refactoring, this PR doesn't include this one. It will be changed when the `spark.logit` PR is merged.
      
      ## How was this patch tested?
      
      Manual build.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16150 from wangmiao1981/audit2.
      
      (cherry picked from commit 86a96034)
      Signed-off-by: default avatarFelix Cheung <felixcheung@apache.org>
      4ceed95b
  6. Dec 08, 2016
    • Shivaram Venkataraman's avatar
      [SPARKR][PYSPARK] Fix R source package name to match Spark version. Remove pip... · ef5646b4
      Shivaram Venkataraman authored
      [SPARKR][PYSPARK] Fix R source package name to match Spark version. Remove pip tar.gz from distribution
      
      ## What changes were proposed in this pull request?
      
      Fixes name of R source package so that the `cp` in release-build.sh works correctly.
      
      Issue discussed in https://github.com/apache/spark/pull/16014#issuecomment-265867125
      
      
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #16221 from shivaram/fix-sparkr-release-build-name.
      
      (cherry picked from commit 4ac8b20b)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      ef5646b4
    • Shixiong Zhu's avatar
      [SPARK-18774][CORE][SQL] Ignore non-existing files when ignoreCorruptFiles is enabled (branch 2.1) · 1cafc76e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Backport #16203 to branch 2.1.
      
      ## How was this patch tested?
      
      Jennkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16216 from zsxwing/SPARK-18774-2.1.
      1cafc76e
    • Tathagata Das's avatar
      [SPARK-18776][SS] Make Offset for FileStreamSource corrected formatted in json · fcd22e53
      Tathagata Das authored
      
      ## What changes were proposed in this pull request?
      
      - Changed FileStreamSource to use new FileStreamSourceOffset rather than LongOffset. The field is named as `logOffset` to make it more clear that this is a offset in the file stream log.
      - Fixed bug in FileStreamSourceLog, the field endId in the FileStreamSourceLog.get(startId, endId) was not being used at all. No test caught it earlier. Only my updated tests caught it.
      
      Other minor changes
      - Dont use batchId in the FileStreamSource, as calling it batch id is extremely miss leading. With multiple sources, it may happen that a new batch has no new data from a file source. So offset of FileStreamSource != batchId after that batch.
      
      ## How was this patch tested?
      
      Updated unit test.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16205 from tdas/SPARK-18776.
      
      (cherry picked from commit 458fa332)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      fcd22e53
    • Shivaram Venkataraman's avatar
      [SPARK-18590][SPARKR] Change the R source build to Hadoop 2.6 · e43209fe
      Shivaram Venkataraman authored
      This PR changes the SparkR source release tarball to be built using the Hadoop 2.6 profile. Previously it was using the without hadoop profile which leads to an error as discussed in https://github.com/apache/spark/pull/16014#issuecomment-265843991
      
      
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #16218 from shivaram/fix-sparkr-release-build.
      
      (cherry picked from commit 202fcd21)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      e43209fe
    • Reynold Xin's avatar
      [SPARK-18760][SQL] Consistent format specification for FileFormats · 9483242f
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      This patch fixes the format specification in explain for file sources (Parquet and Text formats are the only two that are different from the rest):
      
      Before:
      ```
      scala> spark.read.text("test.text").explain()
      == Physical Plan ==
      *FileScan text [value#15] Batched: false, Format: org.apache.spark.sql.execution.datasources.text.TextFileFormatxyz, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>
      ```
      
      After:
      ```
      scala> spark.read.text("test.text").explain()
      == Physical Plan ==
      *FileScan text [value#15] Batched: false, Format: Text, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>
      ```
      
      Also closes #14680.
      
      ## How was this patch tested?
      Verified in spark-shell.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #16187 from rxin/SPARK-18760.
      
      (cherry picked from commit 5f894d23)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      9483242f
    • Shixiong Zhu's avatar
      [SPARK-18751][CORE] Fix deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext · a0356441
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      When `SparkContext.stop` is called in `Utils.tryOrStopSparkContext` (the following three places), it will cause deadlock because the `stop` method needs to wait for the thread running `stop` to exit.
      
      - ContextCleaner.keepCleaning
      - LiveListenerBus.listenerThread.run
      - TaskSchedulerImpl.start
      
      This PR adds `SparkContext.stopInNewThread` and uses it to eliminate the potential deadlock. I also removed my changes in #15775 since they are not necessary now.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16178 from zsxwing/fix-stop-deadlock.
      
      (cherry picked from commit 26432df9)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      a0356441
    • Felix Cheung's avatar
      [SPARK-18590][SPARKR] build R source package when making distribution · d69df907
      Felix Cheung authored
      This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not)
      
      But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below.
      
      This PR also includes a few minor fixes.
      
      These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md
      
      ) on what's going to a CRAN release, which is now run during make-distribution.sh.
      1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path
      2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation)
      3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN
       (will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests)
      4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1)
       (the output of this step is what we package into Spark dist and sparkr.zip)
      
      Alternatively,
         R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead.
       But in any case, despite installing the package multiple times this is relatively fast.
      Building vignettes takes a while though.
      
      Manually, CI.
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16014 from felixcheung/rdist.
      
      (cherry picked from commit c3d3a9d0)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      d69df907
    • Andrew Ray's avatar
      [SPARK-16589] [PYTHON] Chained cartesian produces incorrect number of records · e0173f14
      Andrew Ray authored
      
      ## What changes were proposed in this pull request?
      
      Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching.
      
      `CartesianDeserializer` and `PairDeserializer` were changed to implement `_load_stream_without_unbatching` and borrow the one line implementation of `load_stream` from `BatchedSerializer`. The default implementation of `_load_stream_without_unbatching` was changed to give consistent results (always an iterable) so that it could be used without additional checks.
      
      `PairDeserializer` no longer extends `CartesianDeserializer` as it was not really proper. If wanted a new common super class could be added.
      
      Both `CartesianDeserializer` and `PairDeserializer` now only extend `Serializer` (which has no `dump_stream` implementation) since they are only meant for *de*serialization.
      
      ## How was this patch tested?
      
      Additional unit tests (sourced from #14248) plus one for testing a cartesian with zip.
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #16121 from aray/fix-cartesian.
      
      (cherry picked from commit 3c68944b)
      Signed-off-by: default avatarDavies Liu <davies.liu@gmail.com>
      e0173f14
    • Liang-Chi Hsieh's avatar
      [SPARK-18667][PYSPARK][SQL] Change the way to group row in BatchEvalPythonExec... · 726217eb
      Liang-Chi Hsieh authored
      [SPARK-18667][PYSPARK][SQL] Change the way to group row in BatchEvalPythonExec so input_file_name function can work with UDF in pyspark
      
      ## What changes were proposed in this pull request?
      
      `input_file_name` doesn't return filename when working with UDF in PySpark. An example shows the problem:
      
          from pyspark.sql.functions import *
          from pyspark.sql.types import *
      
          def filename(path):
              return path
      
          sourceFile = udf(filename, StringType())
          spark.read.json("tmp.json").select(sourceFile(input_file_name())).show()
      
          +---------------------------+
          |filename(input_file_name())|
          +---------------------------+
          |                           |
          +---------------------------+
      
      The cause of this issue is, we group rows in `BatchEvalPythonExec` for batching processing of PythonUDF. Currently we group rows first and then evaluate expressions on the rows. If the data is less than the required number of rows for a group, the iterator will be consumed to the end before the evaluation. However, once the iterator reaches the end, we will unset input filename. So the input_file_name expression can't return correct filename.
      
      This patch fixes the approach to group the batch of rows. We evaluate the expression first and then group evaluated results to batch.
      
      ## How was this patch tested?
      
      Added unit test to PySpark.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16115 from viirya/fix-py-udf-input-filename.
      
      (cherry picked from commit 6a5a7254)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      726217eb
    • Yanbo Liang's avatar
      [SPARK-18325][SPARKR][ML] SparkR ML wrappers example code and user guide · 9095c152
      Yanbo Liang authored
      
      ## What changes were proposed in this pull request?
      * Add all R examples for ML wrappers which were added during 2.1 release cycle.
      * Split the whole ```ml.R``` example file into individual example for each algorithm, which will be convenient for users to rerun them.
      * Add corresponding examples to ML user guide.
      * Update ML section of SparkR user guide.
      
      Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR examples may different from them, since R users may use the algorithms in a different way, for example, using R ```formula``` to specify ```featuresCol``` and ```labelCol```.
      
      ## How was this patch tested?
      Run all examples manually.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16148 from yanboliang/spark-18325.
      
      (cherry picked from commit 9bf8f3cd)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      9095c152
    • Patrick Wendell's avatar
      48aa6775
    • Patrick Wendell's avatar
      Preparing Spark release v2.1.0-rc2 · 08071749
      Patrick Wendell authored
      08071749
  7. Dec 07, 2016
    • Yanbo Liang's avatar
      [SPARK-18326][SPARKR][ML] Review SparkR ML wrappers API for 2.1 · 1c3f1da8
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Reviewing SparkR ML wrappers API for 2.1 release, mainly two issues:
      * Remove ```probabilityCol``` from the argument list of ```spark.logit``` and ```spark.randomForest```. Since it was used when making prediction and should be an argument of ```predict```, and we will work on this at [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618
      
      ) in the next release cycle.
      * Fix ```spark.als``` params to make it consistent with MLlib.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16169 from yanboliang/spark-18326.
      
      (cherry picked from commit 97255497)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      1c3f1da8
    • sethah's avatar
      [SPARK-18705][ML][DOC] Update user guide to reflect one pass solver for L1 and elastic-net · ab865cfd
      sethah authored
      
      ## What changes were proposed in this pull request?
      
      WeightedLeastSquares now supports L1 and elastic net penalties and has an additional solver option: QuasiNewton. The docs are updated to reflect this change.
      
      ## How was this patch tested?
      
      Docs only. Generated documentation to make sure Latex looks ok.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #16139 from sethah/SPARK-18705.
      
      (cherry picked from commit 82253617)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      ab865cfd
    • Tathagata Das's avatar
      [SPARK-18758][SS] StreamingQueryListener events from a StreamingQuery should... · 617ce3ba
      Tathagata Das authored
      [SPARK-18758][SS] StreamingQueryListener events from a StreamingQuery should be sent only to the listeners in the same session as the query
      
      ## What changes were proposed in this pull request?
      
      Listeners added with `sparkSession.streams.addListener(l)` are added to a SparkSession. So events only from queries in the same session as a listener should be posted to the listener. Currently, all the events gets rerouted through the Spark's main listener bus, that is,
      - StreamingQuery posts event to StreamingQueryListenerBus. Only the queries associated with the same session as the bus posts events to it.
      - StreamingQueryListenerBus posts event to Spark's main LiveListenerBus as a SparkEvent.
      - StreamingQueryListenerBus also subscribes to LiveListenerBus events thus getting back the posted event in a different thread.
      - The received is posted to the registered listeners.
      
      The problem is that *all StreamingQueryListenerBuses in all sessions* gets the events and posts them to their listeners. This is wrong.
      
      In this PR, I solve it by making StreamingQueryListenerBus track active queries (by their runIds) when a query posts the QueryStarted event to the bus. This allows the rerouted events to be filtered using the tracked queries.
      
      Note that this list needs to be maintained separately
      from the `StreamingQueryManager.activeQueries` because a terminated query is cleared from
      `StreamingQueryManager.activeQueries` as soon as it is stopped, but the this ListenerBus must
      clear a query only after the termination event of that query has been posted lazily, much after the query has been terminated.
      
      Credit goes to zsxwing for coming up with the initial idea.
      
      ## How was this patch tested?
      Updated test harness code to use the correct session, and added new unit test.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16186 from tdas/SPARK-18758.
      
      (cherry picked from commit 9ab725ea)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      617ce3ba
    • wm624@hotmail.com's avatar
      [SPARK-18633][ML][EXAMPLE] Add multiclass logistic regression summary python example and document · 839c2eb9
      wm624@hotmail.com authored
      
      ## What changes were proposed in this pull request?
      Logistic Regression summary is added in Python API. We need to add example and document for summary.
      
      The newly added example is consistent with Scala and Java examples.
      
      ## How was this patch tested?
      
      Manually tests: Run the example with spark-submit; copy & paste code into pyspark; build document and check the document.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16064 from wangmiao1981/py.
      
      (cherry picked from commit aad11209)
      Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
      839c2eb9
Loading