Skip to content
Snippets Groups Projects
  1. Apr 25, 2017
  2. Apr 14, 2017
  3. Apr 09, 2017
    • Vijay Ramesh's avatar
      [SPARK-20260][MLLIB] String interpolation required for error message · 43a7fcad
      Vijay Ramesh authored
      ## What changes were proposed in this pull request?
      This error message doesn't get properly formatted because of a missing `s`.  Currently the error looks like:
      
      ```
      Caused by: java.lang.IllegalArgumentException: requirement failed: indices should be one-based and in ascending order; found current=$current, previous=$previous; line="$line"
      ```
      (note the literal `$current` instead of the interpolated value)
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Vijay Ramesh <vramesh@demandbase.com>
      
      Closes #17572 from vijaykramesh/master.
      
      (cherry picked from commit 261eaf51)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      43a7fcad
  4. Mar 28, 2017
  5. Mar 21, 2017
  6. Feb 12, 2017
  7. Jan 24, 2017
    • Ilya Matiach's avatar
      [SPARK-16473][MLLIB] Fix BisectingKMeans Algorithm failing in edge case · d128b6a3
      Ilya Matiach authored
      [SPARK-16473][MLLIB] Fix BisectingKMeans Algorithm failing in edge case where no children exist in updateAssignments
      
      ## What changes were proposed in this pull request?
      
      Fix a bug in which BisectingKMeans fails with error:
      java.util.NoSuchElementException: key not found: 166
              at scala.collection.MapLike$class.default(MapLike.scala:228)
              at scala.collection.AbstractMap.default(Map.scala:58)
              at scala.collection.MapLike$class.apply(MapLike.scala:141)
              at scala.collection.AbstractMap.apply(Map.scala:58)
              at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338)
              at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
              at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
              at scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231)
              at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
              at scala.collection.immutable.List.foldLeft(List.scala:84)
              at scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125)
              at scala.collection.immutable.List.reduceLeft(List.scala:84)
              at scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231)
              at scala.collection.AbstractTraversable.minBy(Traversable.scala:105)
              at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337)
              at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:334)
              at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
              at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
      
      ## How was this patch tested?
      
      The dataset was run against the code change to verify that the code works.  I will try to add unit tests to the code.
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Ilya Matiach <ilmat@microsoft.com>
      
      Closes #16355 from imatiach-msft/ilmat/fix-kmeans.
      Unverified
      d128b6a3
  8. Jan 23, 2017
    • actuaryzhang's avatar
      [SPARK-19155][ML] Make family case insensitive in GLM · 1e07a719
      actuaryzhang authored
      
      ## What changes were proposed in this pull request?
      This is a supplement to PR #16516 which did not make the value from `getFamily` case insensitive. Current tests of poisson/binomial glm with weight fail when specifying 'Poisson' or 'Binomial', because the calculation of `dispersion` and `pValue` checks the value of family retrieved from `getFamily`
      ```
      model.getFamily == Binomial.name || model.getFamily == Poisson.name
      ```
      
      ## How was this patch tested?
      Update existing tests for 'Poisson' and 'Binomial'.
      
      yanboliang felixcheung imatiach-msft
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #16675 from actuaryzhang/family.
      
      (cherry picked from commit f067acef)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      1e07a719
  9. Jan 21, 2017
  10. Jan 17, 2017
  11. Jan 07, 2017
    • wm624@hotmail.com's avatar
      [SPARK-19110][ML][MLLIB] DistributedLDAModel returns different logPrior for... · 86b66216
      wm624@hotmail.com authored
      [SPARK-19110][ML][MLLIB] DistributedLDAModel returns different logPrior for original and loaded model
      
      ## What changes were proposed in this pull request?
      
      While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different.
      For example, in the test("read/write DistributedLDAModel"), I add the test:
      val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
      val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
      assert(logPrior === logPrior2)
      The test fails:
      -4.394180878889078 did not equal -4.294290536919573
      
      The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`.
      
      Please refer to #16464 for details.
      ## How was this patch tested?
      Add a new unit test for testing logPrior.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16491 from wangmiao1981/ldabug.
      
      (cherry picked from commit 036b5034)
      Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
      86b66216
  12. Dec 22, 2016
  13. Dec 15, 2016
  14. Dec 11, 2016
  15. Dec 10, 2016
  16. Dec 08, 2016
  17. Dec 07, 2016
    • Yanbo Liang's avatar
      [SPARK-18326][SPARKR][ML] Review SparkR ML wrappers API for 2.1 · 1c3f1da8
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Reviewing SparkR ML wrappers API for 2.1 release, mainly two issues:
      * Remove ```probabilityCol``` from the argument list of ```spark.logit``` and ```spark.randomForest```. Since it was used when making prediction and should be an argument of ```predict```, and we will work on this at [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618
      
      ) in the next release cycle.
      * Fix ```spark.als``` params to make it consistent with MLlib.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16169 from yanboliang/spark-18326.
      
      (cherry picked from commit 97255497)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      1c3f1da8
    • actuaryzhang's avatar
      [SPARK-18701][ML] Fix Poisson GLM failure due to wrong initialization · 99c293ee
      actuaryzhang authored
      
      Poisson GLM fails for many standard data sets (see example in test or JIRA). The issue is incorrect initialization leading to almost zero probability and weights. Specifically, the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. Fix and test are included in the commits.
      
      ## What changes were proposed in this pull request?
      Update initialization in Poisson GLM
      
      ## How was this patch tested?
      Add test in GeneralizedLinearRegressionSuite
      
      srowen sethah yanboliang HyukjinKwon mengxr
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #16131 from actuaryzhang/master.
      
      (cherry picked from commit b8280271)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      99c293ee
    • Yanbo Liang's avatar
      [SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit. · 340e9aea
      Yanbo Liang authored
      
      ## What changes were proposed in this pull request?
      Several cleanup and improvements for ```spark.logit```:
      * ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model.
      * ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently.
      * SparkR test improvement: comparing the training result with native R glmnet.
      * Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users.
      
      ## How was this patch tested?
      Unit tests.
      
      The ```summary``` output after this change:
      multinomial logistic regression:
      ```
      > df <- suppressWarnings(createDataFrame(iris))
      > model <- spark.logit(df, Species ~ ., regParam = 0.5)
      > summary(model)
      $coefficients
                   versicolor  virginica   setosa
      (Intercept)  1.514031    -2.609108   1.095077
      Sepal_Length 0.02511006  0.2649821   -0.2900921
      Sepal_Width  -0.5291215  -0.02016446 0.549286
      Petal_Length 0.03647411  0.1544119   -0.190886
      Petal_Width  0.000236092 0.4195804   -0.4198165
      ```
      binomial logistic regression:
      ```
      > df <- suppressWarnings(createDataFrame(iris))
      > training <- df[df$Species %in% c("versicolor", "virginica"), ]
      > model <- spark.logit(training, Species ~ ., regParam = 0.5)
      > summary(model)
      $coefficients
                   Estimate
      (Intercept)  -6.053815
      Sepal_Length 0.2449379
      Sepal_Width  0.1648321
      Petal_Length 0.4730718
      Petal_Width  1.031947
      ```
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16117 from yanboliang/spark-18686.
      
      (cherry picked from commit 90b59d1b)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      340e9aea
  18. Dec 05, 2016
  19. Dec 02, 2016
  20. Nov 30, 2016
  21. Nov 29, 2016
  22. Nov 28, 2016
    • Yun Ni's avatar
      [SPARK-18408][ML] API Improvements for LSH · cdf315ba
      Yun Ni authored
      
      ## What changes were proposed in this pull request?
      
      (1) Change output schema to `Array of Vector` instead of `Vectors`
      (2) Use `numHashTables` as the dimension of Array
      (3) Rename `RandomProjection` to `BucketedRandomProjectionLSH`, `MinHash` to `MinHashLSH`
      (4) Make `randUnitVectors/randCoefficients` private
      (5) Make Multi-Probe NN Search and `hashDistance` private for future discussion
      
      Saved for future PRs:
      (1) AND-amplification and `numHashFunctions` as the dimension of Vector are saved for a future PR.
      (2) `hashDistance` and MultiProbe NN Search needs more discussion. The current implementation is just a backward compatible one.
      
      ## How was this patch tested?
      Related unit tests are modified to make sure the performance of LSH are ensured, and the outputs of the APIs meets expectation.
      
      Author: Yun Ni <yunn@uber.com>
      Author: Yunni <Euler57721@gmail.com>
      
      Closes #15874 from Yunni/SPARK-18408-yunn-api-improvements.
      
      (cherry picked from commit 05f7c6ff)
      Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
      cdf315ba
Loading