Skip to content
Snippets Groups Projects
  1. Jun 21, 2017
  2. Jun 20, 2017
    • Joseph K. Bradley's avatar
      [SPARK-20929][ML] LinearSVC should use its own threshold param · cc67bd57
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      LinearSVC should use its own threshold param, rather than the shared one, since it applies to rawPrediction instead of probability.  This PR changes the param in the Scala, Python and R APIs.
      
      ## How was this patch tested?
      
      New unit test to make sure the threshold can be set to any Double value.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #18151 from jkbradley/ml-2.2-linearsvc-cleanup.
      cc67bd57
  3. Jun 12, 2017
    • Joseph K. Bradley's avatar
      [SPARK-21050][ML] Word2vec persistence overflow bug fix · ff318c0d
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib version), so it is very easily to have an overflow in calculating the number of partitions for ML persistence.
      
      This modifies the calculations to use Long.
      
      ## How was this patch tested?
      
      New unit test.  I verified that the test fails before this patch.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #18265 from jkbradley/word2vec-save-fix.
      ff318c0d
  4. Jun 05, 2017
    • sethah's avatar
      [SPARK-19762][ML] Hierarchy for consolidating ML aggregator/loss code · 1665b5f7
      sethah authored
      ## What changes were proposed in this pull request?
      
      JIRA: [SPARK-19762](https://issues.apache.org/jira/browse/SPARK-19762)
      
      The larger changes in this patch are:
      
      * Adds a `DifferentiableLossAggregator` trait which is intended to be used as a common parent trait to all Spark ML aggregator classes. It factors out the common methods: `merge, gradient, loss, weight` from the aggregator subclasses.
      * Adds a `RDDLossFunction` which is intended to be the only implementation of Breeze's `DiffFunction` necessary in Spark ML, and can be used by all other algorithms. It takes the aggregator type as a type parameter, and maps the aggregator over an RDD. It additionally takes in a optional regularization loss function for applying the differentiable part of regularization.
      * Factors out the regularization from the data part of the cost function, and treats regularization as a separate independent cost function which can be evaluated and added to the data cost function.
      * Changes `LinearRegression` to use this new hierarchy as a proof of concept.
      * Adds the following new namespaces `o.a.s.ml.optim.loss` and `o.a.s.ml.optim.aggregator`
      
      Also note that none of these are public-facing changes. All of these classes are internal to Spark ML and remain that way.
      
      **NOTE: The large majority of the "lines added" and "lines deleted" are simply code moving around or unit tests.**
      
      BTW, I also converted LinearSVC to this framework as a way to prove that this new hierarchy is flexible enough for the other algorithms, but I backed those changes out because the PR is large enough as is.
      
      ## How was this patch tested?
      Test suites are added for the new components, and some test suites are also added to provide coverage where there wasn't any before.
      
      * DifferentiablLossAggregatorSuite
      * LeastSquaresAggregatorSuite
      * RDDLossFunctionSuite
      * DifferentiableRegularizationSuite
      
      Below are some performance testing numbers. Run on a 6 node virtual cluster with 44 cores and ~110G RAM, the dataset size is about 37G. These are not "large-scale" tests, but we really want to just make sure the iteration times don't increase with this patch. Notably we are doing the regularization a bit differently than before, but that should cost very little. I think there's very little risk otherwise, and these numbers don't show a difference. Of course I'm happy to add more tests as we think it's necessary, but I think the patch is ready for review now.
      
      **Note:** timings are best of 3 runs.
      
      |    |   numFeatures |   numPoints |   maxIter |   regParam |   elasticNetParam |   SPARK-19762 (sec) |   master (sec) |
      |----|---------------|-------------|-----------|------------|-------------------|---------------------|----------------|
      |  0 |          5000 |       1e+06 |        30 |       0    |               0   |             129.594 |        131.153 |
      |  1 |          5000 |       1e+06 |        30 |       0.1  |               0   |             135.54  |        136.327 |
      |  2 |          5000 |       1e+06 |        30 |       0.01 |               0.5 |             135.148 |        129.771 |
      |  3 |         50000 |  100000     |        30 |       0    |               0   |             145.764 |        144.096 |
      
      ## Follow ups
      
      If this design is accepted, we will convert the other ML algorithms that use this aggregator pattern to this new hierarchy in follow up PRs.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      Author: sethah <shendrickson@cloudera.com>
      
      Closes #17094 from sethah/ml_aggregators.
      1665b5f7
    • Zheng RuiFeng's avatar
      [SPARK-20930][ML] Destroy broadcasted centers after computing cost in KMeans · 98b5ccd3
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
       Destroy broadcasted centers after computing cost
      ## How was this patch tested?
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #18152 from zhengruifeng/destroy_kmeans_model.
      98b5ccd3
  5. Jun 03, 2017
  6. Jun 02, 2017
  7. Jun 01, 2017
    • John Compitello's avatar
      [SPARK-20109][MLLIB] Rewrote toBlockMatrix method on IndexedRowMatrix · 0975019c
      John Compitello authored
      ## What changes were proposed in this pull request?
      
      - ~~I added the method `toBlockMatrixDense` to the IndexedRowMatrix class. The current implementation of `toBlockMatrix` is insufficient for users with relatively dense IndexedRowMatrix objects, since it assumes sparsity.~~
      
      EDIT: Ended up deciding that there should be just a single `toBlockMatrix` method, which creates a BlockMatrix whose blocks may be dense or sparse depending on the sparsity of the rows. This method will work better on any current use case of `toBlockMatrix` and doesn't go through `CoordinateMatrix` like the old method.
      
      ## How was this patch tested?
      
      ~~I used the same tests already written for `toBlockMatrix()` to test this method. I also added a new additional unit test for an edge case that was not adequately tested by current test suite.~~
      
      I ran the original `IndexedRowMatrix` tests, plus wrote more to better handle edge cases ignored by original tests.
      
      Author: John Compitello <johnc@broadinstitute.org>
      
      Closes #17459 from johnc1231/johnc-fix-ir-to-block.
      0975019c
  8. May 31, 2017
    • David Eis's avatar
      [SPARK-20790][MLLIB] Correctly handle negative values for implicit feedback in ALS · d52f6362
      David Eis authored
      ## What changes were proposed in this pull request?
      
      Revert the handling of negative values in ALS with implicit feedback, so that the confidence is the absolute value of the rating and the preference is 0 for negative ratings. This was the original behavior.
      
      ## How was this patch tested?
      
      This patch was tested with the existing unit tests and an added unit test to ensure that negative ratings are not ignored.
      
      mengxr
      
      Author: David Eis <deis@bloomberg.net>
      
      Closes #18022 from davideis/bugfix/negative-rating.
      d52f6362
  9. May 25, 2017
    • Wayne Zhang's avatar
      [SPARK-14659][ML] RFormula consistent with R when handling strings · f47700c9
      Wayne Zhang authored
      ## What changes were proposed in this pull request?
      When handling strings, the category dropped by RFormula and R are different:
      - RFormula drops the least frequent level
      - R drops the first level after ascending alphabetical ordering
      
      This PR supports different string ordering types in StringIndexer #17879 so that RFormula can drop the same level as R when handling strings using`stringOrderType = "alphabetDesc"`.
      
      ## How was this patch tested?
      new tests
      
      Author: Wayne Zhang <actuaryzhang@uber.com>
      
      Closes #17967 from actuaryzhang/RFormula.
      f47700c9
  10. May 23, 2017
  11. May 22, 2017
    • Zheng RuiFeng's avatar
      [SPARK-15767][ML][SPARKR] Decision Tree wrapper in SparkR · 4be33758
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      support decision tree in R
      
      ## How was this patch tested?
      added tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #17981 from zhengruifeng/dt_r.
      4be33758
    • Ignacio Bermudez's avatar
      [SPARK-20687][MLLIB] mllib.Matrices.fromBreeze may crash when converting from Breeze sparse matrix · 06dda1d5
      Ignacio Bermudez authored
      ## What changes were proposed in this pull request?
      
      When two Breeze SparseMatrices are operated, the result matrix may contain provisional 0 values extra in rowIndices and data arrays. This causes an incoherence with the colPtrs data, but Breeze get away with this incoherence by keeping a counter of the valid data.
      
      In spark, when this matrices are converted to SparseMatrices, Sparks relies solely on rowIndices, data, and colPtrs, but these might be incorrect because of breeze internal hacks. Therefore, we need to slice both rowIndices and data, using their counter of active data
      
      This method is at least called by BlockMatrix when performing distributed block operations, causing exceptions on valid operations.
      
      See http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add
      
      ## How was this patch tested?
      
      Added a test to MatricesSuite that verifies that the conversions are valid and that code doesn't crash. Originally the same code would crash on Spark.
      
      Bugfix for https://issues.apache.org/jira/browse/SPARK-20687
      
      Author: Ignacio Bermudez <ignaciobermudez@gmail.com>
      Author: Ignacio Bermudez Corrales <icorrales@splunk.com>
      
      Closes #17940 from ghoto/bug-fix/SPARK-20687.
      06dda1d5
  12. May 16, 2017
  13. May 15, 2017
  14. May 12, 2017
    • Wayne Zhang's avatar
      [SPARK-20619][ML] StringIndexer supports multiple ways to order label · af40bb11
      Wayne Zhang authored
      ## What changes were proposed in this pull request?
      
      StringIndexer maps labels to numbers according to the descending order of label frequency. Other types of ordering (e.g., alphabetical) may be needed in feature ETL.  For example, the ordering will affect the result in one-hot encoding and RFormula.
      
      This PR proposes to support other ordering methods and we add a parameter `stringOrderType` that supports the following four options:
      - 'frequencyDesc': descending order by label frequency (most frequent label assigned 0)
      - 'frequencyAsc': ascending order by label frequency (least frequent label assigned 0)
      - 'alphabetDesc': descending alphabetical order
      - 'alphabetAsc': ascending alphabetical order
      
      The default is still descending order of label frequency, so there should be no impact to existing programs.
      
      ## How was this patch tested?
      new test
      
      Author: Wayne Zhang <actuaryzhang@uber.com>
      
      Closes #17879 from actuaryzhang/stringIndexer.
      af40bb11
  15. May 11, 2017
  16. May 10, 2017
  17. May 09, 2017
    • Yanbo Liang's avatar
      [SPARK-20606][ML] ML 2.2 QA: Remove deprecated methods for ML · b8733e0a
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Remove ML methods we deprecated in 2.1.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #17867 from yanboliang/spark-20606.
      b8733e0a
    • Jon McLean's avatar
      [SPARK-20615][ML][TEST] SparseVector.argmax throws IndexOutOfBoundsException · be53a783
      Jon McLean authored
      ## What changes were proposed in this pull request?
      
      Added a check for for the number of defined values.  Previously the argmax function assumed that at least one value was defined if the vector size was greater than zero.
      
      ## How was this patch tested?
      
      Tests were added to the existing VectorsSuite to cover this case.
      
      Author: Jon McLean <jon.mclean@atsid.com>
      
      Closes #17877 from jonmclean/vectorArgmaxIndexBug.
      be53a783
    • Nick Pentreath's avatar
      [SPARK-20587][ML] Improve performance of ML ALS recommendForAll · 10b00aba
      Nick Pentreath authored
      This PR is a `DataFrame` version of #17742 for [SPARK-11968](https://issues.apache.org/jira/browse/SPARK-11968), for improving the performance of `recommendAll` methods.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #17845 from MLnick/ml-als-perf.
      10b00aba
    • Peng's avatar
      [SPARK-11968][MLLIB] Optimize MLLIB ALS recommendForAll · 80794247
      Peng authored
      The recommendForAll of MLLIB ALS is very slow.
      GC is a key problem of the current method.
      The task use the following code to keep temp result:
      val output = new Array[(Int, (Int, Double))](m*n)
      m = n = 4096 (default value, no method to set)
      so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and cause serious GC problem, and it is frequently OOM.
      
      Actually, we don't need to save all the temp result. Support we recommend topK (topK is about 10, or 20) product for each user, we only need 4k * topK * (4 + 4 + 8) memory to save the temp result.
      
      The Test Environment:
      3 workers: each work 10 core, each work 30G memory, each work 1 executor.
      The Data: User 480,000, and Item 17,000
      
      BlockSize:     1024  2048  4096  8192
      Old method:  245s  332s  488s  OOM
      This solution: 121s  118s   117s  120s
      
      The existing UT.
      
      Author: Peng <peng.meng@intel.com>
      Author: Peng Meng <peng.meng@intel.com>
      
      Closes #17742 from mpjlu/OptimizeAls.
      80794247
  18. May 08, 2017
  19. May 07, 2017
    • Daniel Li's avatar
      [SPARK-20484][MLLIB] Add documentation to ALS code · 88e6d750
      Daniel Li authored
      ## What changes were proposed in this pull request?
      
      This PR adds documentation to the ALS code.
      
      ## How was this patch tested?
      
      Existing tests were used.
      
      mengxr srowen
      
      This contribution is my original work.  I have the license to work on this project under the Spark project’s open source license.
      
      Author: Daniel Li <dan@danielyli.com>
      
      Closes #17793 from danielyli/spark-20484.
      88e6d750
  20. May 04, 2017
    • Wayne Zhang's avatar
      [SPARK-20574][ML] Allow Bucketizer to handle non-Double numeric column · 0d16faab
      Wayne Zhang authored
      ## What changes were proposed in this pull request?
      Bucketizer currently requires input column to be Double, but the logic should work on any numeric data types. Many practical problems have integer/float data types, and it could get very tedious to manually cast them into Double before calling bucketizer. This PR extends bucketizer to handle all numeric types.
      
      ## How was this patch tested?
      New test.
      
      Author: Wayne Zhang <actuaryzhang@uber.com>
      
      Closes #17840 from actuaryzhang/bucketizer.
      0d16faab
    • Yanbo Liang's avatar
      [SPARK-20047][FOLLOWUP][ML] Constrained Logistic Regression follow up · c5dceb8c
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Address some minor comments for #17715:
      * Put bound-constrained optimization params under expertParams.
      * Update some docs.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #17829 from yanboliang/spark-20047-followup.
      c5dceb8c
  21. May 03, 2017
    • Yan Facai (颜发才)'s avatar
      [SPARK-16957][MLLIB] Use midpoints for split values. · 7f96f2d7
      Yan Facai (颜发才) authored
      ## What changes were proposed in this pull request?
      
      Use midpoints for split values now, and maybe later to make it weighted.
      
      ## How was this patch tested?
      
      + [x] add unit test.
      + [x] revise Split's unit test.
      
      Author: Yan Facai (颜发才) <facai.yan@gmail.com>
      Author: 颜发才(Yan Facai) <facai.yan@gmail.com>
      
      Closes #17556 from facaiy/ENH/decision_tree_overflow_and_precision_in_aggregation.
      7f96f2d7
    • Sean Owen's avatar
      [SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release · 16fab6b0
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fix build warnings primarily related to Breeze 0.13 operator changes, Java style problems
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17803 from srowen/SPARK-20523.
      16fab6b0
  22. Apr 29, 2017
    • wangmiao1981's avatar
      [SPARK-20533][SPARKR] SparkR Wrappers Model should be private and value should be lazy · ee694cdf
      wangmiao1981 authored
      ## What changes were proposed in this pull request?
      
      MultilayerPerceptronClassifierWrapper model should be private.
      LogisticRegressionWrapper.scala rFeatures and rCoefficients should be lazy.
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: wangmiao1981 <wm624@hotmail.com>
      
      Closes #17808 from wangmiao1981/lazy.
      ee694cdf
    • Yuhao Yang's avatar
      [SPARK-19791][ML] Add doc and example for fpgrowth · add9d1bb
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      Add a new section for fpm
      Add Example for FPGrowth in scala and Java
      
      updated: Rewrite transform to be more compact.
      
      ## How was this patch tested?
      
      local doc generation.
      
      Author: Yuhao Yang <yuhao.yang@intel.com>
      
      Closes #17130 from hhbyyh/fpmdoc.
      add9d1bb
  23. Apr 27, 2017
  24. Apr 25, 2017
    • ding's avatar
      [SPARK-5484][GRAPHX] Periodically do checkpoint in Pregel · 0a7f5f27
      ding authored
      ## What changes were proposed in this pull request?
      
      Pregel-based iterative algorithms with more than ~50 iterations begin to slow down and eventually fail with a StackOverflowError due to Spark's lack of support for long lineage chains.
      
      This PR causes Pregel to checkpoint the graph periodically if the checkpoint directory is set.
      This PR moves PeriodicGraphCheckpointer.scala from mllib to graphx, moves PeriodicRDDCheckpointer.scala, PeriodicCheckpointer.scala from mllib to core
      ## How was this patch tested?
      
      unit tests, manual tests
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: ding <ding@localhost.localdomain>
      Author: dding3 <ding.ding@intel.com>
      Author: Michael Allman <michael@videoamp.com>
      
      Closes #15125 from dding3/cp2_pregel.
      0a7f5f27
    • Yanbo Liang's avatar
      [SPARK-20449][ML] Upgrade breeze version to 0.13.1 · 67eef47a
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B.
      
      ## How was this patch tested?
      Existing unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #17746 from yanboliang/spark-20449.
      67eef47a
    • wangmiao1981's avatar
      [SPARK-18901][FOLLOWUP][ML] Require in LR LogisticAggregator is redundant · 387565cf
      wangmiao1981 authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up PR of #17478.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: wangmiao1981 <wm624@hotmail.com>
      
      Closes #17754 from wangmiao1981/followup.
      387565cf
  25. Apr 24, 2017
Loading