Skip to content
Snippets Groups Projects
  1. May 22, 2017
    • Ignacio Bermudez's avatar
      [SPARK-20687][MLLIB] mllib.Matrices.fromBreeze may crash when converting from Breeze sparse matrix · 06dda1d5
      Ignacio Bermudez authored
      ## What changes were proposed in this pull request?
      
      When two Breeze SparseMatrices are operated, the result matrix may contain provisional 0 values extra in rowIndices and data arrays. This causes an incoherence with the colPtrs data, but Breeze get away with this incoherence by keeping a counter of the valid data.
      
      In spark, when this matrices are converted to SparseMatrices, Sparks relies solely on rowIndices, data, and colPtrs, but these might be incorrect because of breeze internal hacks. Therefore, we need to slice both rowIndices and data, using their counter of active data
      
      This method is at least called by BlockMatrix when performing distributed block operations, causing exceptions on valid operations.
      
      See http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add
      
      ## How was this patch tested?
      
      Added a test to MatricesSuite that verifies that the conversions are valid and that code doesn't crash. Originally the same code would crash on Spark.
      
      Bugfix for https://issues.apache.org/jira/browse/SPARK-20687
      
      Author: Ignacio Bermudez <ignaciobermudez@gmail.com>
      Author: Ignacio Bermudez Corrales <icorrales@splunk.com>
      
      Closes #17940 from ghoto/bug-fix/SPARK-20687.
      06dda1d5
  2. May 16, 2017
  3. May 15, 2017
  4. May 12, 2017
    • Wayne Zhang's avatar
      [SPARK-20619][ML] StringIndexer supports multiple ways to order label · af40bb11
      Wayne Zhang authored
      ## What changes were proposed in this pull request?
      
      StringIndexer maps labels to numbers according to the descending order of label frequency. Other types of ordering (e.g., alphabetical) may be needed in feature ETL.  For example, the ordering will affect the result in one-hot encoding and RFormula.
      
      This PR proposes to support other ordering methods and we add a parameter `stringOrderType` that supports the following four options:
      - 'frequencyDesc': descending order by label frequency (most frequent label assigned 0)
      - 'frequencyAsc': ascending order by label frequency (least frequent label assigned 0)
      - 'alphabetDesc': descending alphabetical order
      - 'alphabetAsc': ascending alphabetical order
      
      The default is still descending order of label frequency, so there should be no impact to existing programs.
      
      ## How was this patch tested?
      new test
      
      Author: Wayne Zhang <actuaryzhang@uber.com>
      
      Closes #17879 from actuaryzhang/stringIndexer.
      af40bb11
  5. May 11, 2017
  6. May 10, 2017
  7. May 09, 2017
    • Yanbo Liang's avatar
      [SPARK-20606][ML] ML 2.2 QA: Remove deprecated methods for ML · b8733e0a
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Remove ML methods we deprecated in 2.1.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #17867 from yanboliang/spark-20606.
      b8733e0a
    • Jon McLean's avatar
      [SPARK-20615][ML][TEST] SparseVector.argmax throws IndexOutOfBoundsException · be53a783
      Jon McLean authored
      ## What changes were proposed in this pull request?
      
      Added a check for for the number of defined values.  Previously the argmax function assumed that at least one value was defined if the vector size was greater than zero.
      
      ## How was this patch tested?
      
      Tests were added to the existing VectorsSuite to cover this case.
      
      Author: Jon McLean <jon.mclean@atsid.com>
      
      Closes #17877 from jonmclean/vectorArgmaxIndexBug.
      be53a783
    • Nick Pentreath's avatar
      [SPARK-20587][ML] Improve performance of ML ALS recommendForAll · 10b00aba
      Nick Pentreath authored
      This PR is a `DataFrame` version of #17742 for [SPARK-11968](https://issues.apache.org/jira/browse/SPARK-11968), for improving the performance of `recommendAll` methods.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #17845 from MLnick/ml-als-perf.
      10b00aba
    • Peng's avatar
      [SPARK-11968][MLLIB] Optimize MLLIB ALS recommendForAll · 80794247
      Peng authored
      The recommendForAll of MLLIB ALS is very slow.
      GC is a key problem of the current method.
      The task use the following code to keep temp result:
      val output = new Array[(Int, (Int, Double))](m*n)
      m = n = 4096 (default value, no method to set)
      so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and cause serious GC problem, and it is frequently OOM.
      
      Actually, we don't need to save all the temp result. Support we recommend topK (topK is about 10, or 20) product for each user, we only need 4k * topK * (4 + 4 + 8) memory to save the temp result.
      
      The Test Environment:
      3 workers: each work 10 core, each work 30G memory, each work 1 executor.
      The Data: User 480,000, and Item 17,000
      
      BlockSize:     1024  2048  4096  8192
      Old method:  245s  332s  488s  OOM
      This solution: 121s  118s   117s  120s
      
      The existing UT.
      
      Author: Peng <peng.meng@intel.com>
      Author: Peng Meng <peng.meng@intel.com>
      
      Closes #17742 from mpjlu/OptimizeAls.
      80794247
  8. May 08, 2017
  9. May 07, 2017
    • Daniel Li's avatar
      [SPARK-20484][MLLIB] Add documentation to ALS code · 88e6d750
      Daniel Li authored
      ## What changes were proposed in this pull request?
      
      This PR adds documentation to the ALS code.
      
      ## How was this patch tested?
      
      Existing tests were used.
      
      mengxr srowen
      
      This contribution is my original work.  I have the license to work on this project under the Spark project’s open source license.
      
      Author: Daniel Li <dan@danielyli.com>
      
      Closes #17793 from danielyli/spark-20484.
      88e6d750
  10. May 04, 2017
    • Wayne Zhang's avatar
      [SPARK-20574][ML] Allow Bucketizer to handle non-Double numeric column · 0d16faab
      Wayne Zhang authored
      ## What changes were proposed in this pull request?
      Bucketizer currently requires input column to be Double, but the logic should work on any numeric data types. Many practical problems have integer/float data types, and it could get very tedious to manually cast them into Double before calling bucketizer. This PR extends bucketizer to handle all numeric types.
      
      ## How was this patch tested?
      New test.
      
      Author: Wayne Zhang <actuaryzhang@uber.com>
      
      Closes #17840 from actuaryzhang/bucketizer.
      0d16faab
    • Yanbo Liang's avatar
      [SPARK-20047][FOLLOWUP][ML] Constrained Logistic Regression follow up · c5dceb8c
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Address some minor comments for #17715:
      * Put bound-constrained optimization params under expertParams.
      * Update some docs.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #17829 from yanboliang/spark-20047-followup.
      c5dceb8c
  11. May 03, 2017
    • Yan Facai (颜发才)'s avatar
      [SPARK-16957][MLLIB] Use midpoints for split values. · 7f96f2d7
      Yan Facai (颜发才) authored
      ## What changes were proposed in this pull request?
      
      Use midpoints for split values now, and maybe later to make it weighted.
      
      ## How was this patch tested?
      
      + [x] add unit test.
      + [x] revise Split's unit test.
      
      Author: Yan Facai (颜发才) <facai.yan@gmail.com>
      Author: 颜发才(Yan Facai) <facai.yan@gmail.com>
      
      Closes #17556 from facaiy/ENH/decision_tree_overflow_and_precision_in_aggregation.
      7f96f2d7
    • Sean Owen's avatar
      [SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release · 16fab6b0
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fix build warnings primarily related to Breeze 0.13 operator changes, Java style problems
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17803 from srowen/SPARK-20523.
      16fab6b0
  12. Apr 29, 2017
    • wangmiao1981's avatar
      [SPARK-20533][SPARKR] SparkR Wrappers Model should be private and value should be lazy · ee694cdf
      wangmiao1981 authored
      ## What changes were proposed in this pull request?
      
      MultilayerPerceptronClassifierWrapper model should be private.
      LogisticRegressionWrapper.scala rFeatures and rCoefficients should be lazy.
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: wangmiao1981 <wm624@hotmail.com>
      
      Closes #17808 from wangmiao1981/lazy.
      ee694cdf
    • Yuhao Yang's avatar
      [SPARK-19791][ML] Add doc and example for fpgrowth · add9d1bb
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      Add a new section for fpm
      Add Example for FPGrowth in scala and Java
      
      updated: Rewrite transform to be more compact.
      
      ## How was this patch tested?
      
      local doc generation.
      
      Author: Yuhao Yang <yuhao.yang@intel.com>
      
      Closes #17130 from hhbyyh/fpmdoc.
      add9d1bb
  13. Apr 27, 2017
  14. Apr 25, 2017
    • ding's avatar
      [SPARK-5484][GRAPHX] Periodically do checkpoint in Pregel · 0a7f5f27
      ding authored
      ## What changes were proposed in this pull request?
      
      Pregel-based iterative algorithms with more than ~50 iterations begin to slow down and eventually fail with a StackOverflowError due to Spark's lack of support for long lineage chains.
      
      This PR causes Pregel to checkpoint the graph periodically if the checkpoint directory is set.
      This PR moves PeriodicGraphCheckpointer.scala from mllib to graphx, moves PeriodicRDDCheckpointer.scala, PeriodicCheckpointer.scala from mllib to core
      ## How was this patch tested?
      
      unit tests, manual tests
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: ding <ding@localhost.localdomain>
      Author: dding3 <ding.ding@intel.com>
      Author: Michael Allman <michael@videoamp.com>
      
      Closes #15125 from dding3/cp2_pregel.
      0a7f5f27
    • Yanbo Liang's avatar
      [SPARK-20449][ML] Upgrade breeze version to 0.13.1 · 67eef47a
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B.
      
      ## How was this patch tested?
      Existing unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #17746 from yanboliang/spark-20449.
      Unverified
      67eef47a
    • wangmiao1981's avatar
      [SPARK-18901][FOLLOWUP][ML] Require in LR LogisticAggregator is redundant · 387565cf
      wangmiao1981 authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up PR of #17478.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: wangmiao1981 <wm624@hotmail.com>
      
      Closes #17754 from wangmiao1981/followup.
      387565cf
  15. Apr 24, 2017
  16. Apr 21, 2017
    • WeichenXu's avatar
      [SPARK-20423][ML] fix MLOR coeffs centering when reg == 0 · eb00378f
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      When reg == 0, MLOR has multiple solutions and we need to centralize the coeffs to get identical result.
      BUT current implementation centralize the `coefficientMatrix` by the global coeffs means.
      
      In fact the `coefficientMatrix` should be centralized on each feature index itself.
      Because, according to the MLOR probability distribution function, it can be proven easily that:
      suppose `{ w0, w1, .. w(K-1) }` make up the `coefficientMatrix`,
      then `{ w0 + c, w1 + c, ... w(K - 1) + c}` will also be the equivalent solution.
      `c` is an arbitrary vector of `numFeatures` dimension.
      reference
      https://core.ac.uk/download/pdf/6287975.pdf
      
      So that we need to centralize the `coefficientMatrix` on each feature dimension separately.
      
      **We can also confirm this through R library `glmnet`, that MLOR in `glmnet` always generate coefficients result that the sum of each dimension is all `zero`, when reg == 0.**
      
      ## How was this patch tested?
      
      Tests added.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #17706 from WeichenXu123/mlor_center.
      Unverified
      eb00378f
  17. Apr 13, 2017
    • Syrux's avatar
      [SPARK-20265][MLLIB] Improve Prefix'span pre-processing efficiency · 095d1cb3
      Syrux authored
      ## What changes were proposed in this pull request?
      
      Improve PrefixSpan pre-processing efficency by preventing sequences of zero in the cleaned database.
      The efficiency gain is reflected in the following graph : https://postimg.org/image/9x6ireuvn/
      
      ## How was this patch tested?
      
      Using MLlib's PrefixSpan existing tests and tests of my own on the 8 datasets shown in the graph. All
      result obtained were stricly the same as the original implementation (without this change).
      dev/run-tests was also runned, no error were found.
      
      Author : Cyril de Vogelaere <cyril.devogelaeregmail.com>
      
      Author: Syrux <pokcyril@hotmail.com>
      
      Closes #17575 from Syrux/SPARK-20265.
      095d1cb3
  18. Apr 12, 2017
    • hyukjinkwon's avatar
      [SPARK-18692][BUILD][DOCS] Test Java 8 unidoc build on Jenkins · ceaf77ae
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to run Spark unidoc to test Javadoc 8 build as Javadoc 8 is easily re-breakable.
      
      There are several problems with it:
      
      - It introduces little extra bit of time to run the tests. In my case, it took 1.5 mins more (`Elapsed :[94.8746569157]`). How it was tested is described in "How was this patch tested?".
      
      - > One problem that I noticed was that Unidoc appeared to be processing test sources: if we can find a way to exclude those from being processed in the first place then that might significantly speed things up.
      
        (see  joshrosen's [comment](https://issues.apache.org/jira/browse/SPARK-18692?focusedCommentId=15947627&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15947627))
      
      To complete this automated build, It also suggests to fix existing Javadoc breaks / ones introduced by test codes as described above.
      
      There fixes are similar instances that previously fixed. Please refer https://github.com/apache/spark/pull/15999 and https://github.com/apache/spark/pull/16013
      
      Note that this only fixes **errors** not **warnings**. Please see my observation https://github.com/apache/spark/pull/17389#issuecomment-288438704 for spurious errors by warnings.
      
      ## How was this patch tested?
      
      Manually via `jekyll build` for building tests. Also, tested via running `./dev/run-tests`.
      
      This was tested via manually adding `time.time()` as below:
      
      ```diff
           profiles_and_goals = build_profiles + sbt_goals
      
           print("[info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments: ",
                 " ".join(profiles_and_goals))
      
      +    import time
      +    st = time.time()
           exec_sbt(profiles_and_goals)
      +    print("Elapsed :[%s]" % str(time.time() - st))
      ```
      
      produces
      
      ```
      ...
      ========================================================================
      Building Unidoc API Documentation
      ========================================================================
      ...
      [info] Main Java API documentation successful.
      ...
      Elapsed :[94.8746569157]
      ...
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17477 from HyukjinKwon/SPARK-18692.
      ceaf77ae
  19. Apr 11, 2017
  20. Apr 10, 2017
    • Sean Owen's avatar
      [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String toLowerCase "Turkish... · a26e3ed5
      Sean Owen authored
      [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String toLowerCase "Turkish locale bug" causes Spark problems
      
      ## What changes were proposed in this pull request?
      
      Add Locale.ROOT to internal calls to String `toLowerCase`, `toUpperCase`, to avoid inadvertent locale-sensitive variation in behavior (aka the "Turkish locale problem").
      
      The change looks large but it is just adding `Locale.ROOT` (the locale with no country or language specified) to every call to these methods.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17527 from srowen/SPARK-20156.
      a26e3ed5
  21. Apr 09, 2017
    • Vijay Ramesh's avatar
      [SPARK-20260][MLLIB] String interpolation required for error message · 261eaf51
      Vijay Ramesh authored
      ## What changes were proposed in this pull request?
      This error message doesn't get properly formatted because of a missing `s`.  Currently the error looks like:
      
      ```
      Caused by: java.lang.IllegalArgumentException: requirement failed: indices should be one-based and in ascending order; found current=$current, previous=$previous; line="$line"
      ```
      (note the literal `$current` instead of the interpolated value)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Vijay Ramesh <vramesh@demandbase.com>
      
      Closes #17572 from vijaykramesh/master.
      261eaf51
  22. Apr 07, 2017
  23. Apr 06, 2017
    • Bryan Cutler's avatar
      [SPARK-19953][ML] Random Forest Models use parent UID when being fit · e156b5dd
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      The ML `RandomForestClassificationModel` and `RandomForestRegressionModel` were not using the estimator parent UID when being fit.  This change fixes that so the models can be properly be identified with their parents.
      
      ## How was this patch tested?Existing tests.
      
      Added check to verify that model uid matches that of the parent, then renamed `checkCopy` to `checkCopyAndUids` and verified that it was called by one test for each ML algorithm.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #17296 from BryanCutler/rfmodels-use-parent-uid-SPARK-19953.
      e156b5dd
  24. Apr 04, 2017
    • Yuhao Yang's avatar
      [SPARK-20003][ML] FPGrowthModel setMinConfidence should affect rules generation and transform · b28bbffb
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      jira: https://issues.apache.org/jira/browse/SPARK-20003
      I was doing some test and found the issue. ml.fpm.FPGrowthModel `setMinConfidence` should always affect rules generation and transform.
      Currently associationRules in FPGrowthModel is a lazy val and `setMinConfidence` in FPGrowthModel has no impact once associationRules got computed .
      
      I try to cache the associationRules to avoid re-computation if `minConfidence` is not changed, but this makes FPGrowthModel somehow stateful. Let me know if there's any concern.
      
      ## How was this patch tested?
      
      new unit test and I strength the unit test for model save/load to ensure the cache mechanism.
      
      Author: Yuhao Yang <yuhao.yang@intel.com>
      
      Closes #17336 from hhbyyh/fpmodelminconf.
      b28bbffb
    • Seth Hendrickson's avatar
      [SPARK-20183][ML] Added outlierRatio arg to MLTestingUtils.testOutliersWithSmallWeights · a59759e6
      Seth Hendrickson authored
      ## What changes were proposed in this pull request?
      
      This is a small piece from https://github.com/apache/spark/pull/16722 which ultimately will add sample weights to decision trees.  This is to allow more flexibility in testing outliers since linear models and trees behave differently.
      
      Note: The primary author when this is committed should be sethah since this is taken from his code.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #17501 from jkbradley/SPARK-20183.
      a59759e6
    • zero323's avatar
      [SPARK-19825][R][ML] spark.ml R API for FPGrowth · b34f7665
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Adds SparkR API for FPGrowth: [SPARK-19825](https://issues.apache.org/jira/browse/SPARK-19825):
      
      - `spark.fpGrowth` -model training.
      - `freqItemsets` and `associationRules` methods with new corresponding generics.
      - Scala helper: `org.apache.spark.ml.r. FPGrowthWrapper`
      - unit tests.
      
      ## How was this patch tested?
      
      Feature specific unit tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17170 from zero323/SPARK-19825.
      b34f7665
  25. Apr 03, 2017
    • Yuhao Yang's avatar
      [SPARK-19969][ML] Imputer doc and example · 4d28e843
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      Add docs and examples for spark.ml.feature.Imputer. Currently scala and Java examples are included. Python example will be added after https://github.com/apache/spark/pull/17316
      
      ## How was this patch tested?
      
      local doc generation and example execution
      
      Author: Yuhao Yang <yuhao.yang@intel.com>
      
      Closes #17324 from hhbyyh/imputerdoc.
      4d28e843
    • Bryan Cutler's avatar
      [SPARK-19985][ML] Fixed copy method for some ML Models · 2a903a1e
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Some ML Models were using `defaultCopy` which expects a default constructor, and others were not setting the parent estimator.  This change fixes these by creating a new instance of the model and explicitly setting values and parent.
      
      ## How was this patch tested?
      Added `MLTestingUtils.checkCopy` to the offending models to tests to verify the copy is made and parent is set.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #17326 from BryanCutler/ml-model-copy-error-SPARK-19985.
      2a903a1e
Loading