Skip to content
Snippets Groups Projects
  1. Oct 25, 2016
    • sethah's avatar
      [SPARK-17748][ML] One pass solver for Weighted Least Squares with ElasticNet · 78d740a0
      sethah authored
      ## What changes were proposed in this pull request?
      
      1. Make a pluggable solver interface for `WeightedLeastSquares`
      2. Add a `QuasiNewton` solver to handle elastic net regularization for `WeightedLeastSquares`
      3. Add method `BLAS.dspmv` used by QN solver
      4. Add mechanism for WLS to handle singular covariance matrices by falling back to QN solver when Cholesky fails.
      
      ## How was this patch tested?
      Unit tests - see below.
      
      ## Design choices
      
      **Pluggable Normal Solver**
      
      Before, the `WeightedLeastSquares` package always used the Cholesky decomposition solver to compute the solution to the normal equations. Now, we specify the solver as a constructor argument to the `WeightedLeastSquares`. We introduce a new trait:
      
      ````scala
      private[ml] sealed trait NormalEquationSolver {
      
        def solve(
            bBar: Double,
            bbBar: Double,
            abBar: DenseVector,
            aaBar: DenseVector,
            aBar: DenseVector): NormalEquationSolution
      }
      ````
      
      We extend this trait for different variants of normal equation solvers. In the future, we can easily add others (like QR) using this interface.
      
      **Always train in the standardized space**
      
      The normal solver did not previously standardize the data, but this patch introduces a change such that we always solve the normal equations in the standardized space. We convert back to the original space in the same way that is done for distributed L-BFGS/OWL-QN. We add test cases for zero variance features/labels.
      
      **Use L-BFGS locally to solve normal equations for singular matrix**
      
      When linear regression with the normal solver is called for a singular matrix, we initially try to solve with Cholesky. We use the output of `lapack.dppsv` to determine if the matrix is singular. If it is, we fall back to using L-BFGS locally to solve the normal equations. We add test cases for this as well.
      
      ## Test cases
      I found it helpful to enumerate some of the test cases and hopefully it makes review easier.
      
      **WeightedLeastSquares**
      
      1. Constant columns - Cholesky solver fails with no regularization, Auto solver falls back to QN, and QN trains successfully.
      2. Collinear features - Cholesky solver fails with no regularization, Auto solver falls back to QN, and QN trains successfully.
      3. Label is constant zero - no training is performed regardless of intercept. Coefficients are zero and intercept is zero.
      4. Label is constant - if fitIntercept, then no training is performed and intercept equals label mean. If not fitIntercept, then we train and return an answer that matches R's lm package.
      5. Test with L1 - go through various combinations of L1/L2, standardization, fitIntercept and verify that output matches glmnet.
      6. Initial intercept - verify that setting the initial intercept to label mean is correct by training model with strong L1 regularization so that all coefficients are zero and intercept converges to label mean.
      7. Test diagInvAtWA - since we are standardizing features now during training, we should test that the inverse is computed to match R.
      
      **LinearRegression**
      1. For all existing L1 test cases, test the "normal" solver too.
      2. Check that using the normal solver now handles singular matrices.
      3. Check that using the normal solver with L1 produces an objective history in the model summary, but does not produce the inverse of AtA.
      
      **BLAS**
      1. Test new method `dspmv`.
      
      ## Performance Testing
      This patch will speed up linear regression with L1/elasticnet penalties when the feature size is < 4096. I have not conducted performance tests at scale, only observed by testing locally that there is a speed improvement.
      
      We should decide if this PR needs to be blocked before performance testing is conducted.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15394 from sethah/SPARK-17748.
      78d740a0
  2. Oct 21, 2016
    • Zheng RuiFeng's avatar
      [SPARK-17331][FOLLOWUP][ML][CORE] Avoid allocating 0-length arrays · a8ea4da8
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      `Array[T]()` -> `Array.empty[T]` to avoid allocating 0-length arrays.
      Use regex `find . -name '*.scala' | xargs -i bash -c 'egrep "Array\[[A-Za-z]+\]\(\)" -n {} && echo {}'` to find modification candidates.
      
      cc srowen
      
      ## How was this patch tested?
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #15564 from zhengruifeng/avoid_0_length_array.
      Unverified
      a8ea4da8
  3. Sep 29, 2016
    • Bjarne Fruergaard's avatar
      [SPARK-17721][MLLIB][ML] Fix for multiplying transposed SparseMatrix with SparseVector · 29396e7d
      Bjarne Fruergaard authored
      ## What changes were proposed in this pull request?
      
      * changes the implementation of gemv with transposed SparseMatrix and SparseVector both in mllib-local and mllib (identical)
      * adds a test that was failing before this change, but succeeds with these changes.
      
      The problem in the previous implementation was that it only increments `i`, that is enumerating the columns of a row in the SparseMatrix, when the row-index of the vector matches the column-index of the SparseMatrix. In cases where a particular row of the SparseMatrix has non-zero values at column-indices lower than corresponding non-zero row-indices of the SparseVector, the non-zero values of the SparseVector are enumerated without ever matching the column-index at index `i` and the remaining column-indices i+1,...,indEnd-1 are never attempted. The test cases in this PR illustrate this issue.
      
      ## How was this patch tested?
      
      I have run the specific `gemv` tests in both mllib-local and mllib. I am currently still running `./dev/run-tests`.
      
      ## ___
      As per instructions, I hereby state that this is my original work and that I license the work to the project (Apache Spark) under the project's open source license.
      
      Mentioning dbtsai, viirya and brkyvz whom I can see have worked/authored on these parts before.
      
      Author: Bjarne Fruergaard <bwahlgreen@gmail.com>
      
      Closes #15296 from bwahlgreen/bugfix-spark-17721.
      29396e7d
  4. Sep 07, 2016
    • Liwei Lin's avatar
      [SPARK-17359][SQL][MLLIB] Use ArrayBuffer.+=(A) instead of... · 3ce3a282
      Liwei Lin authored
      [SPARK-17359][SQL][MLLIB] Use ArrayBuffer.+=(A) instead of ArrayBuffer.append(A) in performance critical paths
      
      ## What changes were proposed in this pull request?
      
      We should generally use `ArrayBuffer.+=(A)` rather than `ArrayBuffer.append(A)`, because `append(A)` would involve extra boxing / unboxing.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #14914 from lw-lin/append_to_plus_eq_v2.
      3ce3a282
  5. Sep 04, 2016
    • Yanbo Liang's avatar
      [MINOR][ML][MLLIB] Remove work around for breeze sparse matrix. · 1b001b52
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Since we have updated breeze version to 0.12, we should remove work around for bug of breeze sparse matrix in v0.11.
      I checked all mllib code and found this is the only work around for breeze 0.11.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #14953 from yanboliang/matrices.
      1b001b52
  6. Sep 01, 2016
    • Sean Owen's avatar
      [SPARK-17331][CORE][MLLIB] Avoid allocating 0-length arrays · 3893e8c5
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Avoid allocating some 0-length arrays, esp. in UTF8String, and by using Array.empty in Scala over Array[T]()
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14895 from srowen/SPARK-17331.
      3893e8c5
  7. Aug 27, 2016
    • Peng, Meng's avatar
      [ML][MLLIB] The require condition and message doesn't match in SparseMatrix. · 40168dbe
      Peng, Meng authored
      ## What changes were proposed in this pull request?
      The require condition and message doesn't match, and the condition also should be optimized.
      Small change.  Please kindly let me know if JIRA required.
      
      ## How was this patch tested?
      No additional test required.
      
      Author: Peng, Meng <peng.meng@intel.com>
      
      Closes #14824 from mpjlu/smallChangeForMatrixRequire.
      40168dbe
  8. Aug 26, 2016
    • Peng, Meng's avatar
      [SPARK-17207][MLLIB] fix comparing Vector bug in TestingUtils · c0949dc9
      Peng, Meng authored
      ## What changes were proposed in this pull request?
      
      fix comparing Vector bug in TestingUtils.
      There is the same bug for Matrix comparing. How to check the length of Matrix should be discussed first.
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: Peng, Meng <peng.meng@intel.com>
      
      Closes #14785 from mpjlu/testUtils.
      c0949dc9
  9. Aug 19, 2016
    • Jeff Zhang's avatar
      [SPARK-16965][MLLIB][PYSPARK] Fix bound checking for SparseVector. · 072acf5e
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      
      1. In scala, add negative low bound checking and put all the low/upper bound checking in one place
      2. In python, add low/upper bound checking of indices.
      
      ## How was this patch tested?
      
      unit test added
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #14555 from zjffdu/SPARK-16965.
      072acf5e
  10. Jul 19, 2016
  11. Jul 16, 2016
    • Sean Owen's avatar
      [SPARK-3359][DOCS] More changes to resolve javadoc 8 errors that will help... · 5ec0d692
      Sean Owen authored
      [SPARK-3359][DOCS] More changes to resolve javadoc 8 errors that will help unidoc/genjavadoc compatibility
      
      ## What changes were proposed in this pull request?
      
      These are yet more changes that resolve problems with unidoc/genjavadoc and Java 8. It does not fully resolve the problem, but gets rid of as many errors as we can from this end.
      
      ## How was this patch tested?
      
      Jenkins build of docs
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14221 from srowen/SPARK-3359.3.
      5ec0d692
  12. Jul 11, 2016
    • Reynold Xin's avatar
      [SPARK-16477] Bump master version to 2.1.0-SNAPSHOT · ffcb6e05
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      After SPARK-16476 (committed earlier today as #14128), we can finally bump the version number.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14130 from rxin/SPARK-16477.
      ffcb6e05
  13. Jun 06, 2016
    • Zheng RuiFeng's avatar
      [MINOR] Fix Typos 'an -> a' · fd8af397
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      `an -> a`
      
      Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.
      
      ## How was this patch tested?
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #13515 from zhengruifeng/an_a.
      fd8af397
  14. May 27, 2016
    • DB Tsai's avatar
      [SPARK-15413][ML][MLLIB] Change `toBreeze` to `asBreeze` in Vector and Matrix · 21b2605d
      DB Tsai authored
      ## What changes were proposed in this pull request?
      
      We're using `asML` to convert the mllib vector/matrix to ml vector/matrix now. Using `as` is more correct given that this conversion actually shares the same underline data structure. As a result, in this PR, `toBreeze` will be changed to `asBreeze`. This is a private API, as a result, it will not affect any user's application.
      
      ## How was this patch tested?
      
      unit tests
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #13198 from dbtsai/minor.
      21b2605d
  15. May 19, 2016
  16. May 17, 2016
  17. Apr 30, 2016
    • Xiangrui Meng's avatar
      [SPARK-14653][ML] Remove json4s from mllib-local · 0847fe4e
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      This PR moves Vector.toJson/fromJson to ml.linalg.VectorEncoder under mllib/ to keep mllib-local's dependency minimal. The json encoding is used by Params. So we still need this feature in SPARK-14615, where we will switch to ml.linalg in spark.ml APIs.
      
      ## How was this patch tested?
      
      Copied existing unit tests over.
      
      cc; dbtsai
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #12802 from mengxr/SPARK-14653.
      0847fe4e
  18. Apr 28, 2016
  19. Apr 26, 2016
    • Joseph K. Bradley's avatar
      [SPARK-14732][ML] spark.ml GaussianMixture should use MultivariateGaussian in mllib-local · bd2c9a6d
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Before, spark.ml GaussianMixtureModel used the spark.mllib MultivariateGaussian in its public API.  This was added after 1.6, so we can modify this API without breaking APIs.
      
      This PR copies MultivariateGaussian to mllib-local in spark.ml, with a few changes:
      * Renamed fields to match numpy, scipy: mu => mean, sigma => cov
      
      This PR then uses the spark.ml MultivariateGaussian in the spark.ml GaussianMixtureModel, which involves:
      * Modifying the constructor
      * Adding a computeProbabilities method
      
      Also:
      * Added EPSILON to mllib-local for use in MultivariateGaussian
      
      ## How was this patch tested?
      
      Existing unit tests
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12593 from jkbradley/sparkml-gmm-fix.
      bd2c9a6d
  20. Apr 22, 2016
    • Joan's avatar
      [SPARK-6429] Implement hashCode and equals together · bf95b8da
      Joan authored
      ## What changes were proposed in this pull request?
      
      Implement some `hashCode` and `equals` together in order to enable the scalastyle.
      This is a first batch, I will continue to implement them but I wanted to know your thoughts.
      
      Author: Joan <joan@goyeau.com>
      
      Closes #12157 from joan38/SPARK-6429-HashCode-Equals.
      bf95b8da
  21. Apr 15, 2016
    • DB Tsai's avatar
      [SPARK-14549][ML] Copy the Vector and Matrix classes from mllib to ml in mllib-local · 96534aa4
      DB Tsai authored
      ## What changes were proposed in this pull request?
      
      This task will copy the Vector and Matrix classes from mllib to ml package in mllib-local jar. The UDTs and `since` annotation in ml vector and matrix will be removed from now. UDTs will be achieved by #SPARK-14487, and `since` will be replaced by /*  since 1.2.0 */
      
      The BLAS implementation will be copied, and some of the test utilities will be copies as well.
      
      Summary of changes:
      
      1. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/BLAS.scala
        - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/BLAS.scala
        - logDebug("gemm: alpha is equal to 0 and beta is equal to 1. Returning C.") is removed in ml version.
      2. In  mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Matrices.scala
        - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Matrices.scala
        - `Since` was removed, and we'll use standard `/* Since /*` Java doc. Will be in another PR.
        - `UDT` related code was removed, and will use `SPARK-13944` https://github.com/apache/spark/pull/12259  to replace the annotation.
      3. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Vectors.scala
        - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala
        - `Since` was removed.
        - `UDT` related code was removed.
        - In `def parseNumeric`, it was throwing `throw new SparkException(s"Cannot parse $other.")`, and now it's throwing `throw new IllegalArgumentException(s"Cannot parse $other.")`
      4. In mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala
        - For consistency with ML version of vector, `def parseNumeric` is now throwing `throw new IllegalArgumentException(s"Cannot parse $other.")`
      5. mllib/src/main/scala/org/apache/spark/**mllib**/util/NumericParser.scala is moved to mllib-local/src/main/scala/org/apache/spark/**ml**/util/NumericParser.scala
        - All the `throw new SparkException` were replaced by `throw new IllegalArgumentException`
      
      ## How was this patch tested?
      
      unit tests
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #12317 from dbtsai/dbtsai-ml-vector.
      96534aa4
  22. Apr 14, 2016
  23. Apr 11, 2016
    • DB Tsai's avatar
      [SPARK-14462][ML][MLLIB] Add the mllib-local build to maven pom · efaf7d18
      DB Tsai authored
      ## What changes were proposed in this pull request?
      
      In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies.
      
      The previous PR was failing the build because of `spark-core:test` dependency, and that was reverted. In this PR, `FunSuite` with `// scalastyle:ignore funsuite` in mllib-local test was used, similar to sketch.
      
      Thanks.
      
      ## How was this patch tested?
      
      Unit tests
      
      mengxr tedyu holdenk
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #12298 from dbtsai/dbtsai-mllib-local-build-fix.
      efaf7d18
  24. Apr 09, 2016
    • Xiangrui Meng's avatar
      415446cc
    • DB Tsai's avatar
      [SPARK-14462][ML][MLLIB] add the mllib-local build to maven pom · 1598d11b
      DB Tsai authored
      ## What changes were proposed in this pull request?
      
      In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies. The test scope will still depend on spark-core and spark-core-test in order to use the common utilities, but the runtime will avoid any platform dependency. Couple platform independent classes will be moved to this package to demonstrate how this work.
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #12241 from dbtsai/dbtsai-mllib-local-build.
      1598d11b
Loading