Skip to content
Snippets Groups Projects
  1. Aug 05, 2014
  2. Jul 30, 2014
    • Naftali Harris's avatar
      Avoid numerical instability · e3d85b7e
      Naftali Harris authored
      This avoids basically doing 1 - 1, for example:
      
      ```python
      >>> from math import exp
      >>> margin = -40
      >>> 1 - 1 / (1 + exp(margin))
      0.0
      >>> exp(margin) / (1 + exp(margin))
      4.248354255291589e-18
      >>>
      ```
      
      Author: Naftali Harris <naftaliharris@gmail.com>
      
      Closes #1652 from naftaliharris/patch-2 and squashes the following commits:
      
      0d55a9f [Naftali Harris] Avoid numerical instability
      e3d85b7e
  3. Jul 20, 2014
  4. May 25, 2014
    • Reynold Xin's avatar
      Fix PEP8 violations in Python mllib. · d33d3c61
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #871 from rxin/mllib-pep8 and squashes the following commits:
      
      848416f [Reynold Xin] Fixed a typo in the previous cleanup (c -> sc).
      a8db4cd [Reynold Xin] Fix PEP8 violations in Python mllib.
      d33d3c61
  5. May 05, 2014
    • Xiangrui Meng's avatar
      [SPARK-1594][MLLIB] Cleaning up MLlib APIs and guide · 98750a74
      Xiangrui Meng authored
      Final pass before the v1.0 release.
      
      * Remove `VectorRDDs`
      * Move `BinaryClassificationMetrics` from `evaluation.binary` to `evaluation`
      * Change default value of `addIntercept` to false and allow to add intercept in Ridge and Lasso.
      * Clean `DecisionTree` package doc and test suite.
      * Mark model constructors `private[spark]`
      * Rename `loadLibSVMData` to `loadLibSVMFile` and hide `LabelParser` from users.
      * Add `saveAsLibSVMFile`.
      * Add `appendBias` to `MLUtils`.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #524 from mengxr/mllib-cleaning and squashes the following commits:
      
      295dc8b [Xiangrui Meng] update loadLibSVMFile doc
      1977ac1 [Xiangrui Meng] fix doc of appendBias
      649fcf0 [Xiangrui Meng] rename loadLibSVMData to loadLibSVMFile; hide LabelParser from user APIs
      54b812c [Xiangrui Meng] add appendBias
      a71e7d0 [Xiangrui Meng] add saveAsLibSVMFile
      d976295 [Xiangrui Meng] Merge branch 'master' into mllib-cleaning
      b7e5cec [Xiangrui Meng] remove some experimental annotations and make model constructors private[mllib]
      9b02b93 [Xiangrui Meng] minor code style update
      a593ddc [Xiangrui Meng] fix python tests
      fc28c18 [Xiangrui Meng] mark more classes experimental
      f6cbbff [Xiangrui Meng] fix Java tests
      0af70b0 [Xiangrui Meng] minor
      6e139ef [Xiangrui Meng] Merge branch 'master' into mllib-cleaning
      94e6dce [Xiangrui Meng] move BinaryLabelCounter and BinaryConfusionMatrixImpl to evaluation.binary
      df34907 [Xiangrui Meng] clean DecisionTreeSuite to use LocalSparkContext
      c81807f [Xiangrui Meng] set the default value of AddIntercept to false
      03389c0 [Xiangrui Meng] allow to add intercept in Ridge and Lasso
      c66c56f [Xiangrui Meng] move tree md to package object doc
      a2695df [Xiangrui Meng] update guide for BinaryClassificationMetrics
      9194f4c [Xiangrui Meng] move BinaryClassificationMetrics one level up
      1c1a0e3 [Xiangrui Meng] remove VectorRDDs because it only contains one function that is not necessary for us to maintain
      98750a74
  6. Apr 22, 2014
    • Xusen Yin's avatar
      fix bugs of dot in python · c919798f
      Xusen Yin authored
      If there are no `transpose()` in `self.theta`, a
      
      *ValueError: matrices are not aligned*
      
      is occurring. The former test case just ignore this situation.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #463 from yinxusen/python-naive-bayes and squashes the following commits:
      
      fcbe3bc [Xusen Yin] fix bugs of dot in python
      c919798f
  7. Apr 15, 2014
    • Matei Zaharia's avatar
      [WIP] SPARK-1430: Support sparse data in Python MLlib · 63ca581d
      Matei Zaharia authored
      This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
      
      On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
      
      Some to-do items left:
      - [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
      - [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
      - [x] Explain how to use these in the Python MLlib docs.
      
      CC @mengxr, @joshrosen
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #341 from mateiz/py-ml-update and squashes the following commits:
      
      d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
      ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
      b9f97a3 [Matei Zaharia] Fix test
      1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
      88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
      37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
      da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
      c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
      a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
      74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
      889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
      ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
      a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
      0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
      eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
      2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
      154f45d [Matei Zaharia] Update docs, name some magic values
      881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
      63ca581d
  8. Apr 02, 2014
    • Xiangrui Meng's avatar
      [SPARK-1212, Part II] Support sparse data in MLlib · 9c65fa76
      Xiangrui Meng authored
      In PR https://github.com/apache/spark/pull/117, we added dense/sparse vector data model and updated KMeans to support sparse input. This PR is to replace all other `Array[Double]` usage by `Vector` in generalized linear models (GLMs) and Naive Bayes. Major changes:
      
      1. `LabeledPoint` becomes `LabeledPoint(Double, Vector)`.
      2. Methods that accept `RDD[Array[Double]]` now accept `RDD[Vector]`. We cannot support both in an elegant way because of type erasure.
      3. Mark 'createModel' and 'predictPoint' protected because they are not for end users.
      4. Add libSVMFile to MLContext.
      5. NaiveBayes can accept arbitrary labels (introducing a breaking change to Python's `NaiveBayesModel`).
      6. Gradient computation no longer creates temp vectors.
      7. Column normalization and centering are removed from Lasso and Ridge because the operation will densify the data. Simple feature transformation can be done before training.
      
      TODO:
      1. ~~Use axpy when possible.~~
      2. ~~Optimize Naive Bayes.~~
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #245 from mengxr/vector and squashes the following commits:
      
      eb6e793 [Xiangrui Meng] move libSVMFile to MLUtils and rename to loadLibSVMData
      c26c4fc [Xiangrui Meng] update DecisionTree to use RDD[Vector]
      11999c7 [Xiangrui Meng] Merge branch 'master' into vector
      f7da54b [Xiangrui Meng] add minSplits to libSVMFile
      da25e24 [Xiangrui Meng] revert the change to default addIntercept because it might change the behavior of existing code without warning
      493f26f [Xiangrui Meng] Merge branch 'master' into vector
      7c1bc01 [Xiangrui Meng] add a TODO to NB
      b9b7ef7 [Xiangrui Meng] change default value of addIntercept to false
      b01df54 [Xiangrui Meng] allow to change or clear threshold in LR and SVM
      4addc50 [Xiangrui Meng] merge master
      4ca5b1b [Xiangrui Meng] remove normalization from Lasso and update tests
      f04fe8a [Xiangrui Meng] remove normalization from RidgeRegression and update tests
      d088552 [Xiangrui Meng] use static constructor for MLContext
      6f59eed [Xiangrui Meng] update libSVMFile to determine number of features automatically
      3432e84 [Xiangrui Meng] update NaiveBayes to support sparse data
      0f8759b [Xiangrui Meng] minor updates to NB
      b11659c [Xiangrui Meng] style update
      78c4671 [Xiangrui Meng] add libSVMFile to MLContext
      f0fe616 [Xiangrui Meng] add a test for sparse linear regression
      44733e1 [Xiangrui Meng] use in-place gradient computation
      e981396 [Xiangrui Meng] use axpy in Updater
      db808a1 [Xiangrui Meng] update JavaLR example
      befa592 [Xiangrui Meng] passed scala/java tests
      75c83a4 [Xiangrui Meng] passed test compile
      1859701 [Xiangrui Meng] passed compile
      834ada2 [Xiangrui Meng] optimized MLUtils.computeStats update some ml algorithms to use Vector (cont.)
      135ab72 [Xiangrui Meng] merge glm
      0e57aa4 [Xiangrui Meng] update Lasso and RidgeRegression to parse the weights correctly from GLM mark createModel protected mark predictPoint protected
      d7f629f [Xiangrui Meng] fix a bug in GLM when intercept is not used
      3f346ba [Xiangrui Meng] update some ml algorithms to use Vector
      9c65fa76
  9. Jan 12, 2014
    • Matei Zaharia's avatar
      Update some Python MLlib parameters to use camelCase, and tweak docs · 4c28a2ba
      Matei Zaharia authored
      We've used camel case in other Spark methods so it felt reasonable to
      keep using it here and make the code match Scala/Java as much as
      possible. Note that parameter names matter in Python because it allows
      passing optional parameters by name.
      4c28a2ba
    • Matei Zaharia's avatar
      Add Naive Bayes to Python MLlib, and some API fixes · 9a0dfdf8
      Matei Zaharia authored
      - Added a Python wrapper for Naive Bayes
      - Updated the Scala Naive Bayes to match the style of our other
        algorithms better and in particular make it easier to call from Java
        (added builder pattern, removed default value in train method)
      - Updated Python MLlib functions to not require a SparkContext; we can
        get that from the RDD the user gives
      - Added a toString method in LabeledPoint
      - Made the Python MLlib tests run as part of run-tests as well (before
        they could only be run individually through each file)
      9a0dfdf8
  10. Dec 24, 2013
Loading