Skip to content
Snippets Groups Projects
  1. Jun 30, 2015
    • MechCoder's avatar
      [SPARK-4127] [MLLIB] [PYSPARK] Python bindings for StreamingLinearRegressionWithSGD · 45281664
      MechCoder authored
      Python bindings for StreamingLinearRegressionWithSGD
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6744 from MechCoder/spark-4127 and squashes the following commits:
      
      d8f6457 [MechCoder] Moved StreamingLinearAlgorithm to pyspark.mllib.regression
      d47cc24 [MechCoder] Inherit from StreamingLinearAlgorithm
      1b4ddd6 [MechCoder] minor
      4de6c68 [MechCoder] Minor refactor
      5e85a3b [MechCoder] Add tests for simultaneous training and prediction
      fb27889 [MechCoder] Add example and docs
      505380b [MechCoder] Add tests
      d42bdae [MechCoder] [SPARK-4127] Python bindings for StreamingLinearRegressionWithSGD
      45281664
  2. Jun 25, 2015
    • Yanbo Liang's avatar
      [MINOR] [MLLIB] rename some functions of PythonMLLibAPI · 2519dcc3
      Yanbo Liang authored
      Keep the same naming conventions for PythonMLLibAPI.
      Only the following three functions is different from others
      ```scala
      trainNaiveBayes
      trainGaussianMixture
      trainWord2Vec
      ```
      So change them to
      ```scala
      trainNaiveBayesModel
      trainGaussianMixtureModel
      trainWord2VecModel
      ```
      It does not affect any users and public APIs, only to make better understand for developer and code hacker.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #7011 from yanboliang/py-mllib-api-rename and squashes the following commits:
      
      771ffec [Yanbo Liang] rename some functions of PythonMLLibAPI
      2519dcc3
  3. Jun 24, 2015
    • MechCoder's avatar
      [SPARK-7633] [MLLIB] [PYSPARK] Python bindings for StreamingLogisticRegressionwithSGD · fb32c388
      MechCoder authored
      Add Python bindings to StreamingLogisticRegressionwithSGD.
      
      No Java wrappers are needed as models are updated directly using train.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6849 from MechCoder/spark-3258 and squashes the following commits:
      
      b4376a5 [MechCoder] minor
      d7e5fc1 [MechCoder] Refactor into StreamingLinearAlgorithm Better docs
      9c09d4e [MechCoder] [SPARK-7633] Python bindings for StreamingLogisticRegressionwithSGD
      fb32c388
  4. Jun 22, 2015
  5. Jun 16, 2015
    • Yanbo Liang's avatar
      [SPARK-7916] [MLLIB] MLlib Python doc parity check for classification and regression · ca998757
      Yanbo Liang authored
      Check then make the MLlib Python classification and regression doc to be as complete as the Scala doc.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #6460 from yanboliang/spark-7916 and squashes the following commits:
      
      f8deda4 [Yanbo Liang] trigger jenkins
      6dc4d99 [Yanbo Liang] address comments
      ce2a43e [Yanbo Liang] truncate too long line and remove extra sparse
      3eaf6ad [Yanbo Liang] MLlib Python doc parity check for classification and regression
      ca998757
  6. Apr 21, 2015
    • Reynold Xin's avatar
      [SPARK-6953] [PySpark] speed up python tests · 3134c3fe
      Reynold Xin authored
      This PR try to speed up some python tests:
      
      ```
      tests.py                       144s -> 103s      -41s
      mllib/classification.py         24s -> 17s        -7s
      mllib/regression.py             27s -> 15s       -12s
      mllib/tree.py                   27s -> 13s       -14s
      mllib/tests.py                  64s -> 31s       -33s
      streaming/tests.py             185s -> 84s      -101s
      ```
      Considering python3, the total saving will be 558s (almost 10 minutes) (core, and streaming run three times, mllib runs twice).
      
      During testing, it will show used time for each test file:
      ```
      Run core tests ...
      Running test: pyspark/rdd.py ... ok (22s)
      Running test: pyspark/context.py ... ok (16s)
      Running test: pyspark/conf.py ... ok (4s)
      Running test: pyspark/broadcast.py ... ok (4s)
      Running test: pyspark/accumulators.py ... ok (4s)
      Running test: pyspark/serializers.py ... ok (6s)
      Running test: pyspark/profiler.py ... ok (5s)
      Running test: pyspark/shuffle.py ... ok (1s)
      Running test: pyspark/tests.py ... ok (103s)   144s
      ```
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5605 from rxin/python-tests-speed and squashes the following commits:
      
      d08542d [Reynold Xin] Merge pull request #14 from mengxr/SPARK-6953
      89321ee [Xiangrui Meng] fix seed in tests
      3ad2387 [Reynold Xin] Merge pull request #5427 from davies/python_tests
      3134c3fe
  7. Apr 16, 2015
    • Davies Liu's avatar
      [SPARK-4897] [PySpark] Python 3 support · 04e44b37
      Davies Liu authored
      This PR update PySpark to support Python 3 (tested with 3.4).
      
      Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.
      
      TODO: ec2/spark-ec2.py is not fully tested with python3.
      
      Author: Davies Liu <davies@databricks.com>
      Author: twneale <twneale@gmail.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5173 from davies/python3 and squashes the following commits:
      
      d7d6323 [Davies Liu] fix tests
      6c52a98 [Davies Liu] fix mllib test
      99e334f [Davies Liu] update timeout
      b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      cafd5ec [Davies Liu] adddress comments from @mengxr
      bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      179fc8d [Davies Liu] tuning flaky tests
      8c8b957 [Davies Liu] fix ResourceWarning in Python 3
      5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      4006829 [Davies Liu] fix test
      2fc0066 [Davies Liu] add python3 path
      71535e9 [Davies Liu] fix xrange and divide
      5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ed498c8 [Davies Liu] fix compatibility with python 3
      820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ad7c374 [Davies Liu] fix mllib test and warning
      ef1fc2f [Davies Liu] fix tests
      4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      59bb492 [Davies Liu] fix tests
      1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ca0fdd3 [Davies Liu] fix code style
      9563a15 [Davies Liu] add imap back for python 2
      0b1ec04 [Davies Liu] make python examples work with Python 3
      d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      a716d34 [Davies Liu] test with python 3.4
      f1700e8 [Davies Liu] fix test in python3
      671b1db [Davies Liu] fix test in python3
      692ff47 [Davies Liu] fix flaky test
      7b9699f [Davies Liu] invalidate import cache for Python 3.3+
      9c58497 [Davies Liu] fix kill worker
      309bfbf [Davies Liu] keep compatibility
      5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
      8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      f53e1f0 [Davies Liu] fix tests
      70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
      a39167e [Davies Liu] support customize class in __main__
      814c77b [Davies Liu] run unittests with python 3
      7f4476e [Davies Liu] mllib tests passed
      d737924 [Davies Liu] pass ml tests
      375ea17 [Davies Liu] SQL tests pass
      6cc42a9 [Davies Liu] rename
      431a8de [Davies Liu] streaming tests pass
      78901a7 [Davies Liu] fix hash of serializer in Python 3
      24b2f2e [Davies Liu] pass all RDD tests
      35f48fe [Davies Liu] run future again
      1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
      6e3c21d [Davies Liu] make cloudpickle work with Python3
      2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
      1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
      7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
      b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
      f40d925 [twneale] xrange --> range
      e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
      79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
      2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
      854be27 [Josh Rosen] Run `futurize` on Python code:
      7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
      04e44b37
  8. Mar 31, 2015
    • Yanbo Liang's avatar
      [SPARK-6255] [MLLIB] Support multiclass classification in Python API · b5bd75d9
      Yanbo Liang authored
      Python API parity check for classification and multiclass classification support, major disparities need to be added for Python:
      ```scala
      LogisticRegressionWithLBFGS
          setNumClasses
          setValidateData
      LogisticRegressionModel
          getThreshold
          numClasses
          numFeatures
      SVMWithSGD
          setValidateData
      SVMModel
          getThreshold
      ```
      For users the greatest benefit in this PR is multiclass classification was supported by Python API.
      Users can train multiclass classification model and use it to predict in pyspark.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #5137 from yanboliang/spark-6255 and squashes the following commits:
      
      0bd531e [Yanbo Liang] address comments
      444d5e2 [Yanbo Liang] LogisticRegressionModel.predict() optimization
      fc7990b [Yanbo Liang] address comments
      b0d9c63 [Yanbo Liang] Support Mulinomial LR model predict in Python API
      ded847c [Yanbo Liang] Python API parity check for classification (support multiclass classification)
      b5bd75d9
  9. Mar 20, 2015
    • Xusen Yin's avatar
      [Spark 6096][MLlib] Add Naive Bayes load save methods in Python · 25636d98
      Xusen Yin authored
      See [SPARK-6096](https://issues.apache.org/jira/browse/SPARK-6096).
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #5090 from yinxusen/SPARK-6096 and squashes the following commits:
      
      bd0fea5 [Xusen Yin] fix style problem, etc.
      3fd41f2 [Xusen Yin] use hanging indent in Python style
      e83803d [Xusen Yin] fix Python style
      d6dbde5 [Xusen Yin] fix python call java error
      a054bb3 [Xusen Yin] add save load for NaiveBayes python
      25636d98
    • Yanbo Liang's avatar
      [SPARK-6095] [MLLIB] Support model save/load in Python's linear models · 48866f78
      Yanbo Liang authored
      For Python's linear models, weights and intercept are stored in Python.
      This PR implements Python's linear models sava/load functions which do the same thing as scala.
      It can also make model import/export cross languages.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #5016 from yanboliang/spark-6095 and squashes the following commits:
      
      d9bb824 [Yanbo Liang] fix python style
      b3813ca [Yanbo Liang] linear model save/load for Python reuse the Scala implementation
      48866f78
  10. Mar 02, 2015
    • Yanbo Liang's avatar
      [SPARK-6080] [PySpark] correct LogisticRegressionWithLBFGS regType parameter for pyspark · af2effdd
      Yanbo Liang authored
      Currently LogisticRegressionWithLBFGS in python/pyspark/mllib/classification.py will invoke callMLlibFunc with a wrong "regType" parameter.
      It was assigned to "str(regType)" which translate None(Python) to "None"(Java/Scala). The right way should be translate None(Python) to null(Java/Scala) just as what we did at LogisticRegressionWithSGD.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #4831 from yanboliang/pyspark_classification and squashes the following commits:
      
      12db65a [Yanbo Liang] correct LogisticRegressionWithLBFGS regType parameter for pyspark
      af2effdd
  11. Dec 17, 2014
  12. Nov 18, 2014
    • Davies Liu's avatar
      [SPARK-4306] [MLlib] Python API for LogisticRegressionWithLBFGS · d2e29516
      Davies Liu authored
      ```
      class LogisticRegressionWithLBFGS
       |  train(cls, data, iterations=100, initialWeights=None, corrections=10, tolerance=0.0001, regParam=0.01, intercept=False)
       |      Train a logistic regression model on the given data.
       |
       |      :param data:           The training data, an RDD of LabeledPoint.
       |      :param iterations:     The number of iterations (default: 100).
       |      :param initialWeights: The initial weights (default: None).
       |      :param regParam:       The regularizer parameter (default: 0.01).
       |      :param regType:        The type of regularizer used for training
       |                             our model.
       |                             :Allowed values:
       |                               - "l1" for using L1 regularization
       |                               - "l2" for using L2 regularization
       |                               - None for no regularization
       |                               (default: "l2")
       |      :param intercept:      Boolean parameter which indicates the use
       |                             or not of the augmented representation for
       |                             training data (i.e. whether bias features
       |                             are activated or not).
       |      :param corrections:    The number of corrections used in the LBFGS update (default: 10).
       |      :param tolerance:      The convergence tolerance of iterations for L-BFGS (default: 1e-4).
       |
       |      >>> data = [
       |      ...     LabeledPoint(0.0, [0.0, 1.0]),
       |      ...     LabeledPoint(1.0, [1.0, 0.0]),
       |      ... ]
       |      >>> lrm = LogisticRegressionWithLBFGS.train(sc.parallelize(data))
       |      >>> lrm.predict([1.0, 0.0])
       |      1
       |      >>> lrm.predict([0.0, 1.0])
       |      0
       |      >>> lrm.predict(sc.parallelize([[1.0, 0.0], [0.0, 1.0]])).collect()
       |      [1, 0]
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3307 from davies/lbfgs and squashes the following commits:
      
      34bd986 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into lbfgs
      5a945a6 [Davies Liu] address comments
      941061b [Davies Liu] Merge branch 'master' of github.com:apache/spark into lbfgs
      03e5543 [Davies Liu] add it to docs
      ed2f9a8 [Davies Liu] add regType
      76cd1b6 [Davies Liu] reorder arguments
      4429a74 [Davies Liu] Update classification.py
      9252783 [Davies Liu] python api for LogisticRegressionWithLBFGS
      d2e29516
    • Davies Liu's avatar
      [SPARK-4435] [MLlib] [PySpark] improve classification · 8fbf72b7
      Davies Liu authored
      This PR add setThrehold() and clearThreshold() for LogisticRegressionModel and SVMModel, also support RDD of vector in LogisticRegressionModel.predict(), SVNModel.predict() and NaiveBayes.predict()
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3305 from davies/setThreshold and squashes the following commits:
      
      d0b835f [Davies Liu] Merge branch 'master' of github.com:apache/spark into setThreshold
      e4acd76 [Davies Liu] address comments
      2231a5f [Davies Liu] bugfix
      7bd9009 [Davies Liu] address comments
      0b0a8a7 [Davies Liu] address comments
      c1e5573 [Davies Liu] improve classification
      8fbf72b7
  13. Nov 13, 2014
    • Xiangrui Meng's avatar
      [SPARK-4372][MLLIB] Make LR and SVM's default parameters consistent in Scala and Python · 32218307
      Xiangrui Meng authored
      The current default regParam is 1.0 and regType is claimed to be none in Python (but actually it is l2), while regParam = 0.0 and regType is L2 in Scala. We should make the default values consistent. This PR sets the default regType to L2 and regParam to 0.01. Note that the default regParam value in LIBLINEAR (and hence scikit-learn) is 1.0. However, we use average loss instead of total loss in our formulation. Hence regParam=1.0 is definitely too heavy.
      
      In LinearRegression, we set regParam=0.0 and regType=None, because we have separate classes for Lasso and Ridge, both of which use regParam=0.01 as the default.
      
      davies atalwalkar
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3232 from mengxr/SPARK-4372 and squashes the following commits:
      
      9979837 [Xiangrui Meng] update Ridge/Lasso to use default regParam 0.01 cast input arguments
      d3ba096 [Xiangrui Meng] change 'none' back to None
      1909a6e [Xiangrui Meng] change default regParam to 0.01 and regType to L2 in LR and SVM
      32218307
  14. Nov 11, 2014
    • Davies Liu's avatar
      [SPARK-4324] [PySpark] [MLlib] support numpy.array for all MLlib API · 65083e93
      Davies Liu authored
      This PR check all of the existing Python MLlib API to make sure that numpy.array is supported as Vector (also RDD of numpy.array).
      
      It also improve some docstring and doctest.
      
      cc mateiz mengxr
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3189 from davies/numpy and squashes the following commits:
      
      d5057c4 [Davies Liu] fix tests
      6987611 [Davies Liu] support numpy.array for all MLlib API
      65083e93
  15. Oct 31, 2014
    • Davies Liu's avatar
      [SPARK-4124] [MLlib] [PySpark] simplify serialization in MLlib Python API · 872fc669
      Davies Liu authored
      Create several helper functions to call MLlib Java API, convert the arguments to Java type and convert return value to Python object automatically, this simplify serialization in MLlib Python API very much.
      
      After this, the MLlib Python API does not need to deal with serialization details anymore, it's easier to add new API.
      
      cc mengxr
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2995 from davies/cleanup and squashes the following commits:
      
      8fa6ec6 [Davies Liu] address comments
      16b85a0 [Davies Liu] Merge branch 'master' of github.com:apache/spark into cleanup
      43743e5 [Davies Liu] bugfix
      731331f [Davies Liu] simplify serialization in MLlib Python API
      872fc669
  16. Oct 16, 2014
    • Davies Liu's avatar
      [SPARK-3971] [MLLib] [PySpark] hotfix: Customized pickler should work in cluster mode · 091d32c5
      Davies Liu authored
      Customized pickler should be registered before unpickling, but in executor, there is no way to register the picklers before run the tasks.
      
      So, we need to register the picklers in the tasks itself, duplicate the javaToPython() and pythonToJava() in MLlib, call SerDe.initialize() before pickling or unpickling.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2830 from davies/fix_pickle and squashes the following commits:
      
      0c85fb9 [Davies Liu] revert the privacy change
      6b94e15 [Davies Liu] use JavaConverters instead of JavaConversions
      0f02050 [Davies Liu] hotfix: Customized pickler does not work in cluster
      091d32c5
  17. Oct 07, 2014
  18. Oct 06, 2014
    • cocoatomo's avatar
      [SPARK-3773][PySpark][Doc] Sphinx build warning · 2300eb58
      cocoatomo authored
      When building Sphinx documents for PySpark, we have 12 warnings.
      Their causes are almost docstrings in broken ReST format.
      
      To reproduce this issue, we should run following commands on the commit: 6e27cb63.
      
      ```bash
      $ cd ./python/docs
      $ make clean html
      ...
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/__init__.py:docstring of pyspark.SparkContext.sequenceFile:4: ERROR: Unexpected indentation.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/__init__.py:docstring of pyspark.RDD.saveAsSequenceFile:4: ERROR: Unexpected indentation.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:14: ERROR: Unexpected indentation.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:16: WARNING: Definition list ends without a blank line; unexpected unindent.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:17: WARNING: Block quote ends without a blank line; unexpected unindent.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:14: ERROR: Unexpected indentation.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:16: WARNING: Definition list ends without a blank line; unexpected unindent.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:17: WARNING: Block quote ends without a blank line; unexpected unindent.
      /Users/<user>/MyRepos/Scala/spark/python/docs/pyspark.mllib.rst:50: WARNING: missing attribute mentioned in :members: or __all__: module pyspark.mllib.regression, attribute RidgeRegressionModelLinearRegressionWithSGD
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.DecisionTreeModel.predict:3: ERROR: Unexpected indentation.
      ...
      checking consistency... /Users/<user>/MyRepos/Scala/spark/python/docs/modules.rst:: WARNING: document isn't included in any toctree
      ...
      copying static files... WARNING: html_static_path entry u'/Users/<user>/MyRepos/Scala/spark/python/docs/_static' does not exist
      ...
      build succeeded, 12 warnings.
      ```
      
      Author: cocoatomo <cocoatomo77@gmail.com>
      
      Closes #2653 from cocoatomo/issues/3773-sphinx-build-warnings and squashes the following commits:
      
      6f65661 [cocoatomo] [SPARK-3773][PySpark][Doc] Sphinx build warning
      2300eb58
  19. Sep 19, 2014
    • Davies Liu's avatar
      [SPARK-3491] [MLlib] [PySpark] use pickle to serialize data in MLlib · fce5e251
      Davies Liu authored
      Currently, we serialize the data between JVM and Python case by case manually, this cannot scale to support so many APIs in MLlib.
      
      This patch will try to address this problem by serialize the data using pickle protocol, using Pyrolite library to serialize/deserialize in JVM. Pickle protocol can be easily extended to support customized class.
      
      All the modules are refactored to use this protocol.
      
      Known issues: There will be some performance regression (both CPU and memory, the serialized data increased)
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2378 from davies/pickle_mllib and squashes the following commits:
      
      dffbba2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into pickle_mllib
      810f97f [Davies Liu] fix equal of matrix
      032cd62 [Davies Liu] add more type check and conversion for user_product
      bd738ab [Davies Liu] address comments
      e431377 [Davies Liu] fix cache of rdd, refactor
      19d0967 [Davies Liu] refactor Picklers
      2511e76 [Davies Liu] cleanup
      1fccf1a [Davies Liu] address comments
      a2cc855 [Davies Liu] fix tests
      9ceff73 [Davies Liu] test size of serialized Rating
      44e0551 [Davies Liu] fix cache
      a379a81 [Davies Liu] fix pickle array in python2.7
      df625c7 [Davies Liu] Merge commit '154d141' into pickle_mllib
      154d141 [Davies Liu] fix autobatchedpickler
      44736d7 [Davies Liu] speed up pickling array in Python 2.7
      e1d1bfc [Davies Liu] refactor
      708dc02 [Davies Liu] fix tests
      9dcfb63 [Davies Liu] fix style
      88034f0 [Davies Liu] rafactor, address comments
      46a501e [Davies Liu] choose batch size automatically
      df19464 [Davies Liu] memorize the module and class name during pickleing
      f3506c5 [Davies Liu] Merge branch 'master' into pickle_mllib
      722dd96 [Davies Liu] cleanup _common.py
      0ee1525 [Davies Liu] remove outdated tests
      b02e34f [Davies Liu] remove _common.py
      84c721d [Davies Liu] Merge branch 'master' into pickle_mllib
      4d7963e [Davies Liu] remove muanlly serialization
      6d26b03 [Davies Liu] fix tests
      c383544 [Davies Liu] classification
      f2a0856 [Davies Liu] mllib/regression
      d9f691f [Davies Liu] mllib/util
      cccb8b1 [Davies Liu] mllib/tree
      8fe166a [Davies Liu] Merge branch 'pickle' into pickle_mllib
      aa2287e [Davies Liu] random
      f1544c4 [Davies Liu] refactor clustering
      52d1350 [Davies Liu] use new protocol in mllib/stat
      b30ef35 [Davies Liu] use pickle to serialize data for mllib/recommendation
      f44f771 [Davies Liu] enable tests about array
      3908f5c [Davies Liu] Merge branch 'master' into pickle
      c77c87b [Davies Liu] cleanup debugging code
      60e4e2f [Davies Liu] support unpickle array.array for Python 2.6
      fce5e251
  20. Sep 03, 2014
    • Davies Liu's avatar
      [SPARK-3309] [PySpark] Put all public API in __all__ · 6481d274
      Davies Liu authored
      Put all public API in __all__, also put them all in pyspark.__init__.py, then we can got all the documents for public API by `pydoc pyspark`. It also can be used by other programs (such as Sphinx or Epydoc) to generate only documents for public APIs.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2205 from davies/public and squashes the following commits:
      
      c6c5567 [Davies Liu] fix message
      f7b35be [Davies Liu] put SchemeRDD, Row in pyspark.sql module
      7e3016a [Davies Liu] add __all__ in mllib
      6281b48 [Davies Liu] fix doc for SchemaRDD
      6caab21 [Davies Liu] add public interfaces into pyspark.__init__.py
      6481d274
  21. Aug 06, 2014
    • Nicholas Chammas's avatar
      [SPARK-2627] [PySpark] have the build enforce PEP 8 automatically · d614967b
      Nicholas Chammas authored
      As described in [SPARK-2627](https://issues.apache.org/jira/browse/SPARK-2627), we'd like Python code to automatically be checked for PEP 8 compliance by Jenkins. This pull request aims to do that.
      
      Notes:
      * We may need to install [`pep8`](https://pypi.python.org/pypi/pep8) on the build server.
      * I'm expecting tests to fail now that PEP 8 compliance is being checked as part of the build. I'm fine with cleaning up any remaining PEP 8 violations as part of this pull request.
      * I did not understand why the RAT and scalastyle reports are saved to text files. I did the same for the PEP 8 check, but only so that the console output style can match those for the RAT and scalastyle checks. The PEP 8 report is removed right after the check is complete.
      * Updates to the ["Contributing to Spark"](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) guide will be submitted elsewhere, as I don't believe that text is part of the Spark repo.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      Author: nchammas <nicholas.chammas@gmail.com>
      
      Closes #1744 from nchammas/master and squashes the following commits:
      
      274b238 [Nicholas Chammas] [SPARK-2627] [PySpark] minor indentation changes
      983d963 [nchammas] Merge pull request #5 from apache/master
      1db5314 [nchammas] Merge pull request #4 from apache/master
      0e0245f [Nicholas Chammas] [SPARK-2627] undo erroneous whitespace fixes
      bf30942 [Nicholas Chammas] [SPARK-2627] PEP8: comment spacing
      6db9a44 [nchammas] Merge pull request #3 from apache/master
      7b4750e [Nicholas Chammas] merge upstream changes
      91b7584 [Nicholas Chammas] [SPARK-2627] undo unnecessary line breaks
      44e3e56 [Nicholas Chammas] [SPARK-2627] use tox.ini to exclude files
      b09fae2 [Nicholas Chammas] don't wrap comments unnecessarily
      bfb9f9f [Nicholas Chammas] [SPARK-2627] keep up with the PEP 8 fixes
      9da347f [nchammas] Merge pull request #2 from apache/master
      aa5b4b5 [Nicholas Chammas] [SPARK-2627] follow Spark bash style for if blocks
      d0a83b9 [Nicholas Chammas] [SPARK-2627] check that pep8 downloaded fine
      dffb5dd [Nicholas Chammas] [SPARK-2627] download pep8 at runtime
      a1ce7ae [Nicholas Chammas] [SPARK-2627] space out test report sections
      21da538 [Nicholas Chammas] [SPARK-2627] it's PEP 8, not PEP8
      6f4900b [Nicholas Chammas] [SPARK-2627] more misc PEP 8 fixes
      fe57ed0 [Nicholas Chammas] removing merge conflict backups
      9c01d4c [nchammas] Merge pull request #1 from apache/master
      9a66cb0 [Nicholas Chammas] resolving merge conflicts
      a31ccc4 [Nicholas Chammas] [SPARK-2627] miscellaneous PEP 8 fixes
      beaa9ac [Nicholas Chammas] [SPARK-2627] fail check on non-zero status
      723ed39 [Nicholas Chammas] always delete the report file
      0541ebb [Nicholas Chammas] [SPARK-2627] call Python linter from run-tests
      12440fa [Nicholas Chammas] [SPARK-2627] add Scala linter
      61c07b9 [Nicholas Chammas] [SPARK-2627] add Python linter
      75ad552 [Nicholas Chammas] make check output style consistent
      d614967b
  22. Aug 05, 2014
  23. Jul 30, 2014
    • Naftali Harris's avatar
      Avoid numerical instability · e3d85b7e
      Naftali Harris authored
      This avoids basically doing 1 - 1, for example:
      
      ```python
      >>> from math import exp
      >>> margin = -40
      >>> 1 - 1 / (1 + exp(margin))
      0.0
      >>> exp(margin) / (1 + exp(margin))
      4.248354255291589e-18
      >>>
      ```
      
      Author: Naftali Harris <naftaliharris@gmail.com>
      
      Closes #1652 from naftaliharris/patch-2 and squashes the following commits:
      
      0d55a9f [Naftali Harris] Avoid numerical instability
      e3d85b7e
  24. Jul 20, 2014
  25. May 25, 2014
    • Reynold Xin's avatar
      Fix PEP8 violations in Python mllib. · d33d3c61
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #871 from rxin/mllib-pep8 and squashes the following commits:
      
      848416f [Reynold Xin] Fixed a typo in the previous cleanup (c -> sc).
      a8db4cd [Reynold Xin] Fix PEP8 violations in Python mllib.
      d33d3c61
  26. May 05, 2014
    • Xiangrui Meng's avatar
      [SPARK-1594][MLLIB] Cleaning up MLlib APIs and guide · 98750a74
      Xiangrui Meng authored
      Final pass before the v1.0 release.
      
      * Remove `VectorRDDs`
      * Move `BinaryClassificationMetrics` from `evaluation.binary` to `evaluation`
      * Change default value of `addIntercept` to false and allow to add intercept in Ridge and Lasso.
      * Clean `DecisionTree` package doc and test suite.
      * Mark model constructors `private[spark]`
      * Rename `loadLibSVMData` to `loadLibSVMFile` and hide `LabelParser` from users.
      * Add `saveAsLibSVMFile`.
      * Add `appendBias` to `MLUtils`.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #524 from mengxr/mllib-cleaning and squashes the following commits:
      
      295dc8b [Xiangrui Meng] update loadLibSVMFile doc
      1977ac1 [Xiangrui Meng] fix doc of appendBias
      649fcf0 [Xiangrui Meng] rename loadLibSVMData to loadLibSVMFile; hide LabelParser from user APIs
      54b812c [Xiangrui Meng] add appendBias
      a71e7d0 [Xiangrui Meng] add saveAsLibSVMFile
      d976295 [Xiangrui Meng] Merge branch 'master' into mllib-cleaning
      b7e5cec [Xiangrui Meng] remove some experimental annotations and make model constructors private[mllib]
      9b02b93 [Xiangrui Meng] minor code style update
      a593ddc [Xiangrui Meng] fix python tests
      fc28c18 [Xiangrui Meng] mark more classes experimental
      f6cbbff [Xiangrui Meng] fix Java tests
      0af70b0 [Xiangrui Meng] minor
      6e139ef [Xiangrui Meng] Merge branch 'master' into mllib-cleaning
      94e6dce [Xiangrui Meng] move BinaryLabelCounter and BinaryConfusionMatrixImpl to evaluation.binary
      df34907 [Xiangrui Meng] clean DecisionTreeSuite to use LocalSparkContext
      c81807f [Xiangrui Meng] set the default value of AddIntercept to false
      03389c0 [Xiangrui Meng] allow to add intercept in Ridge and Lasso
      c66c56f [Xiangrui Meng] move tree md to package object doc
      a2695df [Xiangrui Meng] update guide for BinaryClassificationMetrics
      9194f4c [Xiangrui Meng] move BinaryClassificationMetrics one level up
      1c1a0e3 [Xiangrui Meng] remove VectorRDDs because it only contains one function that is not necessary for us to maintain
      98750a74
  27. Apr 22, 2014
    • Xusen Yin's avatar
      fix bugs of dot in python · c919798f
      Xusen Yin authored
      If there are no `transpose()` in `self.theta`, a
      
      *ValueError: matrices are not aligned*
      
      is occurring. The former test case just ignore this situation.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #463 from yinxusen/python-naive-bayes and squashes the following commits:
      
      fcbe3bc [Xusen Yin] fix bugs of dot in python
      c919798f
  28. Apr 15, 2014
    • Matei Zaharia's avatar
      [WIP] SPARK-1430: Support sparse data in Python MLlib · 63ca581d
      Matei Zaharia authored
      This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
      
      On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
      
      Some to-do items left:
      - [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
      - [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
      - [x] Explain how to use these in the Python MLlib docs.
      
      CC @mengxr, @joshrosen
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #341 from mateiz/py-ml-update and squashes the following commits:
      
      d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
      ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
      b9f97a3 [Matei Zaharia] Fix test
      1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
      88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
      37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
      da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
      c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
      a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
      74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
      889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
      ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
      a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
      0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
      eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
      2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
      154f45d [Matei Zaharia] Update docs, name some magic values
      881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
      63ca581d
  29. Apr 02, 2014
    • Xiangrui Meng's avatar
      [SPARK-1212, Part II] Support sparse data in MLlib · 9c65fa76
      Xiangrui Meng authored
      In PR https://github.com/apache/spark/pull/117, we added dense/sparse vector data model and updated KMeans to support sparse input. This PR is to replace all other `Array[Double]` usage by `Vector` in generalized linear models (GLMs) and Naive Bayes. Major changes:
      
      1. `LabeledPoint` becomes `LabeledPoint(Double, Vector)`.
      2. Methods that accept `RDD[Array[Double]]` now accept `RDD[Vector]`. We cannot support both in an elegant way because of type erasure.
      3. Mark 'createModel' and 'predictPoint' protected because they are not for end users.
      4. Add libSVMFile to MLContext.
      5. NaiveBayes can accept arbitrary labels (introducing a breaking change to Python's `NaiveBayesModel`).
      6. Gradient computation no longer creates temp vectors.
      7. Column normalization and centering are removed from Lasso and Ridge because the operation will densify the data. Simple feature transformation can be done before training.
      
      TODO:
      1. ~~Use axpy when possible.~~
      2. ~~Optimize Naive Bayes.~~
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #245 from mengxr/vector and squashes the following commits:
      
      eb6e793 [Xiangrui Meng] move libSVMFile to MLUtils and rename to loadLibSVMData
      c26c4fc [Xiangrui Meng] update DecisionTree to use RDD[Vector]
      11999c7 [Xiangrui Meng] Merge branch 'master' into vector
      f7da54b [Xiangrui Meng] add minSplits to libSVMFile
      da25e24 [Xiangrui Meng] revert the change to default addIntercept because it might change the behavior of existing code without warning
      493f26f [Xiangrui Meng] Merge branch 'master' into vector
      7c1bc01 [Xiangrui Meng] add a TODO to NB
      b9b7ef7 [Xiangrui Meng] change default value of addIntercept to false
      b01df54 [Xiangrui Meng] allow to change or clear threshold in LR and SVM
      4addc50 [Xiangrui Meng] merge master
      4ca5b1b [Xiangrui Meng] remove normalization from Lasso and update tests
      f04fe8a [Xiangrui Meng] remove normalization from RidgeRegression and update tests
      d088552 [Xiangrui Meng] use static constructor for MLContext
      6f59eed [Xiangrui Meng] update libSVMFile to determine number of features automatically
      3432e84 [Xiangrui Meng] update NaiveBayes to support sparse data
      0f8759b [Xiangrui Meng] minor updates to NB
      b11659c [Xiangrui Meng] style update
      78c4671 [Xiangrui Meng] add libSVMFile to MLContext
      f0fe616 [Xiangrui Meng] add a test for sparse linear regression
      44733e1 [Xiangrui Meng] use in-place gradient computation
      e981396 [Xiangrui Meng] use axpy in Updater
      db808a1 [Xiangrui Meng] update JavaLR example
      befa592 [Xiangrui Meng] passed scala/java tests
      75c83a4 [Xiangrui Meng] passed test compile
      1859701 [Xiangrui Meng] passed compile
      834ada2 [Xiangrui Meng] optimized MLUtils.computeStats update some ml algorithms to use Vector (cont.)
      135ab72 [Xiangrui Meng] merge glm
      0e57aa4 [Xiangrui Meng] update Lasso and RidgeRegression to parse the weights correctly from GLM mark createModel protected mark predictPoint protected
      d7f629f [Xiangrui Meng] fix a bug in GLM when intercept is not used
      3f346ba [Xiangrui Meng] update some ml algorithms to use Vector
      9c65fa76
  30. Jan 12, 2014
    • Matei Zaharia's avatar
      Update some Python MLlib parameters to use camelCase, and tweak docs · 4c28a2ba
      Matei Zaharia authored
      We've used camel case in other Spark methods so it felt reasonable to
      keep using it here and make the code match Scala/Java as much as
      possible. Note that parameter names matter in Python because it allows
      passing optional parameters by name.
      4c28a2ba
    • Matei Zaharia's avatar
      Add Naive Bayes to Python MLlib, and some API fixes · 9a0dfdf8
      Matei Zaharia authored
      - Added a Python wrapper for Naive Bayes
      - Updated the Scala Naive Bayes to match the style of our other
        algorithms better and in particular make it easier to call from Java
        (added builder pattern, removed default value in train method)
      - Updated Python MLlib functions to not require a SparkContext; we can
        get that from the RDD the user gives
      - Added a toString method in LabeledPoint
      - Made the Python MLlib tests run as part of run-tests as well (before
        they could only be run individually through each file)
      9a0dfdf8
  31. Dec 24, 2013
Loading