Skip to content
Snippets Groups Projects
  1. Feb 03, 2015
    • FlytxtRnD's avatar
      [SPARK-5012][MLLib][PySpark]Python API for Gaussian Mixture Model · 50a1a874
      FlytxtRnD authored
      Python API for the Gaussian Mixture Model clustering algorithm in MLLib.
      
      Author: FlytxtRnD <meethu.mathew@flytxt.com>
      
      Closes #4059 from FlytxtRnD/PythonGmmWrapper and squashes the following commits:
      
      c973ab3 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
      339b09c [FlytxtRnD] Added MultivariateGaussian namedtuple  and Arraybuffer in trainGaussianMixture
      fa0a142 [FlytxtRnD] New line added
      d5b36ab [FlytxtRnD] Changed argument names to lowercase
      ac134f1 [FlytxtRnD] Merge branch 'PythonGmmWrapper' of https://github.com/FlytxtRnD/spark into PythonGmmWrapper
      6671ea1 [FlytxtRnD] Added mllib/stat/distribution.py
      3aee84b [FlytxtRnD] Fixed style issues
      2e9f12a [FlytxtRnD] Added mllib/stat/distribution.py and fixed style issues
      b22532c [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
      2e14d82 [FlytxtRnD] Incorporate MultivariateGaussian instances in GaussianMixtureModel
      05767c7 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
      3464d19 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
      c1d4c71 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'origin/PythonGmmWrapper' into PythonGmmWrapper
      426d130 [FlytxtRnD] Added random seed parameter
      332bad1 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
      f82750b [FlytxtRnD] Fixed style issues
      5c83825 [FlytxtRnD] Split input file with space delimiter
      fda60f3 [FlytxtRnD] Python API for Gaussian Mixture Model
      50a1a874
  2. Jan 21, 2015
    • nate.crosswhite's avatar
      [SPARK-4749] [mllib]: Allow initializing KMeans clusters using a seed · 7450a992
      nate.crosswhite authored
      This implements the functionality for SPARK-4749 and provides units tests in Scala and PySpark
      
      Author: nate.crosswhite <nate.crosswhite@stresearch.com>
      Author: nxwhite-str <nxwhite-str@users.noreply.github.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3610 from nxwhite-str/master and squashes the following commits:
      
      a2ebbd3 [nxwhite-str] Merge pull request #1 from mengxr/SPARK-4749-kmeans-seed
      7668124 [Xiangrui Meng] minor updates
      f8d5928 [nate.crosswhite] Addressing PR issues
      277d367 [nate.crosswhite] Merge remote-tracking branch 'upstream/master'
      9156a57 [nate.crosswhite] Merge remote-tracking branch 'upstream/master'
      5d087b4 [nate.crosswhite] Adding KMeans train with seed and Scala unit test
      616d111 [nate.crosswhite] Merge remote-tracking branch 'upstream/master'
      35c1884 [nate.crosswhite] Add kmeans initial seed to pyspark API
      7450a992
  3. Nov 21, 2014
    • Davies Liu's avatar
      [SPARK-4531] [MLlib] cache serialized java object · ce95bd8e
      Davies Liu authored
      The Pyrolite is pretty slow (comparing to the adhoc serializer in 1.1), it cause much performance regression in 1.2, because we cache the serialized Python object in JVM, deserialize them into Java object in each step.
      
      This PR change to cache the deserialized JavaRDD instead of PythonRDD to avoid the deserialization of Pyrolite. It should have similar memory usage as before, but much faster.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3397 from davies/cache and squashes the following commits:
      
      7f6e6ce [Davies Liu] Update -> Updater
      4b52edd [Davies Liu] using named argument
      63b984e [Davies Liu] fix
      7da0332 [Davies Liu] add unpersist()
      dff33e1 [Davies Liu] address comments
      c2bdfc2 [Davies Liu] refactor
      d572f00 [Davies Liu] Merge branch 'master' into cache
      f1063e1 [Davies Liu] cache serialized java object
      ce95bd8e
  4. Oct 31, 2014
    • Davies Liu's avatar
      [SPARK-4124] [MLlib] [PySpark] simplify serialization in MLlib Python API · 872fc669
      Davies Liu authored
      Create several helper functions to call MLlib Java API, convert the arguments to Java type and convert return value to Python object automatically, this simplify serialization in MLlib Python API very much.
      
      After this, the MLlib Python API does not need to deal with serialization details anymore, it's easier to add new API.
      
      cc mengxr
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2995 from davies/cleanup and squashes the following commits:
      
      8fa6ec6 [Davies Liu] address comments
      16b85a0 [Davies Liu] Merge branch 'master' of github.com:apache/spark into cleanup
      43743e5 [Davies Liu] bugfix
      731331f [Davies Liu] simplify serialization in MLlib Python API
      872fc669
  5. Oct 16, 2014
    • Davies Liu's avatar
      [SPARK-3971] [MLLib] [PySpark] hotfix: Customized pickler should work in cluster mode · 091d32c5
      Davies Liu authored
      Customized pickler should be registered before unpickling, but in executor, there is no way to register the picklers before run the tasks.
      
      So, we need to register the picklers in the tasks itself, duplicate the javaToPython() and pythonToJava() in MLlib, call SerDe.initialize() before pickling or unpickling.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2830 from davies/fix_pickle and squashes the following commits:
      
      0c85fb9 [Davies Liu] revert the privacy change
      6b94e15 [Davies Liu] use JavaConverters instead of JavaConversions
      0f02050 [Davies Liu] hotfix: Customized pickler does not work in cluster
      091d32c5
  6. Sep 19, 2014
    • Davies Liu's avatar
      [SPARK-3491] [MLlib] [PySpark] use pickle to serialize data in MLlib · fce5e251
      Davies Liu authored
      Currently, we serialize the data between JVM and Python case by case manually, this cannot scale to support so many APIs in MLlib.
      
      This patch will try to address this problem by serialize the data using pickle protocol, using Pyrolite library to serialize/deserialize in JVM. Pickle protocol can be easily extended to support customized class.
      
      All the modules are refactored to use this protocol.
      
      Known issues: There will be some performance regression (both CPU and memory, the serialized data increased)
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2378 from davies/pickle_mllib and squashes the following commits:
      
      dffbba2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into pickle_mllib
      810f97f [Davies Liu] fix equal of matrix
      032cd62 [Davies Liu] add more type check and conversion for user_product
      bd738ab [Davies Liu] address comments
      e431377 [Davies Liu] fix cache of rdd, refactor
      19d0967 [Davies Liu] refactor Picklers
      2511e76 [Davies Liu] cleanup
      1fccf1a [Davies Liu] address comments
      a2cc855 [Davies Liu] fix tests
      9ceff73 [Davies Liu] test size of serialized Rating
      44e0551 [Davies Liu] fix cache
      a379a81 [Davies Liu] fix pickle array in python2.7
      df625c7 [Davies Liu] Merge commit '154d141' into pickle_mllib
      154d141 [Davies Liu] fix autobatchedpickler
      44736d7 [Davies Liu] speed up pickling array in Python 2.7
      e1d1bfc [Davies Liu] refactor
      708dc02 [Davies Liu] fix tests
      9dcfb63 [Davies Liu] fix style
      88034f0 [Davies Liu] rafactor, address comments
      46a501e [Davies Liu] choose batch size automatically
      df19464 [Davies Liu] memorize the module and class name during pickleing
      f3506c5 [Davies Liu] Merge branch 'master' into pickle_mllib
      722dd96 [Davies Liu] cleanup _common.py
      0ee1525 [Davies Liu] remove outdated tests
      b02e34f [Davies Liu] remove _common.py
      84c721d [Davies Liu] Merge branch 'master' into pickle_mllib
      4d7963e [Davies Liu] remove muanlly serialization
      6d26b03 [Davies Liu] fix tests
      c383544 [Davies Liu] classification
      f2a0856 [Davies Liu] mllib/regression
      d9f691f [Davies Liu] mllib/util
      cccb8b1 [Davies Liu] mllib/tree
      8fe166a [Davies Liu] Merge branch 'pickle' into pickle_mllib
      aa2287e [Davies Liu] random
      f1544c4 [Davies Liu] refactor clustering
      52d1350 [Davies Liu] use new protocol in mllib/stat
      b30ef35 [Davies Liu] use pickle to serialize data for mllib/recommendation
      f44f771 [Davies Liu] enable tests about array
      3908f5c [Davies Liu] Merge branch 'master' into pickle
      c77c87b [Davies Liu] cleanup debugging code
      60e4e2f [Davies Liu] support unpickle array.array for Python 2.6
      fce5e251
  7. Sep 03, 2014
    • Davies Liu's avatar
      [SPARK-3309] [PySpark] Put all public API in __all__ · 6481d274
      Davies Liu authored
      Put all public API in __all__, also put them all in pyspark.__init__.py, then we can got all the documents for public API by `pydoc pyspark`. It also can be used by other programs (such as Sphinx or Epydoc) to generate only documents for public APIs.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2205 from davies/public and squashes the following commits:
      
      c6c5567 [Davies Liu] fix message
      f7b35be [Davies Liu] put SchemeRDD, Row in pyspark.sql module
      7e3016a [Davies Liu] add __all__ in mllib
      6281b48 [Davies Liu] fix doc for SchemaRDD
      6caab21 [Davies Liu] add public interfaces into pyspark.__init__.py
      6481d274
  8. Aug 06, 2014
    • Nicholas Chammas's avatar
      [SPARK-2627] [PySpark] have the build enforce PEP 8 automatically · d614967b
      Nicholas Chammas authored
      As described in [SPARK-2627](https://issues.apache.org/jira/browse/SPARK-2627), we'd like Python code to automatically be checked for PEP 8 compliance by Jenkins. This pull request aims to do that.
      
      Notes:
      * We may need to install [`pep8`](https://pypi.python.org/pypi/pep8) on the build server.
      * I'm expecting tests to fail now that PEP 8 compliance is being checked as part of the build. I'm fine with cleaning up any remaining PEP 8 violations as part of this pull request.
      * I did not understand why the RAT and scalastyle reports are saved to text files. I did the same for the PEP 8 check, but only so that the console output style can match those for the RAT and scalastyle checks. The PEP 8 report is removed right after the check is complete.
      * Updates to the ["Contributing to Spark"](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) guide will be submitted elsewhere, as I don't believe that text is part of the Spark repo.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      Author: nchammas <nicholas.chammas@gmail.com>
      
      Closes #1744 from nchammas/master and squashes the following commits:
      
      274b238 [Nicholas Chammas] [SPARK-2627] [PySpark] minor indentation changes
      983d963 [nchammas] Merge pull request #5 from apache/master
      1db5314 [nchammas] Merge pull request #4 from apache/master
      0e0245f [Nicholas Chammas] [SPARK-2627] undo erroneous whitespace fixes
      bf30942 [Nicholas Chammas] [SPARK-2627] PEP8: comment spacing
      6db9a44 [nchammas] Merge pull request #3 from apache/master
      7b4750e [Nicholas Chammas] merge upstream changes
      91b7584 [Nicholas Chammas] [SPARK-2627] undo unnecessary line breaks
      44e3e56 [Nicholas Chammas] [SPARK-2627] use tox.ini to exclude files
      b09fae2 [Nicholas Chammas] don't wrap comments unnecessarily
      bfb9f9f [Nicholas Chammas] [SPARK-2627] keep up with the PEP 8 fixes
      9da347f [nchammas] Merge pull request #2 from apache/master
      aa5b4b5 [Nicholas Chammas] [SPARK-2627] follow Spark bash style for if blocks
      d0a83b9 [Nicholas Chammas] [SPARK-2627] check that pep8 downloaded fine
      dffb5dd [Nicholas Chammas] [SPARK-2627] download pep8 at runtime
      a1ce7ae [Nicholas Chammas] [SPARK-2627] space out test report sections
      21da538 [Nicholas Chammas] [SPARK-2627] it's PEP 8, not PEP8
      6f4900b [Nicholas Chammas] [SPARK-2627] more misc PEP 8 fixes
      fe57ed0 [Nicholas Chammas] removing merge conflict backups
      9c01d4c [nchammas] Merge pull request #1 from apache/master
      9a66cb0 [Nicholas Chammas] resolving merge conflicts
      a31ccc4 [Nicholas Chammas] [SPARK-2627] miscellaneous PEP 8 fixes
      beaa9ac [Nicholas Chammas] [SPARK-2627] fail check on non-zero status
      723ed39 [Nicholas Chammas] always delete the report file
      0541ebb [Nicholas Chammas] [SPARK-2627] call Python linter from run-tests
      12440fa [Nicholas Chammas] [SPARK-2627] add Scala linter
      61c07b9 [Nicholas Chammas] [SPARK-2627] add Python linter
      75ad552 [Nicholas Chammas] make check output style consistent
      d614967b
  9. May 25, 2014
    • Reynold Xin's avatar
      Fix PEP8 violations in Python mllib. · d33d3c61
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #871 from rxin/mllib-pep8 and squashes the following commits:
      
      848416f [Reynold Xin] Fixed a typo in the previous cleanup (c -> sc).
      a8db4cd [Reynold Xin] Fix PEP8 violations in Python mllib.
      d33d3c61
  10. Apr 15, 2014
    • Matei Zaharia's avatar
      [WIP] SPARK-1430: Support sparse data in Python MLlib · 63ca581d
      Matei Zaharia authored
      This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
      
      On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
      
      Some to-do items left:
      - [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
      - [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
      - [x] Explain how to use these in the Python MLlib docs.
      
      CC @mengxr, @joshrosen
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #341 from mateiz/py-ml-update and squashes the following commits:
      
      d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
      ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
      b9f97a3 [Matei Zaharia] Fix test
      1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
      88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
      37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
      da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
      c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
      a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
      74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
      889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
      ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
      a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
      0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
      eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
      2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
      154f45d [Matei Zaharia] Update docs, name some magic values
      881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
      63ca581d
  11. Jan 12, 2014
    • Matei Zaharia's avatar
      Add Naive Bayes to Python MLlib, and some API fixes · 9a0dfdf8
      Matei Zaharia authored
      - Added a Python wrapper for Naive Bayes
      - Updated the Scala Naive Bayes to match the style of our other
        algorithms better and in particular make it easier to call from Java
        (added builder pattern, removed default value in train method)
      - Updated Python MLlib functions to not require a SparkContext; we can
        get that from the RDD the user gives
      - Added a toString method in LabeledPoint
      - Made the Python MLlib tests run as part of run-tests as well (before
        they could only be run individually through each file)
      9a0dfdf8
  12. Dec 24, 2013
Loading