Skip to content
Snippets Groups Projects
  1. Aug 18, 2014
    • Joseph K. Bradley's avatar
      [SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes · c8b16ca0
      Joseph K. Bradley authored
      Added examples for statistical summarization:
      * Scala: StatisticalSummary.scala
      ** Tests: correlation, MultivariateOnlineSummarizer
      * python: statistical_summary.py
      ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
      
      Added examples for random and sampled RDDs:
      * Scala: RandomAndSampledRDDs.scala
      * python: random_and_sampled_rdds.py
      * Both test:
      ** RandomRDDGenerators.normalRDD, normalVectorRDD
      ** RDD.sample, takeSample, sampleByKey
      
      Added sc.stop() to all examples.
      
      CorrelationSuite.scala
      * Added 1 test for RDDs with only 1 value
      
      RowMatrix.scala
      * numCols(): Added check for numRows = 0, with error message.
      * computeCovariance(): Added check for numRows <= 1, with error message.
      
      Python SparseVector (pyspark/mllib/linalg.py)
      * Added toDense() function
      
      python/run-tests script
      * Added stat.py (doc test)
      
      CC: mengxr dorx  Main changes were examples to show usage across APIs.
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #1878 from jkbradley/mllib-stats-api-check and squashes the following commits:
      
      ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
      dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and sampled_rdds.py: Check for division by 0 and for missing key in maps.
      8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
      60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python versions printing nan or NaN.
      b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
      4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use NaN instead of nan.
      32173b7 [Joseph K. Bradley] Stats examples update.
      c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
      cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
      0b7cec3 [Joseph K. Bradley] Small updates based on code review.  Renamed statistical_summary.py to correlations.py
      ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message.
      65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
      8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey
      064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
      ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
      c8b16ca0
  2. Aug 06, 2014
    • Nicholas Chammas's avatar
      [SPARK-2627] [PySpark] have the build enforce PEP 8 automatically · d614967b
      Nicholas Chammas authored
      As described in [SPARK-2627](https://issues.apache.org/jira/browse/SPARK-2627), we'd like Python code to automatically be checked for PEP 8 compliance by Jenkins. This pull request aims to do that.
      
      Notes:
      * We may need to install [`pep8`](https://pypi.python.org/pypi/pep8) on the build server.
      * I'm expecting tests to fail now that PEP 8 compliance is being checked as part of the build. I'm fine with cleaning up any remaining PEP 8 violations as part of this pull request.
      * I did not understand why the RAT and scalastyle reports are saved to text files. I did the same for the PEP 8 check, but only so that the console output style can match those for the RAT and scalastyle checks. The PEP 8 report is removed right after the check is complete.
      * Updates to the ["Contributing to Spark"](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) guide will be submitted elsewhere, as I don't believe that text is part of the Spark repo.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      Author: nchammas <nicholas.chammas@gmail.com>
      
      Closes #1744 from nchammas/master and squashes the following commits:
      
      274b238 [Nicholas Chammas] [SPARK-2627] [PySpark] minor indentation changes
      983d963 [nchammas] Merge pull request #5 from apache/master
      1db5314 [nchammas] Merge pull request #4 from apache/master
      0e0245f [Nicholas Chammas] [SPARK-2627] undo erroneous whitespace fixes
      bf30942 [Nicholas Chammas] [SPARK-2627] PEP8: comment spacing
      6db9a44 [nchammas] Merge pull request #3 from apache/master
      7b4750e [Nicholas Chammas] merge upstream changes
      91b7584 [Nicholas Chammas] [SPARK-2627] undo unnecessary line breaks
      44e3e56 [Nicholas Chammas] [SPARK-2627] use tox.ini to exclude files
      b09fae2 [Nicholas Chammas] don't wrap comments unnecessarily
      bfb9f9f [Nicholas Chammas] [SPARK-2627] keep up with the PEP 8 fixes
      9da347f [nchammas] Merge pull request #2 from apache/master
      aa5b4b5 [Nicholas Chammas] [SPARK-2627] follow Spark bash style for if blocks
      d0a83b9 [Nicholas Chammas] [SPARK-2627] check that pep8 downloaded fine
      dffb5dd [Nicholas Chammas] [SPARK-2627] download pep8 at runtime
      a1ce7ae [Nicholas Chammas] [SPARK-2627] space out test report sections
      21da538 [Nicholas Chammas] [SPARK-2627] it's PEP 8, not PEP8
      6f4900b [Nicholas Chammas] [SPARK-2627] more misc PEP 8 fixes
      fe57ed0 [Nicholas Chammas] removing merge conflict backups
      9c01d4c [nchammas] Merge pull request #1 from apache/master
      9a66cb0 [Nicholas Chammas] resolving merge conflicts
      a31ccc4 [Nicholas Chammas] [SPARK-2627] miscellaneous PEP 8 fixes
      beaa9ac [Nicholas Chammas] [SPARK-2627] fail check on non-zero status
      723ed39 [Nicholas Chammas] always delete the report file
      0541ebb [Nicholas Chammas] [SPARK-2627] call Python linter from run-tests
      12440fa [Nicholas Chammas] [SPARK-2627] add Scala linter
      61c07b9 [Nicholas Chammas] [SPARK-2627] add Python linter
      75ad552 [Nicholas Chammas] make check output style consistent
      d614967b
  3. Jul 31, 2014
    • Doris Xin's avatar
      [SPARK-2724] Python version of RandomRDDGenerators · d8430148
      Doris Xin authored
      RandomRDDGenerators but without support for randomRDD and randomVectorRDD, which take in arbitrary DistributionGenerator.
      
      `randomRDD.py` is named to avoid collision with the built-in Python `random` package.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1628 from dorx/pythonRDD and squashes the following commits:
      
      55c6de8 [Doris Xin] review comments. all python units passed.
      f831d9b [Doris Xin] moved default args logic into PythonMLLibAPI
      2d73917 [Doris Xin] fix for linalg.py
      8663e6a [Doris Xin] reverting back to a single python file for random
      f47c481 [Doris Xin] docs update
      687aac0 [Doris Xin] add RandomRDDGenerators.py to run-tests
      4338f40 [Doris Xin] renamed randomRDD to rand and import as random
      29d205e [Doris Xin] created mllib.random package
      bd2df13 [Doris Xin] typos
      07ddff2 [Doris Xin] units passed.
      23b2ecd [Doris Xin] WIP
      d8430148
  4. Jul 22, 2014
    • Nicholas Chammas's avatar
      [SPARK-2470] PEP8 fixes to PySpark · 5d16d5bb
      Nicholas Chammas authored
      This pull request aims to resolve all outstanding PEP8 violations in PySpark.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      Author: nchammas <nicholas.chammas@gmail.com>
      
      Closes #1505 from nchammas/master and squashes the following commits:
      
      98171af [Nicholas Chammas] [SPARK-2470] revert PEP 8 fixes to cloudpickle
      cba7768 [Nicholas Chammas] [SPARK-2470] wrap expression list in parentheses
      e178dbe [Nicholas Chammas] [SPARK-2470] style - change position of line break
      9127d2b [Nicholas Chammas] [SPARK-2470] wrap expression lists in parentheses
      22132a4 [Nicholas Chammas] [SPARK-2470] wrap conditionals in parentheses
      24639bc [Nicholas Chammas] [SPARK-2470] fix whitespace for doctest
      7d557b7 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to tests.py
      8f8e4c0 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to storagelevel.py
      b3b96cf [Nicholas Chammas] [SPARK-2470] PEP8 fixes to statcounter.py
      d644477 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to worker.py
      aa3a7b6 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to sql.py
      1916859 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to shell.py
      95d1d95 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to serializers.py
      a0fec2e [Nicholas Chammas] [SPARK-2470] PEP8 fixes to mllib
      c85e1e5 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to join.py
      d14f2f1 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to __init__.py
      81fcb20 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to resultiterable.py
      1bde265 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to java_gateway.py
      7fc849c [Nicholas Chammas] [SPARK-2470] PEP8 fixes to daemon.py
      ca2d28b [Nicholas Chammas] [SPARK-2470] PEP8 fixes to context.py
      f4e0039 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to conf.py
      a6d5e4b [Nicholas Chammas] [SPARK-2470] PEP8 fixes to cloudpickle.py
      f0a7ebf [Nicholas Chammas] [SPARK-2470] PEP8 fixes to rddsampler.py
      4dd148f [nchammas] Merge pull request #5 from apache/master
      f7e4581 [Nicholas Chammas] unrelated pep8 fix
      a36eed0 [Nicholas Chammas] name ec2 instances and security groups consistently
      de7292a [nchammas] Merge pull request #4 from apache/master
      2e4fe00 [nchammas] Merge pull request #3 from apache/master
      89fde08 [nchammas] Merge pull request #2 from apache/master
      69f6e22 [Nicholas Chammas] PEP8 fixes
      2627247 [Nicholas Chammas] broke up lines before they hit 100 chars
      6544b7e [Nicholas Chammas] [SPARK-2065] give launched instances names
      69da6cf [nchammas] Merge pull request #1 from apache/master
      5d16d5bb
  5. Jun 04, 2014
    • Xiangrui Meng's avatar
      [SPARK-1752][MLLIB] Standardize text format for vectors and labeled points · 189df165
      Xiangrui Meng authored
      We should standardize the text format used to represent vectors and labeled points. The proposed formats are the following:
      
      1. dense vector: `[v0,v1,..]`
      2. sparse vector: `(size,[i0,i1],[v0,v1])`
      3. labeled point: `(label,vector)`
      
      where "(..)" indicates a tuple and "[...]" indicate an array. `loadLabeledPoints` is added to pyspark's `MLUtils`. I didn't add `loadVectors` to pyspark because `RDD.saveAsTextFile` cannot stringify dense vectors in the proposed format automatically.
      
      `MLUtils#saveLabeledData` and `MLUtils#loadLabeledData` are deprecated. Users should use `RDD#saveAsTextFile` and `MLUtils#loadLabeledPoints` instead. In Scala, `MLUtils#loadLabeledPoints` is compatible with the format used by `MLUtils#loadLabeledData`.
      
      CC: @mateiz, @srowen
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #685 from mengxr/labeled-io and squashes the following commits:
      
      2d1116a [Xiangrui Meng] make loadLabeledData/saveLabeledData deprecated since 1.0.1
      297be75 [Xiangrui Meng] change LabeledPoint.parse to LabeledPointParser.parse to maintain binary compatibility
      d6b1473 [Xiangrui Meng] Merge branch 'master' into labeled-io
      56746ea [Xiangrui Meng] replace # by .
      623a5f0 [Xiangrui Meng] merge master
      f06d5ba [Xiangrui Meng] add docs and minor updates
      640fe0c [Xiangrui Meng] throw SparkException
      5bcfbc4 [Xiangrui Meng] update test to add scientific notations
      e86bf38 [Xiangrui Meng] remove NumericTokenizer
      050fca4 [Xiangrui Meng] use StringTokenizer
      6155b75 [Xiangrui Meng] merge master
      f644438 [Xiangrui Meng] remove parse methods based on eval from pyspark
      a41675a [Xiangrui Meng] python loadLabeledPoint uses Scala's implementation
      ce9a475 [Xiangrui Meng] add deserialize_labeled_point to pyspark with tests
      e9fcd49 [Xiangrui Meng] add serializeLabeledPoint and tests
      aea4ae3 [Xiangrui Meng] minor updates
      810d6df [Xiangrui Meng] update tokenizer/parser implementation
      7aac03a [Xiangrui Meng] remove Scala parsers
      c1885c1 [Xiangrui Meng] add headers and minor changes
      b0c50cb [Xiangrui Meng] add customized parser
      d731817 [Xiangrui Meng] style update
      63dc396 [Xiangrui Meng] add loadLabeledPoints to pyspark
      ea122b5 [Xiangrui Meng] Merge branch 'master' into labeled-io
      cd6c78f [Xiangrui Meng] add __str__ and parse to LabeledPoint
      a7a178e [Xiangrui Meng] add stringify to pyspark's Vectors
      5c2dbfa [Xiangrui Meng] add parse to pyspark's Vectors
      7853f88 [Xiangrui Meng] update pyspark's SparseVector.__str__
      e761d32 [Xiangrui Meng] make LabelPoint.parse compatible with the dense format used before v1.0 and deprecate loadLabeledData and saveLabeledData
      9e63a02 [Xiangrui Meng] add loadVectors and loadLabeledPoints
      19aa523 [Xiangrui Meng] update toString and add parsers for Vectors and LabeledPoint
      189df165
  6. May 25, 2014
    • Reynold Xin's avatar
      Fix PEP8 violations in Python mllib. · d33d3c61
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #871 from rxin/mllib-pep8 and squashes the following commits:
      
      848416f [Reynold Xin] Fixed a typo in the previous cleanup (c -> sc).
      a8db4cd [Reynold Xin] Fix PEP8 violations in Python mllib.
      d33d3c61
  7. May 07, 2014
    • Xiangrui Meng's avatar
      [SPARK-1743][MLLIB] add loadLibSVMFile and saveAsLibSVMFile to pyspark · 3188553f
      Xiangrui Meng authored
      Make loading/saving labeled data easier for pyspark users.
      
      Also changed type check in `SparseVector` to allow numpy integers.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #672 from mengxr/pyspark-mllib-util and squashes the following commits:
      
      2943fa7 [Xiangrui Meng] format docs
      d61668d [Xiangrui Meng] add loadLibSVMFile and saveAsLibSVMFile to pyspark
      3188553f
  8. Apr 15, 2014
    • Matei Zaharia's avatar
      [WIP] SPARK-1430: Support sparse data in Python MLlib · 63ca581d
      Matei Zaharia authored
      This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
      
      On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
      
      Some to-do items left:
      - [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
      - [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
      - [x] Explain how to use these in the Python MLlib docs.
      
      CC @mengxr, @joshrosen
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #341 from mateiz/py-ml-update and squashes the following commits:
      
      d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
      ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
      b9f97a3 [Matei Zaharia] Fix test
      1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
      88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
      37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
      da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
      c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
      a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
      74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
      889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
      ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
      a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
      0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
      eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
      2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
      154f45d [Matei Zaharia] Update docs, name some magic values
      881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
      63ca581d
Loading