Skip to content
Snippets Groups Projects
  1. Jul 08, 2015
    • MechCoder's avatar
      [SPARK-7785] [MLLIB] [PYSPARK] Add __str__ and __repr__ to Matrices · 2b40365d
      MechCoder authored
      Adding __str__ and  __repr__ to DenseMatrix and SparseMatrix
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6342 from MechCoder/spark-7785 and squashes the following commits:
      
      7b9a82c [MechCoder] Add tests for greater than 16 elements
      b88e9dd [MechCoder] Increment limit to 16
      1425a01 [MechCoder] Change tests
      36bd166 [MechCoder] Change str and repr representation
      97f0da9 [MechCoder] zip is same as izip in python3
      94ca4b2 [MechCoder] Added doctests and iterate over values instead of colPtrs
      b26fa89 [MechCoder] minor
      394dde9 [MechCoder] [SPARK-7785] Add __str__ and __repr__ to Matrices
      2b40365d
  2. Jul 03, 2015
    • MechCoder's avatar
      [SPARK-7401] [MLLIB] [PYSPARK] Vectorize dot product and sq_dist between... · f0fac2aa
      MechCoder authored
      [SPARK-7401] [MLLIB] [PYSPARK] Vectorize dot product and sq_dist between SparseVector and DenseVector
      
      Currently we iterate over indices which can be vectorized.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #5946 from MechCoder/spark-7203 and squashes the following commits:
      
      034d086 [MechCoder] Vectorize dot calculation for numpy arrays for ndim=2
      bce2b07 [MechCoder] fix doctest
      fcad0a3 [MechCoder] Remove type checks for list, pyarray etc
      0ee5dd4 [MechCoder] Add tests and other isinstance changes
      e5f1de0 [MechCoder] [SPARK-7401] Vectorize dot product and sq_dist
      f0fac2aa
  3. Jul 01, 2015
    • lewuathe's avatar
      [SPARK-6263] [MLLIB] Python MLlib API missing items: Utils · 184de91d
      lewuathe authored
      Implement missing API in pyspark.
      
      MLUtils
      * appendBias
      * loadVectors
      
      `kFold` is also missing however I am not sure `ClassTag` can be passed or restored through python.
      
      Author: lewuathe <lewuathe@me.com>
      
      Closes #5707 from Lewuathe/SPARK-6263 and squashes the following commits:
      
      16863ea [lewuathe] Merge master
      3fc27e7 [lewuathe] Merge branch 'master' into SPARK-6263
      6084e9c [lewuathe] Resolv conflict
      d2aa2a0 [lewuathe] Resolv conflict
      9c329d8 [lewuathe] Fix efficiency
      3a12a2d [lewuathe] Merge branch 'master' into SPARK-6263
      1d4714b [lewuathe] Fix style
      b29e2bc [lewuathe] Remove scipy dependencies
      e32eb40 [lewuathe] Merge branch 'master' into SPARK-6263
      25d3c9d [lewuathe] Remove unnecessary imports
      7ec04db [lewuathe] Resolv conflict
      1502d13 [lewuathe] Resolv conflict
      d6bd416 [lewuathe] Check existence of scipy.sparse
      5d555b1 [lewuathe] Construct scipy.sparse matrix
      c345a44 [lewuathe] Merge branch 'master' into SPARK-6263
      b8b5ef7 [lewuathe] Fix unnecessary sort method
      d254be7 [lewuathe] Merge branch 'master' into SPARK-6263
      62a9c7e [lewuathe] Fix appendBias return type
      454c73d [lewuathe] Merge branch 'master' into SPARK-6263
      a353354 [lewuathe] Remove unnecessary appendBias implementation
      44295c2 [lewuathe] Merge branch 'master' into SPARK-6263
      64f72ad [lewuathe] Merge branch 'master' into SPARK-6263
      c728046 [lewuathe] Fix style
      2980569 [lewuathe] [SPARK-6263] Python MLlib API missing items: Utils
      184de91d
  4. Jun 30, 2015
    • MechCoder's avatar
      [SPARK-4127] [MLLIB] [PYSPARK] Python bindings for StreamingLinearRegressionWithSGD · 45281664
      MechCoder authored
      Python bindings for StreamingLinearRegressionWithSGD
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6744 from MechCoder/spark-4127 and squashes the following commits:
      
      d8f6457 [MechCoder] Moved StreamingLinearAlgorithm to pyspark.mllib.regression
      d47cc24 [MechCoder] Inherit from StreamingLinearAlgorithm
      1b4ddd6 [MechCoder] minor
      4de6c68 [MechCoder] Minor refactor
      5e85a3b [MechCoder] Add tests for simultaneous training and prediction
      fb27889 [MechCoder] Add example and docs
      505380b [MechCoder] Add tests
      d42bdae [MechCoder] [SPARK-4127] Python bindings for StreamingLinearRegressionWithSGD
      45281664
  5. Jun 24, 2015
    • MechCoder's avatar
      [SPARK-7633] [MLLIB] [PYSPARK] Python bindings for StreamingLogisticRegressionwithSGD · fb32c388
      MechCoder authored
      Add Python bindings to StreamingLogisticRegressionwithSGD.
      
      No Java wrappers are needed as models are updated directly using train.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6849 from MechCoder/spark-3258 and squashes the following commits:
      
      b4376a5 [MechCoder] minor
      d7e5fc1 [MechCoder] Refactor into StreamingLinearAlgorithm Better docs
      9c09d4e [MechCoder] [SPARK-7633] Python bindings for StreamingLogisticRegressionwithSGD
      fb32c388
  6. Jun 23, 2015
    • MechCoder's avatar
      [SPARK-8265] [MLLIB] [PYSPARK] Add LinearDataGenerator to pyspark.mllib.utils · f2022fa0
      MechCoder authored
      It is useful to generate linear data for easy testing of linear models and in general. Scala already has it. This is just a wrapper around the Scala code.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6715 from MechCoder/generate_linear_input and squashes the following commits:
      
      6182884 [MechCoder] Minor changes
      8bda047 [MechCoder] Minor style fixes
      0f1053c [MechCoder] [SPARK-8265] Add LinearDataGenerator to pyspark.mllib.utils
      f2022fa0
    • Holden Karau's avatar
      [SPARK-7781] [MLLIB] gradient boosted trees.train regressor missing max bins · 164fe2aa
      Holden Karau authored
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #6331 from holdenk/SPARK-7781-GradientBoostedTrees.trainRegressor-missing-max-bins and squashes the following commits:
      
      2894695 [Holden Karau] remove extra blank line
      2573e8d [Holden Karau] Update the scala side of the pythonmllibapi and make the test a bit nicer too
      3a09170 [Holden Karau] add maxBins to to the train method as well
      af7f274 [Holden Karau] Add maxBins to GradientBoostedTrees.trainRegressor and correctly mention the default of 32 in other places where it mentioned 100
      164fe2aa
  7. Jun 22, 2015
  8. Jun 19, 2015
    • MechCoder's avatar
      [SPARK-4118] [MLLIB] [PYSPARK] Python bindings for StreamingKMeans · 54976e55
      MechCoder authored
      Python bindings for StreamingKMeans
      
      Will change status to MRG once docs, tests and examples are updated.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6499 from MechCoder/spark-4118 and squashes the following commits:
      
      7722d16 [MechCoder] minor style fixes
      51052d3 [MechCoder] Doc fixes
      2061a76 [MechCoder] Add tests for simultaneous training and prediction Minor style fixes
      81482fd [MechCoder] minor
      5d9fe61 [MechCoder] predictOn should take into account the latest model
      8ab9e89 [MechCoder] Fix Python3 error
      a9817df [MechCoder] Better tests and minor fixes
      c80e451 [MechCoder] Add ignore_unicode_prefix
      ee8ce16 [MechCoder] Update tests, doc and examples
      4b1481f [MechCoder] Some changes and tests
      d8b066a [MechCoder] [SPARK-4118] [MLlib] [PySpark] Python bindings for StreamingKMeans
      54976e55
  9. Jun 18, 2015
    • MechCoder's avatar
      [SPARK-7605] [MLLIB] [PYSPARK] Python API for ElementwiseProduct · 22732e1e
      MechCoder authored
      Python API for org.apache.spark.mllib.feature.ElementwiseProduct
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6346 from MechCoder/spark-7605 and squashes the following commits:
      
      79d1ef5 [MechCoder] Consistent and support list / array types
      5f81d81 [MechCoder] [SPARK-7605] [MLlib] Python API for ElementwiseProduct
      22732e1e
  10. Jun 17, 2015
    • MechCoder's avatar
      [SPARK-6390] [SQL] [MLlib] Port MatrixUDT to PySpark · 6765ef98
      MechCoder authored
      MatrixUDT was recently coded in scala. This has been ported to PySpark
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6354 from MechCoder/spark-6390 and squashes the following commits:
      
      fc4dc1e [MechCoder] Better error message
      c940a44 [MechCoder] Added test
      aa9c391 [MechCoder] Add pyUDT to MatrixUDT
      62a2a7d [MechCoder] [SPARK-6390] Port MatrixUDT to PySpark
      6765ef98
  11. May 07, 2015
    • MechCoder's avatar
      [SPARK-7328] [MLLIB] [PYSPARK] Pyspark.mllib.linalg.Vectors: Missing items · 347a329a
      MechCoder authored
      Add
      1. Class methods squared_dist
      3. parse
      4. norm
      5. numNonzeros
      6. copy
      
      I made a few vectorizations wrt squared_dist and dot as well. I have added support for SparseMatrix serialization in a separate PR (https://github.com/apache/spark/pull/5775) and plan to complete support for Matrices in another PR.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #5872 from MechCoder/local_linalg_api and squashes the following commits:
      
      a8ff1e0 [MechCoder] minor
      ce3e53e [MechCoder] Add error message for parser
      1bd3c04 [MechCoder] Robust parser and removed unnecessary methods
      f779561 [MechCoder] [SPARK-7328] Pyspark.mllib.linalg.Vectors: Missing items
      347a329a
  12. May 05, 2015
    • Hrishikesh Subramonian's avatar
      [SPARK-6612] [MLLIB] [PYSPARK] Python KMeans parity · 5995ada9
      Hrishikesh Subramonian authored
      The following items are added to Python kmeans:
      
      kmeans - setEpsilon, setInitializationSteps
      KMeansModel - computeCost, k
      
      Author: Hrishikesh Subramonian <hrishikesh.subramonian@flytxt.com>
      
      Closes #5647 from FlytxtRnD/newPyKmeansAPI and squashes the following commits:
      
      b9e451b [Hrishikesh Subramonian] set seed to fixed value in doc test
      5fd3ced [Hrishikesh Subramonian] doc test corrections
      20b3c68 [Hrishikesh Subramonian] python 3 fixes
      4d4e695 [Hrishikesh Subramonian] added arguments in python tests
      21eb84c [Hrishikesh Subramonian] Python Kmeans - setEpsilon, setInitializationSteps, k and computeCost added.
      5995ada9
    • MechCoder's avatar
      [SPARK-7202] [MLLIB] [PYSPARK] Add SparseMatrixPickler to SerDe · 5ab652cd
      MechCoder authored
      Utilities for pickling and unpickling SparseMatrices using SerDe
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #5775 from MechCoder/spark-7202 and squashes the following commits:
      
      7e689dc [MechCoder] [SPARK-7202] Add SparseMatrixPickler to SerDe
      5ab652cd
  13. Apr 21, 2015
    • Reynold Xin's avatar
      [SPARK-6953] [PySpark] speed up python tests · 3134c3fe
      Reynold Xin authored
      This PR try to speed up some python tests:
      
      ```
      tests.py                       144s -> 103s      -41s
      mllib/classification.py         24s -> 17s        -7s
      mllib/regression.py             27s -> 15s       -12s
      mllib/tree.py                   27s -> 13s       -14s
      mllib/tests.py                  64s -> 31s       -33s
      streaming/tests.py             185s -> 84s      -101s
      ```
      Considering python3, the total saving will be 558s (almost 10 minutes) (core, and streaming run three times, mllib runs twice).
      
      During testing, it will show used time for each test file:
      ```
      Run core tests ...
      Running test: pyspark/rdd.py ... ok (22s)
      Running test: pyspark/context.py ... ok (16s)
      Running test: pyspark/conf.py ... ok (4s)
      Running test: pyspark/broadcast.py ... ok (4s)
      Running test: pyspark/accumulators.py ... ok (4s)
      Running test: pyspark/serializers.py ... ok (6s)
      Running test: pyspark/profiler.py ... ok (5s)
      Running test: pyspark/shuffle.py ... ok (1s)
      Running test: pyspark/tests.py ... ok (103s)   144s
      ```
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5605 from rxin/python-tests-speed and squashes the following commits:
      
      d08542d [Reynold Xin] Merge pull request #14 from mengxr/SPARK-6953
      89321ee [Xiangrui Meng] fix seed in tests
      3ad2387 [Reynold Xin] Merge pull request #5427 from davies/python_tests
      3134c3fe
    • MechCoder's avatar
      [SPARK-6845] [MLlib] [PySpark] Add isTranposed flag to DenseMatrix · 45c47fa4
      MechCoder authored
      Since sparse matrices now support a isTransposed flag for row major data, DenseMatrices should do the same.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #5455 from MechCoder/spark-6845 and squashes the following commits:
      
      525c370 [MechCoder] minor
      004a37f [MechCoder] Cast boolean to int
      151f3b6 [MechCoder] [WIP] Add isTransposed to pickle DenseMatrix
      cc0b90a [MechCoder] [SPARK-6845] Add isTranposed flag to DenseMatrix
      45c47fa4
  14. Apr 20, 2015
  15. Apr 16, 2015
    • Davies Liu's avatar
      [SPARK-4897] [PySpark] Python 3 support · 04e44b37
      Davies Liu authored
      This PR update PySpark to support Python 3 (tested with 3.4).
      
      Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.
      
      TODO: ec2/spark-ec2.py is not fully tested with python3.
      
      Author: Davies Liu <davies@databricks.com>
      Author: twneale <twneale@gmail.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5173 from davies/python3 and squashes the following commits:
      
      d7d6323 [Davies Liu] fix tests
      6c52a98 [Davies Liu] fix mllib test
      99e334f [Davies Liu] update timeout
      b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      cafd5ec [Davies Liu] adddress comments from @mengxr
      bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      179fc8d [Davies Liu] tuning flaky tests
      8c8b957 [Davies Liu] fix ResourceWarning in Python 3
      5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      4006829 [Davies Liu] fix test
      2fc0066 [Davies Liu] add python3 path
      71535e9 [Davies Liu] fix xrange and divide
      5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ed498c8 [Davies Liu] fix compatibility with python 3
      820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ad7c374 [Davies Liu] fix mllib test and warning
      ef1fc2f [Davies Liu] fix tests
      4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      59bb492 [Davies Liu] fix tests
      1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ca0fdd3 [Davies Liu] fix code style
      9563a15 [Davies Liu] add imap back for python 2
      0b1ec04 [Davies Liu] make python examples work with Python 3
      d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      a716d34 [Davies Liu] test with python 3.4
      f1700e8 [Davies Liu] fix test in python3
      671b1db [Davies Liu] fix test in python3
      692ff47 [Davies Liu] fix flaky test
      7b9699f [Davies Liu] invalidate import cache for Python 3.3+
      9c58497 [Davies Liu] fix kill worker
      309bfbf [Davies Liu] keep compatibility
      5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
      8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      f53e1f0 [Davies Liu] fix tests
      70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
      a39167e [Davies Liu] support customize class in __main__
      814c77b [Davies Liu] run unittests with python 3
      7f4476e [Davies Liu] mllib tests passed
      d737924 [Davies Liu] pass ml tests
      375ea17 [Davies Liu] SQL tests pass
      6cc42a9 [Davies Liu] rename
      431a8de [Davies Liu] streaming tests pass
      78901a7 [Davies Liu] fix hash of serializer in Python 3
      24b2f2e [Davies Liu] pass all RDD tests
      35f48fe [Davies Liu] run future again
      1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
      6e3c21d [Davies Liu] make cloudpickle work with Python3
      2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
      1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
      7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
      b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
      f40d925 [twneale] xrange --> range
      e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
      79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
      2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
      854be27 [Josh Rosen] Run `futurize` on Python code:
      7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
      04e44b37
  16. Apr 13, 2015
    • lewuathe's avatar
      [SPARK-6643][MLLIB] Implement StandardScalerModel missing methods · fc176614
      lewuathe authored
      This is the sub-task of SPARK-6254.
      Wrap missing method for `StandardScalerModel`.
      
      Author: lewuathe <lewuathe@me.com>
      
      Closes #5310 from Lewuathe/SPARK-6643 and squashes the following commits:
      
      fafd690 [lewuathe] Fix for lint-python
      bd31a64 [lewuathe] Merge branch 'master' into SPARK-6643
      578f5ee [lewuathe] Remove unnecessary class
      a38f155 [lewuathe] Merge master
      66bb2ab [lewuathe] Fix typos
      82683a0 [lewuathe] [SPARK-6643] Implement StandardScalerModel missing methods
      fc176614
  17. Apr 10, 2015
    • MechCoder's avatar
      [SPARK-6577] [MLlib] [PySpark] SparseMatrix should be supported in PySpark · e2360810
      MechCoder authored
      Supporting of SparseMatrix in PySpark.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #5355 from MechCoder/spark-6577 and squashes the following commits:
      
      7492190 [MechCoder] More readable code for densifying
      ea2c54b [MechCoder] Check bounds for indexing
      454ef2c [MechCoder] Made the following changes 1. Used convert_to_array for array conversion. 2. Used F order for toArray 3. Minor improvements in speed.
      db76caf [MechCoder] Add support for CSR matrix
      29653e7 [MechCoder] Renamed indices to rowIndices and indptr to colPtrs
      b6384fe [MechCoder] [SPARK-6577] SparseMatrix should be supported in PySpark
      e2360810
  18. Apr 07, 2015
  19. Apr 05, 2015
  20. Apr 03, 2015
    • lewuathe's avatar
      [SPARK-6615][MLLIB] Python API for Word2Vec · 512a2f19
      lewuathe authored
      This is the sub-task of SPARK-6254.
      Wrap missing method for `Word2Vec` and `Word2VecModel`.
      
      Author: lewuathe <lewuathe@me.com>
      
      Closes #5296 from Lewuathe/SPARK-6615 and squashes the following commits:
      
      f14c304 [lewuathe] Reorder tests
      1d326b9 [lewuathe] Merge master
      e2bedfb [lewuathe] Modify test cases
      afb866d [lewuathe] [SPARK-6615] Python API for Word2Vec
      512a2f19
  21. Apr 01, 2015
  22. Mar 31, 2015
    • lewuathe's avatar
      [SPARK-6598][MLLIB] Python API for IDFModel · 46de6c05
      lewuathe authored
      This is the sub-task of SPARK-6254.
      Wrapping IDFModel `idf` member function for pyspark.
      
      Author: lewuathe <lewuathe@me.com>
      
      Closes #5264 from Lewuathe/SPARK-6598 and squashes the following commits:
      
      1dc522c [lewuathe] [SPARK-6598] Python API for IDFModel
      46de6c05
  23. Mar 20, 2015
  24. Mar 03, 2015
    • Xiangrui Meng's avatar
      [SPARK-6097][MLLIB] Support tree model save/load in PySpark/MLlib · 7e53a79c
      Xiangrui Meng authored
      Similar to `MatrixFactorizaionModel`, we only need wrappers to support save/load for tree models in Python.
      
      jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4854 from mengxr/SPARK-6097 and squashes the following commits:
      
      4586a4d [Xiangrui Meng] fix more typos
      8ebcac2 [Xiangrui Meng] fix python style
      91172d8 [Xiangrui Meng] fix typos
      201b3b9 [Xiangrui Meng] update user guide
      b5158e2 [Xiangrui Meng] support tree model save/load in PySpark/MLlib
      7e53a79c
  25. Feb 14, 2015
    • Reynold Xin's avatar
      [SPARK-5752][SQL] Don't implicitly convert RDDs directly to DataFrames · e98dfe62
      Reynold Xin authored
      - The old implicit would convert RDDs directly to DataFrames, and that added too many methods.
      - toDataFrame -> toDF
      - Dsl -> functions
      - implicits moved into SQLContext.implicits
      - addColumn -> withColumn
      - renameColumn -> withColumnRenamed
      
      Python changes:
      - toDataFrame -> toDF
      - Dsl -> functions package
      - addColumn -> withColumn
      - renameColumn -> withColumnRenamed
      - add toDF functions to RDD on SQLContext init
      - add flatMap to DataFrame
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4556 from rxin/SPARK-5752 and squashes the following commits:
      
      5ef9910 [Reynold Xin] More fix
      61d3fca [Reynold Xin] Merge branch 'df5' of github.com:davies/spark into SPARK-5752
      ff5832c [Reynold Xin] Fix python
      749c675 [Reynold Xin] count(*) fixes.
      5806df0 [Reynold Xin] Fix build break again.
      d941f3d [Reynold Xin] Fixed explode compilation break.
      fe1267a [Davies Liu] flatMap
      c4afb8e [Reynold Xin] style
      d9de47f [Davies Liu] add comment
      b783994 [Davies Liu] add comment for toDF
      e2154e5 [Davies Liu] schema() -> schema
      3a1004f [Davies Liu] Dsl -> functions, toDF()
      fb256af [Reynold Xin] - toDataFrame -> toDF - Dsl -> functions - implicits moved into SQLContext.implicits - addColumn -> withColumn - renameColumn -> withColumnRenamed
      0dd74eb [Reynold Xin] [SPARK-5752][SQL] Don't implicitly convert RDDs directly to DataFrames
      97dd47c [Davies Liu] fix mistake
      6168f74 [Davies Liu] fix test
      1fc0199 [Davies Liu] fix test
      a075cd5 [Davies Liu] clean up, toPandas
      663d314 [Davies Liu] add test for agg('*')
      9e214d5 [Reynold Xin] count(*) fixes.
      1ed7136 [Reynold Xin] Fix build break again.
      921b2e3 [Reynold Xin] Fixed explode compilation break.
      14698d4 [Davies Liu] flatMap
      ba3e12d [Reynold Xin] style
      d08c92d [Davies Liu] add comment
      5c8b524 [Davies Liu] add comment for toDF
      a4e5e66 [Davies Liu] schema() -> schema
      d377fc9 [Davies Liu] Dsl -> functions, toDF()
      6b3086c [Reynold Xin] - toDataFrame -> toDF - Dsl -> functions - implicits moved into SQLContext.implicits - addColumn -> withColumn - renameColumn -> withColumnRenamed
      807e8b1 [Reynold Xin] [SPARK-5752][SQL] Don't implicitly convert RDDs directly to DataFrames
      e98dfe62
  26. Feb 04, 2015
    • Davies Liu's avatar
      [SPARK-5585] Flaky test in MLlib python · 38a416f0
      Davies Liu authored
      Add a seed for tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4358 from davies/flaky_test and squashes the following commits:
      
      02371c3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into flaky_test
      ced499b [Davies Liu] add seed for test
      38a416f0
  27. Feb 03, 2015
    • FlytxtRnD's avatar
      [SPARK-5012][MLLib][PySpark]Python API for Gaussian Mixture Model · 50a1a874
      FlytxtRnD authored
      Python API for the Gaussian Mixture Model clustering algorithm in MLLib.
      
      Author: FlytxtRnD <meethu.mathew@flytxt.com>
      
      Closes #4059 from FlytxtRnD/PythonGmmWrapper and squashes the following commits:
      
      c973ab3 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
      339b09c [FlytxtRnD] Added MultivariateGaussian namedtuple  and Arraybuffer in trainGaussianMixture
      fa0a142 [FlytxtRnD] New line added
      d5b36ab [FlytxtRnD] Changed argument names to lowercase
      ac134f1 [FlytxtRnD] Merge branch 'PythonGmmWrapper' of https://github.com/FlytxtRnD/spark into PythonGmmWrapper
      6671ea1 [FlytxtRnD] Added mllib/stat/distribution.py
      3aee84b [FlytxtRnD] Fixed style issues
      2e9f12a [FlytxtRnD] Added mllib/stat/distribution.py and fixed style issues
      b22532c [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
      2e14d82 [FlytxtRnD] Incorporate MultivariateGaussian instances in GaussianMixtureModel
      05767c7 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
      3464d19 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
      c1d4c71 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'origin/PythonGmmWrapper' into PythonGmmWrapper
      426d130 [FlytxtRnD] Added random seed parameter
      332bad1 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
      f82750b [FlytxtRnD] Fixed style issues
      5c83825 [FlytxtRnD] Split input file with space delimiter
      fda60f3 [FlytxtRnD] Python API for Gaussian Mixture Model
      50a1a874
  28. Jan 30, 2015
    • Kazuki Taniguchi's avatar
      [SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees · bc1fc9b6
      Kazuki Taniguchi authored
      This PR is implementing the Gradient Boosted Trees for Python API.
      
      Author: Kazuki Taniguchi <kazuki.t.1018@gmail.com>
      
      Closes #3951 from kazk1018/gbt_for_py and squashes the following commits:
      
      620d247 [Kazuki Taniguchi] [SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees
      bc1fc9b6
  29. Jan 21, 2015
    • nate.crosswhite's avatar
      [SPARK-4749] [mllib]: Allow initializing KMeans clusters using a seed · 7450a992
      nate.crosswhite authored
      This implements the functionality for SPARK-4749 and provides units tests in Scala and PySpark
      
      Author: nate.crosswhite <nate.crosswhite@stresearch.com>
      Author: nxwhite-str <nxwhite-str@users.noreply.github.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3610 from nxwhite-str/master and squashes the following commits:
      
      a2ebbd3 [nxwhite-str] Merge pull request #1 from mengxr/SPARK-4749-kmeans-seed
      7668124 [Xiangrui Meng] minor updates
      f8d5928 [nate.crosswhite] Addressing PR issues
      277d367 [nate.crosswhite] Merge remote-tracking branch 'upstream/master'
      9156a57 [nate.crosswhite] Merge remote-tracking branch 'upstream/master'
      5d087b4 [nate.crosswhite] Adding KMeans train with seed and Scala unit test
      616d111 [nate.crosswhite] Merge remote-tracking branch 'upstream/master'
      35c1884 [nate.crosswhite] Add kmeans initial seed to pyspark API
      7450a992
  30. Jan 14, 2015
    • MechCoder's avatar
      [SPARK-2909] [MLlib] [PySpark] SparseVector in pyspark now supports indexing · 5840f546
      MechCoder authored
      Slightly different than the scala code which converts the sparsevector into a densevector and then checks the index.
      
      I also hope I've added tests in the right place.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #4025 from MechCoder/spark-2909 and squashes the following commits:
      
      07d0f26 [MechCoder] STY: Rename item to index
      f02148b [MechCoder] [SPARK-2909] [Mlib] SparseVector in pyspark now supports indexing
      5840f546
  31. Jan 05, 2015
    • freeman's avatar
      [SPARK-5089][PYSPARK][MLLIB] Fix vector convert · 6c6f3257
      freeman authored
      This is a small change addressing a potentially significant bug in how PySpark + MLlib handles non-float64 numpy arrays. The automatic conversion to `DenseVector` that occurs when passing RDDs to MLlib algorithms in PySpark should automatically upcast to float64s, but currently this wasn't actually happening. As a result, non-float64 would be silently parsed inappropriately during SerDe, yielding erroneous results when running, for example, KMeans.
      
      The PR includes the fix, as well as a new test for the correct conversion behavior.
      
      davies
      
      Author: freeman <the.freeman.lab@gmail.com>
      
      Closes #3902 from freeman-lab/fix-vector-convert and squashes the following commits:
      
      764db47 [freeman] Add a test for proper conversion behavior
      704f97e [freeman] Return array after changing type
      6c6f3257
  32. Dec 16, 2014
    • jbencook's avatar
      [SPARK-4855][mllib] testing the Chi-squared hypothesis test · cb484474
      jbencook authored
      This PR tests the pyspark Chi-squared hypothesis test from this commit: c8abddc5 and moves some of the error messaging in to python.
      
      It is a port of the Scala tests here: [HypothesisTestSuite.scala](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/stat/HypothesisTestSuite.scala)
      
      Hopefully, SPARK-2980 can be closed.
      
      Author: jbencook <jbenjamincook@gmail.com>
      
      Closes #3679 from jbencook/master and squashes the following commits:
      
      44078e0 [jbencook] checking that bad input throws the correct exceptions
      f12ee10 [jbencook] removing checks for ValueError since input tests are on the Scala side
      7536cf1 [jbencook] removing python checks for invalid input
      a17ee84 [jbencook] [SPARK-2980][mllib] adding unit tests for the pyspark chi-squared test
      3aeb0d9 [jbencook] [SPARK-2980][mllib] bringing Chi-squared error messages to the python side
      cb484474
  33. Nov 24, 2014
    • Davies Liu's avatar
      [SPARK-4562] [MLlib] speedup vector · b660de7a
      Davies Liu authored
      This PR change the underline array of DenseVector to numpy.ndarray to avoid the conversion, because most of the users will using numpy.array.
      
      It also improve the serialization of DenseVector.
      
      Before this change:
      
      trial	| trainingTime | 	testTime
      -------|--------|--------
      0	| 5.126 | 	1.786
      1	|2.698	|1.693
      
      After the change:
      
      trial	| trainingTime |	testTime
      -------|--------|--------
      0	|4.692	|0.554
      1	|2.307	|0.525
      
      This could partially fix the performance regression during test.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3420 from davies/ser2 and squashes the following commits:
      
      0e1e6f3 [Davies Liu] fix tests
      426f5db [Davies Liu] impove toArray()
      44707ec [Davies Liu] add name for ISO-8859-1
      fa7d791 [Davies Liu] address comments
      1cfb137 [Davies Liu] handle zero sparse vector
      2548ee2 [Davies Liu] fix tests
      9e6389d [Davies Liu] bugfix
      470f702 [Davies Liu] speed up DenseMatrix
      f0d3c40 [Davies Liu] speedup SparseVector
      ef6ce70 [Davies Liu] speed up dense vector
      b660de7a
  34. Nov 04, 2014
    • Xiangrui Meng's avatar
      [SPARK-3573][MLLIB] Make MLlib's Vector compatible with SQL's SchemaRDD · 1a9c6cdd
      Xiangrui Meng authored
      Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from jkbradley.
      
      ~~This PR contains the changes from #3068 . I will rebase after #3068 is merged.~~
      
      marmbrus jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3070 from mengxr/SPARK-3573 and squashes the following commits:
      
      3a0b6e5 [Xiangrui Meng] organize imports
      236f0a0 [Xiangrui Meng] register vector as UDT and provide dataset examples
      1a9c6cdd
  35. Oct 21, 2014
    • Davies Liu's avatar
      [SPARK-4023] [MLlib] [PySpark] convert rdd into RDD of Vector · 85708168
      Davies Liu authored
      Convert the input rdd to RDD of Vector.
      
      cc mengxr
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2870 from davies/fix4023 and squashes the following commits:
      
      1eac767 [Davies Liu] address comments
      0871576 [Davies Liu] convert rdd into RDD of Vector
      85708168
  36. Oct 11, 2014
    • cocoatomo's avatar
      [SPARK-3867][PySpark] ./python/run-tests failed when it run with Python 2.6... · 81015a2b
      cocoatomo authored
      [SPARK-3867][PySpark] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed
      
      ./python/run-tests search a Python 2.6 executable on PATH and use it if available.
      When using Python 2.6, it is going to import unittest2 module which is not a standard library in Python 2.6, so it fails with ImportError.
      
      Author: cocoatomo <cocoatomo77@gmail.com>
      
      Closes #2759 from cocoatomo/issues/3867-unittest2-import-error and squashes the following commits:
      
      f068eb5 [cocoatomo] [SPARK-3867] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed
      81015a2b
Loading