Skip to content
Snippets Groups Projects
  1. Jan 25, 2016
  2. Jan 15, 2016
  3. Nov 06, 2015
  4. Nov 02, 2015
  5. Oct 20, 2015
    • Xiangrui Meng's avatar
      [MINOR][ML] fix doc warnings · 135ade90
      Xiangrui Meng authored
      Without an empty line, sphinx will treat doctest as docstring. holdenk
      
      ~~~
      /Users/meng/src/spark/python/pyspark/ml/feature.py:docstring of pyspark.ml.feature.CountVectorizer:3: ERROR: Undefined substitution referenced: "label|raw |vectors | +-----+---------------+-------------------------+ |0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])".
      /Users/meng/src/spark/python/pyspark/ml/feature.py:docstring of pyspark.ml.feature.CountVectorizer:3: ERROR: Undefined substitution referenced: "1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])".
      ~~~
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #9188 from mengxr/py-count-vec-doc-fix.
      135ade90
  6. Sep 25, 2015
    • Eric Liang's avatar
      [SPARK-9681] [ML] Support R feature interactions in RFormula · 92233881
      Eric Liang authored
      This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`).
      
      To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms.
      
      mengxr
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #8830 from ericl/interaction-2.
      92233881
  7. Sep 21, 2015
  8. Sep 11, 2015
  9. Sep 10, 2015
    • Yanbo Liang's avatar
      [SPARK-10027] [ML] [PySpark] Add Python API missing methods for ml.feature · a140dd77
      Yanbo Liang authored
      Missing method of ml.feature are listed here:
      ```StringIndexer``` lacks of parameter ```handleInvalid```.
      ```StringIndexerModel``` lacks of method ```labels```.
      ```VectorIndexerModel``` lacks of methods ```numFeatures``` and ```categoryMaps```.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #8313 from yanboliang/spark-10027.
      a140dd77
  10. Sep 09, 2015
  11. Sep 08, 2015
  12. Sep 01, 2015
  13. Aug 31, 2015
  14. Aug 17, 2015
  15. Aug 12, 2015
  16. Aug 06, 2015
  17. Aug 03, 2015
    • Xiangrui Meng's avatar
      [SPARK-9544] [MLLIB] add Python API for RFormula · e4765a46
      Xiangrui Meng authored
      Add Python API for RFormula. Similar to other feature transformers in Python. This is just a thin wrapper over the Scala implementation. ericl MechCoder
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #7879 from mengxr/SPARK-9544 and squashes the following commits:
      
      3d5ff03 [Xiangrui Meng] add an doctest for . and -
      5e969a5 [Xiangrui Meng] fix pydoc
      1cd41f8 [Xiangrui Meng] organize imports
      3c18b10 [Xiangrui Meng] add Python API for RFormula
      e4765a46
  18. Jul 30, 2015
    • Xiangrui Meng's avatar
      [MINOR] [MLLIB] fix doc for RegexTokenizer · 81464f2a
      Xiangrui Meng authored
      This is #7791 for Python. hhbyyh
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #7798 from mengxr/regex-tok-py and squashes the following commits:
      
      baa2dcd [Xiangrui Meng] fix doc for RegexTokenizer
      81464f2a
  19. Jul 17, 2015
    • Yanbo Liang's avatar
      [SPARK-8792] [ML] Add Python API for PCA transformer · 830666f6
      Yanbo Liang authored
      Add Python API for PCA transformer
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #7190 from yanboliang/spark-8792 and squashes the following commits:
      
      8f4ac31 [Yanbo Liang] address comments
      8a79cc0 [Yanbo Liang] Add Python API for PCA transformer
      830666f6
  20. Jul 07, 2015
    • MechCoder's avatar
      [SPARK-8704] [ML] [PySpark] Add missing methods in StandardScaler · 35d781e7
      MechCoder authored
      Add std, mean to StandardScalerModel
      getVectors, findSynonyms to Word2Vec Model
      setFeatures and getFeatures to hashingTF
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #7086 from MechCoder/missing_model_methods and squashes the following commits:
      
      9fbae90 [MechCoder] Add type
      6e3d6b2 [MechCoder] [SPARK-8704] Add missing methods in StandardScaler (ML and PySpark)
      35d781e7
  21. Jun 29, 2015
    • Feynman Liang's avatar
      [SPARK-8456] [ML] Ngram featurizer python · 620605a4
      Feynman Liang authored
      Python API for N-gram feature transformer
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #6960 from feynmanliang/ngram-featurizer-python and squashes the following commits:
      
      f9e37c9 [Feynman Liang] Remove debugging code
      4dd81f4 [Feynman Liang] Fix typo and doctest
      06c79ac [Feynman Liang] Style guide
      26c1175 [Feynman Liang] Add python NGram API
      620605a4
  22. May 29, 2015
    • Xiangrui Meng's avatar
      [SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes... · 23452be9
      Xiangrui Meng authored
      [SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast
      
      This PR contains two major changes to `OneHotEncoder`:
      
      1. more robust handling of ML attributes. If the input attribute is unknown, we look at the values to get the max category index
      2. change `includeFirst` to `dropLast` and leave the default to `true`. There are couple benefits:
      
          a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm)
          b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1.
          c. If users use `StringIndex`, the last element is the least frequent one.
      
      Sorry for including two changes in one PR! I'll update the user guide in another PR.
      
      jkbradley sryza
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6466 from mengxr/SPARK-7912 and squashes the following commits:
      
      a280dca [Xiangrui Meng] fix tests
      d8f234d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7912
      171b276 [Xiangrui Meng] mention the difference between our impl vs sklearn's
      00dfd96 [Xiangrui Meng] update OneHotEncoder in Python
      208ddad [Xiangrui Meng] update OneHotEncoder to handle ML attributes and change includeFirst to dropLast
      23452be9
  23. May 21, 2015
    • Xiangrui Meng's avatar
      [SPARK-7794] [MLLIB] update RegexTokenizer default settings · f5db4b41
      Xiangrui Meng authored
      The previous default is `{gaps: false, pattern: "\\p{L}+|[^\\p{L}\\s]+"}`. The default pattern is hard to understand. This PR changes the default to `{gaps: true, pattern: "\\s+"}`. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6330 from mengxr/SPARK-7794 and squashes the following commits:
      
      5ee7cde [Xiangrui Meng] update RegexTokenizer default settings
      f5db4b41
  24. May 20, 2015
    • Holden Karau's avatar
      [SPARK-7511] [MLLIB] pyspark ml seed param should be random by default or 42... · 191ee474
      Holden Karau authored
      [SPARK-7511] [MLLIB] pyspark ml seed param should be random by default or 42 is quite funny but not very random
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #6139 from holdenk/SPARK-7511-pyspark-ml-seed-param-should-be-random-by-default-or-42-is-quite-funny-but-not-very-random and squashes the following commits:
      
      591f8e5 [Holden Karau] specify old seed for doc tests
      2470004 [Holden Karau] Fix a bunch of seeds with default values to have None as the default which will then result in using the hash of the class name
      cbad96d [Holden Karau] Add the setParams function that is used in the real code
      423b8d7 [Holden Karau] Switch the test code to behave slightly more like production code. also don't check the param map value only check for key existence
      140d25d [Holden Karau] remove extra space
      926165a [Holden Karau] Add some missing newlines for pep8 style
      8616751 [Holden Karau] merge in master
      58532e6 [Holden Karau] its the __name__ method, also treat None values as not set
      56ef24a [Holden Karau] fix test and regenerate base
      afdaa5c [Holden Karau] make sure different classes have different results
      68eb528 [Holden Karau] switch default seed to hash of type of self
      89c4611 [Holden Karau] Merge branch 'master' into SPARK-7511-pyspark-ml-seed-param-should-be-random-by-default-or-42-is-quite-funny-but-not-very-random
      31cd96f [Holden Karau] specify the seed to randomforestregressor test
      e1b947f [Holden Karau] Style fixes
      ce90ec8 [Holden Karau] merge in master
      bcdf3c9 [Holden Karau] update docstring seeds to none and some other default seeds from 42
      65eba21 [Holden Karau] pep8 fixes
      0e3797e [Holden Karau] Make seed default to random in more places
      213a543 [Holden Karau] Simplify the generated code to only include set default if there is a default rather than having None is note None in the generated code
      1ff17c2 [Holden Karau] Make the seed random for HasSeed in python
      191ee474
  25. May 18, 2015
    • Xiangrui Meng's avatar
      [SPARK-7380] [MLLIB] pipeline stages should be copyable in Python · 9c7e802a
      Xiangrui Meng authored
      This PR makes pipeline stages in Python copyable and hence simplifies some implementations. It also includes the following changes:
      
      1. Rename `paramMap` and `defaultParamMap` to `_paramMap` and `_defaultParamMap`, respectively.
      2. Accept a list of param maps in `fit`.
      3. Use parent uid and name to identify param.
      
      jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6088 from mengxr/SPARK-7380 and squashes the following commits:
      
      413c463 [Xiangrui Meng] remove unnecessary doc
      4159f35 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
      611c719 [Xiangrui Meng] fix python style
      68862b8 [Xiangrui Meng] update _java_obj initialization
      927ad19 [Xiangrui Meng] fix ml/tests.py
      0138fc3 [Xiangrui Meng] update feature transformers and fix a bug in RegexTokenizer
      9ca44fb [Xiangrui Meng] simplify Java wrappers and add tests
      c7d84ef [Xiangrui Meng] update ml/tests.py to test copy params
      7e0d27f [Xiangrui Meng] merge master
      46840fb [Xiangrui Meng] update wrappers
      b6db1ed [Xiangrui Meng] update all self.paramMap to self._paramMap
      46cb6ed [Xiangrui Meng] merge master
      a163413 [Xiangrui Meng] fix style
      1042e80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
      9630eae [Xiangrui Meng] fix Identifiable._randomUID
      13bd70a [Xiangrui Meng] update ml/tests.py
      64a536c [Xiangrui Meng] use _fit/_transform/_evaluate to simplify the impl
      02abf13 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into copyable-python
      66ce18c [Joseph K. Bradley] some cleanups before sending to Xiangrui
      7431272 [Joseph K. Bradley] Rebased with master
      9c7e802a
  26. May 14, 2015
    • Xiangrui Meng's avatar
      [SPARK-7619] [PYTHON] fix docstring signature · 48fc38f5
      Xiangrui Meng authored
      Just realized that we need `\` at the end of the docstring. brkyvz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6161 from mengxr/SPARK-7619 and squashes the following commits:
      
      e44495f [Xiangrui Meng] fix docstring signature
      48fc38f5
  27. May 13, 2015
    • Burak Yavuz's avatar
      [SPARK-7593] [ML] Python Api for ml.feature.Bucketizer · 5db18ba6
      Burak Yavuz authored
      Added `ml.feature.Bucketizer` to PySpark.
      
      cc mengxr
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #6124 from brkyvz/ml-bucket and squashes the following commits:
      
      05285be [Burak Yavuz] added sphinx doc
      6abb6ed [Burak Yavuz] added support for Bucketizer
      5db18ba6
  28. May 08, 2015
    • Burak Yavuz's avatar
      [SPARK-7383] [ML] Feature Parity in PySpark for ml.features · f5ff4a84
      Burak Yavuz authored
      Implemented python wrappers for Scala functions that don't exist in `ml.features`
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5991 from brkyvz/ml-feat-PR and squashes the following commits:
      
      adcca55 [Burak Yavuz] add regex tokenizer to __all__
      b91cb44 [Burak Yavuz] addressed comments
      bd39fd2 [Burak Yavuz] remove addition
      b82bd7c [Burak Yavuz] Parity in PySpark for ml.features
      f5ff4a84
  29. May 07, 2015
    • Xiangrui Meng's avatar
      [SPARK-6948] [MLLIB] compress vectors in VectorAssembler · e43803b8
      Xiangrui Meng authored
      The compression is based on storage. brkyvz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5985 from mengxr/SPARK-6948 and squashes the following commits:
      
      df56a00 [Xiangrui Meng] update python tests
      6d90d45 [Xiangrui Meng] compress vectors in VectorAssembler
      e43803b8
    • Burak Yavuz's avatar
      [SPARK-7388] [SPARK-7383] wrapper for VectorAssembler in Python · 9e2ffb13
      Burak Yavuz authored
      The wrapper required the implementation of the `ArrayParam`, because `Array[T]` is hard to obtain from Python. `ArrayParam` has an extra function called `wCast` which is an internal function to obtain `Array[T]` from `Seq[T]`
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5930 from brkyvz/ml-feat and squashes the following commits:
      
      73e745f [Burak Yavuz] Merge pull request #3 from mengxr/SPARK-7388
      c221db9 [Xiangrui Meng] overload StringArrayParam.w
      c81072d [Burak Yavuz] addressed comments
      99c2ebf [Burak Yavuz] add to python_shared_params
      39ecb07 [Burak Yavuz] fix scalastyle
      7f7ea2a [Burak Yavuz] [SPARK-7388][SPARK-7383] wrapper for VectorAssembler in Python
      9e2ffb13
  30. Apr 16, 2015
    • Davies Liu's avatar
      [SPARK-4897] [PySpark] Python 3 support · 04e44b37
      Davies Liu authored
      This PR update PySpark to support Python 3 (tested with 3.4).
      
      Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.
      
      TODO: ec2/spark-ec2.py is not fully tested with python3.
      
      Author: Davies Liu <davies@databricks.com>
      Author: twneale <twneale@gmail.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5173 from davies/python3 and squashes the following commits:
      
      d7d6323 [Davies Liu] fix tests
      6c52a98 [Davies Liu] fix mllib test
      99e334f [Davies Liu] update timeout
      b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      cafd5ec [Davies Liu] adddress comments from @mengxr
      bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      179fc8d [Davies Liu] tuning flaky tests
      8c8b957 [Davies Liu] fix ResourceWarning in Python 3
      5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      4006829 [Davies Liu] fix test
      2fc0066 [Davies Liu] add python3 path
      71535e9 [Davies Liu] fix xrange and divide
      5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ed498c8 [Davies Liu] fix compatibility with python 3
      820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ad7c374 [Davies Liu] fix mllib test and warning
      ef1fc2f [Davies Liu] fix tests
      4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      59bb492 [Davies Liu] fix tests
      1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ca0fdd3 [Davies Liu] fix code style
      9563a15 [Davies Liu] add imap back for python 2
      0b1ec04 [Davies Liu] make python examples work with Python 3
      d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      a716d34 [Davies Liu] test with python 3.4
      f1700e8 [Davies Liu] fix test in python3
      671b1db [Davies Liu] fix test in python3
      692ff47 [Davies Liu] fix flaky test
      7b9699f [Davies Liu] invalidate import cache for Python 3.3+
      9c58497 [Davies Liu] fix kill worker
      309bfbf [Davies Liu] keep compatibility
      5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
      8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      f53e1f0 [Davies Liu] fix tests
      70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
      a39167e [Davies Liu] support customize class in __main__
      814c77b [Davies Liu] run unittests with python 3
      7f4476e [Davies Liu] mllib tests passed
      d737924 [Davies Liu] pass ml tests
      375ea17 [Davies Liu] SQL tests pass
      6cc42a9 [Davies Liu] rename
      431a8de [Davies Liu] streaming tests pass
      78901a7 [Davies Liu] fix hash of serializer in Python 3
      24b2f2e [Davies Liu] pass all RDD tests
      35f48fe [Davies Liu] run future again
      1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
      6e3c21d [Davies Liu] make cloudpickle work with Python3
      2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
      1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
      7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
      b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
      f40d925 [twneale] xrange --> range
      e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
      79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
      2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
      854be27 [Josh Rosen] Run `futurize` on Python code:
      7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
      04e44b37
    • Xiangrui Meng's avatar
      [SPARK-6893][ML] default pipeline parameter handling in python · 57cd1e86
      Xiangrui Meng authored
      Same as #5431 but for Python. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5534 from mengxr/SPARK-6893 and squashes the following commits:
      
      d3b519b [Xiangrui Meng] address comments
      ebaccc6 [Xiangrui Meng] style update
      fce244e [Xiangrui Meng] update explainParams with test
      4d6b07a [Xiangrui Meng] add tests
      5294500 [Xiangrui Meng] update default param handling in python
      57cd1e86
  31. Apr 08, 2015
    • Davies Liu's avatar
      [SPARK-6781] [SQL] use sqlContext in python shell · 6ada4f6f
      Davies Liu authored
      Use `sqlContext` in PySpark shell, make it consistent with SQL programming guide. `sqlCtx` is also kept for compatibility.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5425 from davies/sqlCtx and squashes the following commits:
      
      af67340 [Davies Liu] sqlCtx -> sqlContext
      15a278f [Davies Liu] use sqlContext in python shell
      6ada4f6f
  32. Feb 20, 2015
    • Joseph K. Bradley's avatar
      [SPARK-5867] [SPARK-5892] [doc] [ml] [mllib] Doc cleanups for 1.3 release · 4a17eedb
      Joseph K. Bradley authored
      For SPARK-5867:
      * The spark.ml programming guide needs to be updated to use the new SQL DataFrame API instead of the old SchemaRDD API.
      * It should also include Python examples now.
      
      For SPARK-5892:
      * Fix Python docs
      * Various other cleanups
      
      BTW, I accidentally merged this with master.  If you want to compile it on your own, use this branch which is based on spark/branch-1.3 and cherry-picks the commits from this PR: [https://github.com/jkbradley/spark/tree/doc-review-1.3-check]
      
      CC: mengxr  (ML),  davies  (Python docs)
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4675 from jkbradley/doc-review-1.3 and squashes the following commits:
      
      f191bb0 [Joseph K. Bradley] small cleanups
      e786efa [Joseph K. Bradley] small doc corrections
      6b1ab4a [Joseph K. Bradley] fixed python lint test
      946affa [Joseph K. Bradley] Added sample data for ml.MovieLensALS example.  Changed spark.ml Java examples to use DataFrames API instead of sql()
      da81558 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into doc-review-1.3
      629dbf5 [Joseph K. Bradley] Updated based on code review: * made new page for old migration guides * small fixes * moved inherit_doc in python
      b9df7c4 [Joseph K. Bradley] Small cleanups: toDF to toDF(), adding s for string interpolation
      34b067f [Joseph K. Bradley] small doc correction
      da16aef [Joseph K. Bradley] Fixed python mllib docs
      8cce91c [Joseph K. Bradley] GMM: removed old imports, added some doc
      695f3f6 [Joseph K. Bradley] partly done trying to fix inherit_doc for class hierarchies in python docs
      a72c018 [Joseph K. Bradley] made ChiSqTestResult appear in python docs
      b05a80d [Joseph K. Bradley] organize imports. doc cleanups
      e572827 [Joseph K. Bradley] updated programming guide for ml and mllib
      4a17eedb
  33. Feb 15, 2015
    • Xiangrui Meng's avatar
      [SPARK-5769] Set params in constructors and in setParams in Python ML pipelines · cd4a1536
      Xiangrui Meng authored
      This PR allow Python users to set params in constructors and in setParams, where we use decorator `keyword_only` to force keyword arguments. The trade-off is discussed in the design doc of SPARK-4586.
      
      Generated doc:
      ![screen shot 2015-02-12 at 3 06 58 am](https://cloud.githubusercontent.com/assets/829644/6166491/9cfcd06a-b265-11e4-99ea-473d866634fc.png)
      
      CC: davies rxin
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4564 from mengxr/py-pipeline-kw and squashes the following commits:
      
      fedf720 [Xiangrui Meng] use toDF
      d565f2c [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into py-pipeline-kw
      cbc15d3 [Xiangrui Meng] fix style
      5032097 [Xiangrui Meng] update pipeline signature
      950774e [Xiangrui Meng] simplify keyword_only and update constructor/setParams signatures
      fdde5fc [Xiangrui Meng] fix style
      c9384b8 [Xiangrui Meng] fix sphinx doc
      8e59180 [Xiangrui Meng] add setParams and make constructors take params, where we force keyword args
      cd4a1536
  34. Jan 28, 2015
    • Xiangrui Meng's avatar
      [SPARK-4586][MLLIB] Python API for ML pipeline and parameters · e80dc1c5
      Xiangrui Meng authored
      This PR adds Python API for ML pipeline and parameters. The design doc can be found on the JIRA page. It includes transformers and an estimator to demo the simple text classification example code.
      
      TODO:
      - [x] handle parameters in LRModel
      - [x] unit tests
      - [x] missing some docs
      
      CC: davies jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4151 from mengxr/SPARK-4586 and squashes the following commits:
      
      415268e [Xiangrui Meng] remove inherit_doc from __init__
      edbd6fe [Xiangrui Meng] move Identifiable to ml.util
      44c2405 [Xiangrui Meng] Merge pull request #2 from davies/ml
      dd1256b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586
      14ae7e2 [Davies Liu] fix docs
      54ca7df [Davies Liu] fix tests
      78638df [Davies Liu] Merge branch 'SPARK-4586' of github.com:mengxr/spark into ml
      fc59a02 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586
      1dca16a [Davies Liu] refactor
      090b3a3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into ml
      0882513 [Xiangrui Meng] update doc style
      a4f4dbf [Xiangrui Meng] add unit test for LR
      7521d1c [Xiangrui Meng] add unit tests to HashingTF and Tokenizer
      ba0ba1e [Xiangrui Meng] add unit tests for pipeline
      0586c7b [Xiangrui Meng] add more comments to the example
      5153cff [Xiangrui Meng] simplify java models
      036ca04 [Xiangrui Meng] gen numFeatures
      46fa147 [Xiangrui Meng] update mllib/pom.xml to include python files in the assembly
      1dcc17e [Xiangrui Meng] update code gen and make param appear in the doc
      f66ba0c [Xiangrui Meng] make params a property
      d5efd34 [Xiangrui Meng] update doc conf and move embedded param map to instance attribute
      f4d0fe6 [Xiangrui Meng] use LabeledDocument and Document in example
      05e3e40 [Xiangrui Meng] update example
      d3e8dbe [Xiangrui Meng] more docs optimize pipeline.fit impl
      56de571 [Xiangrui Meng] fix style
      d0c5bb8 [Xiangrui Meng] a working copy
      bce72f4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586
      17ecfb9 [Xiangrui Meng] code gen for shared params
      d9ea77c [Xiangrui Meng] update doc
      c18dca1 [Xiangrui Meng] make the example working
      dadd84e [Xiangrui Meng] add base classes and docs
      a3015cf [Xiangrui Meng] add Estimator and Transformer
      46eea43 [Xiangrui Meng] a pipeline in python
      33b68e0 [Xiangrui Meng] a working LR
      e80dc1c5
Loading