Skip to content
Snippets Groups Projects
  1. Apr 03, 2017
    • Yuhao Yang's avatar
      [SPARK-19969][ML] Imputer doc and example · 4d28e843
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      Add docs and examples for spark.ml.feature.Imputer. Currently scala and Java examples are included. Python example will be added after https://github.com/apache/spark/pull/17316
      
      ## How was this patch tested?
      
      local doc generation and example execution
      
      Author: Yuhao Yang <yuhao.yang@intel.com>
      
      Closes #17324 from hhbyyh/imputerdoc.
      4d28e843
  2. Mar 07, 2017
    • VinceShieh's avatar
      [SPARK-17498][ML] StringIndexer enhancement for handling unseen labels · 4a9034b1
      VinceShieh authored
      ## What changes were proposed in this pull request?
      This PR is an enhancement to ML StringIndexer.
      Before this PR, String Indexer only supports "skip"/"error" options to deal with unseen records.
      But those unseen records might still be useful and user would like to keep the unseen labels in
      certain use cases, This PR enables StringIndexer to support keeping unseen labels as
      indices [numLabels].
      
      '''Before
      StringIndexer().setHandleInvalid("skip")
      StringIndexer().setHandleInvalid("error")
      '''After
      support the third option "keep"
      StringIndexer().setHandleInvalid("keep")
      
      ## How was this patch tested?
      Test added in StringIndexerSuite
      
      Signed-off-by: VinceShieh <vincent.xieintel.com>
      (Please fill in changes proposed in this fix)
      
      Author: VinceShieh <vincent.xie@intel.com>
      
      Closes #16883 from VinceShieh/spark-17498.
      4a9034b1
  3. Feb 15, 2017
    • Yun Ni's avatar
      [SPARK-18080][ML][PYTHON] Python API & Examples for Locality Sensitive Hashing · 08c1972a
      Yun Ni authored
      ## What changes were proposed in this pull request?
      This pull request includes python API and examples for LSH. The API changes was based on yanboliang 's PR #15768 and resolved conflicts and API changes on the Scala API. The examples are consistent with Scala examples of MinHashLSH and BucketedRandomProjectionLSH.
      
      ## How was this patch tested?
      API and examples are tested using spark-submit:
      `bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py`
      `bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py`
      
      User guide changes are generated and manually inspected:
      `SKIP_API=1 jekyll build`
      
      Author: Yun Ni <yunn@uber.com>
      Author: Yanbo Liang <ybliang8@gmail.com>
      Author: Yunni <Euler57721@gmail.com>
      
      Closes #16715 from Yunni/spark-18080.
      08c1972a
  4. Jan 10, 2017
    • Peng, Meng's avatar
      [SPARK-17645][MLLIB][ML][FOLLOW-UP] document minor change · 32286ba6
      Peng, Meng authored
      ## What changes were proposed in this pull request?
      Add FDR test case in ml/feature/ChiSqSelectorSuite.
      Improve some comments in the code.
      This is a follow-up pr for #15212.
      
      ## How was this patch tested?
      ut
      
      Author: Peng, Meng <peng.meng@intel.com>
      
      Closes #16434 from mpjlu/fdr_fwe_update.
      Unverified
      32286ba6
  5. Jan 04, 2017
    • Niranjan Padmanabhan's avatar
      [MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo · a1e40b1f
      Niranjan Padmanabhan authored
      ## What changes were proposed in this pull request?
      There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words.
      
      ## How was this patch tested?
      N/A since only docs or comments were updated.
      
      Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com>
      
      Closes #16455 from neurons/np.structure_streaming_doc.
      Unverified
      a1e40b1f
  6. Dec 28, 2016
    • Peng's avatar
      [SPARK-17645][MLLIB][ML] add feature selector method based on: False Discovery... · 79ff8536
      Peng authored
      [SPARK-17645][MLLIB][ML] add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE)
      
      ## What changes were proposed in this pull request?
      
      Univariate feature selection works by selecting the best features based on univariate statistical tests.
      FDR and FWE are a popular univariate statistical test for feature selection.
      In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate.
      In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests.
      https://en.wikipedia.org/wiki/Family-wise_error_rate
      
      We add  FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn.
      http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
      ## How was this patch tested?
      
      ut will be added soon
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: Peng <peng.meng@intel.com>
      Author: Peng, Meng <peng.meng@intel.com>
      
      Closes #15212 from mpjlu/fdr_fwe.
      79ff8536
  7. Dec 03, 2016
    • Yunni's avatar
      [SPARK-18081][ML][DOCS] Add user guide for Locality Sensitive Hashing(LSH) · 34777184
      Yunni authored
      ## What changes were proposed in this pull request?
      The user guide for LSH is added to ml-features.md, with several scala/java examples in spark-examples.
      
      ## How was this patch tested?
      Doc has been generated through Jekyll, and checked through manual inspection.
      
      Author: Yunni <Euler57721@gmail.com>
      Author: Yun Ni <yunn@uber.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      Author: Yun Ni <Euler57721@gmail.com>
      
      Closes #15795 from Yunni/SPARK-18081-lsh-guide.
      34777184
  8. Nov 30, 2016
    • Yanbo Liang's avatar
      [SPARK-18318][ML] ML, Graph 2.1 QA: API: New Scala APIs, docs · 60022bfd
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      API review for 2.1, except ```LSH``` related classes which are still under development.
      
      ## How was this patch tested?
      Only doc changes, no new tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16009 from yanboliang/spark-18318.
      60022bfd
  9. Nov 17, 2016
    • Zheng RuiFeng's avatar
      [SPARK-18480][DOCS] Fix wrong links for ML guide docs · cdaf4ce9
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      1, There are two `[Graph.partitionBy]` in `graphx-programming-guide.md`, the first one had no effert.
      2, `DataFrame`, `Transformer`, `Pipeline` and `Parameter`  in `ml-pipeline.md` were linked to `ml-guide.html` by mistake.
      3, `PythonMLLibAPI` in `mllib-linear-methods.md` was not accessable, because class `PythonMLLibAPI` is private.
      4, Other link updates.
      ## How was this patch tested?
       manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #15912 from zhengruifeng/md_fix.
      Unverified
      cdaf4ce9
  10. Nov 08, 2016
  11. Nov 01, 2016
    • Joseph K. Bradley's avatar
      [SPARK-18088][ML] Various ChiSqSelector cleanups · 91c33a0c
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      - Renamed kbest to numTopFeatures
      - Renamed alpha to fpr
      - Added missing Since annotations
      - Doc cleanups
      ## How was this patch tested?
      
      Added new standardized unit tests for spark.ml.
      Improved existing unit test coverage a bit.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #15647 from jkbradley/chisqselector-follow-ups.
      91c33a0c
  12. Oct 27, 2016
    • VinceShieh's avatar
      [SPARK-17219][ML] enhanced NaN value handling in Bucketizer · 0b076d4c
      VinceShieh authored
      ## What changes were proposed in this pull request?
      
      This PR is an enhancement of PR with commit ID:57dc326b.
      NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively.
      
      '''Before:
      val bucketizer: Bucketizer = new Bucketizer()
                .setInputCol("feature")
                .setOutputCol("result")
                .setSplits(splits)
      '''After:
      val bucketizer: Bucketizer = new Bucketizer()
                .setInputCol("feature")
                .setOutputCol("result")
                .setSplits(splits)
                .setHandleNaN("keep")
      
      ## How was this patch tested?
      Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite
      
      Signed-off-by: VinceShieh <vincent.xieintel.com>
      
      Author: VinceShieh <vincent.xie@intel.com>
      Author: Vincent Xie <vincent.xie@intel.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #15428 from VinceShieh/spark-17219_followup.
      0b076d4c
  13. Sep 28, 2016
  14. Sep 21, 2016
    • VinceShieh's avatar
      [SPARK-17219][ML] Add NaN value handling in Bucketizer · 57dc326b
      VinceShieh authored
      ## What changes were proposed in this pull request?
      This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value.
      Sometimes, null value might also be useful to users, so in these cases, Bucketizer should
      reserve one extra bucket for NaN values, instead of throwing an illegal exception.
      Before:
      ```
      Bucketizer.transform on NaN value threw an illegal exception.
      ```
      After:
      ```
      NaN values will be grouped in an extra bucket.
      ```
      ## How was this patch tested?
      New test cases added in `BucketizerSuite`.
      Signed-off-by: VinceShieh <vincent.xieintel.com>
      
      Author: VinceShieh <vincent.xie@intel.com>
      
      Closes #14858 from VinceShieh/spark-17219.
      Unverified
      57dc326b
  15. Aug 27, 2016
    • Sean Owen's avatar
      [SPARK-17001][ML] Enable standardScaler to standardize sparse vectors when withMean=True · e07baf14
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Allow centering / mean scaling of sparse vectors in StandardScaler, if requested. This is for compatibility with `VectorAssembler` in common usages.
      
      ## How was this patch tested?
      
      Jenkins tests, including new caes to reflect the new behavior.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14663 from srowen/SPARK-17001.
      e07baf14
  16. Aug 24, 2016
  17. Jul 25, 2016
  18. Jul 15, 2016
    • Joseph K. Bradley's avatar
      [SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide · 5ffd5d38
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Made DataFrame-based API primary
      * Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html
      * mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html
      * ml-guide.html includes a "maintenance mode" announcement about the RDD-based API
        * **Reviewers: please check this carefully**
      * (minor) Titles for DF API no longer include "- spark.ml" suffix.  Titles for RDD API have "- RDD-based API" suffix
      * Moved migration guide to ml-guide from mllib-guide
        * Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides
        * **Reviewers**: I did not change any of the content of the migration guides.
      
      Reorganized DataFrame-based guide:
      * ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc.
      * Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html
        * **Reviewers**: I did not change the content of these guides, except some intro text.
      * Sidebar remains the same, but with pipeline and tuning sections added
      
      Other:
      * ml-classification-regression.html: Moved text about linear methods to new section in page
      
      ## How was this patch tested?
      
      Generated docs locally
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #14213 from jkbradley/ml-guide-2.0.
      5ffd5d38
  19. Jun 24, 2016
  20. Jun 21, 2016
  21. May 20, 2016
    • sethah's avatar
      [SPARK-15394][ML][DOCS] User guide typos and grammar audit · 5e203505
      sethah authored
      ## What changes were proposed in this pull request?
      
      Correct some typos and incorrectly worded sentences.
      
      ## How was this patch tested?
      
      Doc changes only.
      
      Note that many of these changes were identified by whomfire01
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #13180 from sethah/ml_guide_audit.
      5e203505
  22. May 17, 2016
    • Yuhao Yang's avatar
      [SPARK-15182][ML] Copy MLlib doc to ML: ml.feature.tf, idf · 3308a862
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      We should now begin copying algorithm details from the spark.mllib guide to spark.ml as needed, rather than just linking back to the corresponding algorithms in the spark.mllib user guide.
      
      ## How was this patch tested?
      
      manual review for doc.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      Author: Yuhao Yang <yuhao.yang@intel.com>
      
      Closes #12957 from hhbyyh/tfidfdoc.
      3308a862
  23. May 07, 2016
    • Bryan Cutler's avatar
      [DOC][MINOR] Fixed minor errors in feature.ml user guide doc · 5d188a69
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Fixed some minor errors found when reviewing feature.ml user guide
      
      ## How was this patch tested?
      built docs locally
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #12940 from BryanCutler/feature.ml-doc_fixes-DOCS-MINOR.
      5d188a69
  24. May 06, 2016
  25. Apr 26, 2016
    • Zheng RuiFeng's avatar
      [SPARK-14514][DOC] Add python example for VectorSlicer · e88476c8
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Add the missing python example for VectorSlicer
      
      ## How was this patch tested?
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #12282 from zhengruifeng/vecslicer_pe.
      e88476c8
  26. Apr 20, 2016
    • Yuhao Yang's avatar
      [SPARK-14635][ML] Documentation and Examples for TF-IDF only refer to HashingTF · ed9d8038
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this.
      
      ## How was this patch tested?
      
      unit tests and doc generation
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #12454 from hhbyyh/tfdoc.
      ed9d8038
  27. Apr 18, 2016
  28. Apr 13, 2016
  29. Apr 09, 2016
  30. Mar 11, 2016
  31. Feb 22, 2016
  32. Jan 25, 2016
  33. Dec 14, 2015
  34. Dec 12, 2015
  35. Dec 11, 2015
    • BenFradet's avatar
      [SPARK-12217][ML] Document invalid handling for StringIndexer · aea676ca
      BenFradet authored
      Added a paragraph regarding StringIndexer#setHandleInvalid to the ml-features documentation.
      
      I wonder if I should also add a snippet to the code example, input welcome.
      
      Author: BenFradet <benjamin.fradet@gmail.com>
      
      Closes #10257 from BenFradet/SPARK-12217.
      aea676ca
  36. Dec 10, 2015
    • Timothy Hunter's avatar
      [SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib... · 2ecbe02d
      Timothy Hunter authored
      [SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib and mllib in the documentation.
      
      Replaces a number of occurences of `MLlib` in the documentation that were meant to refer to the `spark.mllib` package instead. It should clarify for new users the difference between `spark.mllib` (the package) and MLlib (the umbrella project for ML in spark).
      
      It also removes some files that I forgot to delete with #10207
      
      Author: Timothy Hunter <timhunter@databricks.com>
      
      Closes #10234 from thunterdb/12212.
      2ecbe02d
  37. Dec 09, 2015
  38. Dec 08, 2015
Loading