Skip to content
Snippets Groups Projects
  1. Jul 15, 2016
    • Joseph K. Bradley's avatar
      [SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide · 5ffd5d38
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Made DataFrame-based API primary
      * Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html
      * mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html
      * ml-guide.html includes a "maintenance mode" announcement about the RDD-based API
        * **Reviewers: please check this carefully**
      * (minor) Titles for DF API no longer include "- spark.ml" suffix.  Titles for RDD API have "- RDD-based API" suffix
      * Moved migration guide to ml-guide from mllib-guide
        * Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides
        * **Reviewers**: I did not change any of the content of the migration guides.
      
      Reorganized DataFrame-based guide:
      * ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc.
      * Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html
        * **Reviewers**: I did not change the content of these guides, except some intro text.
      * Sidebar remains the same, but with pipeline and tuning sections added
      
      Other:
      * ml-classification-regression.html: Moved text about linear methods to new section in page
      
      ## How was this patch tested?
      
      Generated docs locally
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #14213 from jkbradley/ml-guide-2.0.
      5ffd5d38
  2. Dec 10, 2015
    • Timothy Hunter's avatar
      [SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib... · 2ecbe02d
      Timothy Hunter authored
      [SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib and mllib in the documentation.
      
      Replaces a number of occurences of `MLlib` in the documentation that were meant to refer to the `spark.mllib` package instead. It should clarify for new users the difference between `spark.mllib` (the package) and MLlib (the umbrella project for ML in spark).
      
      It also removes some files that I forgot to delete with #10207
      
      Author: Timothy Hunter <timhunter@databricks.com>
      
      Closes #10234 from thunterdb/12212.
      2ecbe02d
  3. May 22, 2015
    • Mike Dusenberry's avatar
      [SPARK-7830] [DOCS] [MLLIB] Adding logistic regression to the list of... · 63a5ce75
      Mike Dusenberry authored
      [SPARK-7830] [DOCS] [MLLIB] Adding logistic regression to the list of Multiclass Classification Supported Methods documentation
      
      Added logistic regression to the list of Multiclass Classification Supported Methods in the MLlib Classification and Regression documentation, as it was missing.
      
      Author: Mike Dusenberry <dusenberrymw@gmail.com>
      
      Closes #6357 from dusenberrymw/Add_LR_To_List_Of_Multiclass_Classification_Methods and squashes the following commits:
      
      7918650 [Mike Dusenberry] Updating broken link due to the "Binary Classification" section on the Linear Methods page being renamed to "Classification".
      3005dc2 [Mike Dusenberry] Adding logistic regression to the list of Multiclass Classification Supported Methods in the MLlib Classification and Regression documentation, as it was missing.
      63a5ce75
  4. Feb 25, 2015
    • Joseph K. Bradley's avatar
      [SPARK-5974] [SPARK-5980] [mllib] [python] [docs] Update ML guide with save/load, Python GBT · d20559b1
      Joseph K. Bradley authored
      * Add GradientBoostedTrees Python examples to ML guide
        * I ran these in the pyspark shell, and they worked.
      * Add save/load to examples in ML guide
      * Added note to python docs about predict,transform not working within RDD actions,transformations in some cases (See SPARK-5981)
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4750 from jkbradley/SPARK-5974 and squashes the following commits:
      
      c410e38 [Joseph K. Bradley] Added note to LabeledPoint about attributes
      bcae18b [Joseph K. Bradley] Added import of models for save/load examples in ml guide.  Fixed line length for tree.py, feature.py (but not other ML Pyspark files yet).
      6d81c3e [Joseph K. Bradley] completed python GBT examples
      9903309 [Joseph K. Bradley] Added note to python docs about predict,transform not working within RDD actions,transformations in some cases
      c7dfad8 [Joseph K. Bradley] Added model save/load to ML guide.  Added GBT examples to ML guide
      d20559b1
  5. Feb 15, 2015
    • martinzapletal's avatar
      [MLLIB][SPARK-5502] User guide for isotonic regression · 61eb1267
      martinzapletal authored
      User guide for isotonic regression added to docs/mllib-regression.md including code examples for Scala and Java.
      
      Author: martinzapletal <zapletal-martin@email.cz>
      
      Closes #4536 from zapletal-martin/SPARK-5502 and squashes the following commits:
      
      67fe773 [martinzapletal] SPARK-5502 reworded model prediction rules to use more general language rather than the code/implementation specific terms
      80bd4c3 [martinzapletal] SPARK-5502 created docs page for isotonic regression, added links to the page, updated data and examples
      7d8136e [martinzapletal] SPARK-5502 Added documentation for Isotonic regression including examples for Scala and Java
      504b5c3 [martinzapletal] SPARK-5502 Added documentation for Isotonic regression including examples for Scala and Java
      61eb1267
  6. Aug 12, 2014
    • Ameet Talwalkar's avatar
      SPARK-2830 [MLlib]: re-organize mllib documentation · c235b83e
      Ameet Talwalkar authored
      As per discussions with Xiangrui, I've reorganized and edited the mllib documentation.
      
      Author: Ameet Talwalkar <atalwalkar@gmail.com>
      
      Closes #1908 from atalwalkar/master and squashes the following commits:
      
      fe6938a [Ameet Talwalkar] made xiangruis suggested changes
      840028b [Ameet Talwalkar] made xiangruis suggested changes
      7ec366a [Ameet Talwalkar] reorganize and edit mllib documentation
      c235b83e
  7. Apr 22, 2014
    • Xiangrui Meng's avatar
      [SPARK-1506][MLLIB] Documentation improvements for MLlib 1.0 · 26d35f3f
      Xiangrui Meng authored
      Preview: http://54.82.240.23:4000/mllib-guide.html
      
      Table of contents:
      
      * Basics
        * Data types
        * Summary statistics
      * Classification and regression
        * linear support vector machine (SVM)
        * logistic regression
        * linear linear squares, Lasso, and ridge regression
        * decision tree
        * naive Bayes
      * Collaborative Filtering
        * alternating least squares (ALS)
      * Clustering
        * k-means
      * Dimensionality reduction
        * singular value decomposition (SVD)
        * principal component analysis (PCA)
      * Optimization
        * stochastic gradient descent
        * limited-memory BFGS (L-BFGS)
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #422 from mengxr/mllib-doc and squashes the following commits:
      
      944e3a9 [Xiangrui Meng] merge master
      f9fda28 [Xiangrui Meng] minor
      9474065 [Xiangrui Meng] add alpha to ALS examples
      928e630 [Xiangrui Meng] initialization_mode -> initializationMode
      5bbff49 [Xiangrui Meng] add imports to labeled point examples
      c17440d [Xiangrui Meng] fix python nb example
      28f40dc [Xiangrui Meng] remove localhost:4000
      369a4d3 [Xiangrui Meng] Merge branch 'master' into mllib-doc
      7dc95cc [Xiangrui Meng] update linear methods
      053ad8a [Xiangrui Meng] add links to go back to the main page
      abbbf7e [Xiangrui Meng] update ALS argument names
      648283e [Xiangrui Meng] level down statistics
      14e2287 [Xiangrui Meng] add sample libsvm data and use it in guide
      8cd2441 [Xiangrui Meng] minor updates
      186ab07 [Xiangrui Meng] update section names
      6568d65 [Xiangrui Meng] update toc, level up lr and svm
      162ee12 [Xiangrui Meng] rename section names
      5c1e1b1 [Xiangrui Meng] minor
      8aeaba1 [Xiangrui Meng] wrap long lines
      6ce6a6f [Xiangrui Meng] add summary statistics to toc
      5760045 [Xiangrui Meng] claim beta
      cc604bf [Xiangrui Meng] remove classification and regression
      92747b3 [Xiangrui Meng] make section titles consistent
      e605dd6 [Xiangrui Meng] add LIBSVM loader
      f639674 [Xiangrui Meng] add python section to migration guide
      c82ffb4 [Xiangrui Meng] clean optimization
      31660eb [Xiangrui Meng] update linear algebra and stat
      0a40837 [Xiangrui Meng] first pass over linear methods
      1fc8271 [Xiangrui Meng] update toc
      906ed0a [Xiangrui Meng] add a python example to naive bayes
      5f0a700 [Xiangrui Meng] update collaborative filtering
      656d416 [Xiangrui Meng] update mllib-clustering
      86e143a [Xiangrui Meng] remove data types section from main page
      8d1a128 [Xiangrui Meng] move part of linear algebra to data types and add Java/Python examples
      d1b5cbf [Xiangrui Meng] merge master
      72e4804 [Xiangrui Meng] one pass over tree guide
      64f8995 [Xiangrui Meng] move decision tree guide to a separate file
      9fca001 [Xiangrui Meng] add first version of linear algebra guide
      53c9552 [Xiangrui Meng] update dependencies
      f316ec2 [Xiangrui Meng] add migration guide
      f399f6c [Xiangrui Meng] move linear-algebra to dimensionality-reduction
      182460f [Xiangrui Meng] add guide for naive Bayes
      137fd1d [Xiangrui Meng] re-organize toc
      a61e434 [Xiangrui Meng] update mllib's toc
      26d35f3f
  8. Apr 21, 2014
    • Matei Zaharia's avatar
      [SPARK-1439, SPARK-1440] Generate unified Scaladoc across projects and Javadocs · fc783847
      Matei Zaharia authored
      I used the sbt-unidoc plugin (https://github.com/sbt/sbt-unidoc) to create a unified Scaladoc of our public packages, and generate Javadocs as well. One limitation is that I haven't found an easy way to exclude packages in the Javadoc; there is a SBT task that identifies Java sources to run javadoc on, but it's been very difficult to modify it from outside to change what is set in the unidoc package. Some SBT-savvy people should help with this. The Javadoc site also lacks package-level descriptions and things like that, so we may want to look into that. We may decide not to post these right now if it's too limited compared to the Scala one.
      
      Example of the built doc site: http://people.csail.mit.edu/matei/spark-unified-docs/
      
      Author: Matei Zaharia <matei@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Patrick Wendell <pwendell@gmail.com>
      
      Closes #457 from mateiz/better-docs and squashes the following commits:
      
      a63d4a3 [Matei Zaharia] Skip Java/Scala API docs for Python package
      5ea1f43 [Matei Zaharia] Fix links to Java classes in Java guide, fix some JS for scrolling to anchors on page load
      f05abc0 [Matei Zaharia] Don't include java.lang package names
      995e992 [Matei Zaharia] Skip internal packages and class names with $ in JavaDoc
      a14a93c [Matei Zaharia] typo
      76ce64d [Matei Zaharia] Add groups to Javadoc index page, and a first package-info.java
      ed6f994 [Matei Zaharia] Generate JavaDoc as well, add titles, update doc site to use unified docs
      acb993d [Matei Zaharia] Add Unidoc plugin for the projects we want Unidoced
      fc783847
  9. Apr 15, 2014
    • Matei Zaharia's avatar
      [WIP] SPARK-1430: Support sparse data in Python MLlib · 63ca581d
      Matei Zaharia authored
      This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
      
      On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
      
      Some to-do items left:
      - [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
      - [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
      - [x] Explain how to use these in the Python MLlib docs.
      
      CC @mengxr, @joshrosen
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #341 from mateiz/py-ml-update and squashes the following commits:
      
      d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
      ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
      b9f97a3 [Matei Zaharia] Fix test
      1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
      88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
      37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
      da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
      c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
      a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
      74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
      889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
      ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
      a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
      0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
      eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
      2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
      154f45d [Matei Zaharia] Update docs, name some magic values
      881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
      63ca581d
    • Manish Amde's avatar
      Decision Tree documentation for MLlib programming guide · 07d72fe6
      Manish Amde authored
      Added documentation for user to use the decision tree algorithms for classification and regression in Spark 1.0 release.
      
      Apart from a general review, I need specific input on the following:
      * I had to move a lot of the existing documentation under the *linear methods* umbrella to accommodate decision trees. I wonder if there is a better way to organize the programming guide given we are so close to the release.
      * I have not looked closely at pyspark but I am wondering new mllib algorithms are automatically plugged in or do we need to some extra work to call mllib functions from pyspark. I will add to the pyspark examples based upon the advice I get.
      
      cc: @mengxr, @hirakendu, @etrain, @atalwalkar
      
      Author: Manish Amde <manish9ue@gmail.com>
      
      Closes #402 from manishamde/tree_doc and squashes the following commits:
      
      022485a [Manish Amde] more documentation
      865826e [Manish Amde] minor: grammar
      dbb0e5e [Manish Amde] minor improvements to text
      b9ef6c4 [Manish Amde] basic decision tree code examples
      6e297d7 [Manish Amde] added subsections
      f427e84 [Manish Amde] renaming sections
      9c0c4be [Manish Amde] split candidate
      6925275 [Manish Amde] impurity and information gain
      94fd2f9 [Manish Amde] more reorg
      b93125c [Manish Amde] more subsection reorg
      3ecb2ad [Manish Amde] minor text addition
      1537dd3 [Manish Amde] added placeholders and some doc
      d06511d [Manish Amde] basic skeleton
      07d72fe6
  10. Mar 13, 2014
    • Sandy Ryza's avatar
      SPARK-1183. Don't use "worker" to mean executor · 69837321
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #120 from sryza/sandy-spark-1183 and squashes the following commits:
      
      5066a4a [Sandy Ryza] Remove "worker" in a couple comments
      0bd1e46 [Sandy Ryza] Remove --am-class from usage
      bfc8fe0 [Sandy Ryza] Remove am-class from doc and fix yarn-alpha
      607539f [Sandy Ryza] Address review comments
      74d087a [Sandy Ryza] SPARK-1183. Don't use "worker" to mean executor
      69837321
  11. Feb 09, 2014
    • Martin Jaggi's avatar
      Merge pull request #566 from martinjaggi/copy-MLlib-d. · 2182aa3c
      Martin Jaggi authored
      new MLlib documentation for optimization, regression and classification
      
      new documentation with tex formulas, hopefully improving usability and reproducibility of the offered MLlib methods.
      also did some minor changes in the code for consistency. scala tests pass.
      
      this is the rebased branch, i deleted the old PR
      
      jira:
      https://spark-project.atlassian.net/browse/MLLIB-19
      
      Author: Martin Jaggi <m.jaggi@gmail.com>
      
      Closes #566 and squashes the following commits:
      
      5f0f31e [Martin Jaggi] line wrap at 100 chars
      4e094fb [Martin Jaggi] better description of GradientDescent
      1d6965d [Martin Jaggi] remove broken url
      ea569c3 [Martin Jaggi] telling what updater actually does
      964732b [Martin Jaggi] lambda R() in documentation
      a6c6228 [Martin Jaggi] better comments in SGD code for regression
      b32224a [Martin Jaggi] new optimization documentation
      d5dfef7 [Martin Jaggi] new classification and regression documentation
      b07ead6 [Martin Jaggi] correct scaling for MSE loss
      ba6158c [Martin Jaggi] use d for the number of features
      bab2ed2 [Martin Jaggi] renaming LeastSquaresGradient
      2182aa3c
  12. Feb 08, 2014
    • Martin Jaggi's avatar
      Merge pull request #552 from martinjaggi/master. Closes #552. · fabf1749
      Martin Jaggi authored
      tex formulas in the documentation
      
      using mathjax.
      and spliting the MLlib documentation by techniques
      
      see jira
      https://spark-project.atlassian.net/browse/MLLIB-19
      and
      https://github.com/shivaram/spark/compare/mathjax
      
      Author: Martin Jaggi <m.jaggi@gmail.com>
      
      == Merge branch commits ==
      
      commit 0364bfabbfc347f917216057a20c39b631842481
      Author: Martin Jaggi <m.jaggi@gmail.com>
      Date:   Fri Feb 7 03:19:38 2014 +0100
      
          minor polishing, as suggested by @pwendell
      
      commit dcd2142c164b2f602bf472bb152ad55bae82d31a
      Author: Martin Jaggi <m.jaggi@gmail.com>
      Date:   Thu Feb 6 18:04:26 2014 +0100
      
          enabling inline latex formulas with $.$
      
          same mathjax configuration as used in math.stackexchange.com
      
          sample usage in the linear algebra (SVD) documentation
      
      commit bbafafd2b497a5acaa03a140bb9de1fbb7d67ffa
      Author: Martin Jaggi <m.jaggi@gmail.com>
      Date:   Thu Feb 6 17:31:29 2014 +0100
      
          split MLlib documentation by techniques
      
          and linked from the main mllib-guide.md site
      
      commit d1c5212b93c67436543c2d8ddbbf610fdf0a26eb
      Author: Martin Jaggi <m.jaggi@gmail.com>
      Date:   Thu Feb 6 16:59:43 2014 +0100
      
          enable mathjax formula in the .md documentation files
      
          code by @shivaram
      
      commit d73948db0d9bc36296054e79fec5b1a657b4eab4
      Author: Martin Jaggi <m.jaggi@gmail.com>
      Date:   Thu Feb 6 16:57:23 2014 +0100
      
          minor update on how to compile the documentation
      fabf1749
Loading