-
Davies Liu authored
Use `sqlContext` in PySpark shell, make it consistent with SQL programming guide. `sqlCtx` is also kept for compatibility. Author: Davies Liu <davies@databricks.com> Closes #5425 from davies/sqlCtx and squashes the following commits: af67340 [Davies Liu] sqlCtx -> sqlContext 15a278f [Davies Liu] use sqlContext in python shell
Davies Liu authoredUse `sqlContext` in PySpark shell, make it consistent with SQL programming guide. `sqlCtx` is also kept for compatibility. Author: Davies Liu <davies@databricks.com> Closes #5425 from davies/sqlCtx and squashes the following commits: af67340 [Davies Liu] sqlCtx -> sqlContext 15a278f [Davies Liu] use sqlContext in python shell
layout: global
title: Spark ML Programming Guide
spark.ml
is a new package introduced in Spark 1.2, which aims to provide a uniform set of
high-level APIs that help users create and tune practical machine learning pipelines.
It is currently an alpha component, and we would like to hear back from the community about
how it fits real-world use cases and how it could be improved.
Note that we will keep supporting and adding features to spark.mllib
along with the
development of spark.ml
.
Users should be comfortable using spark.mllib
features and expect more features coming.
Developers should contribute new algorithms to spark.mllib
and can optionally contribute
to spark.ml
.
Table of Contents
- This will become a table of contents (this text will be scraped). {:toc}
Main Concepts
Spark ML standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow. This section covers the key concepts introduced by the Spark ML API.
-
ML Dataset: Spark ML uses the
DataFrame
from Spark SQL as a dataset which can hold a variety of data types. E.g., a dataset could have different columns storing text, feature vectors, true labels, and predictions. -
Transformer
: ATransformer
is an algorithm which can transform oneDataFrame
into anotherDataFrame
. E.g., an ML model is aTransformer
which transforms an RDD with features into an RDD with predictions. -
Estimator
: AnEstimator
is an algorithm which can be fit on aDataFrame
to produce aTransformer
. E.g., a learning algorithm is anEstimator
which trains on a dataset and produces a model. -
Pipeline
: APipeline
chains multipleTransformer
s andEstimator
s together to specify an ML workflow. -
Param
: AllTransformer
s andEstimator
s now share a common API for specifying parameters.
ML Dataset
Machine learning can be applied to a wide variety of data types, such as vectors, text, images, and structured data.
Spark ML adopts the DataFrame
from Spark SQL in order to support a variety of data types under a unified Dataset concept.
DataFrame
supports many basic and structured types; see the Spark SQL datatype reference for a list of supported types.
In addition to the types listed in the Spark SQL guide, DataFrame
can use ML Vector
types.
A DataFrame
can be created either implicitly or explicitly from a regular RDD
. See the code examples below and the Spark SQL programming guide for examples.
Columns in a DataFrame
are named. The code examples below use names such as "text," "features," and "label."
ML Algorithms
Transformers
A Transformer
is an abstraction which includes feature transformers and learned models. Technically, a Transformer
implements a method transform()
which converts one DataFrame
into another, generally by appending one or more columns.
For example:
- A feature transformer might take a dataset, read a column (e.g., text), convert it into a new column (e.g., feature vectors), append the new column to the dataset, and output the updated dataset.
- A learning model might take a dataset, read the column containing feature vectors, predict the label for each feature vector, append the labels as a new column, and output the updated dataset.
Estimators
An Estimator
abstracts the concept of a learning algorithm or any algorithm which fits or trains on data. Technically, an Estimator
implements a method fit()
which accepts a DataFrame
and produces a Transformer
.
For example, a learning algorithm such as LogisticRegression
is an Estimator
, and calling fit()
trains a LogisticRegressionModel
, which is a Transformer
.