Skip to content
Snippets Groups Projects
  • Davies Liu's avatar
    6ada4f6f
    [SPARK-6781] [SQL] use sqlContext in python shell · 6ada4f6f
    Davies Liu authored
    Use `sqlContext` in PySpark shell, make it consistent with SQL programming guide. `sqlCtx` is also kept for compatibility.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #5425 from davies/sqlCtx and squashes the following commits:
    
    af67340 [Davies Liu] sqlCtx -> sqlContext
    15a278f [Davies Liu] use sqlContext in python shell
    6ada4f6f
    History
    [SPARK-6781] [SQL] use sqlContext in python shell
    Davies Liu authored
    Use `sqlContext` in PySpark shell, make it consistent with SQL programming guide. `sqlCtx` is also kept for compatibility.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #5425 from davies/sqlCtx and squashes the following commits:
    
    af67340 [Davies Liu] sqlCtx -> sqlContext
    15a278f [Davies Liu] use sqlContext in python shell
ml-guide.md 36.26 KiB
layout: global
title: Spark ML Programming Guide

spark.ml is a new package introduced in Spark 1.2, which aims to provide a uniform set of high-level APIs that help users create and tune practical machine learning pipelines. It is currently an alpha component, and we would like to hear back from the community about how it fits real-world use cases and how it could be improved.

Note that we will keep supporting and adding features to spark.mllib along with the development of spark.ml. Users should be comfortable using spark.mllib features and expect more features coming. Developers should contribute new algorithms to spark.mllib and can optionally contribute to spark.ml.

Table of Contents

  • This will become a table of contents (this text will be scraped). {:toc}

Main Concepts

Spark ML standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow. This section covers the key concepts introduced by the Spark ML API.

  • ML Dataset: Spark ML uses the DataFrame from Spark SQL as a dataset which can hold a variety of data types. E.g., a dataset could have different columns storing text, feature vectors, true labels, and predictions.

  • Transformer: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms an RDD with features into an RDD with predictions.

  • Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a dataset and produces a model.

  • Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.

  • Param: All Transformers and Estimators now share a common API for specifying parameters.

ML Dataset

Machine learning can be applied to a wide variety of data types, such as vectors, text, images, and structured data. Spark ML adopts the DataFrame from Spark SQL in order to support a variety of data types under a unified Dataset concept.

DataFrame supports many basic and structured types; see the Spark SQL datatype reference for a list of supported types. In addition to the types listed in the Spark SQL guide, DataFrame can use ML Vector types.

A DataFrame can be created either implicitly or explicitly from a regular RDD. See the code examples below and the Spark SQL programming guide for examples.

Columns in a DataFrame are named. The code examples below use names such as "text," "features," and "label."

ML Algorithms

Transformers

A Transformer is an abstraction which includes feature transformers and learned models. Technically, a Transformer implements a method transform() which converts one DataFrame into another, generally by appending one or more columns. For example:

  • A feature transformer might take a dataset, read a column (e.g., text), convert it into a new column (e.g., feature vectors), append the new column to the dataset, and output the updated dataset.
  • A learning model might take a dataset, read the column containing feature vectors, predict the label for each feature vector, append the labels as a new column, and output the updated dataset.

Estimators

An Estimator abstracts the concept of a learning algorithm or any algorithm which fits or trains on data. Technically, an Estimator implements a method fit() which accepts a DataFrame and produces a Transformer. For example, a learning algorithm such as LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Transformer.

Properties of ML Algorithms