Skip to content
Snippets Groups Projects
Commit c2b105af authored by Josh Rosen's avatar Josh Rosen
Browse files

Add documentation for Python API.

parent 7ec3595d
No related branches found
No related tags found
No related merge requests found
...@@ -47,6 +47,7 @@ ...@@ -47,6 +47,7 @@
<li><a href="quick-start.html">Quick Start</a></li> <li><a href="quick-start.html">Quick Start</a></li>
<li><a href="scala-programming-guide.html">Scala</a></li> <li><a href="scala-programming-guide.html">Scala</a></li>
<li><a href="java-programming-guide.html">Java</a></li> <li><a href="java-programming-guide.html">Java</a></li>
<li><a href="python-programming-guide.html">Python</a></li>
</ul> </ul>
</li> </li>
......
...@@ -8,3 +8,4 @@ Here you can find links to the Scaladoc generated for the Spark sbt subprojects. ...@@ -8,3 +8,4 @@ Here you can find links to the Scaladoc generated for the Spark sbt subprojects.
- [Core](api/core/index.html) - [Core](api/core/index.html)
- [Examples](api/examples/index.html) - [Examples](api/examples/index.html)
- [Bagel](api/bagel/index.html) - [Bagel](api/bagel/index.html)
- [PySpark](api/pyspark/index.html)
...@@ -7,11 +7,11 @@ title: Spark Overview ...@@ -7,11 +7,11 @@ title: Spark Overview
TODO(andyk): Rewrite to make the Java API a first class part of the story. TODO(andyk): Rewrite to make the Java API a first class part of the story.
{% endcomment %} {% endcomment %}
Spark is a MapReduce-like cluster computing framework designed for low-latency iterative jobs and interactive use from an Spark is a MapReduce-like cluster computing framework designed for low-latency iterative jobs and interactive use from an interpreter.
interpreter. It provides clean, language-integrated APIs in Scala and Java, with a rich array of parallel operators. Spark can It provides clean, language-integrated APIs in Scala, Java, and Python, with a rich array of parallel operators.
run on top of the [Apache Mesos](http://incubator.apache.org/mesos/) cluster manager, Spark can run on top of the [Apache Mesos](http://incubator.apache.org/mesos/) cluster manager,
[Hadoop YARN](http://hadoop.apache.org/docs/r2.0.1-alpha/hadoop-yarn/hadoop-yarn-site/YARN.html), [Hadoop YARN](http://hadoop.apache.org/docs/r2.0.1-alpha/hadoop-yarn/hadoop-yarn-site/YARN.html),
Amazon EC2, or without an independent resource manager ("standalone mode"). Amazon EC2, or without an independent resource manager ("standalone mode").
# Downloading # Downloading
...@@ -59,6 +59,7 @@ of `project/SparkBuild.scala`, then rebuilding Spark (`sbt/sbt clean compile`). ...@@ -59,6 +59,7 @@ of `project/SparkBuild.scala`, then rebuilding Spark (`sbt/sbt clean compile`).
* [Quick Start](quick-start.html): a quick introduction to the Spark API; start here! * [Quick Start](quick-start.html): a quick introduction to the Spark API; start here!
* [Spark Programming Guide](scala-programming-guide.html): an overview of Spark concepts, and details on the Scala API * [Spark Programming Guide](scala-programming-guide.html): an overview of Spark concepts, and details on the Scala API
* [Java Programming Guide](java-programming-guide.html): using Spark from Java * [Java Programming Guide](java-programming-guide.html): using Spark from Java
* [Python Programming Guide](python-programming-guide.html): using Spark from Python
**Deployment guides:** **Deployment guides:**
...@@ -72,7 +73,7 @@ of `project/SparkBuild.scala`, then rebuilding Spark (`sbt/sbt clean compile`). ...@@ -72,7 +73,7 @@ of `project/SparkBuild.scala`, then rebuilding Spark (`sbt/sbt clean compile`).
* [Configuration](configuration.html): customize Spark via its configuration system * [Configuration](configuration.html): customize Spark via its configuration system
* [Tuning Guide](tuning.html): best practices to optimize performance and memory use * [Tuning Guide](tuning.html): best practices to optimize performance and memory use
* [API Docs (Scaladoc)](api/core/index.html) * API Docs: [Java/Scala (Scaladoc)](api/core/index.html) and [Python (Epydoc)](api/pyspark/index.html)
* [Bagel](bagel-programming-guide.html): an implementation of Google's Pregel on Spark * [Bagel](bagel-programming-guide.html): an implementation of Google's Pregel on Spark
* [Contributing to Spark](contributing-to-spark.html) * [Contributing to Spark](contributing-to-spark.html)
......
---
layout: global
title: Python Programming Guide
---
The Spark Python API (PySpark) exposes most of the Spark features available in the Scala version to Python.
To learn the basics of Spark, we recommend reading through the
[Scala programming guide](scala-programming-guide.html) first; it should be
easy to follow even if you don't know Scala.
This guide will show how to use the Spark features described there in Python.
# Key Differences in the Python API
There are a few key differences between the Python and Scala APIs:
* Python is dynamically typed, so RDDs can hold objects of different types.
* PySpark does not currently support the following Spark features:
- Accumulators
- Special functions on RRDs of doubles, such as `mean` and `stdev`
- Approximate jobs / functions, such as `countApprox` and `sumApprox`.
- `lookup`
- `mapPartitionsWithSplit`
- `persist` at storage levels other than `MEMORY_ONLY`
- `sample`
- `sort`
# Installing and Configuring PySpark
PySpark requires Python 2.6 or higher.
PySpark jobs are executed using a standard cPython interpreter in order to support Python modules that use C extensions.
We have not tested PySpark with Python 3 or with alternative Python interpreters, such as [PyPy](http://pypy.org/) or [Jython](http://www.jython.org/).
By default, PySpark's scripts will run programs using `python`; an alternate Python executable may be specified by setting the `PYSPARK_PYTHON` environment variable in `conf/spark-env.sh`.
All of PySpark's library dependencies, including [Py4J](http://py4j.sourceforge.net/), are bundled with PySpark and automatically imported.
Standalone PySpark jobs should be run using the `run-pyspark` script, which automatically configures the Java and Python environmnt using the settings in `conf/spark-env.sh`.
The script automatically adds the `pyspark` package to the `PYTHONPATH`.
# Interactive Use
PySpark's `pyspark-shell` script provides a simple way to learn the API:
{% highlight python %}
>>> words = sc.textFile("/usr/share/dict/words")
>>> words.filter(lambda w: w.startswith("spar")).take(5)
[u'spar', u'sparable', u'sparada', u'sparadrap', u'sparagrass']
{% endhighlight %}
# Standalone Use
PySpark can also be used from standalone Python scripts by creating a SparkContext in the script and running the script using the `run-pyspark` script in the `pyspark` directory.
The Quick Start guide includes a [complete example](quick-start.html#a-standalone-job-in-python) of a standalone Python job.
Code dependencies can be deployed by listing them in the `pyFiles` option in the SparkContext constructor:
{% highlight python %}
from pyspark import SparkContext
sc = SparkContext("local", "Job Name", pyFiles=['MyFile.py', 'lib.zip', 'app.egg'])
{% endhighlight %}
Files listed here will be added to the `PYTHONPATH` and shipped to remote worker machines.
Code dependencies can be added to an existing SparkContext using its `addPyFile()` method.
# Where to Go from Here
PySpark includes several sample programs using the Python API in `pyspark/examples`.
You can run them by passing the files to the `pyspark-run` script included in PySpark -- for example `./pyspark-run examples/wordcount.py`.
Each example program prints usage help when run without any arguments.
We currently provide [API documentation](api/pyspark/index.html) for the Python API as Epydoc.
Many of the RDD method descriptions contain [doctests](http://docs.python.org/2/library/doctest.html) that provide additional usage examples.
...@@ -6,7 +6,8 @@ title: Quick Start ...@@ -6,7 +6,8 @@ title: Quick Start
* This will become a table of contents (this text will be scraped). * This will become a table of contents (this text will be scraped).
{:toc} {:toc}
This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark's interactive Scala shell (don't worry if you don't know Scala -- you will need much for this), then show how to write standalone jobs in Scala and Java. See the [programming guide](scala-programming-guide.html) for a fuller reference. This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark's interactive Scala shell (don't worry if you don't know Scala -- you will need much for this), then show how to write standalone jobs in Scala, Java, and Python.
See the [programming guide](scala-programming-guide.html) for a more complete reference.
To follow along with this guide, you only need to have successfully built Spark on one machine. Simply go into your Spark directory and run: To follow along with this guide, you only need to have successfully built Spark on one machine. Simply go into your Spark directory and run:
...@@ -230,3 +231,40 @@ Lines with a: 8422, Lines with b: 1836 ...@@ -230,3 +231,40 @@ Lines with a: 8422, Lines with b: 1836
{% endhighlight %} {% endhighlight %}
This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode](spark-standalone.html) documentation, and consider using a distributed input source, such as HDFS. This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode](spark-standalone.html) documentation, and consider using a distributed input source, such as HDFS.
# A Standalone Job In Python
Now we will show how to write a standalone job using the Python API (PySpark).
As an example, we'll create a simple Spark job, `SimpleJob.py`:
{% highlight python %}
"""SimpleJob.py"""
from pyspark import SparkContext
logFile = "/var/log/syslog" # Should be some file on your system
sc = SparkContext("local", "Simple job")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
{% endhighlight %}
This job simply counts the number of lines containing 'a' and the number containing 'b' in a system log file.
Like in the Scala and Java examples, we use a SparkContext to create RDDs.
We can pass Python functions to Spark, which are automatically serialized along with any variables that they reference.
For jobs that use custom classes or third-party libraries, we can add those code dependencies to SparkContext to ensure that they will be available on remote machines; this is described in more detail in the [Python programming guide](python-programming-guide).
`SimpleJob` is simple enough that we do not need to specify any code dependencies.
We can run this job using the `run-pyspark` script in `$SPARK_HOME/pyspark`:
{% highlight python %}
$ cd $SPARK_HOME
$ ./pyspark/run-pyspark SimpleJob.py
...
Lines with a: 8422, Lines with b: 1836
{% endhighlight python %}
This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode](spark-standalone.html) documentation, and consider using a distributed input source, such as HDFS.
# PySpark
PySpark is a Python API for Spark.
PySpark jobs are writen in Python and executed using a standard Python
interpreter; this supports modules that use Python C extensions. The
API is based on the Spark Scala API and uses regular Python functions
and lambdas to support user-defined functions. PySpark supports
interactive use through a standard Python interpreter; it can
automatically serialize closures and ship them to worker processes.
PySpark is built on top of the Spark Java API. Data is uniformly
represented as serialized Python objects and stored in Spark Java
processes, which communicate with PySpark worker processes over pipes.
## Features
PySpark supports most of the Spark API, including broadcast variables.
RDDs are dynamically typed and can hold any Python object.
PySpark does not support:
- Special functions on RDDs of doubles
- Accumulators
## Examples and Documentation
The PySpark source contains docstrings and doctests that document its
API. The public classes are in `context.py` and `rdd.py`.
The `pyspark/pyspark/examples` directory contains a few complete
examples.
## Installing PySpark
#
To use PySpark, `SPARK_HOME` should be set to the location of the Spark
package.
## Running PySpark
The easiest way to run PySpark is to use the `run-pyspark` and
`pyspark-shell` scripts, which are included in the `pyspark` directory.
File moved
File moved
File moved
import sys import sys
import os import os
sys.path.insert(0, os.path.join(os.environ["SPARK_HOME"], "pyspark/lib/py4j0.7.egg")) sys.path.insert(0, os.path.join(os.environ["SPARK_HOME"], "pyspark/lib/py4j0.7.egg"))
from pyspark.context import SparkContext
__all__ = ["SparkContext"]
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment