-
Sandy Ryza authored
Author: Sandy Ryza <sandy@cloudera.com> Closes #120 from sryza/sandy-spark-1183 and squashes the following commits: 5066a4a [Sandy Ryza] Remove "worker" in a couple comments 0bd1e46 [Sandy Ryza] Remove --am-class from usage bfc8fe0 [Sandy Ryza] Remove am-class from doc and fix yarn-alpha 607539f [Sandy Ryza] Address review comments 74d087a [Sandy Ryza] SPARK-1183. Don't use "worker" to mean executor
Sandy Ryza authoredAuthor: Sandy Ryza <sandy@cloudera.com> Closes #120 from sryza/sandy-spark-1183 and squashes the following commits: 5066a4a [Sandy Ryza] Remove "worker" in a couple comments 0bd1e46 [Sandy Ryza] Remove --am-class from usage bfc8fe0 [Sandy Ryza] Remove am-class from doc and fix yarn-alpha 607539f [Sandy Ryza] Address review comments 74d087a [Sandy Ryza] SPARK-1183. Don't use "worker" to mean executor
layout: global
title: Python Programming Guide
The Spark Python API (PySpark) exposes the Spark programming model to Python. To learn the basics of Spark, we recommend reading through the Scala programming guide first; it should be easy to follow even if you don't know Scala. This guide will show how to use the Spark features described there in Python.
Key Differences in the Python API
There are a few key differences between the Python and Scala APIs:
- Python is dynamically typed, so RDDs can hold objects of multiple types.
- PySpark does not yet support a few API calls, such as
lookup
and non-text input files, though these will be added in future releases.
In PySpark, RDDs support the same methods as their Scala counterparts but take Python functions and return Python collection types.
Short functions can be passed to RDD methods using Python's lambda
syntax:
{% highlight python %} logData = sc.textFile(logFile).cache() errors = logData.filter(lambda line: "ERROR" in line) {% endhighlight %}
You can also pass functions that are defined with the def
keyword; this is useful for longer functions that can't be expressed using lambda
:
{% highlight python %} def is_error(line): return "ERROR" in line errors = logData.filter(is_error) {% endhighlight %}
Functions can access objects in enclosing scopes, although modifications to those objects within RDD methods will not be propagated back: