Skip to content
Snippets Groups Projects
  • Sandy Ryza's avatar
    69837321
    SPARK-1183. Don't use "worker" to mean executor · 69837321
    Sandy Ryza authored
    Author: Sandy Ryza <sandy@cloudera.com>
    
    Closes #120 from sryza/sandy-spark-1183 and squashes the following commits:
    
    5066a4a [Sandy Ryza] Remove "worker" in a couple comments
    0bd1e46 [Sandy Ryza] Remove --am-class from usage
    bfc8fe0 [Sandy Ryza] Remove am-class from doc and fix yarn-alpha
    607539f [Sandy Ryza] Address review comments
    74d087a [Sandy Ryza] SPARK-1183. Don't use "worker" to mean executor
    69837321
    History
    SPARK-1183. Don't use "worker" to mean executor
    Sandy Ryza authored
    Author: Sandy Ryza <sandy@cloudera.com>
    
    Closes #120 from sryza/sandy-spark-1183 and squashes the following commits:
    
    5066a4a [Sandy Ryza] Remove "worker" in a couple comments
    0bd1e46 [Sandy Ryza] Remove --am-class from usage
    bfc8fe0 [Sandy Ryza] Remove am-class from doc and fix yarn-alpha
    607539f [Sandy Ryza] Address review comments
    74d087a [Sandy Ryza] SPARK-1183. Don't use "worker" to mean executor
python-programming-guide.md 6.91 KiB
layout: global
title: Python Programming Guide

The Spark Python API (PySpark) exposes the Spark programming model to Python. To learn the basics of Spark, we recommend reading through the Scala programming guide first; it should be easy to follow even if you don't know Scala. This guide will show how to use the Spark features described there in Python.

Key Differences in the Python API

There are a few key differences between the Python and Scala APIs:

  • Python is dynamically typed, so RDDs can hold objects of multiple types.
  • PySpark does not yet support a few API calls, such as lookup and non-text input files, though these will be added in future releases.

In PySpark, RDDs support the same methods as their Scala counterparts but take Python functions and return Python collection types. Short functions can be passed to RDD methods using Python's lambda syntax:

{% highlight python %} logData = sc.textFile(logFile).cache() errors = logData.filter(lambda line: "ERROR" in line) {% endhighlight %}

You can also pass functions that are defined with the def keyword; this is useful for longer functions that can't be expressed using lambda:

{% highlight python %} def is_error(line): return "ERROR" in line errors = logData.filter(is_error) {% endhighlight %}

Functions can access objects in enclosing scopes, although modifications to those objects within RDD methods will not be propagated back: