-
Tathagata Das authoredTathagata Das authored
layout: global
title: Spark Streaming Programming Guide
- This will become a table of contents (this text will be scraped). {:toc}
Overview
A Spark Streaming application is very similar to a Spark application; it consists of a driver program that runs the user's main
function and continuous executes various parallel operations on input streams of data. The main abstraction Spark Streaming provides is a discretized stream (DStream), which is a continuous sequence of RDDs (distributed collection of elements) representing a continuous stream of data. DStreams can created from live incoming data (such as data from a socket, Kafka, etc.) or it can be generated by transformation of existing DStreams using parallel operators like map, reduce, and window. The basic processing model is as follows:
(i) While a Spark Streaming driver program is running, the system receives data from various sources and and divides the data into batches. Each batch of data is treated as a RDD, that is a immutable and parallel collection of data. These input data RDDs are automatically persisted in memory (serialized by default) and replicated to two nodes for fault-tolerance. This sequence of RDDs is collectively referred to as an InputDStream.
(ii) Data received by InputDStreams are processed processed using DStream operations. Since all data is represented as RDDs and all DStream operations as RDD operations, data is automatically recovered in the event of node failures.
This guide shows some how to start programming with DStreams.
Initializing Spark Streaming
The first thing a Spark Streaming program must do is create a StreamingContext
object, which tells Spark how to access a cluster. A StreamingContext
can be created by using
{% highlight scala %} new StreamingContext(master, jobName, batchDuration) {% endhighlight %}
The master
parameter is either the Mesos master URL (for running on a cluster)or the special "local" string (for local mode) that is used to create a Spark Context. For more information about this please refer to the Spark programming guide. The jobName
is the name of the streaming job, which is the same as the jobName used in SparkContext. It is used to identify this job in the Mesos web UI. The batchDuration
is the size of the batches (as explained earlier). This must be set carefully such the cluster can keep up with the processing of the data streams. Starting with something conservative like 5 seconds maybe a good start. See Performance Tuning section for a detailed discussion.
This constructor creates a SparkContext object using the given master
and jobName
parameters. However, if you already have a SparkContext or you need to create a custom SparkContext by specifying list of JARs, then a StreamingContext can be created from the existing SparkContext, by using
{% highlight scala %}
new StreamingContext(sparkContext, batchDuration)
{% endhighlight %}