Skip to content
Snippets Groups Projects
  • Eran Medan's avatar
    c25c669d
    change signature of example to match released code · c25c669d
    Eran Medan authored
    the signature of registerKryoClasses is actually of Array[Class[_]]  not Seq
    
    Author: Eran Medan <ehrann.mehdan@gmail.com>
    
    Closes #3747 from eranation/patch-1 and squashes the following commits:
    
    ee9885d [Eran Medan] change signature of example to match released code
    c25c669d
    History
    change signature of example to match released code
    Eran Medan authored
    the signature of registerKryoClasses is actually of Array[Class[_]]  not Seq
    
    Author: Eran Medan <ehrann.mehdan@gmail.com>
    
    Closes #3747 from eranation/patch-1 and squashes the following commits:
    
    ee9885d [Eran Medan] change signature of example to match released code
tuning.md 17.45 KiB
layout: global
title: Tuning Spark
  • This will become a table of contents (this text will be scraped). {:toc}

Because of the in-memory nature of most Spark computations, Spark programs can be bottlenecked by any resource in the cluster: CPU, network bandwidth, or memory. Most often, if the data fits in memory, the bottleneck is network bandwidth, but sometimes, you also need to do some tuning, such as storing RDDs in serialized form, to decrease memory usage. This guide will cover two main topics: data serialization, which is crucial for good network performance and can also reduce memory use, and memory tuning. We also sketch several smaller topics.

Data Serialization

Serialization plays an important role in the performance of any distributed application. Formats that are slow to serialize objects into, or consume a large number of bytes, will greatly slow down the computation. Often, this will be the first thing you should tune to optimize a Spark application. Spark aims to strike a balance between convenience (allowing you to work with any Java type in your operations) and performance. It provides two serialization libraries:

  • Java serialization: By default, Spark serializes objects using Java's ObjectOutputStream framework, and can work with any class you create that implements java.io.Serializable. You can also control the performance of your serialization more closely by extending java.io.Externalizable. Java serialization is flexible but often quite slow, and leads to large serialized formats for many classes.
  • Kryo serialization: Spark can also use the Kryo library (version 2) to serialize objects more quickly. Kryo is significantly faster and more compact than Java serialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you'll use in the program in advance for best performance.

You can switch to using Kryo by initializing your job with a SparkConf and calling conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"). This setting configures the serializer used for not only shuffling data between worker nodes but also when serializing RDDs to disk. The only reason Kryo is not the default is because of the custom registration requirement, but we recommend trying it in any network-intensive application.

Spark automatically includes Kryo serializers for the many commonly-used core Scala classes covered in the AllScalaRegistrar from the Twitter chill library.

To register your own custom classes with Kryo, use the registerKryoClasses method.

{% highlight scala %} val conf = new SparkConf().setMaster(...).setAppName(...) conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2])) val sc = new SparkContext(conf) {% endhighlight %}