Skip to content
Snippets Groups Projects
  • Reynold Xin's avatar
    5e92583d
    [SPARK-14667] Remove HashShuffleManager · 5e92583d
    Reynold Xin authored
    ## What changes were proposed in this pull request?
    The sort shuffle manager has been the default since Spark 1.2. It is time to remove the old hash shuffle manager.
    
    ## How was this patch tested?
    Removed some tests related to the old manager.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #12423 from rxin/SPARK-14667.
    5e92583d
    History
    [SPARK-14667] Remove HashShuffleManager
    Reynold Xin authored
    ## What changes were proposed in this pull request?
    The sort shuffle manager has been the default since Spark 1.2. It is time to remove the old hash shuffle manager.
    
    ## How was this patch tested?
    Removed some tests related to the old manager.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #12423 from rxin/SPARK-14667.
configuration.md 62.39 KiB
layout: global
displayTitle: Spark Configuration
title: Configuration
  • This will become a table of contents (this text will be scraped). {:toc}

Spark provides three locations to configure the system:

  • Spark properties control most application parameters and can be set by using a SparkConf object, or through Java system properties.
  • Environment variables can be used to set per-machine settings, such as the IP address, through the conf/spark-env.sh script on each node.
  • Logging can be configured through log4j.properties.

Spark Properties

Spark properties control most application settings and are configured separately for each application. These properties can be set directly on a SparkConf passed to your SparkContext. SparkConf allows you to configure some of the common properties (e.g. master URL and application name), as well as arbitrary key-value pairs through the set() method. For example, we could initialize an application with two threads as follows:

Note that we run with local[2], meaning two threads - which represents "minimal" parallelism, which can help detect bugs that only exist when we run in a distributed context.

{% highlight scala %} val conf = new SparkConf() .setMaster("local[2]") .setAppName("CountingSheep") val sc = new SparkContext(conf) {% endhighlight %}

Note that we can have more than 1 thread in local mode, and in cases like Spark Streaming, we may actually require more than 1 thread to prevent any sort of starvation issues.

Properties that specify some time duration should be configured with a unit of time. The following format is accepted:

25ms (milliseconds)
5s (seconds)
10m or 10min (minutes)
3h (hours)
5d (days)
1y (years)

Properties that specify a byte size should be configured with a unit of size. The following format is accepted:

1b (bytes)
1k or 1kb (kibibytes = 1024 bytes)
1m or 1mb (mebibytes = 1024 kibibytes)
1g or 1gb (gibibytes = 1024 mebibytes)
1t or 1tb (tebibytes = 1024 gibibytes)
1p or 1pb (pebibytes = 1024 tebibytes)

Dynamically Loading Spark Properties

In some cases, you may want to avoid hard-coding certain configurations in a SparkConf. For instance, if you'd like to run the same application with different masters or different amounts of memory. Spark allows you to simply create an empty conf:

{% highlight scala %} val sc = new SparkContext(new SparkConf()) {% endhighlight %}

Then, you can supply configuration values at runtime: {% highlight bash %} ./bin/spark-submit --name "My app" --master local[4] --conf spark.eventLog.enabled=false --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar {% endhighlight %}

The Spark shell and spark-submit tool support two ways to load configurations dynamically. The first are command line options, such as --master, as shown above. spark-submit can accept any Spark property using the --conf flag, but uses special flags for properties that play a part in launching the Spark application. Running ./bin/spark-submit --help will show the entire list of these options.

bin/spark-submit will also read configuration options from conf/spark-defaults.conf, in which each line consists of a key and a value separated by whitespace. For example:

spark.master            spark://5.6.7.8:7077
spark.executor.memory   4g
spark.eventLog.enabled  true
spark.serializer        org.apache.spark.serializer.KryoSerializer