-
Marcelo Vanzin authored
Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #7651 from vanzin/SPARK-9327 and squashes the following commits: 2923e23 [Marcelo Vanzin] [SPARK-9327] [docs] Fix documentation about classpath config options.
Marcelo Vanzin authoredAuthor: Marcelo Vanzin <vanzin@cloudera.com> Closes #7651 from vanzin/SPARK-9327 and squashes the following commits: 2923e23 [Marcelo Vanzin] [SPARK-9327] [docs] Fix documentation about classpath config options.
layout: global
displayTitle: Spark Configuration
title: Configuration
- This will become a table of contents (this text will be scraped). {:toc}
Spark provides three locations to configure the system:
- Spark properties control most application parameters and can be set by using a SparkConf object, or through Java system properties.
-
Environment variables can be used to set per-machine settings, such as
the IP address, through the
conf/spark-env.sh
script on each node. -
Logging can be configured through
log4j.properties
.
Spark Properties
Spark properties control most application settings and are configured separately for each
application. These properties can be set directly on a
SparkConf passed to your
SparkContext
. SparkConf
allows you to configure some of the common properties
(e.g. master URL and application name), as well as arbitrary key-value pairs through the
set()
method. For example, we could initialize an application with two threads as follows:
Note that we run with local[2], meaning two threads - which represents "minimal" parallelism, which can help detect bugs that only exist when we run in a distributed context.
{% highlight scala %} val conf = new SparkConf() .setMaster("local[2]") .setAppName("CountingSheep") val sc = new SparkContext(conf) {% endhighlight %}
Note that we can have more than 1 thread in local mode, and in cases like Spark Streaming, we may actually require one to prevent any sort of starvation issues.
Properties that specify some time duration should be configured with a unit of time. The following format is accepted:
25ms (milliseconds)
5s (seconds)
10m or 10min (minutes)
3h (hours)
5d (days)
1y (years)
Properties that specify a byte size should be configured with a unit of size.
The following format is accepted:
1b (bytes)
1k or 1kb (kibibytes = 1024 bytes)
1m or 1mb (mebibytes = 1024 kibibytes)
1g or 1gb (gibibytes = 1024 mebibytes)
1t or 1tb (tebibytes = 1024 gibibytes)
1p or 1pb (pebibytes = 1024 tebibytes)
Dynamically Loading Spark Properties
In some cases, you may want to avoid hard-coding certain configurations in a SparkConf
. For
instance, if you'd like to run the same application with different masters or different
amounts of memory. Spark allows you to simply create an empty conf:
{% highlight scala %} val sc = new SparkContext(new SparkConf()) {% endhighlight %}
Then, you can supply configuration values at runtime: {% highlight bash %} ./bin/spark-submit --name "My app" --master local[4] --conf spark.shuffle.spill=false --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar {% endhighlight %}
The Spark shell and spark-submit
tool support two ways to load configurations dynamically. The first are command line options,
such as --master
, as shown above. spark-submit
can accept any Spark property using the --conf
flag, but uses special flags for properties that play a part in launching the Spark application.
Running ./bin/spark-submit --help
will show the entire list of these options.
bin/spark-submit
will also read configuration options from conf/spark-defaults.conf
, in which
each line consists of a key and a value separated by whitespace. For example:
spark.master spark://5.6.7.8:7077
spark.executor.memory 4g
spark.eventLog.enabled true
spark.serializer org.apache.spark.serializer.KryoSerializer
Any values specified as flags or in the properties file will be passed on to the application
and merged with those specified through SparkConf. Properties set directly on the SparkConf
take highest precedence, then flags passed to spark-submit
or spark-shell
, then options
in the spark-defaults.conf
file. A few configuration keys have been renamed since earlier
versions of Spark; in such cases, the older key names are still accepted, but take lower
precedence than any instance of the newer key.
Viewing Spark Properties
The application web UI at http://<driver>:4040
lists Spark properties in the "Environment" tab.
This is a useful place to check to make sure that your properties have been set correctly. Note
that only values explicitly specified through spark-defaults.conf
, SparkConf
, or the command
line will appear. For all other configuration properties, you can assume the default value is used.
Available Properties
Most of the properties that control internal settings have reasonable default values. Some of the most common options to set are:
Application Properties
Property Name | Default | Meaning |
---|---|---|
spark.app.name |
(none) | The name of your application. This will appear in the UI and in log data. |
spark.driver.cores |
1 | Number of cores to use for the driver process, only in cluster mode. |
spark.driver.maxResultSize |
1g | Limit of total size of serialized results of all partitions for each Spark action (e.g. collect). Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size is above this limit. Having a high limit may cause out-of-memory errors in driver (depends on spark.driver.memory and memory overhead of objects in JVM). Setting a proper limit can protect the driver from out-of-memory errors. |
spark.driver.memory |
1g |
Amount of memory to use for the driver process, i.e. where SparkContext is initialized.
(e.g. 1g , 2g ).
|
spark.executor.memory |
1g |
Amount of memory to use per executor process (e.g. 2g , 8g ).
|
spark.extraListeners |
(none) |
A comma-separated list of classes that implement SparkListener ; when initializing
SparkContext, instances of these classes will be created and registered with Spark's listener
bus. If a class has a single-argument constructor that accepts a SparkConf, that constructor
will be called; otherwise, a zero-argument constructor will be called. If no valid constructor
can be found, the SparkContext creation will fail with an exception.
|
spark.local.dir |
/tmp |
Directory to use for "scratch" space in Spark, including map output files and RDDs that get
stored on disk. This should be on a fast, local disk in your system. It can also be a
comma-separated list of multiple directories on different disks.
|
spark.logConf |
false | Logs the effective SparkConf as INFO when a SparkContext is started. |
spark.master |
(none) | The cluster manager to connect to. See the list of allowed master URL's. |
Apart from these, the following properties are also available, and may be useful in some situations:
Runtime Environment
Property Name | Default | Meaning |
---|---|---|
spark.driver.extraClassPath |
(none) |
Extra classpath entries to prepend to the classpath of the driver.
|
spark.driver.extraJavaOptions |
(none) |
A string of extra JVM options to pass to the driver. For instance, GC settings or other logging.
|
spark.driver.extraLibraryPath |
(none) |
Set a special library path to use when launching the driver JVM.
|
spark.driver.userClassPathFirst |
false |
(Experimental) Whether to give user-added jars precedence over Spark's own jars when loading
classes in the the driver. This feature can be used to mitigate conflicts between Spark's
dependencies and user dependencies. It is currently an experimental feature.
|
spark.executor.extraClassPath |
(none) | Extra classpath entries to prepend to the classpath of executors. This exists primarily for backwards-compatibility with older versions of Spark. Users typically should not need to set this option. |
spark.executor.extraJavaOptions |
(none) | A string of extra JVM options to pass to executors. For instance, GC settings or other logging. Note that it is illegal to set Spark properties or heap size settings with this option. Spark properties should be set using a SparkConf object or the spark-defaults.conf file used with the spark-submit script. Heap size settings can be set with spark.executor.memory. |
spark.executor.extraLibraryPath |
(none) | Set a special library path to use when launching executor JVM's. |
spark.executor.logs.rolling.maxRetainedFiles |
(none) | Sets the number of latest rolling log files that are going to be retained by the system. Older log files will be deleted. Disabled by default. |
spark.executor.logs.rolling.maxSize |
(none) |
Set the max size of the file by which the executor logs will be rolled over.
Rolling is disabled by default. See spark.executor.logs.rolling.maxRetainedFiles
for automatic cleaning of old logs.
|
spark.executor.logs.rolling.strategy |
(none) |
Set the strategy of rolling of executor logs. By default it is disabled. It can
be set to "time" (time-based rolling) or "size" (size-based rolling). For "time",
use spark.executor.logs.rolling.time.interval to set the rolling interval.
For "size", use spark.executor.logs.rolling.size.maxBytes to set
the maximum file size for rolling.
|
spark.executor.logs.rolling.time.interval |
daily |
Set the time interval by which the executor logs will be rolled over.
Rolling is disabled by default. Valid values are `daily`, `hourly`, `minutely` or
any interval in seconds. See spark.executor.logs.rolling.maxRetainedFiles
for automatic cleaning of old logs.
|
spark.executor.userClassPathFirst |
false |
(Experimental) Same functionality as spark.driver.userClassPathFirst , but
applied to executor instances.
|
spark.executorEnv.[EnvironmentVariableName] |
(none) |
Add the environment variable specified by EnvironmentVariableName to the Executor
process. The user can specify multiple of these to set multiple environment variables.
|
spark.python.profile |
false |
Enable profiling in Python worker, the profile result will show up by `sc.show_profiles()`,
or it will be displayed before the driver exiting. It also can be dumped into disk by
`sc.dump_profiles(path)`. If some of the profile results had been displayed manually,
they will not be displayed automatically before driver exiting.
|
spark.python.profile.dump |
(none) | The directory which is used to dump the profile result before driver exiting. The results will be dumped as separated file for each RDD. They can be loaded by ptats.Stats(). If this is specified, the profile result will not be displayed automatically. |
spark.python.worker.memory |
512m |
Amount of memory to use per python worker process during aggregation, in the same
format as JVM memory strings (e.g. 512m , 2g ). If the memory
used during aggregation goes above this amount, it will spill the data into disks.
|
spark.python.worker.reuse |
true | Reuse Python worker or not. If yes, it will use a fixed number of Python workers, does not need to fork() a Python process for every tasks. It will be very useful if there is large broadcast, then the broadcast will not be needed to transfered from JVM to Python worker for every task. |