Skip to content
Snippets Groups Projects
Commit 4a3e9cf6 authored by Matei Zaharia's avatar Matei Zaharia
Browse files

Document how to configure SPARK_MEM & co on a per-job basis

parent ce6b5a3e
No related branches found
No related tags found
No related merge requests found
#!/usr/bin/env bash #!/usr/bin/env bash
# Set Spark environment variables for your site in this file. Some useful # This file contains environment variables required to run Spark. Copy it as
# variables to set are: # spark-env.sh and edit that to configure Spark for your site. At a minimum,
# the following two variables should be set:
# - MESOS_NATIVE_LIBRARY, to point to your Mesos native library (libmesos.so) # - MESOS_NATIVE_LIBRARY, to point to your Mesos native library (libmesos.so)
# - SCALA_HOME, to point to your Scala installation # - SCALA_HOME, to point to your Scala installation
#
# If using the standalone deploy mode, you can also set variables for it:
# - SPARK_MASTER_IP, to bind the master to a different IP address
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much memory to use (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT
#
# Finally, Spark also relies on the following variables, but these can be set
# on just the *master* (i.e. in your driver program), and will automatically
# be propagated to workers:
# - SPARK_MEM, to change the amount of memory used per node (this should
# be in the same format as the JVM's -Xmx option, e.g. 300m or 1g)
# - SPARK_CLASSPATH, to add elements to Spark's classpath # - SPARK_CLASSPATH, to add elements to Spark's classpath
# - SPARK_JAVA_OPTS, to add JVM options # - SPARK_JAVA_OPTS, to add JVM options
# - SPARK_MEM, to change the amount of memory used per node (this should
# be in the same format as the JVM's -Xmx option, e.g. 300m or 1g).
# - SPARK_LIBRARY_PATH, to add extra search paths for native libraries. # - SPARK_LIBRARY_PATH, to add extra search paths for native libraries.
# Settings used by the scripts in the bin/ directory, apply to standalone mode only.
# Note that the same worker settings apply to all of the workers.
# - SPARK_MASTER_IP, to bind the master to a different ip address, for example a public one (Default: local ip address)
# - SPARK_MASTER_PORT, to start the spark master on a different port (Default: 7077)
# - SPARK_MASTER_WEBUI_PORT, to specify a different port for the Master WebUI (Default: 8080)
# - SPARK_WORKER_PORT, to start the spark worker on a specific port (Default: random)
# - SPARK_WORKER_CORES, to specify the number of cores to use (Default: all available cores)
# - SPARK_WORKER_MEMORY, to specify how much memory to use, e.g. 1000M, 2G (Default: MAX(Available - 1024MB, 512MB))
# - SPARK_WORKER_WEBUI_PORT, to specify a different port for the Worker WebUI (Default: 8081)
\ No newline at end of file
...@@ -5,33 +5,45 @@ title: Spark Configuration ...@@ -5,33 +5,45 @@ title: Spark Configuration
Spark provides three main locations to configure the system: Spark provides three main locations to configure the system:
* The [`conf/spark-env.sh` script](#environment-variables-in-spark-envsh), in which you can set environment variables * [Environment variables](#environment-variables) for launching Spark workers, which can
that affect how the JVM is launched, such as, most notably, the amount of memory per JVM. be set either in your driver program or in the `conf/spark-env.sh` script.
* [Java system properties](#system-properties), which control internal configuration parameters and can be set either * [Java system properties](#system-properties), which control internal configuration parameters and can be set either
programmatically (by calling `System.setProperty` *before* creating a `SparkContext`) or through the programmatically (by calling `System.setProperty` *before* creating a `SparkContext`) or through the
`SPARK_JAVA_OPTS` environment variable in `spark-env.sh`. `SPARK_JAVA_OPTS` environment variable in `spark-env.sh`.
* [Logging configuration](#configuring-logging), which is done through `log4j.properties`. * [Logging configuration](#configuring-logging), which is done through `log4j.properties`.
# Environment Variables in spark-env.sh # Environment Variables
Spark determines how to initialize the JVM on worker nodes, or even on the local node when you run `spark-shell`, Spark determines how to initialize the JVM on worker nodes, or even on the local node when you run `spark-shell`,
by running the `conf/spark-env.sh` script in the directory where it is installed. This script does not exist by default by running the `conf/spark-env.sh` script in the directory where it is installed. This script does not exist by default
in the Git repository, but but you can create it by copying `conf/spark-env.sh.template`. Make sure that you make in the Git repository, but but you can create it by copying `conf/spark-env.sh.template`. Make sure that you make
the copy executable. the copy executable.
Inside `spark-env.sh`, you can set the following environment variables: Inside `spark-env.sh`, you *must* set at least the following two environment variables:
* `SCALA_HOME` to point to your Scala installation. * `SCALA_HOME` to point to your Scala installation.
* `MESOS_NATIVE_LIBRARY` if you are [running on a Mesos cluster](running-on-mesos.html). * `MESOS_NATIVE_LIBRARY` if you are [running on a Mesos cluster](running-on-mesos.html).
* `SPARK_MEM` to set the amount of memory used per node (this should be in the same format as the JVM's -Xmx option, e.g. `300m` or `1g`)
In addition, there are four other variables that control execution. These can be set *either in `spark-env.sh`
or in each job's driver program*, because they will automatically be propagated to workers from the driver.
For a multi-user environment, we recommend setting the in the driver program instead of `spark-env.sh`, so
that different user jobs can use different amounts of memory, JVM options, etc.
* `SPARK_MEM` to set the amount of memory used per node (this should be in the same format as the
JVM's -Xmx option, e.g. `300m` or `1g`)
* `SPARK_JAVA_OPTS` to add JVM options. This includes any system properties that you'd like to pass with `-D`. * `SPARK_JAVA_OPTS` to add JVM options. This includes any system properties that you'd like to pass with `-D`.
* `SPARK_CLASSPATH` to add elements to Spark's classpath. * `SPARK_CLASSPATH` to add elements to Spark's classpath.
* `SPARK_LIBRARY_PATH` to add search directories for native libraries. * `SPARK_LIBRARY_PATH` to add search directories for native libraries.
The most important things to set first will be `SCALA_HOME`, without which `spark-shell` cannot run, and `MESOS_NATIVE_LIBRARY` Note that if you do set these in `spark-env.sh`, they will override the values set by user programs, which
if running on Mesos. The next setting will probably be the memory (`SPARK_MEM`). Make sure you set it high enough to be able to run your job but lower than the total memory on the machines (leave at least 1 GB for the operating system). is undesirable; you can choose to have `spark-env.sh` set them only if the user program hasn't, as follows:
{% highlight bash %}
if [ -z "$SPARK_MEM" ] ; then
SPARK_MEM="1g"
fi
{% endhighlight %}
# System Properties # System Properties
......
...@@ -19,12 +19,6 @@ if [ -z "$SCALA_HOME" ]; then ...@@ -19,12 +19,6 @@ if [ -z "$SCALA_HOME" ]; then
exit 1 exit 1
fi fi
# If the user specifies a Mesos JAR, put it before our included one on the classpath
MESOS_CLASSPATH=""
if [ -n "$MESOS_JAR" ] ; then
MESOS_CLASSPATH="$MESOS_JAR"
fi
# Figure out how much memory to use per executor and set it as an environment # Figure out how much memory to use per executor and set it as an environment
# variable so that our process sees it and can report it to Mesos # variable so that our process sees it and can report it to Mesos
if [ -z "$SPARK_MEM" ] ; then if [ -z "$SPARK_MEM" ] ; then
...@@ -49,7 +43,6 @@ BAGEL_DIR="$FWDIR/bagel" ...@@ -49,7 +43,6 @@ BAGEL_DIR="$FWDIR/bagel"
# Build up classpath # Build up classpath
CLASSPATH="$SPARK_CLASSPATH" CLASSPATH="$SPARK_CLASSPATH"
CLASSPATH+=":$MESOS_CLASSPATH"
CLASSPATH+=":$FWDIR/conf" CLASSPATH+=":$FWDIR/conf"
CLASSPATH+=":$CORE_DIR/target/scala-$SCALA_VERSION/classes" CLASSPATH+=":$CORE_DIR/target/scala-$SCALA_VERSION/classes"
if [ -n "$SPARK_TESTING" ] ; then if [ -n "$SPARK_TESTING" ] ; then
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment