Skip to content
Snippets Groups Projects
Commit 342278c0 authored by Sean Owen's avatar Sean Owen Committed by Yin Huai
Browse files

[SPARK-16320][DOC] Document G1 heap region's effect on spark 2.0 vs 1.6

## What changes were proposed in this pull request?

Collect GC discussion in one section, and documenting findings about G1 GC heap region size.

## How was this patch tested?

Jekyll doc build

Author: Sean Owen <sowen@cloudera.com>

Closes #14732 from srowen/SPARK-16320.
parent 209e1b3c
No related branches found
No related tags found
No related merge requests found
......@@ -122,21 +122,8 @@ large records.
`R` is the storage space within `M` where cached blocks immune to being evicted by execution.
The value of `spark.memory.fraction` should be set in order to fit this amount of heap space
comfortably within the JVM's old or "tenured" generation. Otherwise, when much of this space is
used for caching and execution, the tenured generation will be full, which causes the JVM to
significantly increase time spent in garbage collection. See
<a href="https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/sizing.html">Java GC sizing documentation</a>
for more information.
The tenured generation size is controlled by the JVM's `NewRatio` parameter, which defaults to 2,
meaning that the tenured generation is 2 times the size of the new generation (the rest of the heap).
So, by default, the tenured generation occupies 2/3 or about 0.66 of the heap. A value of
0.6 for `spark.memory.fraction` keeps storage and execution memory within the old generation with
room to spare. If `spark.memory.fraction` is increased to, say, 0.8, then `NewRatio` may have to
increase to 6 or more.
`NewRatio` is set as a JVM flag for executors, which means adding
`spark.executor.extraJavaOptions=-XX:NewRatio=x` to a Spark job's configuration.
comfortably within the JVM's old or "tenured" generation. See the discussion of advanced GC
tuning below for details.
## Determining Memory Consumption
......@@ -217,14 +204,22 @@ temporary objects created during task execution. Some steps which may be useful
* Check if there are too many garbage collections by collecting GC stats. If a full GC is invoked multiple times for
before a task completes, it means that there isn't enough memory available for executing tasks.
* In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of
memory used for caching by lowering `spark.memory.storageFraction`; it is better to cache fewer
objects than to slow down task execution!
* If there are too many minor collections but not many major GCs, allocating more memory for Eden would help. You
can set the size of the Eden to be an over-estimate of how much memory each task will need. If the size of Eden
is determined to be `E`, then you can set the size of the Young generation using the option `-Xmn=4/3*E`. (The scaling
up by 4/3 is to account for space used by survivor regions as well.)
* In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of
memory used for caching by lowering `spark.memory.fraction`; it is better to cache fewer
objects than to slow down task execution. Alternatively, consider decreasing the size of
the Young generation. This means lowering `-Xmn` if you've set it as above. If not, try changing the
value of the JVM's `NewRatio` parameter. Many JVMs default this to 2, meaning that the Old generation
occupies 2/3 of the heap. It should be large enough such that this fraction exceeds `spark.memory.fraction`.
* Try the G1GC garbage collector with `-XX:+UseG1GC`. It can improve performance in some situations where
garbage collection is a bottleneck. Note that with large executor heap sizes, it may be important to
increase the [G1 region size](https://blogs.oracle.com/g1gc/entry/g1_gc_tuning_a_case)
with `-XX:G1HeapRegionSize`
* As an example, if your task is reading data from HDFS, the amount of memory used by the task can be estimated using
the size of the data block read from HDFS. Note that the size of a decompressed block is often 2 or 3 times the
......@@ -237,6 +232,9 @@ Our experience suggests that the effect of GC tuning depends on your application
There are [many more tuning options](http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html) described online,
but at a high level, managing how frequently full GC takes place can help in reducing the overhead.
GC tuning flags for executors can be specified by setting `spark.executor.extraJavaOptions` in
a job's configuration.
# Other Considerations
## Level of Parallelism
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment