-
- Downloads
[SPARK-18236] Reduce duplicate objects in Spark UI and HistoryServer
## What changes were proposed in this pull request? When profiling heap dumps from the HistoryServer and live Spark web UIs, I found a large amount of memory being wasted on duplicated objects and strings. This patch's changes remove most of this duplication, resulting in over 40% memory savings for some benchmarks. - **Task metrics** (6441f0624dfcda9c7193a64bfb416a145b5aabdf): previously, every `TaskUIData` object would have its own instances of `InputMetricsUIData`, `OutputMetricsUIData`, `ShuffleReadMetrics`, and `ShuffleWriteMetrics`, but for many tasks these metrics are irrelevant because they're all zero. This patch changes how we construct these metrics in order to re-use a single immutable "empty" value for the cases where these metrics are empty. - **TaskInfo.accumulables** (ade86db901127bf13c0e0bdc3f09c933a093bb76): Previously, every `TaskInfo` object had its own empty `ListBuffer` for holding updates from named accumulators. Tasks which didn't use named accumulators still paid for the cost of allocating and storing this empty buffer. To avoid this overhead, I changed the `val` with a mutable buffer into a `var` which holds an immutable Scala list, allowing tasks which do not have named accumulator updates to share the same singleton `Nil` object. - **String.intern() in JSONProtocol** (7e05630e9a78c455db8c8c499f0590c864624e05): in the HistoryServer, executor hostnames and ids are deserialized from JSON, leading to massive duplication of these string objects. By calling `String.intern()` on the deserialized values we can remove all of this duplication. Since Spark now requires Java 7+ we don't have to worry about string interning exhausting the permgen (see http://java-performance.info/string-intern-in-java-6-7-8/). ## How was this patch tested? I ran ``` sc.parallelize(1 to 100000, 100000).count() ``` in `spark-shell` with event logging enabled, then loaded that event log in the HistoryServer, performed a full GC, and took a heap dump. According to YourKit, the changes in this patch reduced memory consumption by roughly 28 megabytes (or 770k Java objects):  Here's a table illustrating the drop in objects due to deduplication (the drop is <100k for some objects because some events were dropped from the listener bus; this is a separate, existing bug that I'll address separately after CPU-profiling):  Author: Josh Rosen <joshrosen@databricks.com> Closes #15743 from JoshRosen/spark-ui-memory-usage.
Showing
- core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 2 additions, 1 deletion.../main/scala/org/apache/spark/scheduler/DAGScheduler.scala
- core/src/main/scala/org/apache/spark/scheduler/TaskInfo.scala 7 additions, 3 deletions.../src/main/scala/org/apache/spark/scheduler/TaskInfo.scala
- core/src/main/scala/org/apache/spark/ui/jobs/UIData.scala 62 additions, 21 deletionscore/src/main/scala/org/apache/spark/ui/jobs/UIData.scala
- core/src/main/scala/org/apache/spark/util/JsonProtocol.scala 5 additions, 5 deletionscore/src/main/scala/org/apache/spark/util/JsonProtocol.scala
- core/src/test/scala/org/apache/spark/ui/jobs/JobProgressListenerSuite.scala 1 addition, 1 deletion...a/org/apache/spark/ui/jobs/JobProgressListenerSuite.scala
- core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala 2 additions, 5 deletions.../test/scala/org/apache/spark/util/JsonProtocolSuite.scala
- project/MimaExcludes.scala 4 additions, 1 deletionproject/MimaExcludes.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLListenerSuite.scala 1 addition, 1 deletion.../org/apache/spark/sql/execution/ui/SQLListenerSuite.scala
Loading
Please register or sign in to comment