-
- Downloads
[SPARK-21743][SQL] top-most limit should not cause memory leak
## What changes were proposed in this pull request? For top-most limit, we will use a special operator to execute it: `CollectLimitExec`. `CollectLimitExec` will retrieve `n`(which is the limit) rows from each partition of the child plan output, see https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L311. It's very likely that we don't exhaust the child plan output. This is fine when whole-stage-codegen is off, as child plan will release the resource via task completion listener. However, when whole-stage codegen is on, the resource can only be released if all output is consumed. To fix this memory leak, one simple approach is, when `CollectLimitExec` retrieve `n` rows from child plan output, child plan output should only have `n` rows, then the output is exhausted and resource is released. This can be done by wrapping child plan with `LocalLimit` ## How was this patch tested? a regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #18955 from cloud-fan/leak.
Showing
- sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala 6 additions, 1 deletion...cala/org/apache/spark/sql/execution/SparkStrategies.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala 8 additions, 0 deletions...src/main/scala/org/apache/spark/sql/execution/limit.scala
- sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 5 additions, 0 deletions...e/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Loading
Please register or sign in to comment