-
- Downloads
[SPARK-8964] [SQL] Use Exchange to perform shuffle in Limit
This patch changes the implementation of the physical `Limit` operator so that it relies on the `Exchange` operator to perform data movement rather than directly using `ShuffledRDD`. In addition to improving efficiency, this lays the necessary groundwork for further optimization of limit, such as limit pushdown or whole-stage codegen. At a high-level, this replaces the old physical `Limit` operator with two new operators, `LocalLimit` and `GlobalLimit`. `LocalLimit` performs per-partition limits, while `GlobalLimit` applies the final limit to a single partition; `GlobalLimit`'s declares that its `requiredInputDistribution` is `SinglePartition`, which will cause the planner to use an `Exchange` to perform the appropriate shuffles. Thus, a logical `Limit` appearing in the middle of a query plan will be expanded into `LocalLimit -> Exchange to one partition -> GlobalLimit`. In the old code, calling `someDataFrame.limit(100).collect()` or `someDataFrame.take(100)` would actually skip the shuffle and use a fast-path which used `executeTake()` in order to avoid computing all partitions in case only a small number of rows were requested. This patch preserves this optimization by treating logical `Limit` operators specially when they appear as the terminal operator in a query plan: if a `Limit` is the final operator, then we will plan a special `CollectLimit` physical operator which implements the old `take()`-based logic. In order to be able to match on operators only at the root of the query plan, this patch introduces a special `ReturnAnswer` logical operator which functions similar to `BroadcastHint`: this dummy operator is inserted at the root of the optimized logical plan before invoking the physical planner, allowing the planner to pattern-match on it. Author: Josh Rosen <joshrosen@databricks.com> Closes #7334 from JoshRosen/remove-copy-in-limit.
Showing
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala 11 additions, 0 deletions...che/spark/sql/catalyst/plans/logical/basicOperators.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala 71 additions, 59 deletions.../main/scala/org/apache/spark/sql/execution/Exchange.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala 2 additions, 2 deletions...scala/org/apache/spark/sql/execution/QueryExecution.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala 6 additions, 1 deletion...cala/org/apache/spark/sql/execution/SparkStrategies.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala 1 addition, 94 deletions...scala/org/apache/spark/sql/execution/basicOperators.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala 122 additions, 0 deletions...src/main/scala/org/apache/spark/sql/execution/limit.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala 8 additions, 2 deletions...t/scala/org/apache/spark/sql/execution/PlannerSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/SortSuite.scala 2 additions, 2 deletions...test/scala/org/apache/spark/sql/execution/SortSuite.scala
Loading
Please register or sign in to comment