-
- Downloads
[SPARK-14255][SQL] Streaming Aggregation
This PR adds the ability to perform aggregations inside of a `ContinuousQuery`. In order to implement this feature, the planning of aggregation has augmented with a new `StatefulAggregationStrategy`. Unlike batch aggregation, stateful-aggregation uses the `StateStore` (introduced in #11645) to persist the results of partial aggregation across different invocations. The resulting physical plan performs the aggregation using the following progression: - Partial Aggregation - Shuffle - Partial Merge (now there is at most 1 tuple per group) - StateStoreRestore (now there is 1 tuple from this batch + optionally one from the previous) - Partial Merge (now there is at most 1 tuple per group) - StateStoreSave (saves the tuple for the next batch) - Complete (output the current result of the aggregation) The following refactoring was also performed to allow us to plug into existing code: - The get/put implementation is taken from #12013 - The logic for breaking down and de-duping the physical execution of aggregation has been move into a new pattern `PhysicalAggregation` - The `AttributeReference` used to identify the result of an `AggregateFunction` as been moved into the `AggregateExpression` container. This change moves the reference into the same object as the other intermediate references used in aggregation and eliminates the need to pass around a `Map[(AggregateFunction, Boolean), Attribute]`. Further clean up (using a different aggregation container for logical/physical plans) is deferred to a followup. - Some planning logic is moved from the `SessionState` into the `QueryExecution` to make it easier to override in the streaming case. - The ability to write a `StreamTest` that checks only the output of the last batch has been added to simulate the future addition of output modes. Author: Michael Armbrust <michael@databricks.com> Closes #12048 from marmbrus/statefulAgg.
Showing
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala 7 additions, 2 deletions...ala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala 1 addition, 1 deletion...rg/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/errors/package.scala 5 additions, 2 deletions.../scala/org/apache/spark/sql/catalyst/errors/package.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala 36 additions, 1 deletion...spark/sql/catalyst/expressions/aggregate/interfaces.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala 1 addition, 1 deletion...che/spark/sql/catalyst/expressions/namedExpressions.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala 7 additions, 7 deletions...a/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala 73 additions, 0 deletions...ala/org/apache/spark/sql/catalyst/planning/patterns.scala
- sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala 3 additions, 0 deletions.../scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala 22 additions, 2 deletions...scala/org/apache/spark/sql/execution/QueryExecution.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala 7 additions, 0 deletions...main/scala/org/apache/spark/sql/execution/SparkPlan.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala 2 additions, 2 deletions...n/scala/org/apache/spark/sql/execution/SparkPlanner.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala 29 additions, 63 deletions...cala/org/apache/spark/sql/execution/SparkStrategies.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala 1 addition, 1 deletion...rc/main/scala/org/apache/spark/sql/execution/Window.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala 2 additions, 2 deletions...sql/execution/aggregate/TungstenAggregationIterator.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/utils.scala 96 additions, 25 deletions...cala/org/apache/spark/sql/execution/aggregate/utils.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala 72 additions, 0 deletions.../spark/sql/execution/streaming/IncrementalExecution.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulAggregate.scala 119 additions, 0 deletions...che/spark/sql/execution/streaming/StatefulAggregate.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala 8 additions, 4 deletions...pache/spark/sql/execution/streaming/StreamExecution.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala 3 additions, 1 deletion...ala/org/apache/spark/sql/execution/streaming/memory.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala 17 additions, 19 deletions...cution/streaming/state/HDFSBackedStateStoreProvider.scala
Loading
Please register or sign in to comment