-
- Downloads
[SPARK-19497][SS] Implement streaming deduplication
## What changes were proposed in this pull request? This PR adds a special streaming deduplication operator to support `dropDuplicates` with `aggregation` and watermark. It reuses the `dropDuplicates` API but creates new logical plan `Deduplication` and new physical plan `DeduplicationExec`. The following cases are supported: - one or multiple `dropDuplicates()` without aggregation (with or without watermark) - `dropDuplicates` before aggregation Not supported cases: - `dropDuplicates` after aggregation Breaking changes: - `dropDuplicates` without aggregation doesn't work with `complete` or `update` mode. ## How was this patch tested? The new unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #16970 from zsxwing/dedup.
Showing
- python/pyspark/sql/dataframe.py 6 additions, 0 deletionspython/pyspark/sql/dataframe.py
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala 5 additions, 1 deletion...k/sql/catalyst/analysis/UnsupportedOperationChecker.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala 20 additions, 1 deletion...a/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala 9 additions, 0 deletions...rk/sql/catalyst/plans/logical/basicLogicalOperators.scala
- sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala 55 additions, 1 deletion...rk/sql/catalyst/analysis/UnsupportedOperationsSuite.scala
- sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala 32 additions, 1 deletion...e/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala
- sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 28 additions, 11 deletionssql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala 14 additions, 1 deletion...cala/org/apache/spark/sql/execution/SparkStrategies.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala 10 additions, 0 deletions.../spark/sql/execution/streaming/IncrementalExecution.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala 108 additions, 32 deletions...che/spark/sql/execution/streaming/statefulOperators.scala
- sql/core/src/test/scala/org/apache/spark/sql/streaming/DeduplicateSuite.scala 252 additions, 0 deletions...ala/org/apache/spark/sql/streaming/DeduplicateSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/streaming/MapGroupsWithStateSuite.scala 1 addition, 8 deletions.../apache/spark/sql/streaming/MapGroupsWithStateSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/streaming/StateStoreMetricsTest.scala 36 additions, 0 deletions...rg/apache/spark/sql/streaming/StateStoreMetricsTest.scala
- sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala 1 addition, 1 deletion...st/scala/org/apache/spark/sql/streaming/StreamSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingAggregationSuite.scala 1 addition, 1 deletion...pache/spark/sql/streaming/StreamingAggregationSuite.scala
Loading
Please register or sign in to comment