Skip to content
  • Shixiong Zhu's avatar
    9bf4e2ba
    [SPARK-19497][SS] Implement streaming deduplication · 9bf4e2ba
    Shixiong Zhu authored
    ## What changes were proposed in this pull request?
    
    This PR adds a special streaming deduplication operator to support `dropDuplicates` with `aggregation` and watermark. It reuses the `dropDuplicates` API but creates new logical plan `Deduplication` and new physical plan `DeduplicationExec`.
    
    The following cases are supported:
    
    - one or multiple `dropDuplicates()` without aggregation (with or without watermark)
    - `dropDuplicates` before aggregation
    
    Not supported cases:
    
    - `dropDuplicates` after aggregation
    
    Breaking changes:
    - `dropDuplicates` without aggregation doesn't work with `complete` or `update` mode.
    
    ## How was this patch tested?
    
    The new unit tests.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #16970 from zsxwing/dedup.
    9bf4e2ba
    [SPARK-19497][SS] Implement streaming deduplication
    Shixiong Zhu authored
    ## What changes were proposed in this pull request?
    
    This PR adds a special streaming deduplication operator to support `dropDuplicates` with `aggregation` and watermark. It reuses the `dropDuplicates` API but creates new logical plan `Deduplication` and new physical plan `DeduplicationExec`.
    
    The following cases are supported:
    
    - one or multiple `dropDuplicates()` without aggregation (with or without watermark)
    - `dropDuplicates` before aggregation
    
    Not supported cases:
    
    - `dropDuplicates` after aggregation
    
    Breaking changes:
    - `dropDuplicates` without aggregation doesn't work with `complete` or `update` mode.
    
    ## How was this patch tested?
    
    The new unit tests.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #16970 from zsxwing/dedup.
Loading