Skip to content
Snippets Groups Projects
Commit 96287884 authored by Yin Huai's avatar Yin Huai
Browse files

[SPARK-11840][SQL] Restore the 1.5's behavior of planning a single distinct aggregation.

The impact of this change is for a query that has a single distinct column and does not have any grouping expression like
`SELECT COUNT(DISTINCT a) FROM table`
The plan will be changed from
```
AGG-2 (count distinct)
  Shuffle to a single reducer
    Partial-AGG-2 (count distinct)
      AGG-1 (grouping on a)
        Shuffle by a
          Partial-AGG-1 (grouping on 1)
```
to the following one (1.5 uses this)
```
AGG-2
  AGG-1 (grouping on a)
    Shuffle to a single reducer
      Partial-AGG-1(grouping on a)
```
The first plan is more robust. However, to better benchmark the impact of this change, we should use 1.5's plan and use the conf of `spark.sql.specializeSingleDistinctAggPlanning` to control the plan.

Author: Yin Huai <yhuai@databricks.com>

Closes #9828 from yhuai/distinctRewriter.
parent f4499920
No related branches found
No related tags found
No related merge requests found
......@@ -126,8 +126,8 @@ case class DistinctAggregationRewriter(conf: CatalystConf) extends Rule[LogicalP
val shouldRewrite = if (conf.specializeSingleDistinctAggPlanning) {
// When the flag is set to specialize single distinct agg planning,
// we will rely on our Aggregation strategy to handle queries with a single
// distinct column and this aggregate operator does have grouping expressions.
distinctAggGroups.size > 1 || (distinctAggGroups.size == 1 && a.groupingExpressions.isEmpty)
// distinct column.
distinctAggGroups.size > 1
} else {
distinctAggGroups.size >= 1
}
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment