-
- Downloads
[SPARK-17073][SQL] generate column-level statistics
## What changes were proposed in this pull request? Generate basic column statistics for all the atomic types: - numeric types: max, min, num of nulls, ndv (number of distinct values) - date/timestamp types: they are also represented as numbers internally, so they have the same stats as above. - string: avg length, max length, num of nulls, ndv - binary: avg length, max length, num of nulls - boolean: num of nulls, num of trues, num of falsies Also support storing and loading these statistics. One thing to notice: We support analyzing columns independently, e.g.: sql1: `ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS key;` sql2: `ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS value;` when running sql2 to collect column stats for `value`, we don’t remove stats of columns `key` which are analyzed in sql1 and not in sql2. As a result, **users need to guarantee consistency** between sql1 and sql2. If the table has been changed before sql2, users should re-analyze column `key` when they want to analyze column `value`: `ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS key, value;` ## How was this patch tested? add unit tests Author: Zhenhua Wang <wzh_zju@163.com> Closes #15090 from wzhfy/colStats.
Showing
- sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 1 addition, 1 deletion...in/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala 68 additions, 1 deletion.../apache/spark/sql/catalyst/plans/logical/Statistics.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala 13 additions, 5 deletions...scala/org/apache/spark/sql/execution/SparkSqlParser.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala 175 additions, 0 deletions...he/spark/sql/execution/command/AnalyzeColumnCommand.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala 59 additions, 53 deletions...che/spark/sql/execution/command/AnalyzeTableCommand.scala
- sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 9 additions, 0 deletions...rc/main/scala/org/apache/spark/sql/internal/SQLConf.scala
- sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala 3 additions, 5 deletions...in/scala/org/apache/spark/sql/internal/SessionState.scala
- sql/core/src/test/scala/org/apache/spark/sql/StatisticsColumnSuite.scala 334 additions, 0 deletions...st/scala/org/apache/spark/sql/StatisticsColumnSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/StatisticsSuite.scala 2 additions, 14 deletions...src/test/scala/org/apache/spark/sql/StatisticsSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/StatisticsTest.scala 129 additions, 0 deletions.../src/test/scala/org/apache/spark/sql/StatisticsTest.scala
- sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala 19 additions, 9 deletions...scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
- sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala 93 additions, 26 deletions...est/scala/org/apache/spark/sql/hive/StatisticsSuite.scala
- sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLViewSuite.scala 1 addition, 0 deletions...la/org/apache/spark/sql/hive/execution/SQLViewSuite.scala
Loading
Please register or sign in to comment