-
- Downloads
[SPARK-21213][SQL] Support collecting partition-level statistics: rowCount and sizeInBytes
## What changes were proposed in this pull request? Added support for ANALYZE TABLE [db_name].tablename PARTITION (partcol1[=val1], partcol2[=val2], ...) COMPUTE STATISTICS [NOSCAN] SQL command to calculate total number of rows and size in bytes for a subset of partitions. Calculated statistics are stored in Hive Metastore as user-defined properties attached to partition objects. Property names are the same as the ones used to store table-level statistics: spark.sql.statistics.totalSize and spark.sql.statistics.numRows. When partition specification contains all partition columns with values, the command collects statistics for a single partition that matches the specification. When some partition columns are missing or listed without their values, the command collects statistics for all partitions which match a subset of partition column values specified. For example, table t has 4 partitions with the following specs: * Partition1: (ds='2008-04-08', hr=11) * Partition2: (ds='2008-04-08', hr=12) * Partition3: (ds='2008-04-09', hr=11) * Partition4: (ds='2008-04-09', hr=12) 'ANALYZE TABLE t PARTITION (ds='2008-04-09', hr=11)' command will collect statistics only for partition 3. 'ANALYZE TABLE t PARTITION (ds='2008-04-09')' command will collect statistics for partitions 3 and 4. 'ANALYZE TABLE t PARTITION (ds, hr)' command will collect statistics for all four partitions. When the optional parameter NOSCAN is specified, the command doesn't count number of rows and only gathers size in bytes. The statistics gathered by ANALYZE TABLE command can be fetched using DESC EXTENDED [db_name.]tablename PARTITION command. ## How was this patch tested? Added tests. Author: Masha Basmanova <mbasmanova@fb.com> Closes #18421 from mbasmanova/mbasmanova-analyze-partition.
Showing
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala 5 additions, 2 deletions...ala/org/apache/spark/sql/catalyst/catalog/interface.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala 23 additions, 13 deletions...scala/org/apache/spark/sql/execution/SparkSqlParser.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala 149 additions, 0 deletions...spark/sql/execution/command/AnalyzePartitionCommand.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala 6 additions, 22 deletions...che/spark/sql/execution/command/AnalyzeTableCommand.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala 26 additions, 1 deletion...org/apache/spark/sql/execution/command/CommandUtils.scala
- sql/core/src/test/resources/sql-tests/inputs/describe-part-after-analyze.sql 34 additions, 0 deletions...esources/sql-tests/inputs/describe-part-after-analyze.sql
- sql/core/src/test/resources/sql-tests/results/describe-part-after-analyze.sql.out 244 additions, 0 deletions...ces/sql-tests/results/describe-part-after-analyze.sql.out
- sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala 27 additions, 6 deletions.../org/apache/spark/sql/execution/SparkSqlParserSuite.scala
- sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala 118 additions, 51 deletions...scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
- sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 2 additions, 0 deletions...ala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
- sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala 254 additions, 0 deletions...est/scala/org/apache/spark/sql/hive/StatisticsSuite.scala
Loading
Please register or sign in to comment