-
- Downloads
[SPARK-18572][SQL] Add a method `listPartitionNames` to `ExternalCatalog`
(Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-18572) ## What changes were proposed in this pull request? Currently Spark answers the `SHOW PARTITIONS` command by fetching all of the table's partition metadata from the external catalog and constructing partition names therefrom. The Hive client has a `getPartitionNames` method which is many times faster for this purpose, with the performance improvement scaling with the number of partitions in a table. To test the performance impact of this PR, I ran the `SHOW PARTITIONS` command on two Hive tables with large numbers of partitions. One table has ~17,800 partitions, and the other has ~95,000 partitions. For the purposes of this PR, I'll call the former table `table1` and the latter table `table2`. I ran 5 trials for each table with before-and-after versions of this PR. The results are as follows: Spark at bdc8153e, `SHOW PARTITIONS table1`, times in seconds: 7.901 3.983 4.018 4.331 4.261 Spark at bdc8153e, `SHOW PARTITIONS table2` (Timed out after 10 minutes with a `SocketTimeoutException`.) Spark at this PR, `SHOW PARTITIONS table1`, times in seconds: 3.801 0.449 0.395 0.348 0.336 Spark at this PR, `SHOW PARTITIONS table2`, times in seconds: 5.184 1.63 1.474 1.519 1.41 Taking the best times from each trial, we get a 12x performance improvement for a table with ~17,800 partitions and at least a 426x improvement for a table with ~95,000 partitions. More significantly, the latter command doesn't even complete with the current code in master. This is actually a patch we've been using in-house at VideoAmp since Spark 1.1. It's made all the difference in the practical usability of our largest tables. Even with tables with about 1,000 partitions there's a performance improvement of about 2-3x. ## How was this patch tested? I added a unit test to `VersionsSuite` which tests that the Hive client's `getPartitionNames` method returns the correct number of partitions. Author: Michael Allman <michael@videoamp.com> Closes #15998 from mallman/spark-18572-list_partition_names.
Showing
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala 24 additions, 2 deletions...g/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala 14 additions, 0 deletions...g/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala 23 additions, 0 deletions...rg/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
- sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala 25 additions, 0 deletions...che/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala
- sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala 39 additions, 0 deletions...ache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala 1 addition, 11 deletions...scala/org/apache/spark/sql/execution/command/tables.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala 11 additions, 11 deletions.../spark/sql/execution/datasources/DataSourceStrategy.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala 11 additions, 2 deletions...e/spark/sql/execution/datasources/PartitioningUtils.scala
- sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala 38 additions, 4 deletions...scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
- sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClient.scala 10 additions, 0 deletions...n/scala/org/apache/spark/sql/hive/client/HiveClient.scala
- sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 20 additions, 0 deletions...ala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
- sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala 5 additions, 0 deletions...cala/org/apache/spark/sql/hive/client/VersionsSuite.scala
Loading
Please register or sign in to comment