-
- Downloads
[SPARK-18112][SQL] Support reading data from Hive 2.1 metastore
### What changes were proposed in this pull request? This PR is to support reading data from Hive 2.1 metastore. Need to update shim class because of the Hive API changes caused by the following three Hive JIRAs: - [HIVE-12730 MetadataUpdater: provide a mechanism to edit the basic statistics of a table (or a partition)](https://issues.apache.org/jira/browse/HIVE-12730) - [Hive-13341 Stats state is not captured correctly: differentiate load table and create table](https://issues.apache.org/jira/browse/HIVE-13341) - [HIVE-13622 WriteSet tracking optimizations](https://issues.apache.org/jira/browse/HIVE-13622) There are three new fields added to Hive APIs. - `boolean hasFollowingStatsTask`. We always set it to `false`. This is to keep the existing behavior unchanged (starting from 0.13), no matter which Hive metastore client version users choose. If we set it to `true`, the basic table statistics is not collected by Hive. For example, ```SQL CREATE TABLE tbl AS SELECT 1 AS a ``` When setting `hasFollowingStatsTask ` to `false`, the table properties is like ``` Properties: [numFiles=1, transient_lastDdlTime=1489513927, totalSize=2] ``` When setting `hasFollowingStatsTask ` to `true`, the table properties is like ``` Properties: [transient_lastDdlTime=1489513563] ``` - `AcidUtils.Operation operation`. Obviously, we do not support ACID. Thus, we set it to `AcidUtils.Operation.NOT_ACID`. - `EnvironmentContext environmentContext`. So far, this is always set to `null`. This was introduced for supporting DDL `alter table s update statistics set ('numRows'='NaN')`. Using this DDL, users can specify the statistics. So far, our Spark SQL does not need it, because we use different table properties to store our generated statistics values. However, when Spark SQL issues ALTER TABLE DDL statements, Hive metastore always automatically invalidate the Hive-generated statistics. In the follow-up PR, we can fix it by explicitly adding a property to `environmentContext`. ```JAVA putToProperties(StatsSetupConst.STATS_GENERATED, StatsSetupConst.USER) ``` Another alternative is to set `DO_NOT_UPDATE_STATS`to `TRUE`. See the Hive JIRA: https://issues.apache.org/jira/browse/HIVE-15653. We will not address it in this PR. ### How was this patch tested? Added test cases to VersionsSuite.scala Author: Xiao Li <gatorsmile@gmail.com> Closes #17232 from gatorsmile/Hive21.
Showing
- sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala 0 additions, 1 deletion...scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
- sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 3 additions, 2 deletions...ala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
- sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala 165 additions, 16 deletions...ain/scala/org/apache/spark/sql/hive/client/HiveShim.scala
- sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala 1 addition, 0 deletions...g/apache/spark/sql/hive/client/IsolatedClientLoader.scala
- sql/hive/src/main/scala/org/apache/spark/sql/hive/client/package.scala 5 additions, 1 deletion...main/scala/org/apache/spark/sql/hive/client/package.scala
- sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala 1 addition, 1 deletion...apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
- sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala 15 additions, 4 deletions...cala/org/apache/spark/sql/hive/client/VersionsSuite.scala
Loading
Please register or sign in to comment