Skip to content
Snippets Groups Projects
  1. Apr 14, 2017
  2. Apr 13, 2017
    • Xiao Li's avatar
      [SPARK-19924][SQL][BACKPORT-2.1] Handle InvocationTargetException for all Hive Shim · 98ae5481
      Xiao Li authored
      ### What changes were proposed in this pull request?
      
      This is to backport the PR https://github.com/apache/spark/pull/17265 to Spark 2.1 branch.
      
      ---
      Since we are using shim for most Hive metastore APIs, the exceptions thrown by the underlying method of Method.invoke() are wrapped by `InvocationTargetException`. Instead of doing it one by one, we should handle all of them in the `withClient`. If any of them is missing, the error message could looks unfriendly. For example, below is an example for dropping tables.
      
      ```
      Expected exception org.apache.spark.sql.AnalysisException to be thrown, but java.lang.reflect.InvocationTargetException was thrown.
      ScalaTestFailureLocation: org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14 at (ExternalCatalogSuite.scala:193)
      org.scalatest.exceptions.TestFailedException: Expected exception org.apache.spark.sql.AnalysisException to be thrown, but java.lang.reflect.InvocationTargetException was thrown.
      	at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496)
      	at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
      	at org.scalatest.Assertions$class.intercept(Assertions.scala:1004)
      	at org.scalatest.FunSuite.intercept(FunSuite.scala:1555)
      	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14.apply$mcV$sp(ExternalCatalogSuite.scala:193)
      	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14.apply(ExternalCatalogSuite.scala:183)
      	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14.apply(ExternalCatalogSuite.scala:183)
      	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
      	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
      	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
      	at org.scalatest.Transformer.apply(Transformer.scala:22)
      	at org.scalatest.Transformer.apply(Transformer.scala:20)
      	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
      	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
      	at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
      	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
      	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
      	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
      	at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
      	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(ExternalCatalogSuite.scala:40)
      	at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
      	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite.runTest(ExternalCatalogSuite.scala:40)
      	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
      	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
      	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
      	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
      	at scala.collection.immutable.List.foreach(List.scala:381)
      	at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
      	at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
      	at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
      	at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
      	at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
      	at org.scalatest.Suite$class.run(Suite.scala:1424)
      	at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
      	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
      	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
      	at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
      	at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
      	at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
      	at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
      	at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
      	at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31)
      	at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55)
      	at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563)
      	at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2557)
      	at scala.collection.immutable.List.foreach(List.scala:381)
      	at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:2557)
      	at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1044)
      	at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1043)
      	at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:2722)
      	at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1043)
      	at org.scalatest.tools.Runner$.run(Runner.scala:883)
      	at org.scalatest.tools.Runner.run(Runner.scala)
      	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:138)
      	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:28)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
      Caused by: java.lang.reflect.InvocationTargetException
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at org.apache.spark.sql.hive.client.Shim_v0_14.dropTable(HiveShim.scala:736)
      	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropTable$1.apply$mcV$sp(HiveClientImpl.scala:451)
      	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropTable$1.apply(HiveClientImpl.scala:451)
      	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropTable$1.apply(HiveClientImpl.scala:451)
      	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:287)
      	at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:228)
      	at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:227)
      	at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:270)
      	at org.apache.spark.sql.hive.client.HiveClientImpl.dropTable(HiveClientImpl.scala:450)
      	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$dropTable$1.apply$mcV$sp(HiveExternalCatalog.scala:456)
      	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$dropTable$1.apply(HiveExternalCatalog.scala:454)
      	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$dropTable$1.apply(HiveExternalCatalog.scala:454)
      	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:94)
      	at org.apache.spark.sql.hive.HiveExternalCatalog.dropTable(HiveExternalCatalog.scala:454)
      	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14$$anonfun$apply$mcV$sp$8.apply$mcV$sp(ExternalCatalogSuite.scala:194)
      	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14$$anonfun$apply$mcV$sp$8.apply(ExternalCatalogSuite.scala:194)
      	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14$$anonfun$apply$mcV$sp$8.apply(ExternalCatalogSuite.scala:194)
      	at org.scalatest.Assertions$class.intercept(Assertions.scala:997)
      	... 57 more
      Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: NoSuchObjectException(message:db2.unknown_table table not found)
      	at org.apache.hadoop.hive.ql.metadata.Hive.dropTable(Hive.java:1038)
      	... 79 more
      Caused by: NoSuchObjectException(message:db2.unknown_table table not found)
      	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table_core(HiveMetaStore.java:1808)
      	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1778)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
      	at com.sun.proxy.$Proxy10.get_table(Unknown Source)
      	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1208)
      	at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.getTable(SessionHiveMetaStoreClient.java:131)
      	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropTable(HiveMetaStoreClient.java:952)
      	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropTable(HiveMetaStoreClient.java:904)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
      	at com.sun.proxy.$Proxy11.dropTable(Unknown Source)
      	at org.apache.hadoop.hive.ql.metadata.Hive.dropTable(Hive.java:1035)
      	... 79 more
      ```
      
      After unwrapping the exception, the message is like
      ```
      org.apache.hadoop.hive.ql.metadata.HiveException: NoSuchObjectException(message:db2.unknown_table table not found);
      org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: NoSuchObjectException(message:db2.unknown_table table not found);
      	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:100)
      	at org.apache.spark.sql.hive.HiveExternalCatalog.dropTable(HiveExternalCatalog.scala:460)
      	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14.apply$mcV$sp(ExternalCatalogSuite.scala:193)
      	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14.apply(ExternalCatalogSuite.scala:183)
      	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogSuite$$anonfun$14.apply(ExternalCatalogSuite.scala:183)
      	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
      ...
      ```
      ### How was this patch tested?
      N/A
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17627 from gatorsmile/backport-17265.
      98ae5481
  3. Apr 09, 2017
    • Vijay Ramesh's avatar
      [SPARK-20260][MLLIB] String interpolation required for error message · 43a7fcad
      Vijay Ramesh authored
      ## What changes were proposed in this pull request?
      This error message doesn't get properly formatted because of a missing `s`.  Currently the error looks like:
      
      ```
      Caused by: java.lang.IllegalArgumentException: requirement failed: indices should be one-based and in ascending order; found current=$current, previous=$previous; line="$line"
      ```
      (note the literal `$current` instead of the interpolated value)
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Vijay Ramesh <vramesh@demandbase.com>
      
      Closes #17572 from vijaykramesh/master.
      
      (cherry picked from commit 261eaf51)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      43a7fcad
  4. Mar 28, 2017
  5. Mar 23, 2017
    • Dongjoon Hyun's avatar
      [SPARK-19970][SQL][BRANCH-2.1] Table owner should be USER instead of PRINCIPAL... · af960e86
      Dongjoon Hyun authored
      [SPARK-19970][SQL][BRANCH-2.1] Table owner should be USER instead of PRINCIPAL in kerberized clusters
      
      ## What changes were proposed in this pull request?
      
      In the kerberized hadoop cluster, when Spark creates tables, the owner of tables are filled with PRINCIPAL strings instead of USER names. This is inconsistent with Hive and causes problems when using [ROLE](https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization) in Hive. We had better to fix this.
      
      **BEFORE**
      ```scala
      scala> sql("create table t(a int)").show
      scala> sql("desc formatted t").show(false)
      ...
      |Owner:                      |sparkEXAMPLE.COM                                         |       |
      ```
      
      **AFTER**
      ```scala
      scala> sql("create table t(a int)").show
      scala> sql("desc formatted t").show(false)
      ...
      |Owner:                      |spark                                         |       |
      ```
      
      ## How was this patch tested?
      
      Manually do `create table` and `desc formatted` because this happens in Kerberized clusters.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #17363 from dongjoon-hyun/SPARK-19970-2.
      af960e86
  6. Mar 21, 2017
  7. Mar 20, 2017
    • Dongjoon Hyun's avatar
      [SPARK-19912][SQL] String literals should be escaped for Hive metastore partition pruning · c4c7b185
      Dongjoon Hyun authored
      
      ## What changes were proposed in this pull request?
      
      Since current `HiveShim`'s `convertFilters` does not escape the string literals. There exists the following correctness issues. This PR aims to return the correct result and also shows the more clear exception message.
      
      **BEFORE**
      
      ```scala
      scala> Seq((1, "p1", "q1"), (2, "p1\" and q=\"q1", "q2")).toDF("a", "p", "q").write.partitionBy("p", "q").saveAsTable("t1")
      
      scala> spark.table("t1").filter($"p" === "p1\" and q=\"q1").select($"a").show
      +---+
      |  a|
      +---+
      +---+
      
      scala> spark.table("t1").filter($"p" === "'\"").select($"a").show
      java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from ...
      ```
      
      **AFTER**
      
      ```scala
      scala> spark.table("t1").filter($"p" === "p1\" and q=\"q1").select($"a").show
      +---+
      |  a|
      +---+
      |  2|
      +---+
      
      scala> spark.table("t1").filter($"p" === "'\"").select($"a").show
      java.lang.UnsupportedOperationException: Partition filter cannot have both `"` and `'` characters
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins test with new test cases.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #17266 from dongjoon-hyun/SPARK-19912.
      
      (cherry picked from commit 21e366ae)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      c4c7b185
  8. Mar 16, 2017
    • Xiao Li's avatar
      [SPARK-19765][SPARK-18549][SPARK-19093][SPARK-19736][BACKPORT-2.1][SQL]... · 4b977ff0
      Xiao Li authored
      [SPARK-19765][SPARK-18549][SPARK-19093][SPARK-19736][BACKPORT-2.1][SQL] Backport Three Cache-related PRs to Spark 2.1
      
      ### What changes were proposed in this pull request?
      
      Backport a few cache related PRs:
      
      ---
      [[SPARK-19093][SQL] Cached tables are not used in SubqueryExpression](https://github.com/apache/spark/pull/16493)
      
      Consider the plans inside subquery expressions while looking up cache manager to make
      use of cached data. Currently CacheManager.useCachedData does not consider the
      subquery expressions in the plan.
      
      ---
      [[SPARK-19736][SQL] refreshByPath should clear all cached plans with the specified path](https://github.com/apache/spark/pull/17064)
      
      Catalog.refreshByPath can refresh the cache entry and the associated metadata for all dataframes (if any), that contain the given data source path.
      
      However, CacheManager.invalidateCachedPath doesn't clear all cached plans with the specified path. It causes some strange behaviors reported in SPARK-15678.
      
      ---
      [[SPARK-19765][SPARK-18549][SQL] UNCACHE TABLE should un-cache all cached plans that refer to this table](https://github.com/apache/spark/pull/17097)
      
      When un-cache a table, we should not only remove the cache entry for this table, but also un-cache any other cached plans that refer to this table. The following commands trigger the table uncache: `DropTableCommand`, `TruncateTableCommand`, `AlterTableRenameCommand`, `UncacheTableCommand`, `RefreshTable` and `InsertIntoHiveTable`
      
      This PR also includes some refactors:
      - use java.util.LinkedList to store the cache entries, so that it's safer to remove elements while iterating
      - rename invalidateCache to recacheByPlan, which is more obvious about what it does.
      
      ### How was this patch tested?
      N/A
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17319 from gatorsmile/backport-17097.
      4b977ff0
  9. Mar 14, 2017
  10. Mar 10, 2017
    • Budde's avatar
      [SPARK-19611][SQL] Introduce configurable table schema inference · e481a738
      Budde authored
      Add a new configuration option that allows Spark SQL to infer a case-sensitive schema from a Hive Metastore table's data files when a case-sensitive schema can't be read from the table properties.
      
      - Add spark.sql.hive.caseSensitiveInferenceMode param to SQLConf
      - Add schemaPreservesCase field to CatalogTable (set to false when schema can't
        successfully be read from Hive table props)
      - Perform schema inference in HiveMetastoreCatalog if schemaPreservesCase is
        false, depending on spark.sql.hive.caseSensitiveInferenceMode
      - Add alterTableSchema() method to the ExternalCatalog interface
      - Add HiveSchemaInferenceSuite tests
      - Refactor and move ParquetFileForamt.meregeMetastoreParquetSchema() as
        HiveMetastoreCatalog.mergeWithMetastoreSchema
      - Move schema merging tests from ParquetSchemaSuite to HiveSchemaInferenceSuite
      
      [JIRA for this change](https://issues.apache.org/jira/browse/SPARK-19611)
      
      The tests in ```HiveSchemaInferenceSuite``` should verify that schema inference is working as expected. ```ExternalCatalogSuite``` has also been extended to cover the new ```alterTableSchema()``` API.
      
      Author: Budde <budde@amazon.com>
      
      Closes #17229 from budde/SPARK-19611-2.1.
      e481a738
  11. Feb 24, 2017
    • jerryshao's avatar
      [SPARK-19038][YARN] Avoid overwriting keytab configuration in yarn-client · ed9aaa31
      jerryshao authored
      
      ## What changes were proposed in this pull request?
      
      Because yarn#client will reset the `spark.yarn.keytab` configuration to point to the location in distributed file, so if user still uses the old `SparkConf` to create `SparkSession` with Hive enabled, it will read keytab from the path in distributed cached. This is OK for yarn cluster mode, but in yarn client mode where driver is running out of container, it will be failed to fetch the keytab.
      
      So here we should avoid reseting this configuration in the `yarn#client` and only overwriting it for AM, so using `spark.yarn.keytab` could get correct keytab path no matter running in client (keytab in local fs) or cluster (keytab in distributed cache) mode.
      
      ## How was this patch tested?
      
      Verified in security cluster.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #16923 from jerryshao/SPARK-19038.
      
      (cherry picked from commit a920a436)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      ed9aaa31
  12. Feb 23, 2017
  13. Jan 19, 2017
    • Wenchen Fan's avatar
      [SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor the error checking when... · 7bc3e9ba
      Wenchen Fan authored
      [SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor the error checking when append data to an existing table
      
      ## What changes were proposed in this pull request?
      
      When we append data to an existing table with `DataFrameWriter.saveAsTable`, we will do various checks to make sure the appended data is consistent with the existing data.
      
      However, we get the information of the existing table by matching the table relation, instead of looking at the table metadata. This is error-prone, e.g. we only check the number of columns for `HadoopFsRelation`, we forget to check bucketing, etc.
      
      This PR refactors the error checking by looking at the metadata of the existing table, and fix several bugs:
      * SPARK-18899: We forget to check if the specified bucketing matched the existing table, which may lead to a problematic table that has different bucketing in different data files.
      * SPARK-18912: We forget to check the number of columns for non-file-based data source table
      * SPARK-18913: We don't support append data to a table with special column names.
      
      ## How was this patch tested?
      new regression test.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16313 from cloud-fan/bug1.
      
      (cherry picked from commit f923c849)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      7bc3e9ba
  14. Jan 17, 2017
    • gatorsmile's avatar
      [SPARK-19129][SQL] SessionCatalog: Disallow empty part col values in partition spec · 3ec3e3f2
      gatorsmile authored
      
      Empty partition column values are not valid for partition specification. Before this PR, we accept users to do it; however, Hive metastore does not detect and disallow it too. Thus, users hit the following strange error.
      
      ```Scala
      val df = spark.createDataFrame(Seq((0, "a"), (1, "b"))).toDF("partCol1", "name")
      df.write.mode("overwrite").partitionBy("partCol1").saveAsTable("partitionedTable")
      spark.sql("alter table partitionedTable drop partition(partCol1='')")
      spark.table("partitionedTable").show()
      ```
      
      In the above example, the WHOLE table is DROPPED when users specify a partition spec containing only one partition column with empty values.
      
      When the partition columns contains more than one, Hive metastore APIs simply ignore the columns with empty values and treat it as partial spec. This is also not expected. This does not follow the actual Hive behaviors. This PR is to disallow users to specify such an invalid partition spec in the `SessionCatalog` APIs.
      
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16583 from gatorsmile/disallowEmptyPartColValue.
      
      (cherry picked from commit a23debd7)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      3ec3e3f2
  15. Jan 15, 2017
    • gatorsmile's avatar
      [SPARK-19092][SQL][BACKPORT-2.1] Save() API of DataFrameWriter should not scan... · bf2f233e
      gatorsmile authored
      [SPARK-19092][SQL][BACKPORT-2.1] Save() API of DataFrameWriter should not scan all the saved files #16481
      
      ### What changes were proposed in this pull request?
      
      #### This PR is to backport https://github.com/apache/spark/pull/16481 to Spark 2.1
      ---
      `DataFrameWriter`'s [save() API](https://github.com/gatorsmile/spark/blob/5d38f09f47a767a342a0a8219c63efa2943b5d1f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L207) is performing a unnecessary full filesystem scan for the saved files. The save() API is the most basic/core API in `DataFrameWriter`. We should avoid it.
      
      ### How was this patch tested?
      Added and modified the test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16588 from gatorsmile/backport-19092.
      bf2f233e
    • gatorsmile's avatar
      [SPARK-19120] Refresh Metadata Cache After Loading Hive Tables · db37049d
      gatorsmile authored
      
      ```Scala
              sql("CREATE TABLE tab (a STRING) STORED AS PARQUET")
      
              // This table fetch is to fill the cache with zero leaf files
              spark.table("tab").show()
      
              sql(
                s"""
                   |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
                   |INTO TABLE tab
                 """.stripMargin)
      
              spark.table("tab").show()
      ```
      
      In the above example, the returned result is empty after table loading. The metadata cache could be out of dated after loading new data into the table, because loading/inserting does not update the cache. So far, the metadata cache is only used for data source tables. Thus, for Hive serde tables, only `parquet` and `orc` formats are facing such issues, because the Hive serde tables in the format of  parquet/orc could be converted to data source tables when `spark.sql.hive.convertMetastoreParquet`/`spark.sql.hive.convertMetastoreOrc` is on.
      
      This PR is to refresh the metadata cache after processing the `LOAD DATA` command.
      
      In addition, Spark SQL does not convert **partitioned** Hive tables (orc/parquet) to data source tables in the write path, but the read path is using the metadata cache for both **partitioned** and non-partitioned Hive tables (orc/parquet). That means, writing the partitioned parquet/orc tables still use `InsertIntoHiveTable`, instead of `InsertIntoHadoopFsRelationCommand`. To avoid reading the out-of-dated cache, `InsertIntoHiveTable` needs to refresh the metadata cache for partitioned tables. Note, it does not need to refresh the cache for non-partitioned parquet/orc tables, because it does not call `InsertIntoHiveTable` at all. Based on the comments, this PR will keep the existing logics unchanged. That means, we always refresh the table no matter whether the table is partitioned or not.
      
      Added test cases in parquetSuites.scala
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16500 from gatorsmile/refreshInsertIntoHiveTable.
      
      (cherry picked from commit de62ddf7)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      db37049d
  16. Jan 03, 2017
    • gatorsmile's avatar
      [SPARK-19048][SQL] Delete Partition Location when Dropping Managed Partitioned... · 77625506
      gatorsmile authored
      [SPARK-19048][SQL] Delete Partition Location when Dropping Managed Partitioned Tables in InMemoryCatalog
      
      ### What changes were proposed in this pull request?
      The data in the managed table should be deleted after table is dropped. However, if the partition location is not under the location of the partitioned table, it is not deleted as expected. Users can specify any location for the partition when they adding a partition.
      
      This PR is to delete partition location when dropping managed partitioned tables stored in `InMemoryCatalog`.
      
      ### How was this patch tested?
      Added test cases for both HiveExternalCatalog and InMemoryCatalog
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16448 from gatorsmile/unsetSerdeProp.
      
      (cherry picked from commit b67b35f7)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      77625506
  17. Jan 02, 2017
  18. Dec 22, 2016
  19. Dec 21, 2016
    • gatorsmile's avatar
      [SPARK-18949][SQL][BACKPORT-2.1] Add recoverPartitions API to Catalog · 0e51bb08
      gatorsmile authored
      ### What changes were proposed in this pull request?
      
      This PR is to backport https://github.com/apache/spark/pull/16356 to Spark 2.1.1 branch.
      
      ----
      
      Currently, we only have a SQL interface for recovering all the partitions in the directory of a table and update the catalog. `MSCK REPAIR TABLE` or `ALTER TABLE table RECOVER PARTITIONS`. (Actually, very hard for me to remember `MSCK` and have no clue what it means)
      
      After the new "Scalable Partition Handling", the table repair becomes much more important for making visible the data in the created data source partitioned table.
      
      Thus, this PR is to add it into the Catalog interface. After this PR, users can repair the table by
      ```Scala
      spark.catalog.recoverPartitions("testTable")
      ```
      
      ### How was this patch tested?
      Modified the existing test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16372 from gatorsmile/repairTable2.1.1.
      0e51bb08
  20. Dec 19, 2016
    • Wenchen Fan's avatar
      [SPARK-18921][SQL] check database existence with Hive.databaseExists instead of getDatabase · c1a26b45
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      It's weird that we use `Hive.getDatabase` to check the existence of a database, while Hive has a `databaseExists` interface.
      
      What's worse, `Hive.getDatabase` will produce an error message if the database doesn't exist, which is annoying when we only want to check the database existence.
      
      This PR fixes this and use `Hive.databaseExists` to check database existence.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16332 from cloud-fan/minor.
      
      (cherry picked from commit 7a75ee1c)
      Signed-off-by: default avatarYin Huai <yhuai@databricks.com>
      c1a26b45
    • xuanyuanking's avatar
      [SPARK-18700][SQL] Add StripedLock for each table's relation in cache · fc1b2566
      xuanyuanking authored
      ## What changes were proposed in this pull request?
      
      As the scenario describe in [SPARK-18700](https://issues.apache.org/jira/browse/SPARK-18700
      
      ), when cachedDataSourceTables invalided, the coming few queries will fetch all FileStatus in listLeafFiles function. In the condition of table has many partitions, these jobs will occupy much memory of driver finally may cause driver OOM.
      
      In this patch, add StripedLock for each table's relation in cache not for the whole cachedDataSourceTables, each table's load cache operation protected by it.
      
      ## How was this patch tested?
      
      Add a multi-thread access table test in `PartitionedTablePerfStatsSuite` and check it only loading once using metrics in `HiveCatalogMetrics`
      
      Author: xuanyuanking <xyliyuanjian@gmail.com>
      
      Closes #16135 from xuanyuanking/SPARK-18700.
      
      (cherry picked from commit 24482858)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      fc1b2566
  21. Dec 18, 2016
    • gatorsmile's avatar
      [SPARK-18703][SPARK-18675][SQL][BACKPORT-2.1] CTAS for hive serde table should... · 3080f995
      gatorsmile authored
      [SPARK-18703][SPARK-18675][SQL][BACKPORT-2.1] CTAS for hive serde table should work for all hive versions AND Drop Staging Directories and Data Files
      
      ### What changes were proposed in this pull request?
      
      This PR is to backport https://github.com/apache/spark/pull/16104 and https://github.com/apache/spark/pull/16134.
      
      ----------
      [[SPARK-18675][SQL] CTAS for hive serde table should work for all hive versions](https://github.com/apache/spark/pull/16104)
      
      Before hive 1.1, when inserting into a table, hive will create the staging directory under a common scratch directory. After the writing is finished, hive will simply empty the table directory and move the staging directory to it.
      
      After hive 1.1, hive will create the staging directory under the table directory, and when moving staging directory to table directory, hive will still empty the table directory, but will exclude the staging directory there.
      
      In `InsertIntoHiveTable`, we simply copy the code from hive 1.2, which means we will always create the staging directory under the table directory, no matter what the hive version is. This causes problems if the hive version is prior to 1.1, because the staging directory will be removed by hive when hive is trying to empty the table directory.
      
      This PR copies the code from hive 0.13, so that we have 2 branches to create staging directory. If hive version is prior to 1.1, we'll go to the old style branch(i.e. create the staging directory under a common scratch directory), else, go to the new style branch(i.e. create the staging directory under the table directory)
      
      ----------
      [[SPARK-18703] [SQL] Drop Staging Directories and Data Files After each Insertion/CTAS of Hive serde Tables](https://github.com/apache/spark/pull/16134)
      
      Below are the files/directories generated for three inserts againsts a Hive table:
      ```
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/._SUCCESS.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/.part-00000.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/_SUCCESS
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/part-00000
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/._SUCCESS.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/.part-00000.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/_SUCCESS
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/part-00000
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/._SUCCESS.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/.part-00000.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/_SUCCESS
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/part-00000
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-00000.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-00000
      ```
      
      The first 18 files are temporary. We do not drop it until the end of JVM termination. If JVM does not appropriately terminate, these temporary files/directories will not be dropped.
      
      Only the last two files are needed, as shown below.
      ```
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-00000.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-00000
      ```
      The temporary files/directories could accumulate a lot when we issue many inserts, since each insert generats at least six files. This could eat a lot of spaces and slow down the JVM termination. When the JVM does not terminates approprately, the files might not be dropped.
      
      This PR is to drop the created staging files and temporary data files after each insert/CTAS.
      
      ### How was this patch tested?
      Added test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16325 from gatorsmile/backport-18703&18675.
      3080f995
  22. Dec 15, 2016
  23. Dec 12, 2016
    • Yuming Wang's avatar
      [SPARK-18681][SQL] Fix filtering to compatible with partition keys of type int · 523071f3
      Yuming Wang authored
      
      ## What changes were proposed in this pull request?
      
      Cloudera put `/var/run/cloudera-scm-agent/process/15000-hive-HIVEMETASTORE/hive-site.xml` as the configuration file for the Hive Metastore Server, where `hive.metastore.try.direct.sql=false`. But Spark isn't reading this configuration file and get default value `hive.metastore.try.direct.sql=true`. As mallman said, we should use `getMetaConf` method to obtain the original configuration from Hive Metastore Server. I have tested this method few times and the return value is always consistent with Hive Metastore Server.
      
      ## How was this patch tested?
      
      The existing tests.
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #16122 from wangyum/SPARK-18681.
      
      (cherry picked from commit 90abfd15)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      523071f3
  24. Dec 09, 2016
    • Zhan Zhang's avatar
      [SPARK-18637][SQL] Stateful UDF should be considered as nondeterministic · 72bf5199
      Zhan Zhang authored
      
      Make stateful udf as nondeterministic
      
      Add new test cases with both Stateful and Stateless UDF.
      Without the patch, the test cases will throw exception:
      
      1 did not equal 10
      ScalaTestFailureLocation: org.apache.spark.sql.hive.execution.HiveUDFSuite$$anonfun$21 at (HiveUDFSuite.scala:501)
      org.scalatest.exceptions.TestFailedException: 1 did not equal 10
              at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
              at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
              ...
      
      Author: Zhan Zhang <zhanzhang@fb.com>
      
      Closes #16068 from zhzhan/state.
      
      (cherry picked from commit 67587d96)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      72bf5199
  25. Dec 08, 2016
  26. Dec 05, 2016
    • Michael Allman's avatar
      [SPARK-18572][SQL] Add a method `listPartitionNames` to `ExternalCatalog` · 8ca6a82c
      Michael Allman authored
      (Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-18572
      
      )
      
      ## What changes were proposed in this pull request?
      
      Currently Spark answers the `SHOW PARTITIONS` command by fetching all of the table's partition metadata from the external catalog and constructing partition names therefrom. The Hive client has a `getPartitionNames` method which is many times faster for this purpose, with the performance improvement scaling with the number of partitions in a table.
      
      To test the performance impact of this PR, I ran the `SHOW PARTITIONS` command on two Hive tables with large numbers of partitions. One table has ~17,800 partitions, and the other has ~95,000 partitions. For the purposes of this PR, I'll call the former table `table1` and the latter table `table2`. I ran 5 trials for each table with before-and-after versions of this PR. The results are as follows:
      
      Spark at bdc8153e, `SHOW PARTITIONS table1`, times in seconds:
      7.901
      3.983
      4.018
      4.331
      4.261
      
      Spark at bdc8153e, `SHOW PARTITIONS table2`
      (Timed out after 10 minutes with a `SocketTimeoutException`.)
      
      Spark at this PR, `SHOW PARTITIONS table1`, times in seconds:
      3.801
      0.449
      0.395
      0.348
      0.336
      
      Spark at this PR, `SHOW PARTITIONS table2`, times in seconds:
      5.184
      1.63
      1.474
      1.519
      1.41
      
      Taking the best times from each trial, we get a 12x performance improvement for a table with ~17,800 partitions and at least a 426x improvement for a table with ~95,000 partitions. More significantly, the latter command doesn't even complete with the current code in master.
      
      This is actually a patch we've been using in-house at VideoAmp since Spark 1.1. It's made all the difference in the practical usability of our largest tables. Even with tables with about 1,000 partitions there's a performance improvement of about 2-3x.
      
      ## How was this patch tested?
      
      I added a unit test to `VersionsSuite` which tests that the Hive client's `getPartitionNames` method returns the correct number of partitions.
      
      Author: Michael Allman <michael@videoamp.com>
      
      Closes #15998 from mallman/spark-18572-list_partition_names.
      
      (cherry picked from commit 772ddbea)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      8ca6a82c
  27. Dec 04, 2016
  28. Dec 02, 2016
    • Eric Liang's avatar
      [SPARK-18659][SQL] Incorrect behaviors in overwrite table for datasource tables · e374b242
      Eric Liang authored
      
      ## What changes were proposed in this pull request?
      
      Two bugs are addressed here
      1. INSERT OVERWRITE TABLE sometime crashed when catalog partition management was enabled. This was because when dropping partitions after an overwrite operation, the Hive client will attempt to delete the partition files. If the entire partition directory was dropped, this would fail. The PR fixes this by adding a flag to control whether the Hive client should attempt to delete files.
      2. The static partition spec for OVERWRITE TABLE was not correctly resolved to the case-sensitive original partition names. This resulted in the entire table being overwritten if you did not correctly capitalize your partition names.
      
      cc yhuai cloud-fan
      
      ## How was this patch tested?
      
      Unit tests. Surprisingly, the existing overwrite table tests did not catch these edge cases.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #16088 from ericl/spark-18659.
      
      (cherry picked from commit 7935c847)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      e374b242
  29. Dec 01, 2016
    • Wenchen Fan's avatar
      [SPARK-18647][SQL] do not put provider in table properties for Hive serde table · 0f0903d1
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      In Spark 2.1, we make Hive serde tables case-preserving by putting the table metadata in table properties, like what we did for data source table. However, we should not put table provider, as it will break forward compatibility. e.g. if we create a Hive serde table with Spark 2.1, using `sql("create table test stored as parquet as select 1")`, we will fail to read it with Spark 2.0, as Spark 2.0 mistakenly treat it as data source table because there is a `provider` entry in table properties.
      
      Logically Hive serde table's provider is always hive, we don't need to store it in table properties, this PR removes it.
      
      ## How was this patch tested?
      
      manually test the forward compatibility issue.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16080 from cloud-fan/hive.
      
      (cherry picked from commit a5f02b00)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      0f0903d1
    • Eric Liang's avatar
      [SPARK-18635][SQL] Partition name/values not escaped correctly in some cases · 9dc3ef6e
      Eric Liang authored
      
      ## What changes were proposed in this pull request?
      
      Due to confusion between URI vs paths, in certain cases we escape partition values too many times, which causes some Hive client operations to fail or write data to the wrong location. This PR fixes at least some of these cases.
      
      To my understanding this is how values, filesystem paths, and URIs interact.
      - Hive stores raw (unescaped) partition values that are returned to you directly when you call listPartitions.
      - Internally, we convert these raw values to filesystem paths via `ExternalCatalogUtils.[un]escapePathName`.
      - In some circumstances we store URIs instead of filesystem paths. When a path is converted to a URI via `path.toURI`, the escaped partition values are further URI-encoded. This means that to get a path back from a URI, you must call `new Path(new URI(uriTxt))` in order to decode the URI-encoded string.
      - In `CatalogStorageFormat` we store URIs as strings. This makes it easy to forget to URI-decode the value before converting it into a path.
      - Finally, the Hive client itself uses mostly Paths for representing locations, and only URIs occasionally.
      
      In the future we should probably clean this up, perhaps by dropping use of URIs when unnecessary. We should also try fixing escaping for partition names as well as values, though names are unlikely to contain special characters.
      
      cc mallman cloud-fan yhuai
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #16071 from ericl/spark-18635.
      
      (cherry picked from commit 88f559f2)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      9dc3ef6e
Loading