Skip to content
Snippets Groups Projects
  1. Dec 21, 2016
    • gatorsmile's avatar
      [SPARK-18949][SQL][BACKPORT-2.1] Add recoverPartitions API to Catalog · 0e51bb08
      gatorsmile authored
      ### What changes were proposed in this pull request?
      
      This PR is to backport https://github.com/apache/spark/pull/16356 to Spark 2.1.1 branch.
      
      ----
      
      Currently, we only have a SQL interface for recovering all the partitions in the directory of a table and update the catalog. `MSCK REPAIR TABLE` or `ALTER TABLE table RECOVER PARTITIONS`. (Actually, very hard for me to remember `MSCK` and have no clue what it means)
      
      After the new "Scalable Partition Handling", the table repair becomes much more important for making visible the data in the created data source partitioned table.
      
      Thus, this PR is to add it into the Catalog interface. After this PR, users can repair the table by
      ```Scala
      spark.catalog.recoverPartitions("testTable")
      ```
      
      ### How was this patch tested?
      Modified the existing test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16372 from gatorsmile/repairTable2.1.1.
      0e51bb08
  2. Dec 19, 2016
    • Wenchen Fan's avatar
      [SPARK-18921][SQL] check database existence with Hive.databaseExists instead of getDatabase · c1a26b45
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      It's weird that we use `Hive.getDatabase` to check the existence of a database, while Hive has a `databaseExists` interface.
      
      What's worse, `Hive.getDatabase` will produce an error message if the database doesn't exist, which is annoying when we only want to check the database existence.
      
      This PR fixes this and use `Hive.databaseExists` to check database existence.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16332 from cloud-fan/minor.
      
      (cherry picked from commit 7a75ee1c)
      Signed-off-by: default avatarYin Huai <yhuai@databricks.com>
      c1a26b45
    • xuanyuanking's avatar
      [SPARK-18700][SQL] Add StripedLock for each table's relation in cache · fc1b2566
      xuanyuanking authored
      ## What changes were proposed in this pull request?
      
      As the scenario describe in [SPARK-18700](https://issues.apache.org/jira/browse/SPARK-18700
      
      ), when cachedDataSourceTables invalided, the coming few queries will fetch all FileStatus in listLeafFiles function. In the condition of table has many partitions, these jobs will occupy much memory of driver finally may cause driver OOM.
      
      In this patch, add StripedLock for each table's relation in cache not for the whole cachedDataSourceTables, each table's load cache operation protected by it.
      
      ## How was this patch tested?
      
      Add a multi-thread access table test in `PartitionedTablePerfStatsSuite` and check it only loading once using metrics in `HiveCatalogMetrics`
      
      Author: xuanyuanking <xyliyuanjian@gmail.com>
      
      Closes #16135 from xuanyuanking/SPARK-18700.
      
      (cherry picked from commit 24482858)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      fc1b2566
  3. Dec 18, 2016
    • gatorsmile's avatar
      [SPARK-18703][SPARK-18675][SQL][BACKPORT-2.1] CTAS for hive serde table should... · 3080f995
      gatorsmile authored
      [SPARK-18703][SPARK-18675][SQL][BACKPORT-2.1] CTAS for hive serde table should work for all hive versions AND Drop Staging Directories and Data Files
      
      ### What changes were proposed in this pull request?
      
      This PR is to backport https://github.com/apache/spark/pull/16104 and https://github.com/apache/spark/pull/16134.
      
      ----------
      [[SPARK-18675][SQL] CTAS for hive serde table should work for all hive versions](https://github.com/apache/spark/pull/16104)
      
      Before hive 1.1, when inserting into a table, hive will create the staging directory under a common scratch directory. After the writing is finished, hive will simply empty the table directory and move the staging directory to it.
      
      After hive 1.1, hive will create the staging directory under the table directory, and when moving staging directory to table directory, hive will still empty the table directory, but will exclude the staging directory there.
      
      In `InsertIntoHiveTable`, we simply copy the code from hive 1.2, which means we will always create the staging directory under the table directory, no matter what the hive version is. This causes problems if the hive version is prior to 1.1, because the staging directory will be removed by hive when hive is trying to empty the table directory.
      
      This PR copies the code from hive 0.13, so that we have 2 branches to create staging directory. If hive version is prior to 1.1, we'll go to the old style branch(i.e. create the staging directory under a common scratch directory), else, go to the new style branch(i.e. create the staging directory under the table directory)
      
      ----------
      [[SPARK-18703] [SQL] Drop Staging Directories and Data Files After each Insertion/CTAS of Hive serde Tables](https://github.com/apache/spark/pull/16134)
      
      Below are the files/directories generated for three inserts againsts a Hive table:
      ```
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/._SUCCESS.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/.part-00000.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/_SUCCESS
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/part-00000
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/._SUCCESS.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/.part-00000.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/_SUCCESS
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/part-00000
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/._SUCCESS.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/.part-00000.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/_SUCCESS
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/part-00000
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-00000.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-00000
      ```
      
      The first 18 files are temporary. We do not drop it until the end of JVM termination. If JVM does not appropriately terminate, these temporary files/directories will not be dropped.
      
      Only the last two files are needed, as shown below.
      ```
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-00000.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-00000
      ```
      The temporary files/directories could accumulate a lot when we issue many inserts, since each insert generats at least six files. This could eat a lot of spaces and slow down the JVM termination. When the JVM does not terminates approprately, the files might not be dropped.
      
      This PR is to drop the created staging files and temporary data files after each insert/CTAS.
      
      ### How was this patch tested?
      Added test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16325 from gatorsmile/backport-18703&18675.
      3080f995
  4. Dec 15, 2016
  5. Dec 12, 2016
    • Yuming Wang's avatar
      [SPARK-18681][SQL] Fix filtering to compatible with partition keys of type int · 523071f3
      Yuming Wang authored
      
      ## What changes were proposed in this pull request?
      
      Cloudera put `/var/run/cloudera-scm-agent/process/15000-hive-HIVEMETASTORE/hive-site.xml` as the configuration file for the Hive Metastore Server, where `hive.metastore.try.direct.sql=false`. But Spark isn't reading this configuration file and get default value `hive.metastore.try.direct.sql=true`. As mallman said, we should use `getMetaConf` method to obtain the original configuration from Hive Metastore Server. I have tested this method few times and the return value is always consistent with Hive Metastore Server.
      
      ## How was this patch tested?
      
      The existing tests.
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #16122 from wangyum/SPARK-18681.
      
      (cherry picked from commit 90abfd15)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      523071f3
  6. Dec 09, 2016
    • Zhan Zhang's avatar
      [SPARK-18637][SQL] Stateful UDF should be considered as nondeterministic · 72bf5199
      Zhan Zhang authored
      
      Make stateful udf as nondeterministic
      
      Add new test cases with both Stateful and Stateless UDF.
      Without the patch, the test cases will throw exception:
      
      1 did not equal 10
      ScalaTestFailureLocation: org.apache.spark.sql.hive.execution.HiveUDFSuite$$anonfun$21 at (HiveUDFSuite.scala:501)
      org.scalatest.exceptions.TestFailedException: 1 did not equal 10
              at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
              at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
              ...
      
      Author: Zhan Zhang <zhanzhang@fb.com>
      
      Closes #16068 from zhzhan/state.
      
      (cherry picked from commit 67587d96)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      72bf5199
  7. Dec 08, 2016
  8. Dec 05, 2016
    • Michael Allman's avatar
      [SPARK-18572][SQL] Add a method `listPartitionNames` to `ExternalCatalog` · 8ca6a82c
      Michael Allman authored
      (Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-18572
      
      )
      
      ## What changes were proposed in this pull request?
      
      Currently Spark answers the `SHOW PARTITIONS` command by fetching all of the table's partition metadata from the external catalog and constructing partition names therefrom. The Hive client has a `getPartitionNames` method which is many times faster for this purpose, with the performance improvement scaling with the number of partitions in a table.
      
      To test the performance impact of this PR, I ran the `SHOW PARTITIONS` command on two Hive tables with large numbers of partitions. One table has ~17,800 partitions, and the other has ~95,000 partitions. For the purposes of this PR, I'll call the former table `table1` and the latter table `table2`. I ran 5 trials for each table with before-and-after versions of this PR. The results are as follows:
      
      Spark at bdc8153e, `SHOW PARTITIONS table1`, times in seconds:
      7.901
      3.983
      4.018
      4.331
      4.261
      
      Spark at bdc8153e, `SHOW PARTITIONS table2`
      (Timed out after 10 minutes with a `SocketTimeoutException`.)
      
      Spark at this PR, `SHOW PARTITIONS table1`, times in seconds:
      3.801
      0.449
      0.395
      0.348
      0.336
      
      Spark at this PR, `SHOW PARTITIONS table2`, times in seconds:
      5.184
      1.63
      1.474
      1.519
      1.41
      
      Taking the best times from each trial, we get a 12x performance improvement for a table with ~17,800 partitions and at least a 426x improvement for a table with ~95,000 partitions. More significantly, the latter command doesn't even complete with the current code in master.
      
      This is actually a patch we've been using in-house at VideoAmp since Spark 1.1. It's made all the difference in the practical usability of our largest tables. Even with tables with about 1,000 partitions there's a performance improvement of about 2-3x.
      
      ## How was this patch tested?
      
      I added a unit test to `VersionsSuite` which tests that the Hive client's `getPartitionNames` method returns the correct number of partitions.
      
      Author: Michael Allman <michael@videoamp.com>
      
      Closes #15998 from mallman/spark-18572-list_partition_names.
      
      (cherry picked from commit 772ddbea)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      8ca6a82c
  9. Dec 04, 2016
  10. Dec 02, 2016
    • Eric Liang's avatar
      [SPARK-18659][SQL] Incorrect behaviors in overwrite table for datasource tables · e374b242
      Eric Liang authored
      
      ## What changes were proposed in this pull request?
      
      Two bugs are addressed here
      1. INSERT OVERWRITE TABLE sometime crashed when catalog partition management was enabled. This was because when dropping partitions after an overwrite operation, the Hive client will attempt to delete the partition files. If the entire partition directory was dropped, this would fail. The PR fixes this by adding a flag to control whether the Hive client should attempt to delete files.
      2. The static partition spec for OVERWRITE TABLE was not correctly resolved to the case-sensitive original partition names. This resulted in the entire table being overwritten if you did not correctly capitalize your partition names.
      
      cc yhuai cloud-fan
      
      ## How was this patch tested?
      
      Unit tests. Surprisingly, the existing overwrite table tests did not catch these edge cases.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #16088 from ericl/spark-18659.
      
      (cherry picked from commit 7935c847)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      e374b242
  11. Dec 01, 2016
    • Wenchen Fan's avatar
      [SPARK-18647][SQL] do not put provider in table properties for Hive serde table · 0f0903d1
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      In Spark 2.1, we make Hive serde tables case-preserving by putting the table metadata in table properties, like what we did for data source table. However, we should not put table provider, as it will break forward compatibility. e.g. if we create a Hive serde table with Spark 2.1, using `sql("create table test stored as parquet as select 1")`, we will fail to read it with Spark 2.0, as Spark 2.0 mistakenly treat it as data source table because there is a `provider` entry in table properties.
      
      Logically Hive serde table's provider is always hive, we don't need to store it in table properties, this PR removes it.
      
      ## How was this patch tested?
      
      manually test the forward compatibility issue.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16080 from cloud-fan/hive.
      
      (cherry picked from commit a5f02b00)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      0f0903d1
    • Eric Liang's avatar
      [SPARK-18635][SQL] Partition name/values not escaped correctly in some cases · 9dc3ef6e
      Eric Liang authored
      
      ## What changes were proposed in this pull request?
      
      Due to confusion between URI vs paths, in certain cases we escape partition values too many times, which causes some Hive client operations to fail or write data to the wrong location. This PR fixes at least some of these cases.
      
      To my understanding this is how values, filesystem paths, and URIs interact.
      - Hive stores raw (unescaped) partition values that are returned to you directly when you call listPartitions.
      - Internally, we convert these raw values to filesystem paths via `ExternalCatalogUtils.[un]escapePathName`.
      - In some circumstances we store URIs instead of filesystem paths. When a path is converted to a URI via `path.toURI`, the escaped partition values are further URI-encoded. This means that to get a path back from a URI, you must call `new Path(new URI(uriTxt))` in order to decode the URI-encoded string.
      - In `CatalogStorageFormat` we store URIs as strings. This makes it easy to forget to URI-decode the value before converting it into a path.
      - Finally, the Hive client itself uses mostly Paths for representing locations, and only URIs occasionally.
      
      In the future we should probably clean this up, perhaps by dropping use of URIs when unnecessary. We should also try fixing escaping for partition names as well as values, though names are unlikely to contain special characters.
      
      cc mallman cloud-fan yhuai
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #16071 from ericl/spark-18635.
      
      (cherry picked from commit 88f559f2)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      9dc3ef6e
  12. Nov 30, 2016
    • Wenchen Fan's avatar
      [SPARK-18220][SQL] read Hive orc table with varchar column should not fail · 3de93fb4
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side.
      
      In Spark 2.1, after https://github.com/apache/spark/pull/14363
      
       , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail.
      
      This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column.
      
      ## How was this patch tested?
      
      newly added regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16060 from cloud-fan/varchar.
      
      (cherry picked from commit 3f03c90a)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      3de93fb4
    • gatorsmile's avatar
      [SPARK-17680][SQL][TEST] Added a Testcase for Verifying Unicode Character... · a5ec2a7b
      gatorsmile authored
      [SPARK-17680][SQL][TEST] Added a Testcase for Verifying Unicode Character Support for Column Names and Comments
      
      ### What changes were proposed in this pull request?
      
      Spark SQL supports Unicode characters for column names when specified within backticks(`). When the Hive support is enabled, the version of the Hive metastore must be higher than 0.12,  See the JIRA: https://issues.apache.org/jira/browse/HIVE-6013
      
       Hive metastore supports Unicode characters for column names since 0.13.
      
      In Spark SQL, table comments, and view comments always allow Unicode characters without backticks.
      
      BTW, a separate PR has been submitted for database and table name validation because we do not support Unicode characters in these two cases.
      ### How was this patch tested?
      
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15255 from gatorsmile/unicodeSupport.
      
      (cherry picked from commit a1d9138a)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      a5ec2a7b
  13. Nov 29, 2016
  14. Nov 28, 2016
  15. Nov 27, 2016
    • Wenchen Fan's avatar
      [SPARK-18482][SQL] make sure Spark can access the table metadata created by older version of spark · 6b77889e
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      In Spark 2.1, we did a lot of refactor for `HiveExternalCatalog` and related code path. These refactor may introduce external behavior changes and break backward compatibility. e.g. http://issues.apache.org/jira/browse/SPARK-18464
      
      
      
      To avoid future compatibility problems of `HiveExternalCatalog`, this PR dumps some typical table metadata from tables created by 2.0, and test if they can recognized by current version of Spark.
      
      ## How was this patch tested?
      
      test only change
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16003 from cloud-fan/test.
      
      (cherry picked from commit fc2c13bd)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      6b77889e
    • gatorsmile's avatar
      [SPARK-18594][SQL] Name Validation of Databases/Tables · 1e8fbefa
      gatorsmile authored
      
      ### What changes were proposed in this pull request?
      Currently, the name validation checks are limited to table creation. It is enfored by Analyzer rule: `PreWriteCheck`.
      
      However, table renaming and database creation have the same issues. It makes more sense to do the checks in `SessionCatalog`. This PR is to add it into `SessionCatalog`.
      
      ### How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16018 from gatorsmile/nameValidate.
      
      (cherry picked from commit 07f32c22)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      1e8fbefa
  16. Nov 25, 2016
    • hyukjinkwon's avatar
      [SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will... · 69856f28
      hyukjinkwon authored
      [SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will help unidoc/genjavadoc compatibility
      
      ## What changes were proposed in this pull request?
      
      This PR only tries to fix things that looks pretty straightforward and were fixed in other previous PRs before.
      
      This PR roughly fixes several things as below:
      
      - Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` ``
      
        ```
        [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/DataStreamReader.java:226: error: reference not found
        [error]    * Loads text files and returns a {link DataFrame} whose schema starts with a string column named
        ```
      
      - Fix an exception annotation and remove code backticks in `throws` annotation
      
        Currently, sbt unidoc with Java 8 complains as below:
      
        ```
        [error] .../java/org/apache/spark/sql/streaming/StreamingQuery.java:72: error: unexpected text
        [error]    * throws StreamingQueryException, if <code>this</code> query has terminated with an exception.
        ```
      
        `throws` should specify the correct class name from `StreamingQueryException,` to `StreamingQueryException` without backticks. (see [JDK-8007644](https://bugs.openjdk.java.net/browse/JDK-8007644)).
      
      - Fix `[[http..]]` to `<a href="http..."></a>`.
      
        ```diff
        -   * [[https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https Oracle
        -   * blog page]].
        +   * <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https
      
      ">
        +   * Oracle blog page</a>.
        ```
      
         `[[http...]]` link markdown in scaladoc is unrecognisable in javadoc.
      
      - It seems class can't have `return` annotation. So, two cases of this were removed.
      
        ```
        [error] .../java/org/apache/spark/mllib/regression/IsotonicRegression.java:27: error: invalid use of return
        [error]    * return New instance of IsotonicRegression.
        ```
      
      - Fix < to `&lt;` and > to `&gt;` according to HTML rules.
      
      - Fix `</p>` complaint
      
      - Exclude unrecognisable in javadoc, `constructor`, `todo` and `groupname`.
      
      ## How was this patch tested?
      
      Manually tested by `jekyll build` with Java 7 and 8
      
      ```
      java version "1.7.0_80"
      Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
      Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
      ```
      
      ```
      java version "1.8.0_45"
      Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
      Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
      ```
      
      Note: this does not yet make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15999 from HyukjinKwon/SPARK-3359-errors.
      
      (cherry picked from commit 51b1c155)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      69856f28
  17. Nov 23, 2016
    • Reynold Xin's avatar
      [SPARK-18522][SQL] Explicit contract for column stats serialization · 599dac15
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      The current implementation of column stats uses the base64 encoding of the internal UnsafeRow format to persist statistics (in table properties in Hive metastore). This is an internal format that is not stable across different versions of Spark and should NOT be used for persistence. In addition, it would be better if statistics stored in the catalog is human readable.
      
      This pull request introduces the following changes:
      
      1. Created a single ColumnStat class to for all data types. All data types track the same set of statistics.
      2. Updated the implementation for stats collection to get rid of the dependency on internal data structures (e.g. InternalRow, or storing DateType as an int32). For example, previously dates were stored as a single integer, but are now stored as java.sql.Date. When we implement the next steps of CBO, we can add code to convert those back into internal types again.
      3. Documented clearly what JVM data types are being used to store what data.
      4. Defined a simple Map[String, String] interface for serializing and deserializing column stats into/from the catalog.
      5. Rearranged the method/function structure so it is more clear what the supported data types are, and also moved how stats are generated into ColumnStat class so they are easy to find.
      
      ## How was this patch tested?
      Removed most of the original test cases created for column statistics, and added three very simple ones to cover all the cases. The three test cases validate:
      1. Roundtrip serialization works.
      2. Behavior when analyzing non-existent column or unsupported data type column.
      3. Result for stats collection for all valid data types.
      
      Also moved parser related tests into a parser test suite and added an explicit serialization test for the Hive external catalog.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15959 from rxin/SPARK-18522.
      
      (cherry picked from commit 70ad07a9)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      599dac15
    • Eric Liang's avatar
      [SPARK-18545][SQL] Verify number of hive client RPCs in PartitionedTablePerfStatsSuite · 539c193a
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      This would help catch accidental O(n) calls to the hive client as in https://issues.apache.org/jira/browse/SPARK-18507
      
      ## How was this patch tested?
      
      Checked that the test fails before https://issues.apache.org/jira/browse/SPARK-18507
      
       was patched. cc cloud-fan
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #15985 from ericl/spark-18545.
      
      (cherry picked from commit 85235ed6)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      539c193a
  18. Nov 22, 2016
    • gatorsmile's avatar
      [SPARK-16803][SQL] SaveAsTable does not work when target table is a Hive serde table · 64b9de9c
      gatorsmile authored
      
      ### What changes were proposed in this pull request?
      
      In Spark 2.0, `SaveAsTable` does not work when the target table is a Hive serde table, but Spark 1.6 works.
      
      **Spark 1.6**
      
      ``` Scala
      scala> sql("create table sample.sample stored as SEQUENCEFILE as select 1 as key, 'abc' as value")
      res2: org.apache.spark.sql.DataFrame = []
      
      scala> val df = sql("select key, value as value from sample.sample")
      df: org.apache.spark.sql.DataFrame = [key: int, value: string]
      
      scala> df.write.mode("append").saveAsTable("sample.sample")
      
      scala> sql("select * from sample.sample").show()
      +---+-----+
      |key|value|
      +---+-----+
      |  1|  abc|
      |  1|  abc|
      +---+-----+
      ```
      
      **Spark 2.0**
      
      ``` Scala
      scala> df.write.mode("append").saveAsTable("sample.sample")
      org.apache.spark.sql.AnalysisException: Saving data in MetastoreRelation sample, sample
       is not supported.;
      ```
      
      So far, we do not plan to support it in Spark 2.1 due to the risk. Spark 1.6 works because it internally uses insertInto. But, if we change it back it will break the semantic of saveAsTable (this method uses by-name resolution instead of using by-position resolution used by insertInto). More extra changes are needed to support `hive` as a `format` in DataFrameWriter.
      
      Instead, users should use insertInto API. This PR corrects the error messages. Users can understand how to bypass it before we support it in a separate PR.
      ### How was this patch tested?
      
      Test cases are added
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15926 from gatorsmile/saveAsTableFix5.
      
      (cherry picked from commit 9c42d4a7)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      64b9de9c
    • Burak Yavuz's avatar
      [SPARK-18465] Add 'IF EXISTS' clause to 'UNCACHE' to not throw exceptions when table doesn't exist · fb2ea54a
      Burak Yavuz authored
      
      ## What changes were proposed in this pull request?
      
      While this behavior is debatable, consider the following use case:
      ```sql
      UNCACHE TABLE foo;
      CACHE TABLE foo AS
      SELECT * FROM bar
      ```
      The command above fails the first time you run it. But I want to run the command above over and over again, and I don't want to change my code just for the first run of it.
      The issue is that subsequent `CACHE TABLE` commands do not overwrite the existing table.
      
      Now we can do:
      ```sql
      UNCACHE TABLE IF EXISTS foo;
      CACHE TABLE foo AS
      SELECT * FROM bar
      ```
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #15896 from brkyvz/uncache.
      
      (cherry picked from commit bdc8153e)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      fb2ea54a
    • Wenchen Fan's avatar
      [SPARK-18507][SQL] HiveExternalCatalog.listPartitions should only call getTable once · fa360134
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      HiveExternalCatalog.listPartitions should only call `getTable` once, instead of calling it for every partitions.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15978 from cloud-fan/perf.
      
      (cherry picked from commit 702cd403)
      Signed-off-by: default avatarAndrew Or <andrewor14@gmail.com>
      fa360134
  19. Nov 21, 2016
  20. Nov 20, 2016
  21. Nov 18, 2016
    • Reynold Xin's avatar
      [SPARK-18505][SQL] Simplify AnalyzeColumnCommand · 4b1df0e8
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      I'm spending more time at the design & code level for cost-based optimizer now, and have found a number of issues related to maintainability and compatibility that I will like to address.
      
      This is a small pull request to clean up AnalyzeColumnCommand:
      
      1. Removed warning on duplicated columns. Warnings in log messages are useless since most users that run SQL don't see them.
      2. Removed the nested updateStats function, by just inlining the function.
      3. Renamed a few functions to better reflect what they do.
      4. Removed the factory apply method for ColumnStatStruct. It is a bad pattern to use a apply method that returns an instantiation of a class that is not of the same type (ColumnStatStruct.apply used to return CreateNamedStruct).
      5. Renamed ColumnStatStruct to just AnalyzeColumnCommand.
      6. Added more documentation explaining some of the non-obvious return types and code blocks.
      
      In follow-up pull requests, I'd like to address the following:
      
      1. Get rid of the Map[String, ColumnStat] map, since internally we should be using Attribute to reference columns, rather than strings.
      2. Decouple the fields exposed by ColumnStat and internals of Spark SQL's execution path. Currently the two are coupled because ColumnStat takes in an InternalRow.
      3. Correctness: Remove code path that stores statistics in the catalog using the base64 encoding of the UnsafeRow format, which is not stable across Spark versions.
      4. Clearly document the data representation stored in the catalog for statistics.
      
      ## How was this patch tested?
      Affected test cases have been updated.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15933 from rxin/SPARK-18505.
      
      (cherry picked from commit 6f7ff750)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      4b1df0e8
    • Andrew Ray's avatar
      [SPARK-18457][SQL] ORC and other columnar formats using HiveShim read all... · ec622eb7
      Andrew Ray authored
      [SPARK-18457][SQL] ORC and other columnar formats using HiveShim read all columns when doing a simple count
      
      ## What changes were proposed in this pull request?
      
      When reading zero columns (e.g., count(*)) from ORC or any other format that uses HiveShim, actually set the read column list to empty for Hive to use.
      
      ## How was this patch tested?
      
      Query correctness is handled by existing unit tests. I'm happy to add more if anyone can point out some case that is not covered.
      
      Reduction in data read can be verified in the UI when built with a recent version of Hadoop say:
      ```
      build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive -DskipTests clean package
      ```
      However the default Hadoop 2.2 that is used for unit tests does not report actual bytes read and instead just full file sizes (see FileScanRDD.scala line 80). Therefore I don't think there is a good way to add a unit test for this.
      
      I tested with the following setup using above build options
      ```
      case class OrcData(intField: Long, stringField: String)
      spark.range(1,1000000).map(i => OrcData(i, s"part-$i")).toDF().write.format("orc").save("orc_test")
      
      sql(
            s"""CREATE EXTERNAL TABLE orc_test(
               |  intField LONG,
               |  stringField STRING
               |)
               |STORED AS ORC
               |LOCATION '${System.getProperty("user.dir") + "/orc_test"}'
             """.stripMargin)
      ```
      
      ## Results
      
      query | Spark 2.0.2 | this PR
      ---|---|---
      `sql("select count(*) from orc_test").collect`|4.4 MB|199.4 KB
      `sql("select intField from orc_test").collect`|743.4 KB|743.4 KB
      `sql("select * from orc_test").collect`|4.4 MB|4.4 MB
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #15898 from aray/sql-orc-no-col.
      
      (cherry picked from commit 795e9fc9)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      ec622eb7
  22. Nov 17, 2016
    • Wenchen Fan's avatar
      [SPARK-18360][SQL] default table path of tables in default database should... · fc466be4
      Wenchen Fan authored
      [SPARK-18360][SQL] default table path of tables in default database should depend on the location of default database
      
      ## What changes were proposed in this pull request?
      
      The current semantic of the warehouse config:
      
      1. it's a static config, which means you can't change it once your spark application is launched.
      2. Once a database is created, its location won't change even the warehouse path config is changed.
      3. default database is a special case, although its location is fixed, but the locations of tables created in it are not. If a Spark app starts with warehouse path B(while the location of default database is A), then users create a table `tbl` in default database, its location will be `B/tbl` instead of `A/tbl`. If uses change the warehouse path config to C, and create another table `tbl2`, its location will still be `B/tbl2` instead of `C/tbl2`.
      
      rule 3 doesn't make sense and I think we made it by mistake, not intentionally. Data source tables don't follow rule 3 and treat default database like normal ones.
      
      This PR fixes hive serde tables to make it consistent with data source tables.
      
      ## How was this patch tested?
      
      HiveSparkSubmitSuite
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15812 from cloud-fan/default-db.
      
      (cherry picked from commit ce13c267)
      Signed-off-by: default avatarYin Huai <yhuai@databricks.com>
      fc466be4
    • Wenchen Fan's avatar
      [SPARK-18464][SQL] support old table which doesn't store schema in metastore · 014fceee
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      Before Spark 2.1, users can create an external data source table without schema, and we will infer the table schema at runtime. In Spark 2.1, we decided to infer the schema when the table was created, so that we don't need to infer it again and again at runtime.
      
      This is a good improvement, but we should still respect and support old tables which doesn't store table schema in metastore.
      
      ## How was this patch tested?
      
      regression test.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15900 from cloud-fan/hive-catalog.
      
      (cherry picked from commit 07b3f045)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      014fceee
Loading