Skip to content
Snippets Groups Projects
  1. Dec 15, 2016
  2. Dec 12, 2016
    • Yuming Wang's avatar
      [SPARK-18681][SQL] Fix filtering to compatible with partition keys of type int · 523071f3
      Yuming Wang authored
      
      ## What changes were proposed in this pull request?
      
      Cloudera put `/var/run/cloudera-scm-agent/process/15000-hive-HIVEMETASTORE/hive-site.xml` as the configuration file for the Hive Metastore Server, where `hive.metastore.try.direct.sql=false`. But Spark isn't reading this configuration file and get default value `hive.metastore.try.direct.sql=true`. As mallman said, we should use `getMetaConf` method to obtain the original configuration from Hive Metastore Server. I have tested this method few times and the return value is always consistent with Hive Metastore Server.
      
      ## How was this patch tested?
      
      The existing tests.
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #16122 from wangyum/SPARK-18681.
      
      (cherry picked from commit 90abfd15)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      523071f3
  3. Dec 09, 2016
    • Zhan Zhang's avatar
      [SPARK-18637][SQL] Stateful UDF should be considered as nondeterministic · 72bf5199
      Zhan Zhang authored
      
      Make stateful udf as nondeterministic
      
      Add new test cases with both Stateful and Stateless UDF.
      Without the patch, the test cases will throw exception:
      
      1 did not equal 10
      ScalaTestFailureLocation: org.apache.spark.sql.hive.execution.HiveUDFSuite$$anonfun$21 at (HiveUDFSuite.scala:501)
      org.scalatest.exceptions.TestFailedException: 1 did not equal 10
              at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
              at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
              ...
      
      Author: Zhan Zhang <zhanzhang@fb.com>
      
      Closes #16068 from zhzhan/state.
      
      (cherry picked from commit 67587d96)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      72bf5199
  4. Dec 08, 2016
  5. Dec 05, 2016
    • Michael Allman's avatar
      [SPARK-18572][SQL] Add a method `listPartitionNames` to `ExternalCatalog` · 8ca6a82c
      Michael Allman authored
      (Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-18572
      
      )
      
      ## What changes were proposed in this pull request?
      
      Currently Spark answers the `SHOW PARTITIONS` command by fetching all of the table's partition metadata from the external catalog and constructing partition names therefrom. The Hive client has a `getPartitionNames` method which is many times faster for this purpose, with the performance improvement scaling with the number of partitions in a table.
      
      To test the performance impact of this PR, I ran the `SHOW PARTITIONS` command on two Hive tables with large numbers of partitions. One table has ~17,800 partitions, and the other has ~95,000 partitions. For the purposes of this PR, I'll call the former table `table1` and the latter table `table2`. I ran 5 trials for each table with before-and-after versions of this PR. The results are as follows:
      
      Spark at bdc8153e, `SHOW PARTITIONS table1`, times in seconds:
      7.901
      3.983
      4.018
      4.331
      4.261
      
      Spark at bdc8153e, `SHOW PARTITIONS table2`
      (Timed out after 10 minutes with a `SocketTimeoutException`.)
      
      Spark at this PR, `SHOW PARTITIONS table1`, times in seconds:
      3.801
      0.449
      0.395
      0.348
      0.336
      
      Spark at this PR, `SHOW PARTITIONS table2`, times in seconds:
      5.184
      1.63
      1.474
      1.519
      1.41
      
      Taking the best times from each trial, we get a 12x performance improvement for a table with ~17,800 partitions and at least a 426x improvement for a table with ~95,000 partitions. More significantly, the latter command doesn't even complete with the current code in master.
      
      This is actually a patch we've been using in-house at VideoAmp since Spark 1.1. It's made all the difference in the practical usability of our largest tables. Even with tables with about 1,000 partitions there's a performance improvement of about 2-3x.
      
      ## How was this patch tested?
      
      I added a unit test to `VersionsSuite` which tests that the Hive client's `getPartitionNames` method returns the correct number of partitions.
      
      Author: Michael Allman <michael@videoamp.com>
      
      Closes #15998 from mallman/spark-18572-list_partition_names.
      
      (cherry picked from commit 772ddbea)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      8ca6a82c
  6. Dec 04, 2016
  7. Dec 02, 2016
    • Eric Liang's avatar
      [SPARK-18659][SQL] Incorrect behaviors in overwrite table for datasource tables · e374b242
      Eric Liang authored
      
      ## What changes were proposed in this pull request?
      
      Two bugs are addressed here
      1. INSERT OVERWRITE TABLE sometime crashed when catalog partition management was enabled. This was because when dropping partitions after an overwrite operation, the Hive client will attempt to delete the partition files. If the entire partition directory was dropped, this would fail. The PR fixes this by adding a flag to control whether the Hive client should attempt to delete files.
      2. The static partition spec for OVERWRITE TABLE was not correctly resolved to the case-sensitive original partition names. This resulted in the entire table being overwritten if you did not correctly capitalize your partition names.
      
      cc yhuai cloud-fan
      
      ## How was this patch tested?
      
      Unit tests. Surprisingly, the existing overwrite table tests did not catch these edge cases.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #16088 from ericl/spark-18659.
      
      (cherry picked from commit 7935c847)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      e374b242
  8. Dec 01, 2016
    • Wenchen Fan's avatar
      [SPARK-18647][SQL] do not put provider in table properties for Hive serde table · 0f0903d1
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      In Spark 2.1, we make Hive serde tables case-preserving by putting the table metadata in table properties, like what we did for data source table. However, we should not put table provider, as it will break forward compatibility. e.g. if we create a Hive serde table with Spark 2.1, using `sql("create table test stored as parquet as select 1")`, we will fail to read it with Spark 2.0, as Spark 2.0 mistakenly treat it as data source table because there is a `provider` entry in table properties.
      
      Logically Hive serde table's provider is always hive, we don't need to store it in table properties, this PR removes it.
      
      ## How was this patch tested?
      
      manually test the forward compatibility issue.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16080 from cloud-fan/hive.
      
      (cherry picked from commit a5f02b00)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      0f0903d1
    • Eric Liang's avatar
      [SPARK-18635][SQL] Partition name/values not escaped correctly in some cases · 9dc3ef6e
      Eric Liang authored
      
      ## What changes were proposed in this pull request?
      
      Due to confusion between URI vs paths, in certain cases we escape partition values too many times, which causes some Hive client operations to fail or write data to the wrong location. This PR fixes at least some of these cases.
      
      To my understanding this is how values, filesystem paths, and URIs interact.
      - Hive stores raw (unescaped) partition values that are returned to you directly when you call listPartitions.
      - Internally, we convert these raw values to filesystem paths via `ExternalCatalogUtils.[un]escapePathName`.
      - In some circumstances we store URIs instead of filesystem paths. When a path is converted to a URI via `path.toURI`, the escaped partition values are further URI-encoded. This means that to get a path back from a URI, you must call `new Path(new URI(uriTxt))` in order to decode the URI-encoded string.
      - In `CatalogStorageFormat` we store URIs as strings. This makes it easy to forget to URI-decode the value before converting it into a path.
      - Finally, the Hive client itself uses mostly Paths for representing locations, and only URIs occasionally.
      
      In the future we should probably clean this up, perhaps by dropping use of URIs when unnecessary. We should also try fixing escaping for partition names as well as values, though names are unlikely to contain special characters.
      
      cc mallman cloud-fan yhuai
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #16071 from ericl/spark-18635.
      
      (cherry picked from commit 88f559f2)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      9dc3ef6e
  9. Nov 30, 2016
    • Wenchen Fan's avatar
      [SPARK-18220][SQL] read Hive orc table with varchar column should not fail · 3de93fb4
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side.
      
      In Spark 2.1, after https://github.com/apache/spark/pull/14363
      
       , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail.
      
      This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column.
      
      ## How was this patch tested?
      
      newly added regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16060 from cloud-fan/varchar.
      
      (cherry picked from commit 3f03c90a)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      3de93fb4
    • gatorsmile's avatar
      [SPARK-17680][SQL][TEST] Added a Testcase for Verifying Unicode Character... · a5ec2a7b
      gatorsmile authored
      [SPARK-17680][SQL][TEST] Added a Testcase for Verifying Unicode Character Support for Column Names and Comments
      
      ### What changes were proposed in this pull request?
      
      Spark SQL supports Unicode characters for column names when specified within backticks(`). When the Hive support is enabled, the version of the Hive metastore must be higher than 0.12,  See the JIRA: https://issues.apache.org/jira/browse/HIVE-6013
      
       Hive metastore supports Unicode characters for column names since 0.13.
      
      In Spark SQL, table comments, and view comments always allow Unicode characters without backticks.
      
      BTW, a separate PR has been submitted for database and table name validation because we do not support Unicode characters in these two cases.
      ### How was this patch tested?
      
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15255 from gatorsmile/unicodeSupport.
      
      (cherry picked from commit a1d9138a)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      a5ec2a7b
  10. Nov 29, 2016
  11. Nov 28, 2016
  12. Nov 27, 2016
    • Wenchen Fan's avatar
      [SPARK-18482][SQL] make sure Spark can access the table metadata created by older version of spark · 6b77889e
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      In Spark 2.1, we did a lot of refactor for `HiveExternalCatalog` and related code path. These refactor may introduce external behavior changes and break backward compatibility. e.g. http://issues.apache.org/jira/browse/SPARK-18464
      
      
      
      To avoid future compatibility problems of `HiveExternalCatalog`, this PR dumps some typical table metadata from tables created by 2.0, and test if they can recognized by current version of Spark.
      
      ## How was this patch tested?
      
      test only change
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16003 from cloud-fan/test.
      
      (cherry picked from commit fc2c13bd)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      6b77889e
    • gatorsmile's avatar
      [SPARK-18594][SQL] Name Validation of Databases/Tables · 1e8fbefa
      gatorsmile authored
      
      ### What changes were proposed in this pull request?
      Currently, the name validation checks are limited to table creation. It is enfored by Analyzer rule: `PreWriteCheck`.
      
      However, table renaming and database creation have the same issues. It makes more sense to do the checks in `SessionCatalog`. This PR is to add it into `SessionCatalog`.
      
      ### How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16018 from gatorsmile/nameValidate.
      
      (cherry picked from commit 07f32c22)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      1e8fbefa
  13. Nov 25, 2016
    • hyukjinkwon's avatar
      [SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will... · 69856f28
      hyukjinkwon authored
      [SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will help unidoc/genjavadoc compatibility
      
      ## What changes were proposed in this pull request?
      
      This PR only tries to fix things that looks pretty straightforward and were fixed in other previous PRs before.
      
      This PR roughly fixes several things as below:
      
      - Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` ``
      
        ```
        [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/DataStreamReader.java:226: error: reference not found
        [error]    * Loads text files and returns a {link DataFrame} whose schema starts with a string column named
        ```
      
      - Fix an exception annotation and remove code backticks in `throws` annotation
      
        Currently, sbt unidoc with Java 8 complains as below:
      
        ```
        [error] .../java/org/apache/spark/sql/streaming/StreamingQuery.java:72: error: unexpected text
        [error]    * throws StreamingQueryException, if <code>this</code> query has terminated with an exception.
        ```
      
        `throws` should specify the correct class name from `StreamingQueryException,` to `StreamingQueryException` without backticks. (see [JDK-8007644](https://bugs.openjdk.java.net/browse/JDK-8007644)).
      
      - Fix `[[http..]]` to `<a href="http..."></a>`.
      
        ```diff
        -   * [[https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https Oracle
        -   * blog page]].
        +   * <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https
      
      ">
        +   * Oracle blog page</a>.
        ```
      
         `[[http...]]` link markdown in scaladoc is unrecognisable in javadoc.
      
      - It seems class can't have `return` annotation. So, two cases of this were removed.
      
        ```
        [error] .../java/org/apache/spark/mllib/regression/IsotonicRegression.java:27: error: invalid use of return
        [error]    * return New instance of IsotonicRegression.
        ```
      
      - Fix < to `&lt;` and > to `&gt;` according to HTML rules.
      
      - Fix `</p>` complaint
      
      - Exclude unrecognisable in javadoc, `constructor`, `todo` and `groupname`.
      
      ## How was this patch tested?
      
      Manually tested by `jekyll build` with Java 7 and 8
      
      ```
      java version "1.7.0_80"
      Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
      Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
      ```
      
      ```
      java version "1.8.0_45"
      Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
      Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
      ```
      
      Note: this does not yet make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15999 from HyukjinKwon/SPARK-3359-errors.
      
      (cherry picked from commit 51b1c155)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      69856f28
  14. Nov 23, 2016
    • Reynold Xin's avatar
      [SPARK-18522][SQL] Explicit contract for column stats serialization · 599dac15
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      The current implementation of column stats uses the base64 encoding of the internal UnsafeRow format to persist statistics (in table properties in Hive metastore). This is an internal format that is not stable across different versions of Spark and should NOT be used for persistence. In addition, it would be better if statistics stored in the catalog is human readable.
      
      This pull request introduces the following changes:
      
      1. Created a single ColumnStat class to for all data types. All data types track the same set of statistics.
      2. Updated the implementation for stats collection to get rid of the dependency on internal data structures (e.g. InternalRow, or storing DateType as an int32). For example, previously dates were stored as a single integer, but are now stored as java.sql.Date. When we implement the next steps of CBO, we can add code to convert those back into internal types again.
      3. Documented clearly what JVM data types are being used to store what data.
      4. Defined a simple Map[String, String] interface for serializing and deserializing column stats into/from the catalog.
      5. Rearranged the method/function structure so it is more clear what the supported data types are, and also moved how stats are generated into ColumnStat class so they are easy to find.
      
      ## How was this patch tested?
      Removed most of the original test cases created for column statistics, and added three very simple ones to cover all the cases. The three test cases validate:
      1. Roundtrip serialization works.
      2. Behavior when analyzing non-existent column or unsupported data type column.
      3. Result for stats collection for all valid data types.
      
      Also moved parser related tests into a parser test suite and added an explicit serialization test for the Hive external catalog.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15959 from rxin/SPARK-18522.
      
      (cherry picked from commit 70ad07a9)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      599dac15
    • Eric Liang's avatar
      [SPARK-18545][SQL] Verify number of hive client RPCs in PartitionedTablePerfStatsSuite · 539c193a
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      This would help catch accidental O(n) calls to the hive client as in https://issues.apache.org/jira/browse/SPARK-18507
      
      ## How was this patch tested?
      
      Checked that the test fails before https://issues.apache.org/jira/browse/SPARK-18507
      
       was patched. cc cloud-fan
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #15985 from ericl/spark-18545.
      
      (cherry picked from commit 85235ed6)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      539c193a
  15. Nov 22, 2016
    • gatorsmile's avatar
      [SPARK-16803][SQL] SaveAsTable does not work when target table is a Hive serde table · 64b9de9c
      gatorsmile authored
      
      ### What changes were proposed in this pull request?
      
      In Spark 2.0, `SaveAsTable` does not work when the target table is a Hive serde table, but Spark 1.6 works.
      
      **Spark 1.6**
      
      ``` Scala
      scala> sql("create table sample.sample stored as SEQUENCEFILE as select 1 as key, 'abc' as value")
      res2: org.apache.spark.sql.DataFrame = []
      
      scala> val df = sql("select key, value as value from sample.sample")
      df: org.apache.spark.sql.DataFrame = [key: int, value: string]
      
      scala> df.write.mode("append").saveAsTable("sample.sample")
      
      scala> sql("select * from sample.sample").show()
      +---+-----+
      |key|value|
      +---+-----+
      |  1|  abc|
      |  1|  abc|
      +---+-----+
      ```
      
      **Spark 2.0**
      
      ``` Scala
      scala> df.write.mode("append").saveAsTable("sample.sample")
      org.apache.spark.sql.AnalysisException: Saving data in MetastoreRelation sample, sample
       is not supported.;
      ```
      
      So far, we do not plan to support it in Spark 2.1 due to the risk. Spark 1.6 works because it internally uses insertInto. But, if we change it back it will break the semantic of saveAsTable (this method uses by-name resolution instead of using by-position resolution used by insertInto). More extra changes are needed to support `hive` as a `format` in DataFrameWriter.
      
      Instead, users should use insertInto API. This PR corrects the error messages. Users can understand how to bypass it before we support it in a separate PR.
      ### How was this patch tested?
      
      Test cases are added
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15926 from gatorsmile/saveAsTableFix5.
      
      (cherry picked from commit 9c42d4a7)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      64b9de9c
    • Burak Yavuz's avatar
      [SPARK-18465] Add 'IF EXISTS' clause to 'UNCACHE' to not throw exceptions when table doesn't exist · fb2ea54a
      Burak Yavuz authored
      
      ## What changes were proposed in this pull request?
      
      While this behavior is debatable, consider the following use case:
      ```sql
      UNCACHE TABLE foo;
      CACHE TABLE foo AS
      SELECT * FROM bar
      ```
      The command above fails the first time you run it. But I want to run the command above over and over again, and I don't want to change my code just for the first run of it.
      The issue is that subsequent `CACHE TABLE` commands do not overwrite the existing table.
      
      Now we can do:
      ```sql
      UNCACHE TABLE IF EXISTS foo;
      CACHE TABLE foo AS
      SELECT * FROM bar
      ```
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #15896 from brkyvz/uncache.
      
      (cherry picked from commit bdc8153e)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      fb2ea54a
    • Wenchen Fan's avatar
      [SPARK-18507][SQL] HiveExternalCatalog.listPartitions should only call getTable once · fa360134
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      HiveExternalCatalog.listPartitions should only call `getTable` once, instead of calling it for every partitions.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15978 from cloud-fan/perf.
      
      (cherry picked from commit 702cd403)
      Signed-off-by: default avatarAndrew Or <andrewor14@gmail.com>
      fa360134
  16. Nov 21, 2016
  17. Nov 20, 2016
  18. Nov 18, 2016
    • Reynold Xin's avatar
      [SPARK-18505][SQL] Simplify AnalyzeColumnCommand · 4b1df0e8
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      I'm spending more time at the design & code level for cost-based optimizer now, and have found a number of issues related to maintainability and compatibility that I will like to address.
      
      This is a small pull request to clean up AnalyzeColumnCommand:
      
      1. Removed warning on duplicated columns. Warnings in log messages are useless since most users that run SQL don't see them.
      2. Removed the nested updateStats function, by just inlining the function.
      3. Renamed a few functions to better reflect what they do.
      4. Removed the factory apply method for ColumnStatStruct. It is a bad pattern to use a apply method that returns an instantiation of a class that is not of the same type (ColumnStatStruct.apply used to return CreateNamedStruct).
      5. Renamed ColumnStatStruct to just AnalyzeColumnCommand.
      6. Added more documentation explaining some of the non-obvious return types and code blocks.
      
      In follow-up pull requests, I'd like to address the following:
      
      1. Get rid of the Map[String, ColumnStat] map, since internally we should be using Attribute to reference columns, rather than strings.
      2. Decouple the fields exposed by ColumnStat and internals of Spark SQL's execution path. Currently the two are coupled because ColumnStat takes in an InternalRow.
      3. Correctness: Remove code path that stores statistics in the catalog using the base64 encoding of the UnsafeRow format, which is not stable across Spark versions.
      4. Clearly document the data representation stored in the catalog for statistics.
      
      ## How was this patch tested?
      Affected test cases have been updated.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15933 from rxin/SPARK-18505.
      
      (cherry picked from commit 6f7ff750)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      4b1df0e8
    • Andrew Ray's avatar
      [SPARK-18457][SQL] ORC and other columnar formats using HiveShim read all... · ec622eb7
      Andrew Ray authored
      [SPARK-18457][SQL] ORC and other columnar formats using HiveShim read all columns when doing a simple count
      
      ## What changes were proposed in this pull request?
      
      When reading zero columns (e.g., count(*)) from ORC or any other format that uses HiveShim, actually set the read column list to empty for Hive to use.
      
      ## How was this patch tested?
      
      Query correctness is handled by existing unit tests. I'm happy to add more if anyone can point out some case that is not covered.
      
      Reduction in data read can be verified in the UI when built with a recent version of Hadoop say:
      ```
      build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive -DskipTests clean package
      ```
      However the default Hadoop 2.2 that is used for unit tests does not report actual bytes read and instead just full file sizes (see FileScanRDD.scala line 80). Therefore I don't think there is a good way to add a unit test for this.
      
      I tested with the following setup using above build options
      ```
      case class OrcData(intField: Long, stringField: String)
      spark.range(1,1000000).map(i => OrcData(i, s"part-$i")).toDF().write.format("orc").save("orc_test")
      
      sql(
            s"""CREATE EXTERNAL TABLE orc_test(
               |  intField LONG,
               |  stringField STRING
               |)
               |STORED AS ORC
               |LOCATION '${System.getProperty("user.dir") + "/orc_test"}'
             """.stripMargin)
      ```
      
      ## Results
      
      query | Spark 2.0.2 | this PR
      ---|---|---
      `sql("select count(*) from orc_test").collect`|4.4 MB|199.4 KB
      `sql("select intField from orc_test").collect`|743.4 KB|743.4 KB
      `sql("select * from orc_test").collect`|4.4 MB|4.4 MB
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #15898 from aray/sql-orc-no-col.
      
      (cherry picked from commit 795e9fc9)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      ec622eb7
  19. Nov 17, 2016
    • Wenchen Fan's avatar
      [SPARK-18360][SQL] default table path of tables in default database should... · fc466be4
      Wenchen Fan authored
      [SPARK-18360][SQL] default table path of tables in default database should depend on the location of default database
      
      ## What changes were proposed in this pull request?
      
      The current semantic of the warehouse config:
      
      1. it's a static config, which means you can't change it once your spark application is launched.
      2. Once a database is created, its location won't change even the warehouse path config is changed.
      3. default database is a special case, although its location is fixed, but the locations of tables created in it are not. If a Spark app starts with warehouse path B(while the location of default database is A), then users create a table `tbl` in default database, its location will be `B/tbl` instead of `A/tbl`. If uses change the warehouse path config to C, and create another table `tbl2`, its location will still be `B/tbl2` instead of `C/tbl2`.
      
      rule 3 doesn't make sense and I think we made it by mistake, not intentionally. Data source tables don't follow rule 3 and treat default database like normal ones.
      
      This PR fixes hive serde tables to make it consistent with data source tables.
      
      ## How was this patch tested?
      
      HiveSparkSubmitSuite
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15812 from cloud-fan/default-db.
      
      (cherry picked from commit ce13c267)
      Signed-off-by: default avatarYin Huai <yhuai@databricks.com>
      fc466be4
    • Wenchen Fan's avatar
      [SPARK-18464][SQL] support old table which doesn't store schema in metastore · 014fceee
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      Before Spark 2.1, users can create an external data source table without schema, and we will infer the table schema at runtime. In Spark 2.1, we decided to infer the schema when the table was created, so that we don't need to infer it again and again at runtime.
      
      This is a good improvement, but we should still respect and support old tables which doesn't store table schema in metastore.
      
      ## How was this patch tested?
      
      regression test.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15900 from cloud-fan/hive-catalog.
      
      (cherry picked from commit 07b3f045)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      014fceee
  20. Nov 16, 2016
    • Dongjoon Hyun's avatar
      [SPARK-18433][SQL] Improve DataSource option keys to be more case-insensitive · b18c5a9b
      Dongjoon Hyun authored
      
      ## What changes were proposed in this pull request?
      
      This PR aims to improve DataSource option keys to be more case-insensitive
      
      DataSource partially use CaseInsensitiveMap in code-path. For example, the following fails to find url.
      
      ```scala
      val df = spark.createDataFrame(sparkContext.parallelize(arr2x2), schema2)
      df.write.format("jdbc")
          .option("UrL", url1)
          .option("dbtable", "TEST.SAVETEST")
          .options(properties.asScala)
          .save()
      ```
      
      This PR makes DataSource options to use CaseInsensitiveMap internally and also makes DataSource to use CaseInsensitiveMap generally except `InMemoryFileIndex` and `InsertIntoHadoopFsRelationCommand`. We can not pass them CaseInsensitiveMap because they creates new case-sensitive HadoopConfs by calling newHadoopConfWithOptions(options) inside.
      
      ## How was this patch tested?
      
      Pass the Jenkins test with newly added test cases.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15884 from dongjoon-hyun/SPARK-18433.
      
      (cherry picked from commit 74f5c217)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      b18c5a9b
  21. Nov 15, 2016
    • Wenchen Fan's avatar
      [SPARK-18377][SQL] warehouse path should be a static conf · 436ae201
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      it's weird that every session can set its own warehouse path at runtime, we should forbid it and make it a static conf.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15825 from cloud-fan/warehouse.
      
      (cherry picked from commit 4ac9759f)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      436ae201
    • Dongjoon Hyun's avatar
      [SPARK-17732][SQL] ALTER TABLE DROP PARTITION should support comparators · 1126c319
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR aims to support `comparators`, e.g. '<', '<=', '>', '>=', again in Apache Spark 2.0 for backward compatibility.
      
      **Spark 1.6**
      
      ``` scala
      scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, quarter STRING)")
      res0: org.apache.spark.sql.DataFrame = [result: string]
      
      scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
      res1: org.apache.spark.sql.DataFrame = [result: string]
      ```
      
      **Spark 2.0**
      
      ``` scala
      scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, quarter STRING)")
      res0: org.apache.spark.sql.DataFrame = []
      
      scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
      org.apache.spark.sql.catalyst.parser.ParseException:
      mismatched input '<' expecting {')', ','}(line 1, pos 42)
      ```
      
      After this PR, it's supported.
      
      ## How was this patch tested?
      
      Pass the Jenkins test with a newly added testcase.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15704 from dongjoon-hyun/SPARK-17732-2.
      1126c319
  22. Nov 11, 2016
    • Dongjoon Hyun's avatar
      [SPARK-17982][SQL] SQLBuilder should wrap the generated SQL with parenthesis for LIMIT · 465e4b40
      Dongjoon Hyun authored
      
      ## What changes were proposed in this pull request?
      
      Currently, `SQLBuilder` handles `LIMIT` by always adding `LIMIT` at the end of the generated subSQL. It makes `RuntimeException`s like the following. This PR adds a parenthesis always except `SubqueryAlias` is used together with `LIMIT`.
      
      **Before**
      
      ``` scala
      scala> sql("CREATE TABLE tbl(id INT)")
      scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl LIMIT 2")
      java.lang.RuntimeException: Failed to analyze the canonicalized SQL: ...
      ```
      
      **After**
      
      ``` scala
      scala> sql("CREATE TABLE tbl(id INT)")
      scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl LIMIT 2")
      scala> sql("SELECT id2 FROM v1")
      res4: org.apache.spark.sql.DataFrame = [id2: int]
      ```
      
      **Fixed cases in this PR**
      
      The following two cases are the detail query plans having problematic SQL generations.
      
      1. `SELECT * FROM (SELECT id FROM tbl LIMIT 2)`
      
          Please note that **FROM SELECT** part of the generated SQL in the below. When we don't use '()' for limit, this fails.
      
      ```scala
      # Original logical plan:
      Project [id#1]
      +- GlobalLimit 2
         +- LocalLimit 2
            +- Project [id#1]
               +- MetastoreRelation default, tbl
      
      # Canonicalized logical plan:
      Project [gen_attr_0#1 AS id#4]
      +- SubqueryAlias tbl
         +- Project [gen_attr_0#1]
            +- GlobalLimit 2
               +- LocalLimit 2
                  +- Project [gen_attr_0#1]
                     +- SubqueryAlias gen_subquery_0
                        +- Project [id#1 AS gen_attr_0#1]
                           +- SQLTable default, tbl, [id#1]
      
      # Generated SQL:
      SELECT `gen_attr_0` AS `id` FROM (SELECT `gen_attr_0` FROM SELECT `gen_attr_0` FROM (SELECT `id` AS `gen_attr_0` FROM `default`.`tbl`) AS gen_subquery_0 LIMIT 2) AS tbl
      ```
      
      2. `SELECT * FROM (SELECT id FROM tbl TABLESAMPLE (2 ROWS))`
      
          Please note that **((~~~) AS gen_subquery_0 LIMIT 2)** in the below. When we use '()' for limit on `SubqueryAlias`, this fails.
      
      ```scala
      # Original logical plan:
      Project [id#1]
      +- Project [id#1]
         +- GlobalLimit 2
            +- LocalLimit 2
               +- MetastoreRelation default, tbl
      
      # Canonicalized logical plan:
      Project [gen_attr_0#1 AS id#4]
      +- SubqueryAlias tbl
         +- Project [gen_attr_0#1]
            +- GlobalLimit 2
               +- LocalLimit 2
                  +- SubqueryAlias gen_subquery_0
                     +- Project [id#1 AS gen_attr_0#1]
                        +- SQLTable default, tbl, [id#1]
      
      # Generated SQL:
      SELECT `gen_attr_0` AS `id` FROM (SELECT `gen_attr_0` FROM ((SELECT `id` AS `gen_attr_0` FROM `default`.`tbl`) AS gen_subquery_0 LIMIT 2)) AS tbl
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins test with a newly added test case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15546 from dongjoon-hyun/SPARK-17982.
      
      (cherry picked from commit d42bb7cc)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      465e4b40
Loading