Skip to content
Snippets Groups Projects
  1. Aug 27, 2016
    • Reynold Xin's avatar
      [SPARK-17269][SQL] Move finish analysis optimization stage into its own file · dcefac43
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      As part of breaking Optimizer.scala apart, this patch moves various finish analysis optimization stage rules into a single file. I'm submitting separate pull requests so we can more easily merge this in branch-2.0 to simplify optimizer backports.
      
      ## How was this patch tested?
      This should be covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14838 from rxin/SPARK-17269.
      dcefac43
  2. Aug 26, 2016
    • Reynold Xin's avatar
      [SPARK-17270][SQL] Move object optimization rules into its own file · cc0caa69
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      As part of breaking Optimizer.scala apart, this patch moves various Dataset object optimization rules into a single file. I'm submitting separate pull requests so we can more easily merge this in branch-2.0 to simplify optimizer backports.
      
      ## How was this patch tested?
      This should be covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14839 from rxin/SPARK-17270.
      cc0caa69
    • Sameer Agarwal's avatar
      [SPARK-17244] Catalyst should not pushdown non-deterministic join conditions · 540e9128
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      Given that non-deterministic expressions can be stateful, pushing them down the query plan during the optimization phase can cause incorrect behavior. This patch fixes that issue by explicitly disabling that.
      
      ## How was this patch tested?
      
      A new test in `FilterPushdownSuite` that checks catalyst behavior for both deterministic and non-deterministic join conditions.
      
      Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
      
      Closes #14815 from sameeragarwal/constraint-inputfile.
      540e9128
    • petermaxlee's avatar
      [SPARK-17235][SQL] Support purging of old logs in MetadataLog · f64a1ddd
      petermaxlee authored
      ## What changes were proposed in this pull request?
      This patch adds a purge interface to MetadataLog, and an implementation in HDFSMetadataLog. The purge function is currently unused, but I will use it to purge old execution and file source logs in follow-up patches. These changes are required in a production structured streaming job that runs for a long period of time.
      
      ## How was this patch tested?
      Added a unit test case in HDFSMetadataLogSuite.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #14802 from petermaxlee/SPARK-17235.
      f64a1ddd
    • Herman van Hovell's avatar
      [SPARK-17246][SQL] Add BigDecimal literal · a11d10f1
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      This PR adds parser support for `BigDecimal` literals. If you append the suffix `BD` to a valid number then this will be interpreted as a `BigDecimal`, for example `12.0E10BD` will interpreted into a BigDecimal with scale -9 and precision 3. This is useful in situations where you need exact values.
      
      ## How was this patch tested?
      Added tests to `ExpressionParserSuite`, `ExpressionSQLBuilderSuite` and `SQLQueryTestSuite`.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #14819 from hvanhovell/SPARK-17246.
      a11d10f1
    • petermaxlee's avatar
      [SPARK-17165][SQL] FileStreamSource should not track the list of seen files indefinitely · 9812f7d5
      petermaxlee authored
      ## What changes were proposed in this pull request?
      Before this change, FileStreamSource uses an in-memory hash set to track the list of files processed by the engine. The list can grow indefinitely, leading to OOM or overflow of the hash set.
      
      This patch introduces a new user-defined option called "maxFileAge", default to 24 hours. If a file is older than this age, FileStreamSource will purge it from the in-memory map that was used to track the list of files that have been processed.
      
      ## How was this patch tested?
      Added unit tests for the underlying utility, and also added an end-to-end test to validate the purge in FileStreamSourceSuite. Also verified the new test cases would fail when the timeout was set to a very large number.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #14728 from petermaxlee/SPARK-17165.
      9812f7d5
    • gatorsmile's avatar
      [SPARK-17250][SQL] Remove HiveClient and setCurrentDatabase from HiveSessionCatalog · 261c55dd
      gatorsmile authored
      ### What changes were proposed in this pull request?
      This is the first step to remove `HiveClient` from `HiveSessionState`. In the metastore interaction, we always use the fully qualified table name when accessing/operating a table. That means, we always specify the database. Thus, it is not necessary to use `HiveClient` to change the active database in Hive metastore.
      
      In `HiveSessionCatalog `, `setCurrentDatabase` is the only function that uses `HiveClient`. Thus, we can remove it after removing `setCurrentDatabase`
      
      ### How was this patch tested?
      The existing test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14821 from gatorsmile/setCurrentDB.
      261c55dd
    • gatorsmile's avatar
      [SPARK-17192][SQL] Issue Exception when Users Specify the Partitioning Columns... · fd4ba3f6
      gatorsmile authored
      [SPARK-17192][SQL] Issue Exception when Users Specify the Partitioning Columns without a Given Schema
      
      ### What changes were proposed in this pull request?
      Address the comments by yhuai in the original PR: https://github.com/apache/spark/pull/14207
      
      First, issue an exception instead of logging a warning when users specify the partitioning columns without a given schema.
      
      Second, refactor the codes a little.
      
      ### How was this patch tested?
      Fixed the test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14572 from gatorsmile/followup16552.
      fd4ba3f6
    • Wenchen Fan's avatar
      [SPARK-17187][SQL][FOLLOW-UP] improve document of TypedImperativeAggregate · 970ab8f6
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      improve the document to make it easier to understand and also mention window operator.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14822 from cloud-fan/object-agg.
      970ab8f6
    • Wenchen Fan's avatar
      [SPARK-17260][MINOR] move CreateTables to HiveStrategies · 28ab1792
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      `CreateTables` rule turns a general `CreateTable` plan to `CreateHiveTableAsSelectCommand` for hive serde table. However, this rule is logically a planner strategy, we should move it to `HiveStrategies`, to be consistent with other DDL commands.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14825 from cloud-fan/ctas.
      28ab1792
    • hyukjinkwon's avatar
      [SPARK-16216][SQL][FOLLOWUP] Enable timestamp type tests for JSON and verify... · 6063d596
      hyukjinkwon authored
      [SPARK-16216][SQL][FOLLOWUP] Enable timestamp type tests for JSON and verify all unsupported types in CSV
      
      ## What changes were proposed in this pull request?
      
      This PR enables the tests for `TimestampType` for JSON and unifies the logics for verifying schema when writing in CSV.
      
      In more details, this PR,
      
      - Enables the tests for `TimestampType` for JSON and
      
        This was disabled due to an issue in `DatatypeConverter.parseDateTime` which parses dates incorrectly, for example as below:
      
        ```scala
         val d = javax.xml.bind.DatatypeConverter.parseDateTime("0900-01-01T00:00:00.000").getTime
        println(d.toString)
        ```
        ```
        Fri Dec 28 00:00:00 KST 899
        ```
      
        However, since we use `FastDateFormat`, it seems we are safe now.
      
        ```scala
        val d = FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSS").parse("0900-01-01T00:00:00.000")
        println(d)
        ```
        ```
        Tue Jan 01 00:00:00 PST 900
        ```
      
      - Verifies all unsupported types in CSV
      
        There is a separate logics to verify the schemas in `CSVFileFormat`. This is actually not quite correct enough because we don't support `NullType` and `CalanderIntervalType` as well `StructType`, `ArrayType`, `MapType`. So, this PR adds both types.
      
      ## How was this patch tested?
      
      Tests in `JsonHadoopFsRelation` and `CSVSuite`
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14829 from HyukjinKwon/SPARK-16216-followup.
      6063d596
  3. Aug 25, 2016
    • hyukjinkwon's avatar
      [SPARK-17212][SQL] TypeCoercion supports widening conversion between DateType and TimestampType · b964a172
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, type-widening does not work between `TimestampType` and `DateType`.
      
      This applies to `SetOperation`, `Union`, `In`, `CaseWhen`, `Greatest`,  `Leatest`, `CreateArray`, `CreateMap`, `Coalesce`, `NullIf`, `IfNull`, `Nvl` and `Nvl2`, .
      
      This PR adds the support for widening `DateType` to `TimestampType` for them.
      
      For a simple example,
      
      **Before**
      
      ```scala
      Seq(Tuple2(new Timestamp(0), new Date(0))).toDF("a", "b").selectExpr("greatest(a, b)").show()
      ```
      
      shows below:
      
      ```
      cannot resolve 'greatest(`a`, `b`)' due to data type mismatch: The expressions should all have the same type, got GREATEST(timestamp, date)
      ```
      
      or union as below:
      
      ```scala
      val a = Seq(Tuple1(new Timestamp(0))).toDF()
      val b = Seq(Tuple1(new Date(0))).toDF()
      a.union(b).show()
      ```
      
      shows below:
      
      ```
      Union can only be performed on tables with the compatible column types. DateType <> TimestampType at the first column of the second table;
      ```
      
      **After**
      
      ```scala
      Seq(Tuple2(new Timestamp(0), new Date(0))).toDF("a", "b").selectExpr("greatest(a, b)").show()
      ```
      
      shows below:
      
      ```
      +----------------------------------------------------+
      |greatest(CAST(a AS TIMESTAMP), CAST(b AS TIMESTAMP))|
      +----------------------------------------------------+
      |                                1969-12-31 16:00:...|
      +----------------------------------------------------+
      ```
      
      or union as below:
      
      ```scala
      val a = Seq(Tuple1(new Timestamp(0))).toDF()
      val b = Seq(Tuple1(new Date(0))).toDF()
      a.union(b).show()
      ```
      
      shows below:
      
      ```
      +--------------------+
      |                  _1|
      +--------------------+
      |1969-12-31 16:00:...|
      |1969-12-31 00:00:...|
      +--------------------+
      ```
      
      ## How was this patch tested?
      
      Unit tests in `TypeCoercionSuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: HyukjinKwon <gurwls223@gmail.com>
      
      Closes #14786 from HyukjinKwon/SPARK-17212.
      b964a172
    • Sean Zhong's avatar
      [SPARK-17187][SQL] Supports using arbitrary Java object as internal aggregation buffer object · d96d1515
      Sean Zhong authored
      ## What changes were proposed in this pull request?
      
      This PR introduces an abstract class `TypedImperativeAggregate` so that an aggregation function of TypedImperativeAggregate can use  **arbitrary** user-defined Java object as intermediate aggregation buffer object.
      
      **This has advantages like:**
      1. It now can support larger category of aggregation functions. For example, it will be much easier to implement aggregation function `percentile_approx`, which has a complex aggregation buffer definition.
      2. It can be used to avoid doing serialization/de-serialization for every call of `update` or `merge` when converting domain specific aggregation object to internal Spark-Sql storage format.
      3. It is easier to integrate with other existing monoid libraries like algebird, and supports more aggregation functions with high performance.
      
      Please see `org.apache.spark.sql.TypedImperativeAggregateSuite.TypedMaxAggregate` to find an example of how to defined a `TypedImperativeAggregate` aggregation function.
      Please see Java doc of `TypedImperativeAggregate` and Jira ticket SPARK-17187 for more information.
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #14753 from clockfly/object_aggregation_buffer_try_2.
      d96d1515
    • Josh Rosen's avatar
      [SPARK-17205] Literal.sql should handle Infinity and NaN · 3e4c7db4
      Josh Rosen authored
      This patch updates `Literal.sql` to properly generate SQL for `NaN` and `Infinity` float and double literals: these special values need to be handled differently from regular values, since simply appending a suffix to the value's `toString()` representation will not work for these values.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #14777 from JoshRosen/SPARK-17205.
      3e4c7db4
    • Josh Rosen's avatar
      [SPARK-17229][SQL] PostgresDialect shouldn't widen float and short types during reads · a133057c
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      When reading float4 and smallint columns from PostgreSQL, Spark's `PostgresDialect` widens these types to Decimal and Integer rather than using the narrower Float and Short types. According to https://www.postgresql.org/docs/7.1/static/datatype.html#DATATYPE-TABLE, Postgres maps the `smallint` type to a signed two-byte integer and the `real` / `float4` types to single precision floating point numbers.
      
      This patch fixes this by adding more special-cases to `getCatalystType`, similar to what was done for the Derby JDBC dialect. I also fixed a similar problem in the write path which causes Spark to create integer columns in Postgres for what should have been ShortType columns.
      
      ## How was this patch tested?
      
      New test cases in `PostgresIntegrationSuite` (which I ran manually because Jenkins can't run it right now).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #14796 from JoshRosen/postgres-jdbc-type-fixes.
      a133057c
    • gatorsmile's avatar
      [SPARK-16991][SPARK-17099][SPARK-17120][SQL] Fix Outer Join Elimination when... · d2ae6399
      gatorsmile authored
      [SPARK-16991][SPARK-17099][SPARK-17120][SQL] Fix Outer Join Elimination when Filter's isNotNull Constraints Unable to Filter Out All Null-supplying Rows
      
      ### What changes were proposed in this pull request?
      This PR is to fix an incorrect outer join elimination when filter's `isNotNull` constraints is unable to filter out all null-supplying rows. For example, `isnotnull(coalesce(b#227, c#238))`.
      
      Users can hit this error when they try to use `using/natural outer join`, which is converted to a normal outer join with a `coalesce` expression on the `using columns`. For example,
      ```Scala
          val a = Seq((1, 2), (2, 3)).toDF("a", "b")
          val b = Seq((2, 5), (3, 4)).toDF("a", "c")
          val c = Seq((3, 1)).toDF("a", "d")
          val ab = a.join(b, Seq("a"), "fullouter")
          ab.join(c, "a").explain(true)
      ```
      The dataframe `ab` is doing `using full-outer join`, which is converted to a normal outer join with a `coalesce` expression. Constraints inference generates a `Filter` with constraints `isnotnull(coalesce(b#227, c#238))`. Then, it triggers a wrong outer join elimination and generates a wrong result.
      ```
      Project [a#251, b#227, c#237, d#247]
      +- Join Inner, (a#251 = a#246)
         :- Project [coalesce(a#226, a#236) AS a#251, b#227, c#237]
         :  +- Join FullOuter, (a#226 = a#236)
         :     :- Project [_1#223 AS a#226, _2#224 AS b#227]
         :     :  +- LocalRelation [_1#223, _2#224]
         :     +- Project [_1#233 AS a#236, _2#234 AS c#237]
         :        +- LocalRelation [_1#233, _2#234]
         +- Project [_1#243 AS a#246, _2#244 AS d#247]
            +- LocalRelation [_1#243, _2#244]
      
      == Optimized Logical Plan ==
      Project [a#251, b#227, c#237, d#247]
      +- Join Inner, (a#251 = a#246)
         :- Project [coalesce(a#226, a#236) AS a#251, b#227, c#237]
         :  +- Filter isnotnull(coalesce(a#226, a#236))
         :     +- Join FullOuter, (a#226 = a#236)
         :        :- LocalRelation [a#226, b#227]
         :        +- LocalRelation [a#236, c#237]
         +- LocalRelation [a#246, d#247]
      ```
      
      **A note to the `Committer`**, please also give the credit to dongjoon-hyun who submitted another PR for fixing this issue. https://github.com/apache/spark/pull/14580
      
      ### How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14661 from gatorsmile/fixOuterJoinElimination.
      d2ae6399
    • Takeshi YAMAMURO's avatar
      [SPARK-12978][SQL] Skip unnecessary final group-by when input data already... · 2b0cc4e0
      Takeshi YAMAMURO authored
      [SPARK-12978][SQL] Skip unnecessary final group-by when input data already clustered with group-by keys
      
      This ticket targets the optimization to skip an unnecessary group-by operation below;
      
      Without opt.:
      ```
      == Physical Plan ==
      TungstenAggregate(key=[col0#159], functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], output=[col0#159,sum(col1)#177,avg(col2)#178])
      +- TungstenAggregate(key=[col0#159], functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)], output=[col0#159,sum#200,sum#201,count#202L])
         +- TungstenExchange hashpartitioning(col0#159,200), None
            +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation [col0#159,col1#160,col2#161], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
      ```
      
      With opt.:
      ```
      == Physical Plan ==
      TungstenAggregate(key=[col0#159], functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], output=[col0#159,sum(col1)#177,avg(col2)#178])
      +- TungstenExchange hashpartitioning(col0#159,200), None
        +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation [col0#159,col1#160,col2#161], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
      ```
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #10896 from maropu/SkipGroupbySpike.
      2b0cc4e0
    • Liwei Lin's avatar
      [SPARK-17061][SPARK-17093][SQL] MapObjects` should make copies of unsafe-backed data · e0b20f9f
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      Currently `MapObjects` does not make copies of unsafe-backed data, leading to problems like [SPARK-17061](https://issues.apache.org/jira/browse/SPARK-17061) [SPARK-17093](https://issues.apache.org/jira/browse/SPARK-17093).
      
      This patch makes `MapObjects` make copies of unsafe-backed data.
      
      Generated code - prior to this patch:
      ```java
      ...
      /* 295 */ if (isNull12) {
      /* 296 */   convertedArray1[loopIndex1] = null;
      /* 297 */ } else {
      /* 298 */   convertedArray1[loopIndex1] = value12;
      /* 299 */ }
      ...
      ```
      
      Generated code - after this patch:
      ```java
      ...
      /* 295 */ if (isNull12) {
      /* 296 */   convertedArray1[loopIndex1] = null;
      /* 297 */ } else {
      /* 298 */   convertedArray1[loopIndex1] = value12 instanceof UnsafeRow? value12.copy() : value12;
      /* 299 */ }
      ...
      ```
      
      ## How was this patch tested?
      
      Add a new test case which would fail without this patch.
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #14698 from lw-lin/mapobjects-copy.
      e0b20f9f
    • jiangxingbo's avatar
      [SPARK-17215][SQL] Method `SQLContext.parseDataType(dataTypeString: String)` could be removed. · 5f02d2e5
      jiangxingbo authored
      ## What changes were proposed in this pull request?
      
      Method `SQLContext.parseDataType(dataTypeString: String)` could be removed, we should use `SparkSession.parseDataType(dataTypeString: String)` instead.
      This require updating PySpark.
      
      ## How was this patch tested?
      
      Existing test cases.
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      
      Closes #14790 from jiangxb1987/parseDataType.
      5f02d2e5
  4. Aug 24, 2016
    • gatorsmile's avatar
      [SPARK-17190][SQL] Removal of HiveSharedState · 4d0706d6
      gatorsmile authored
      ### What changes were proposed in this pull request?
      Since `HiveClient` is used to interact with the Hive metastore, it should be hidden in `HiveExternalCatalog`. After moving `HiveClient` into `HiveExternalCatalog`, `HiveSharedState` becomes a wrapper of `HiveExternalCatalog`. Thus, removal of `HiveSharedState` becomes straightforward. After removal of `HiveSharedState`, the reflection logic is directly applied on the choice of `ExternalCatalog` types, based on the configuration of `CATALOG_IMPLEMENTATION`.
      
      ~~`HiveClient` is also used/invoked by the other entities besides HiveExternalCatalog, we defines the following two APIs: getClient and getNewClient~~
      
      ### How was this patch tested?
      The existing test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14757 from gatorsmile/removeHiveClient.
      4d0706d6
    • Sameer Agarwal's avatar
      [SPARK-17228][SQL] Not infer/propagate non-deterministic constraints · ac27557e
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      Given that filters based on non-deterministic constraints shouldn't be pushed down in the query plan, unnecessarily inferring them is confusing and a source of potential bugs. This patch simplifies the inferring logic by simply ignoring them.
      
      ## How was this patch tested?
      
      Added a new test in `ConstraintPropagationSuite`.
      
      Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
      
      Closes #14795 from sameeragarwal/deterministic-constraints.
      ac27557e
    • hyukjinkwon's avatar
      [SPARK-16216][SQL] Read/write timestamps and dates in ISO 8601 and... · 29952ed0
      hyukjinkwon authored
      [SPARK-16216][SQL] Read/write timestamps and dates in ISO 8601 and dateFormat/timestampFormat option for CSV and JSON
      
      ## What changes were proposed in this pull request?
      
      ### Default - ISO 8601
      
      Currently, CSV datasource is writing `Timestamp` and `Date` as numeric form and JSON datasource is writing both as below:
      
      - CSV
        ```
        // TimestampType
        1414459800000000
        // DateType
        16673
        ```
      
      - Json
      
        ```
        // TimestampType
        1970-01-01 11:46:40.0
        // DateType
        1970-01-01
        ```
      
      So, for CSV we can't read back what we write and for JSON it becomes ambiguous because the timezone is being missed.
      
      So, this PR make both **write** `Timestamp` and `Date` in ISO 8601 formatted string (please refer the [ISO 8601 specification](https://www.w3.org/TR/NOTE-datetime)).
      
      - For `Timestamp` it becomes as below: (`yyyy-MM-dd'T'HH:mm:ss.SSSZZ`)
      
        ```
        1970-01-01T02:00:01.000-01:00
        ```
      
      - For `Date` it becomes as below (`yyyy-MM-dd`)
      
        ```
        1970-01-01
        ```
      
      ### Custom date format option - `dateFormat`
      
      This PR also adds the support to write and read dates and timestamps in a formatted string as below:
      
      - **DateType**
      
        - With `dateFormat` option (e.g. `yyyy/MM/dd`)
      
          ```
          +----------+
          |      date|
          +----------+
          |2015/08/26|
          |2014/10/27|
          |2016/01/28|
          +----------+
          ```
      
      ### Custom date format option - `timestampFormat`
      
      - **TimestampType**
      
        - With `dateFormat` option (e.g. `dd/MM/yyyy HH:mm`)
      
          ```
          +----------------+
          |            date|
          +----------------+
          |2015/08/26 18:00|
          |2014/10/27 18:30|
          |2016/01/28 20:00|
          +----------------+
          ```
      
      ## How was this patch tested?
      
      Unit tests were added in `CSVSuite` and `JsonSuite`. For JSON, existing tests cover the default cases.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14279 from HyukjinKwon/SPARK-16216-json-csv.
      29952ed0
    • Dongjoon Hyun's avatar
      [SPARK-16983][SQL] Add `prettyName` for row_number, dense_rank, percent_rank, cume_dist · 40b30fcf
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, two-word window functions like `row_number`, `dense_rank`, `percent_rank`, and `cume_dist` are expressed without `_` in error messages. We had better show the correct names.
      
      **Before**
      ```scala
      scala> sql("select row_number()").show
      java.lang.UnsupportedOperationException: Cannot evaluate expression: rownumber()
      ```
      
      **After**
      ```scala
      scala> sql("select row_number()").show
      java.lang.UnsupportedOperationException: Cannot evaluate expression: row_number()
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins and manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14571 from dongjoon-hyun/SPARK-16983.
      40b30fcf
    • Wenchen Fan's avatar
      [SPARK-17186][SQL] remove catalog table type INDEX · 52fa45d6
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Actually Spark SQL doesn't support index, the catalog table type `INDEX` is from Hive. However, most operations in Spark SQL can't handle index table, e.g. create table, alter table, etc.
      
      Logically index table should be invisible to end users, and Hive also generates special table name for index table to avoid users accessing it directly. Hive has special SQL syntax to create/show/drop index tables.
      
      At Spark SQL side, although we can describe index table directly, but the result is unreadable, we should use the dedicated SQL syntax to do it(e.g. `SHOW INDEX ON tbl`). Spark SQL can also read index table directly, but the result is always empty.(Can hive read index table directly?)
      
      This PR remove the table type `INDEX`, to make it clear that Spark SQL doesn't support index currently.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14752 from cloud-fan/minor2.
      52fa45d6
    • Weiqing Yang's avatar
      [MINOR][SQL] Remove implemented functions from comments of 'HiveSessionCatalog.scala' · b9994ad0
      Weiqing Yang authored
      ## What changes were proposed in this pull request?
      This PR removes implemented functions from comments of `HiveSessionCatalog.scala`: `java_method`, `posexplode`, `str_to_map`.
      
      ## How was this patch tested?
      Manual.
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #14769 from Sherry302/cleanComment.
      b9994ad0
  5. Aug 23, 2016
    • Josh Rosen's avatar
      [SPARK-17194] Use single quotes when generating SQL for string literals · bf8ff833
      Josh Rosen authored
      When Spark emits SQL for a string literal, it should wrap the string in single quotes, not double quotes. Databases which adhere more strictly to the ANSI SQL standards, such as Postgres, allow only single-quotes to be used for denoting string literals (see http://stackoverflow.com/a/1992331/590203).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #14763 from JoshRosen/SPARK-17194.
      bf8ff833
    • Davies Liu's avatar
      [SPARK-13286] [SQL] add the next expression of SQLException as cause · 9afdfc94
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Some JDBC driver (for example PostgreSQL) does not use the underlying exception as cause, but have another APIs (getNextException) to access that, so it it's included in the error logging, making us hard to find the root cause, especially in batch mode.
      
      This PR will pull out the next exception and add it as cause (if it's different) or suppressed (if there is another different cause).
      
      ## How was this patch tested?
      
      Can't reproduce this on the default JDBC driver, so did not add a regression test.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #14722 from davies/keep_cause.
      9afdfc94
    • Jacek Laskowski's avatar
      [SPARK-17199] Use CatalystConf.resolver for case-sensitivity comparison · 9d376ad7
      Jacek Laskowski authored
      ## What changes were proposed in this pull request?
      
      Use `CatalystConf.resolver` consistently for case-sensitivity comparison (removed dups).
      
      ## How was this patch tested?
      
      Local build. Waiting for Jenkins to ensure clean build and test.
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #14771 from jaceklaskowski/17199-catalystconf-resolver.
      9d376ad7
    • Sean Zhong's avatar
      [SPARK-17188][SQL] Moves class QuantileSummaries to project catalyst for... · cc33460a
      Sean Zhong authored
      [SPARK-17188][SQL] Moves class QuantileSummaries to project catalyst for implementing percentile_approx
      
      ## What changes were proposed in this pull request?
      
      This is a sub-task of [SPARK-16283](https://issues.apache.org/jira/browse/SPARK-16283) (Implement percentile_approx SQL function), which moves class QuantileSummaries to project catalyst so that it can be reused when implementing aggregation function `percentile_approx`.
      
      ## How was this patch tested?
      
      This PR only does class relocation, class implementation is not changed.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #14754 from clockfly/move_QuantileSummaries_to_catalyst.
      cc33460a
  6. Aug 22, 2016
    • Cheng Lian's avatar
      [SPARK-17182][SQL] Mark Collect as non-deterministic · 2cdd92a7
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR marks the abstract class `Collect` as non-deterministic since the results of `CollectList` and `CollectSet` depend on the actual order of input rows.
      
      ## How was this patch tested?
      
      Existing test cases should be enough.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #14749 from liancheng/spark-17182-non-deterministic-collect.
      2cdd92a7
    • gatorsmile's avatar
      [SPARK-17144][SQL] Removal of useless CreateHiveTableAsSelectLogicalPlan · 6d93f9e0
      gatorsmile authored
      ## What changes were proposed in this pull request?
      `CreateHiveTableAsSelectLogicalPlan` is a dead code after refactoring.
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14707 from gatorsmile/removeCreateHiveTable.
      6d93f9e0
    • Eric Liang's avatar
      [SPARK-17162] Range does not support SQL generation · 84770b59
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      The range operator previously didn't support SQL generation, which made it not possible to use in views.
      
      ## How was this patch tested?
      
      Unit tests.
      
      cc hvanhovell
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #14724 from ericl/spark-17162.
      84770b59
    • Sean Zhong's avatar
      [MINOR][SQL] Fix some typos in comments and test hints · 929cb8be
      Sean Zhong authored
      ## What changes were proposed in this pull request?
      
      Fix some typos in comments and test hints
      
      ## How was this patch tested?
      
      N/A.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #14755 from clockfly/fix_minor_typo.
      929cb8be
    • Davies Liu's avatar
      [SPARK-17115][SQL] decrease the threshold when split expressions · 8d35a6f6
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      In 2.0, we change the threshold of splitting expressions from 16K to 64K, which cause very bad performance on wide table, because the generated method can't be JIT compiled by default (above the limit of 8K bytecode).
      
      This PR will decrease it to 1K, based on the benchmark results for a wide table with 400 columns of LongType.
      
      It also fix a bug around splitting expression in whole-stage codegen (it should not split them).
      
      ## How was this patch tested?
      
      Added benchmark suite.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #14692 from davies/split_exprs.
      8d35a6f6
    • Wenchen Fan's avatar
      [SPARK-16498][SQL] move hive hack for data source table into HiveExternalCatalog · b2074b66
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Spark SQL doesn't have its own meta store yet, and use hive's currently. However, hive's meta store has some limitations(e.g. columns can't be too many, not case-preserving, bad decimal type support, etc.), so we have some hacks to successfully store data source table metadata into hive meta store, i.e. put all the information in table properties.
      
      This PR moves these hacks to `HiveExternalCatalog`, tries to isolate hive specific logic in one place.
      
      changes overview:
      
      1.  **before this PR**: we need to put metadata(schema, partition columns, etc.) of data source tables to table properties before saving it to external catalog, even the external catalog doesn't use hive metastore(e.g. `InMemoryCatalog`)
      **after this PR**: the table properties tricks are only in `HiveExternalCatalog`, the caller side doesn't need to take care of it anymore.
      
      2. **before this PR**: because the table properties tricks are done outside of external catalog, so we also need to revert these tricks when we read the table metadata from external catalog and use it. e.g. in `DescribeTableCommand` we will read schema and partition columns from table properties.
      **after this PR**: The table metadata read from external catalog is exactly the same with what we saved to it.
      
      bonus: now we can create data source table using `SessionCatalog`, if schema is specified.
      breaks: `schemaStringLengthThreshold` is not configurable anymore. `hive.default.rcfile.serde` is not configurable anymore.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14155 from cloud-fan/catalog-table.
      b2074b66
  7. Aug 21, 2016
    • Dongjoon Hyun's avatar
      [SPARK-17098][SQL] Fix `NullPropagation` optimizer to handle `COUNT(NULL) OVER` correctly · 91c23976
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, `NullPropagation` optimizer replaces `COUNT` on null literals in a bottom-up fashion. During that, `WindowExpression` is not covered properly. This PR adds the missing propagation logic.
      
      **Before**
      ```scala
      scala> sql("SELECT COUNT(1 + NULL) OVER ()").show
      java.lang.UnsupportedOperationException: Cannot evaluate expression: cast(0 as bigint) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
      ```
      
      **After**
      ```scala
      scala> sql("SELECT COUNT(1 + NULL) OVER ()").show
      +----------------------------------------------------------------------------------------------+
      |count((1 + CAST(NULL AS INT))) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)|
      +----------------------------------------------------------------------------------------------+
      |                                                                                             0|
      +----------------------------------------------------------------------------------------------+
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins test with a new test case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14689 from dongjoon-hyun/SPARK-17098.
      91c23976
  8. Aug 20, 2016
    • petermaxlee's avatar
      [SPARK-17124][SQL] RelationalGroupedDataset.agg should preserve order and... · 9560c8d2
      petermaxlee authored
      [SPARK-17124][SQL] RelationalGroupedDataset.agg should preserve order and allow multiple aggregates per column
      
      ## What changes were proposed in this pull request?
      This patch fixes a longstanding issue with one of the RelationalGroupedDataset.agg function. Even though the signature accepts vararg of pairs, the underlying implementation turns the seq into a map, and thus not order preserving nor allowing multiple aggregates per column.
      
      This change also allows users to use this function to run multiple different aggregations for a single column, e.g.
      ```
      agg("age" -> "max", "age" -> "count")
      ```
      
      ## How was this patch tested?
      Added a test case in DataFrameAggregateSuite.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #14697 from petermaxlee/SPARK-17124.
      9560c8d2
    • Liang-Chi Hsieh's avatar
      [SPARK-17104][SQL] LogicalRelation.newInstance should follow the semantics of MultiInstanceRelation · 31a01557
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Currently `LogicalRelation.newInstance()` simply creates another `LogicalRelation` object with the same parameters. However, the `newInstance()` method inherited from `MultiInstanceRelation` should return a copy of object with unique expression ids. Current `LogicalRelation.newInstance()` can cause failure when doing self-join.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #14682 from viirya/fix-localrelation.
      31a01557
    • petermaxlee's avatar
      [SPARK-17150][SQL] Support SQL generation for inline tables · 45d40d9f
      petermaxlee authored
      ## What changes were proposed in this pull request?
      This patch adds support for SQL generation for inline tables. With this, it would be possible to create a view that depends on inline tables.
      
      ## How was this patch tested?
      Added a test case in LogicalPlanToSQLSuite.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #14709 from petermaxlee/SPARK-17150.
      45d40d9f
  9. Aug 19, 2016
    • Srinath Shankar's avatar
      [SPARK-17158][SQL] Change error message for out of range numeric literals · ba1737c2
      Srinath Shankar authored
      ## What changes were proposed in this pull request?
      
      Modifies error message for numeric literals to
      Numeric literal <literal> does not fit in range [min, max] for type <T>
      
      ## How was this patch tested?
      
      Fixed up the error messages for literals.sql in  SqlQueryTestSuite and re-ran via sbt. Also fixed up error messages in ExpressionParserSuite
      
      Author: Srinath Shankar <srinath@databricks.com>
      
      Closes #14721 from srinathshankar/sc4296.
      ba1737c2
Loading