Skip to content
Snippets Groups Projects
  1. Feb 15, 2017
  2. Feb 14, 2017
    • sureshthalamati's avatar
      [SPARK-19318][SQL] Fix to treat JDBC connection properties specified by the... · f48c5a57
      sureshthalamati authored
      [SPARK-19318][SQL] Fix to treat JDBC connection properties specified by the user in case-sensitive manner.
      
      ## What changes were proposed in this pull request?
      The reason for test failure is that the property “oracle.jdbc.mapDateToTimestamp” set by the test was getting converted into all lower case. Oracle database expects this property in case-sensitive manner.
      
      This test was passing in previous releases because connection properties were sent as user specified for the test case scenario. Fixes to handle all option uniformly in case-insensitive manner, converted the JDBC connection properties also to lower case.
      
      This PR  enhances CaseInsensitiveMap to keep track of input case-sensitive keys , and uses those when creating connection properties that are passed to the JDBC connection.
      
      Alternative approach PR https://github.com/apache/spark/pull/16847  is to pass original input keys to JDBC data source by adding check in the  Data source class and handle case-insensitivity in the JDBC source code.
      
      ## How was this patch tested?
      Added new test cases to JdbcSuite , and OracleIntegrationSuite. Ran docker integration tests passed on my laptop, all tests passed successfully.
      
      Author: sureshthalamati <suresh.thalamati@gmail.com>
      
      Closes #16891 from sureshthalamati/jdbc_case_senstivity_props_fix-SPARK-19318.
      f48c5a57
    • Reynold Xin's avatar
      [SPARK-16475][SQL] Broadcast hint for SQL Queries · da7aef7a
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This pull request introduces a simple hint infrastructure to SQL and implements broadcast join hint using the infrastructure.
      
      The hint syntax looks like the following:
      ```
      SELECT /*+ BROADCAST(t) */ * FROM t
      ```
      
      For broadcast hint, we accept "BROADCAST", "BROADCASTJOIN", and "MAPJOIN", and a sequence of relation aliases can be specified in the hint. A broadcast hint plan node will be inserted on top of any relation (that is not aliased differently), subquery, or common table expression that match the specified name.
      
      The hint resolution works by recursively traversing down the query plan to find a relation or subquery that matches one of the specified broadcast aliases. The traversal does not go past beyond any existing broadcast hints, subquery aliases. This rule happens before common table expressions.
      
      Note that there was an earlier patch in https://github.com/apache/spark/pull/14426. This is a rewrite of that patch, with different semantics and simpler test cases.
      
      ## How was this patch tested?
      Added a new unit test suite for the broadcast hint rule (SubstituteHintsSuite) and new test cases for parser change (in PlanParserSuite). Also added end-to-end test case in BroadcastSuite.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16925 from rxin/SPARK-16475-broadcast-hint.
      da7aef7a
    • Xiao Li's avatar
      [SPARK-19589][SQL] Removal of SQLGEN files · 457850e6
      Xiao Li authored
      ### What changes were proposed in this pull request?
      SQLGen is removed. Thus, the generated files should be removed too.
      
      ### How was this patch tested?
      N/A
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #16921 from gatorsmile/removeSQLGenFiles.
      457850e6
  3. Feb 13, 2017
    • Xin Wu's avatar
      [SPARK-19539][SQL] Block duplicate temp table during creation · 1ab97310
      Xin Wu authored
      ## What changes were proposed in this pull request?
      Current `CREATE TEMPORARY TABLE ... ` is deprecated and recommend users to use `CREATE TEMPORARY VIEW ...` And it does not support `IF NOT EXISTS `clause. However, if there is an existing temporary view defined, it is possible to unintentionally replace this existing view by issuing `CREATE TEMPORARY TABLE ...`  with the same table/view name.
      
      This PR is to disallow `CREATE TEMPORARY TABLE ...` with an existing view name.
      Under the cover, `CREATE TEMPORARY TABLE ...` will be changed to create temporary view, however, passing in a flag `replace=false`, instead of currently `true`. So when creating temporary view under the cover, if there is existing view with the same name, the operation will be blocked.
      
      ## How was this patch tested?
      New unit test case is added and updated some existing test cases to adapt the new behavior
      
      Author: Xin Wu <xinwu@us.ibm.com>
      
      Closes #16878 from xwu0226/block_duplicate_temp_table.
      1ab97310
    • ouyangxiaochen's avatar
      [SPARK-19115][SQL] Supporting Create Table Like Location · 6e45b547
      ouyangxiaochen authored
      What changes were proposed in this pull request?
      
      Support CREATE [EXTERNAL] TABLE LIKE LOCATION... syntax for Hive serde and datasource tables.
      In this PR,we follow SparkSQL design rules :
      
          supporting create table like view or physical table or temporary view with location.
          creating a table with location,this table will be an external table other than managed table.
      
      How was this patch tested?
      
      Add new test cases and update existing test cases
      
      Author: ouyangxiaochen <ou.yangxiaochen@zte.com.cn>
      
      Closes #16868 from ouyangxiaochen/spark19115.
      6e45b547
    • hyukjinkwon's avatar
      [SPARK-19435][SQL] Type coercion between ArrayTypes · 9af8f743
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to support type coercion between `ArrayType`s where the element types are compatible.
      
      **Before**
      
      ```
      Seq(Array(1)).toDF("a").selectExpr("greatest(a, array(1D))")
      org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(`a`, array(1.0D))' due to data type mismatch: The expressions should all have the same type, got GREATEST(array<int>, array<double>).; line 1 pos 0;
      
      Seq(Array(1)).toDF("a").selectExpr("least(a, array(1D))")
      org.apache.spark.sql.AnalysisException: cannot resolve 'least(`a`, array(1.0D))' due to data type mismatch: The expressions should all have the same type, got LEAST(array<int>, array<double>).; line 1 pos 0;
      
      sql("SELECT * FROM values (array(0)), (array(1D)) as data(a)")
      org.apache.spark.sql.AnalysisException: incompatible types found in column a for inline table; line 1 pos 14
      
      Seq(Array(1)).toDF("a").union(Seq(Array(1D)).toDF("b"))
      org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. ArrayType(DoubleType,false) <> ArrayType(IntegerType,false) at the first column of the second table;;
      
      sql("SELECT IF(1=1, array(1), array(1D))")
      org.apache.spark.sql.AnalysisException: cannot resolve '(IF((1 = 1), array(1), array(1.0D)))' due to data type mismatch: differing types in '(IF((1 = 1), array(1), array(1.0D)))' (array<int> and array<double>).; line 1 pos 7;
      ```
      
      **After**
      
      ```scala
      Seq(Array(1)).toDF("a").selectExpr("greatest(a, array(1D))")
      res5: org.apache.spark.sql.DataFrame = [greatest(a, array(1.0)): array<double>]
      
      Seq(Array(1)).toDF("a").selectExpr("least(a, array(1D))")
      res6: org.apache.spark.sql.DataFrame = [least(a, array(1.0)): array<double>]
      
      sql("SELECT * FROM values (array(0)), (array(1D)) as data(a)")
      res8: org.apache.spark.sql.DataFrame = [a: array<double>]
      
      Seq(Array(1)).toDF("a").union(Seq(Array(1D)).toDF("b"))
      res10: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: array<double>]
      
      sql("SELECT IF(1=1, array(1), array(1D))")
      res15: org.apache.spark.sql.DataFrame = [(IF((1 = 1), array(1), array(1.0))): array<double>]
      ```
      
      ## How was this patch tested?
      
      Unit tests in `TypeCoercion` and Jenkins tests and
      
      building with scala 2.10
      
      ```scala
      ./dev/change-scala-version.sh 2.10
      ./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16777 from HyukjinKwon/SPARK-19435.
      9af8f743
    • Shixiong Zhu's avatar
      [SPARK-19542][SS] Delete the temp checkpoint if a query is stopped without errors · 3dbff9be
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      When a query uses a temp checkpoint dir, it's better to delete it if it's stopped without errors.
      
      ## How was this patch tested?
      
      New unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16880 from zsxwing/delete-temp-checkpoint.
      3dbff9be
    • Ala Luszczak's avatar
      [SPARK-19514] Enhancing the test for Range interruption. · 0417ce87
      Ala Luszczak authored
      Improve the test for SPARK-19514, so that it's clear which stage is being cancelled.
      
      Author: Ala Luszczak <ala@databricks.com>
      
      Closes #16914 from ala/fix-range-test.
      0417ce87
    • hyukjinkwon's avatar
      [SPARK-19544][SQL] Improve error message when some column types are compatible... · 4321ff9e
      hyukjinkwon authored
      [SPARK-19544][SQL] Improve error message when some column types are compatible and others are not in set operations
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to fix the error message when some data types are compatible and others are not in set/union operation.
      
      Currently, the code below:
      
      ```scala
      Seq((1,("a", 1))).toDF.union(Seq((1L,("a", "b"))).toDF)
      ```
      
      throws an exception saying `LongType` and `IntegerType` are incompatible types. It should say something about `StructType`s with more readable format as below:
      
      **Before**
      
      ```
      Union can only be performed on tables with the compatible column types.
      LongType <> IntegerType at the first column of the second table;;
      ```
      
      **After**
      
      ```
      Union can only be performed on tables with the compatible column types.
      struct<_1:string,_2:string> <> struct<_1:string,_2:int> at the second column of the second table;;
      ```
      
      *I manually inserted a newline in the messages above for readability only in this PR description.
      
      ## How was this patch tested?
      
      Unit tests in `AnalysisErrorSuite`, manual tests and build wth Scala 2.10.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16882 from HyukjinKwon/SPARK-19544.
      4321ff9e
    • windpiger's avatar
      [SPARK-19496][SQL] to_date udf to return null when input date is invalid · 04ad8225
      windpiger authored
      ## What changes were proposed in this pull request?
      
      Currently the udf  `to_date` has different return value with an invalid date input.
      
      ```
      SELECT to_date('2015-07-22', 'yyyy-dd-MM') ->  return `2016-10-07`
      SELECT to_date('2014-31-12')    -> return null
      ```
      
      As discussed in JIRA [SPARK-19496](https://issues.apache.org/jira/browse/SPARK-19496), we should return null in both situations when the input date is invalid
      
      ## How was this patch tested?
      unit test added
      
      Author: windpiger <songjun@outlook.com>
      
      Closes #16870 from windpiger/to_date.
      04ad8225
  4. Feb 12, 2017
  5. Feb 10, 2017
    • Herman van Hovell's avatar
      [SPARK-19548][SQL] Support Hive UDFs which return typed Lists/Maps · 226d3884
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      This PR adds support for Hive UDFs that return fully typed java Lists or Maps, for example `List<String>` or `Map<String, Integer>`.  It is also allowed to nest these structures, for example `Map<String, List<Integer>>`. Raw collections or collections using wildcards are still not supported, and cannot be supported due to the lack of type information.
      
      ## How was this patch tested?
      Modified existing tests in `HiveUDFSuite`, and I have added test cases for raw collection and collection using wildcards.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #16886 from hvanhovell/SPARK-19548.
      226d3884
    • Ala Luszczak's avatar
      [SPARK-19549] Allow providing reason for stage/job cancelling · d785217b
      Ala Luszczak authored
      ## What changes were proposed in this pull request?
      
      This change add an optional argument to `SparkContext.cancelStage()` and `SparkContext.cancelJob()` functions, which allows the caller to provide exact reason  for the cancellation.
      
      ## How was this patch tested?
      
      Adds unit test.
      
      Author: Ala Luszczak <ala@databricks.com>
      
      Closes #16887 from ala/cancel.
      d785217b
    • Herman van Hovell's avatar
      [SPARK-19459][SQL] Add Hive datatype (char/varchar) to StructField metadata · de8a03e6
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      Reading from an existing ORC table which contains `char` or `varchar` columns can fail with a `ClassCastException` if the table metadata has been created using Spark. This is caused by the fact that spark internally replaces `char` and `varchar` columns with a `string` column.
      
      This PR fixes this by adding the hive type to the `StructField's` metadata under the `HIVE_TYPE_STRING` key. This is picked up by the `HiveClient` and the ORC reader, see https://github.com/apache/spark/pull/16060 for more details on how the metadata is used.
      
      ## How was this patch tested?
      Added a regression test to `OrcSourceSuite`.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #16804 from hvanhovell/SPARK-19459.
      de8a03e6
    • Burak Yavuz's avatar
      [SPARK-19543] from_json fails when the input row is empty · d5593f7f
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      Using from_json on a column with an empty string results in: java.util.NoSuchElementException: head of empty list.
      
      This is because `parser.parse(input)` may return `Nil` when `input.trim.isEmpty`
      
      ## How was this patch tested?
      
      Regression test in `JsonExpressionsSuite`
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #16881 from brkyvz/json-fix.
      d5593f7f
  6. Feb 09, 2017
    • jiangxingbo's avatar
      [SPARK-19025][SQL] Remove SQL builder for operators · af63c52f
      jiangxingbo authored
      ## What changes were proposed in this pull request?
      
      With the new approach of view resolution, we can get rid of SQL generation on view creation, so let's remove SQL builder for operators.
      
      Note that, since all sql generation for operators is defined in one file (org.apache.spark.sql.catalyst.SQLBuilder), it’d be trivial to recover it in the future.
      
      ## How was this patch tested?
      
      N/A
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      
      Closes #16869 from jiangxb1987/SQLBuilder.
      af63c52f
    • Bogdan Raducanu's avatar
      [SPARK-19512][SQL] codegen for compare structs fails · 1af0dee4
      Bogdan Raducanu authored
      ## What changes were proposed in this pull request?
      
      Set currentVars to null in GenerateOrdering.genComparisons before genCode is called. genCode ignores INPUT_ROW if currentVars is not null and in genComparisons we want it to use INPUT_ROW.
      
      ## How was this patch tested?
      
      Added test with 2 queries in WholeStageCodegenSuite
      
      Author: Bogdan Raducanu <bogdan.rdc@gmail.com>
      
      Closes #16852 from bogdanrdc/SPARK-19512.
      1af0dee4
    • Ala Luszczak's avatar
      [SPARK-19514] Making range interruptible. · 4064574d
      Ala Luszczak authored
      ## What changes were proposed in this pull request?
      
      Previously range operator could not be interrupted. For example, using DAGScheduler.cancelStage(...) on a query with range might have been ineffective.
      
      This change adds periodic checks of TaskContext.isInterrupted to codegen version, and InterruptibleOperator to non-codegen version.
      
      I benchmarked the performance of codegen version on a sample query `spark.range(1000L * 1000 * 1000 * 10).count()` and there is no measurable difference.
      
      ## How was this patch tested?
      
      Adds a unit test.
      
      Author: Ala Luszczak <ala@databricks.com>
      
      Closes #16872 from ala/SPARK-19514b.
      4064574d
  7. Feb 08, 2017
    • Liwei Lin's avatar
      [SPARK-19265][SQL][FOLLOW-UP] Configurable `tableRelationCache` maximum size · 9d9d67c7
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      SPARK-19265 had made table relation cache general; this follow-up aims to make `tableRelationCache`'s maximum size configurable.
      
      In order to do sanity-check, this patch also adds a `checkValue()` method to `TypedConfigBuilder`.
      
      ## How was this patch tested?
      
      new test case: `test("conf entry: checkValue()")`
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #16736 from lw-lin/conf.
      9d9d67c7
    • Wenchen Fan's avatar
      [SPARK-19359][SQL] renaming partition should not leave useless directories · 50a99126
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Hive metastore is not case-preserving and keep partition columns with lower case names. If Spark SQL creates a table with upper-case partition column names using `HiveExternalCatalog`, when we rename partition, it first calls the HiveClient to renamePartition, which will create a new lower case partition path, then Spark SQL renames the lower case path to upper-case.
      
      However, when we rename a nested path, different file systems have different behaviors. e.g. in jenkins, renaming `a=1/b=2` to `A=2/B=2` will success, but leave an empty directory `a=1`. in mac os, the renaming doesn't work as expected and result to `a=1/B=2`.
      
      This PR renames the partition directory recursively from the first partition column in `HiveExternalCatalog`, to be most compatible with different file systems.
      
      ## How was this patch tested?
      
      new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16837 from cloud-fan/partition.
      50a99126
    • Dilip Biswal's avatar
      [SPARK-18872][SQL][TESTS] New test cases for EXISTS subquery (Aggregate, Having, Orderby, Limit) · 64cae22f
      Dilip Biswal authored
      ## What changes were proposed in this pull request?
      This PR adds the second set of tests for EXISTS subquery.
      
      File name                        | Brief description
      ------------------------| -----------------
      exists-aggregate.sql              |Tests aggregate expressions in outer query and EXISTS subquery.
      exists-having.sql|Tests HAVING clause in subquery.
      exists-orderby-limit.sql|Tests EXISTS subquery support with ORDER BY and LIMIT clauses.
      
      DB2 results are attached here as reference :
      
      [exists-aggregate-db2.txt](https://github.com/apache/spark/files/743287/exists-aggregate-db2.txt)
      [exists-having-db2.txt](https://github.com/apache/spark/files/743286/exists-having-db2.txt)
      [exists-orderby-limit-db2.txt](https://github.com/apache/spark/files/743288/exists-orderby-limit-db2.txt)
      
      ##  How the patch was tested.
      The test result is compared with the result run from another SQL engine (in this case is IBM DB2). If the result are equivalent, we assume the result is correct.
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #16760 from dilipbiswal/exists-pr2.
      64cae22f
    • gatorsmile's avatar
      [SPARK-19279][SQL][FOLLOW-UP] Infer Schema for Hive Serde Tables · 4d4d0de7
      gatorsmile authored
      ### What changes were proposed in this pull request?
      `table.schema` is always not empty for partitioned tables, because `table.schema` also contains the partitioned columns, even if the original table does not have any column. This PR is to fix the issue.
      
      ### How was this patch tested?
      Added a test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16848 from gatorsmile/inferHiveSerdeSchema.
      4d4d0de7
    • Dongjoon Hyun's avatar
      [SPARK-19409][BUILD][TEST-MAVEN] Fix ParquetAvroCompatibilitySuite failure due... · 0077bfcb
      Dongjoon Hyun authored
      [SPARK-19409][BUILD][TEST-MAVEN] Fix ParquetAvroCompatibilitySuite failure due to test dependency on avro
      
      ## What changes were proposed in this pull request?
      
      After using Apache Parquet 1.8.2, `ParquetAvroCompatibilitySuite` fails on **Maven** test. It is because `org.apache.parquet.avro.AvroParquetWriter` in the test code used new `avro 1.8.0` specific class, `LogicalType`. This PR aims to fix the test dependency of `sql/core` module to use avro 1.8.0.
      
      https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/2530/consoleFull
      
      ```
      ParquetAvroCompatibilitySuite:
      *** RUN ABORTED ***
        java.lang.NoClassDefFoundError: org/apache/avro/LogicalType
        at org.apache.parquet.avro.AvroParquetWriter.writeSupport(AvroParquetWriter.java:144)
      ```
      
      ## How was this patch tested?
      
      Pass the existing test with **Maven**.
      
      ```
      $ build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver test
      ...
      [INFO] ------------------------------------------------------------------------
      [INFO] BUILD SUCCESS
      [INFO] ------------------------------------------------------------------------
      [INFO] Total time: 02:07 h
      [INFO] Finished at: 2017-02-04T05:41:43+00:00
      [INFO] Final Memory: 77M/987M
      [INFO] ------------------------------------------------------------------------
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16795 from dongjoon-hyun/SPARK-19409-2.
      Unverified
      0077bfcb
    • Sean Owen's avatar
      [SPARK-19464][CORE][YARN][TEST-HADOOP2.6] Remove support for Hadoop 2.5 and earlier · e8d3fca4
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Remove support for Hadoop 2.5 and earlier
      - Remove reflection and code constructs only needed to support multiple versions at once
      - Update docs to reflect newer versions
      - Remove older versions' builds and profiles.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16810 from srowen/SPARK-19464.
      Unverified
      e8d3fca4
    • windpiger's avatar
      [SPARK-19488][SQL] fix csv infer schema when the field is Nan/Inf etc · d60dde26
      windpiger authored
      ## What changes were proposed in this pull request?
      
      when csv infer schema, it does not use user defined csvoptions to parse the field, such as `inf`, `-inf` which are should be parsed to DoubleType
      
      this pr add  `options.nanValue`, `options.negativeInf`, `options.positiveIn`  to check if the field is a DoubleType
      
      ## How was this patch tested?
      unit test added
      
      Author: windpiger <songjun@outlook.com>
      
      Closes #16834 from windpiger/fixinferInfSchemaCsv.
      d60dde26
  8. Feb 07, 2017
    • Nattavut Sutyanyong's avatar
      [SPARK-18873][SQL][TEST] New test cases for scalar subquery (part 1 of 2) -... · 266c1e73
      Nattavut Sutyanyong authored
      [SPARK-18873][SQL][TEST] New test cases for scalar subquery (part 1 of 2) - scalar subquery in SELECT clause
      
      ## What changes were proposed in this pull request?
      This PR adds new test cases for scalar subquery in SELECT clause.
      
      ## How was this patch tested?
      The test result is compared with the result run from another SQL engine (in this case is IBM DB2). If the result are equivalent, we assume the result is correct.
      
      Author: Nattavut Sutyanyong <nsy.can@gmail.com>
      
      Closes #16712 from nsyca/18873.
      266c1e73
    • CodingCat's avatar
      [SPARK-19499][SS] Add more notes in the comments of Sink.addBatch() · d4cd9757
      CodingCat authored
      ## What changes were proposed in this pull request?
      
      addBatch method in Sink trait is supposed to be a synchronous method to coordinate with the fault-tolerance design in StreamingExecution (being different with the compute() method in DStream)
      
      We need to add more notes in the comments of this method to remind the developers
      
      ## How was this patch tested?
      
      existing tests
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #16840 from CodingCat/SPARK-19499.
      d4cd9757
    • Tathagata Das's avatar
      [SPARK-19413][SS] MapGroupsWithState for arbitrary stateful operations · aeb80348
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      `mapGroupsWithState` is a new API for arbitrary stateful operations in Structured Streaming, similar to `DStream.mapWithState`
      
      *Requirements*
      - Users should be able to specify a function that can do the following
      - Access the input row corresponding to a key
      - Access the previous state corresponding to a key
      - Optionally, update or remove the state
      - Output any number of new rows (or none at all)
      
      *Proposed API*
      ```
      // ------------ New methods on KeyValueGroupedDataset ------------
      class KeyValueGroupedDataset[K, V] {
      	// Scala friendly
      	def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], KeyedState[S]) => U)
              def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, Iterator[V], KeyedState[S]) => Iterator[U])
      	// Java friendly
             def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
             def flatMapGroupsWithState[S, U](func: FlatMapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
      }
      
      // ------------------- New Java-friendly function classes -------------------
      public interface MapGroupsWithStateFunction<K, V, S, R> extends Serializable {
        R call(K key, Iterator<V> values, state: KeyedState<S>) throws Exception;
      }
      public interface FlatMapGroupsWithStateFunction<K, V, S, R> extends Serializable {
        Iterator<R> call(K key, Iterator<V> values, state: KeyedState<S>) throws Exception;
      }
      
      // ---------------------- Wrapper class for state data ----------------------
      trait State[S] {
      	def exists(): Boolean
        	def get(): S 			// throws Exception is state does not exist
      	def getOption(): Option[S]
      	def update(newState: S): Unit
      	def remove(): Unit		// exists() will be false after this
      }
      ```
      
      Key Semantics of the State class
      - The state can be null.
      - If the state.remove() is called, then state.exists() will return false, and getOption will returm None.
      - After that state.update(newState) is called, then state.exists() will return true, and getOption will return Some(...).
      - None of the operations are thread-safe. This is to avoid memory barriers.
      
      *Usage*
      ```
      val stateFunc = (word: String, words: Iterator[String, runningCount: KeyedState[Long]) => {
          val newCount = words.size + runningCount.getOption.getOrElse(0L)
          runningCount.update(newCount)
         (word, newCount)
      }
      
      dataset					                        // type is Dataset[String]
        .groupByKey[String](w => w)        	                // generates KeyValueGroupedDataset[String, String]
        .mapGroupsWithState[Long, (String, Long)](stateFunc)	// returns Dataset[(String, Long)]
      ```
      
      ## How was this patch tested?
      New unit tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16758 from tdas/mapWithState.
      aeb80348
    • gatorsmile's avatar
      [SPARK-19397][SQL] Make option names of LIBSVM and TEXT case insensitive · e33aaa2a
      gatorsmile authored
      ### What changes were proposed in this pull request?
      Prior to Spark 2.1, the option names are case sensitive for all the formats. Since Spark 2.1, the option key names become case insensitive except the format `Text` and `LibSVM `. This PR is to fix these issues.
      
      Also, add a check to know whether the input option vector type is legal for `LibSVM`.
      
      ### How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16737 from gatorsmile/libSVMTextOptions.
      e33aaa2a
    • Herman van Hovell's avatar
      [SPARK-18609][SPARK-18841][SQL] Fix redundant Alias removal in the optimizer · 73ee7394
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      The optimizer tries to remove redundant alias only projections from the query plan using the `RemoveAliasOnlyProject` rule. The current rule identifies removes such a project and rewrites the project's attributes in the **entire** tree. This causes problems when parts of the tree are duplicated (for instance a self join on a temporary view/CTE)  and the duplicated part contains the alias only project, in this case the rewrite will break the tree.
      
      This PR fixes these problems by using a blacklist for attributes that are not to be moved, and by making sure that attribute remapping is only done for the parent tree, and not for unrelated parts of the query plan.
      
      The current tree transformation infrastructure works very well if the transformation at hand requires little or a global contextual information. In this case we need to know both the attributes that were not to be moved, and we also needed to know which child attributes were modified. This cannot be done easily using the current infrastructure, and solutions typically involves transversing the query plan multiple times (which is super slow). I have moved around some code in `TreeNode`, `QueryPlan` and `LogicalPlan`to make this much more straightforward; this basically allows you to manually traverse the tree.
      
      This PR subsumes the following PRs by windpiger:
      Closes https://github.com/apache/spark/pull/16267
      Closes https://github.com/apache/spark/pull/16255
      
      ## How was this patch tested?
      I have added unit tests to `RemoveRedundantAliasAndProjectSuite` and I have added integration tests to the `SQLQueryTestSuite.union` and `SQLQueryTestSuite.cte` test cases.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #16757 from hvanhovell/SPARK-18609.
      73ee7394
    • Reynold Xin's avatar
      [SPARK-19495][SQL] Make SQLConf slightly more extensible · b7277e03
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This pull request makes SQLConf slightly more extensible by removing the visibility limitations on the build* functions.
      
      ## How was this patch tested?
      N/A - there are no logic changes and everything should be covered by existing unit tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #16835 from rxin/SPARK-19495.
      b7277e03
Loading