Skip to content
Snippets Groups Projects
  1. Sep 15, 2017
    • Jose Torres's avatar
      [SPARK-22017] Take minimum of all watermark execs in StreamExecution. · 0bad10d3
      Jose Torres authored
      ## What changes were proposed in this pull request?
      
      Take the minimum of all watermark exec nodes as the "real" watermark in StreamExecution, rather than picking one arbitrarily.
      
      ## How was this patch tested?
      
      new unit test
      
      Author: Jose Torres <jose@databricks.com>
      
      Closes #19239 from joseph-torres/SPARK-22017.
      0bad10d3
    • Wenchen Fan's avatar
      [SPARK-15689][SQL] data source v2 read path · c7307acd
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This PR adds the infrastructure for data source v2, and implement features which Spark already have in data source v1, i.e. column pruning, filter push down, catalyst expression filter push down, InternalRow scan, schema inference, data size report. The write path is excluded to avoid making this PR growing too big, and will be added in follow-up PR.
      
      ## How was this patch tested?
      
      new tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #19136 from cloud-fan/data-source-v2.
      c7307acd
    • Wenchen Fan's avatar
      [SPARK-21987][SQL] fix a compatibility issue of sql event logs · 3c6198c8
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      In https://github.com/apache/spark/pull/18600 we removed the `metadata` field from `SparkPlanInfo`. This causes a problem when we replay event logs that are generated by older Spark versions.
      
      ## How was this patch tested?
      
      a regression test.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #19237 from cloud-fan/event.
      3c6198c8
    • Yuming Wang's avatar
      [SPARK-22002][SQL] Read JDBC table use custom schema support specify partial fields. · 4decedfd
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      
      https://github.com/apache/spark/pull/18266 add a new feature to support read JDBC table use custom schema, but we must specify all the fields. For simplicity, this PR support  specify partial fields.
      
      ## How was this patch tested?
      unit tests
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #19231 from wangyum/SPARK-22002.
      4decedfd
    • Tathagata Das's avatar
      [SPARK-22018][SQL] Preserve top-level alias metadata when collapsing projects · 88661747
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      If there are two projects like as follows.
      ```
      Project [a_with_metadata#27 AS b#26]
      +- Project [a#0 AS a_with_metadata#27]
         +- LocalRelation <empty>, [a#0, b#1]
      ```
      Child Project has an output column with a metadata in it, and the parent Project has an alias that implicitly forwards the metadata. So this metadata is visible for higher operators. Upon applying CollapseProject optimizer rule, the metadata is not preserved.
      ```
      Project [a#0 AS b#26]
      +- LocalRelation <empty>, [a#0, b#1]
      ```
      This is incorrect, as downstream operators that expect certain metadata (e.g. watermark in structured streaming) to identify certain fields will fail to do so. This PR fixes it by preserving the metadata of top-level aliases.
      
      ## How was this patch tested?
      New unit test
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #19240 from tdas/SPARK-22018.
      88661747
  2. Sep 14, 2017
    • goldmedal's avatar
      [SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json support converting MapType to... · a28728a9
      goldmedal authored
      [SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json support converting MapType to json for PySpark and SparkR
      
      ## What changes were proposed in this pull request?
      In previous work SPARK-21513, we has allowed `MapType` and `ArrayType` of `MapType`s convert to a json string but only for Scala API. In this follow-up PR, we will make SparkSQL support it for PySpark and SparkR, too. We also fix some little bugs and comments of the previous work in this follow-up PR.
      
      ### For PySpark
      ```
      >>> data = [(1, {"name": "Alice"})]
      >>> df = spark.createDataFrame(data, ("key", "value"))
      >>> df.select(to_json(df.value).alias("json")).collect()
      [Row(json=u'{"name":"Alice")']
      >>> data = [(1, [{"name": "Alice"}, {"name": "Bob"}])]
      >>> df = spark.createDataFrame(data, ("key", "value"))
      >>> df.select(to_json(df.value).alias("json")).collect()
      [Row(json=u'[{"name":"Alice"},{"name":"Bob"}]')]
      ```
      ### For SparkR
      ```
      # Converts a map into a JSON object
      df2 <- sql("SELECT map('name', 'Bob')) as people")
      df2 <- mutate(df2, people_json = to_json(df2$people))
      # Converts an array of maps into a JSON array
      df2 <- sql("SELECT array(map('name', 'Bob'), map('name', 'Alice')) as people")
      df2 <- mutate(df2, people_json = to_json(df2$people))
      ```
      ## How was this patch tested?
      Add unit test cases.
      
      cc viirya HyukjinKwon
      
      Author: goldmedal <liugs963@gmail.com>
      
      Closes #19223 from goldmedal/SPARK-21513-fp-PySaprkAndSparkR.
      a28728a9
    • Jose Torres's avatar
      [SPARK-21988] Add default stats to StreamingExecutionRelation. · 054ddb2f
      Jose Torres authored
      ## What changes were proposed in this pull request?
      
      Add default stats to StreamingExecutionRelation.
      
      ## How was this patch tested?
      
      existing unit tests and an explain() test to be sure
      
      Author: Jose Torres <jose@databricks.com>
      
      Closes #19212 from joseph-torres/SPARK-21988.
      054ddb2f
    • Zhenhua Wang's avatar
      [SPARK-17642][SQL][FOLLOWUP] drop test tables and improve comments · ddd7f5e1
      Zhenhua Wang authored
      ## What changes were proposed in this pull request?
      
      Drop test tables and improve comments.
      
      ## How was this patch tested?
      
      Modified existing test.
      
      Author: Zhenhua Wang <wangzhenhua@huawei.com>
      
      Closes #19213 from wzhfy/useless_comment.
      ddd7f5e1
    • gatorsmile's avatar
      [SPARK-4131][FOLLOW-UP] Support "Writing data into the filesystem from queries" · 4e6fc690
      gatorsmile authored
      ## What changes were proposed in this pull request?
      This PR is clean the codes in https://github.com/apache/spark/pull/18975
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19225 from gatorsmile/refactorSPARK-4131.
      4e6fc690
    • Dilip Biswal's avatar
      [MINOR][SQL] Only populate type metadata for required types such as CHAR/VARCHAR. · dcbb2294
      Dilip Biswal authored
      ## What changes were proposed in this pull request?
      When reading column descriptions from hive catalog, we currently populate the metadata for all types to record the raw hive type string. In terms of processing , we need this additional metadata information for CHAR/VARCHAR types or complex type containing the CHAR/VARCHAR types.
      
      Its a minor cleanup. I haven't created a JIRA for it.
      
      ## How was this patch tested?
      Test added in HiveMetastoreCatalogSuite
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #19215 from dilipbiswal/column_metadata.
      dcbb2294
  3. Sep 13, 2017
    • Takeshi Yamamuro's avatar
      [SPARK-21973][SQL] Add an new option to filter queries in TPC-DS · 8be7e6bb
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added a new option to filter TPC-DS queries to run in `TPCDSQueryBenchmark`.
      By default, `TPCDSQueryBenchmark` runs all the TPC-DS queries.
      This change could enable developers to run some of the TPC-DS queries by this option,
      e.g., to run q2, q4, and q6 only:
      ```
      spark-submit --class <this class> --conf spark.sql.tpcds.queryFilter="q2,q4,q6" --jars <spark sql test jar>
      ```
      
      ## How was this patch tested?
      Manually checked.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #19188 from maropu/RunPartialQueriesInTPCDS.
      8be7e6bb
    • Yuming Wang's avatar
      [SPARK-20427][SQL] Read JDBC table use custom schema · 17edfec5
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      
      Auto generated Oracle schema some times not we expect:
      
      - `number(1)` auto mapped to BooleanType, some times it's not we expect, per [SPARK-20921](https://issues.apache.org/jira/browse/SPARK-20921).
      -  `number` auto mapped to Decimal(38,10), It can't read big data, per [SPARK-20427](https://issues.apache.org/jira/browse/SPARK-20427).
      
      This PR fix this issue by custom schema as follows:
      ```scala
      val props = new Properties()
      props.put("customSchema", "ID decimal(38, 0), N1 int, N2 boolean")
      val dfRead = spark.read.schema(schema).jdbc(jdbcUrl, "tableWithCustomSchema", props)
      dfRead.show()
      ```
      or
      ```sql
      CREATE TEMPORARY VIEW tableWithCustomSchema
      USING org.apache.spark.sql.jdbc
      OPTIONS (url '$jdbcUrl', dbTable 'tableWithCustomSchema', customSchema'ID decimal(38, 0), N1 int, N2 boolean')
      ```
      
      ## How was this patch tested?
      
      unit tests
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #18266 from wangyum/SPARK-20427.
      17edfec5
    • Jane Wang's avatar
      [SPARK-4131] Merge HiveTmpFile.scala to SaveAsHiveFile.scala · 8c7e19a3
      Jane Wang authored
      ## What changes were proposed in this pull request?
      
      The code is already merged to master:
      https://github.com/apache/spark/pull/18975
      
      This is a following up PR to merge HiveTmpFile.scala to SaveAsHiveFile.
      
      ## How was this patch tested?
      
      Build successfully
      
      Author: Jane Wang <janewang@fb.com>
      
      Closes #19221 from janewangfb/merge_savehivefile_hivetmpfile.
      8c7e19a3
    • donnyzone's avatar
      [SPARK-21980][SQL] References in grouping functions should be indexed with semanticEquals · 21c4450f
      donnyzone authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-21980
      
      This PR fixes the issue in ResolveGroupingAnalytics rule, which indexes the column references in grouping functions without considering case sensitive configurations.
      
      The problem can be reproduced by:
      
      `val df = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b")
       df.cube("a").agg(grouping("A")).show()`
      
      ## How was this patch tested?
      unit tests
      
      Author: donnyzone <wellfengzhu@gmail.com>
      
      Closes #19202 from DonnyZone/ResolveGroupingAnalytics.
      21c4450f
    • Armin's avatar
      [SPARK-21970][CORE] Fix Redundant Throws Declarations in Java Codebase · b6ef1f57
      Armin authored
      ## What changes were proposed in this pull request?
      
      1. Removing all redundant throws declarations from Java codebase.
      2. Removing dead code made visible by this from `ShuffleExternalSorter#closeAndGetSpills`
      
      ## How was this patch tested?
      
      Build still passes.
      
      Author: Armin <me@obrown.io>
      
      Closes #19182 from original-brownbear/SPARK-21970.
      b6ef1f57
  4. Sep 12, 2017
    • goldmedal's avatar
      [SPARK-21513][SQL] Allow UDF to_json support converting MapType to json · 371e4e20
      goldmedal authored
      # What changes were proposed in this pull request?
      UDF to_json only supports converting `StructType` or `ArrayType` of `StructType`s to a json output string now.
      According to the discussion of JIRA SPARK-21513, I allow to `to_json` support converting `MapType` and `ArrayType` of `MapType`s to a json output string.
      This PR is for SQL and Scala API only.
      
      # How was this patch tested?
      Adding unit test case.
      
      cc viirya HyukjinKwon
      
      Author: goldmedal <liugs963@gmail.com>
      Author: Jia-Xuan Liu <liugs963@gmail.com>
      
      Closes #18875 from goldmedal/SPARK-21513.
      371e4e20
    • Wang Gengliang's avatar
      [SPARK-21979][SQL] Improve QueryPlanConstraints framework · 1a985747
      Wang Gengliang authored
      ## What changes were proposed in this pull request?
      
      Improve QueryPlanConstraints framework, make it robust and simple.
      In https://github.com/apache/spark/pull/15319, constraints for expressions like `a = f(b, c)` is resolved.
      However, for expressions like
      ```scala
      a = f(b, c) && c = g(a, b)
      ```
      The current QueryPlanConstraints framework will produce non-converging constraints.
      Essentially, the problem is caused by having both the name and child of aliases in the same constraint set.   We infer constraints, and push down constraints as predicates in filters, later on these predicates are propagated as constraints, etc..
      Simply using the alias names only can resolve these problems.  The size of constraints is reduced without losing any information. We can always get these inferred constraints on child of aliases when pushing down filters.
      
      Also, the EqualNullSafe between name and child in propagating alias is meaningless
      ```scala
      allConstraints += EqualNullSafe(e, a.toAttribute)
      ```
      It just produces redundant constraints.
      
      ## How was this patch tested?
      
      Unit test
      
      Author: Wang Gengliang <ltnwgl@gmail.com>
      
      Closes #19201 from gengliangwang/QueryPlanConstraints.
      1a985747
    • sarutak's avatar
      [SPARK-21368][SQL] TPCDSQueryBenchmark can't refer query files. · b9b54b1c
      sarutak authored
      ## What changes were proposed in this pull request?
      
      TPCDSQueryBenchmark packaged into a jar doesn't work with spark-submit.
      It's because of the failure of reference query files in the jar file.
      
      ## How was this patch tested?
      
      Ran the benchmark.
      
      Author: sarutak <sarutak@oss.nttdata.co.jp>
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #18592 from sarutak/fix-tpcds-benchmark.
      b9b54b1c
    • Zhenhua Wang's avatar
      [SPARK-17642][SQL] support DESC EXTENDED/FORMATTED table column commands · 515910e9
      Zhenhua Wang authored
      ## What changes were proposed in this pull request?
      
      Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command.
      Support DESC EXTENDED | FORMATTED TABLE COLUMN command to show column-level statistics.
      Do NOT support describe nested columns.
      
      ## How was this patch tested?
      
      Added test cases.
      
      Author: Zhenhua Wang <wzh_zju@163.com>
      Author: Zhenhua Wang <wangzhenhua@huawei.com>
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #16422 from wzhfy/descColumn.
      515910e9
    • Jen-Ming Chung's avatar
      [SPARK-21610][SQL][FOLLOWUP] Corrupt records are not handled properly when... · 7d0a3ef4
      Jen-Ming Chung authored
      [SPARK-21610][SQL][FOLLOWUP] Corrupt records are not handled properly when creating a dataframe from a file
      
      ## What changes were proposed in this pull request?
      
      When the `requiredSchema` only contains `_corrupt_record`, the derived `actualSchema` is empty and the `_corrupt_record` are all null for all rows. This PR captures above situation and raise an exception with a reasonable workaround messag so that users can know what happened and how to fix the query.
      
      ## How was this patch tested?
      
      Added unit test in `CSVSuite`.
      
      Author: Jen-Ming Chung <jenmingisme@gmail.com>
      
      Closes #19199 from jmchung/SPARK-21610-FOLLOWUP.
      7d0a3ef4
  5. Sep 11, 2017
    • caoxuewen's avatar
      [MINOR][SQL] remove unuse import class · dc74c0e6
      caoxuewen authored
      ## What changes were proposed in this pull request?
      
      this PR describe remove the import class that are unused.
      
      ## How was this patch tested?
      
      N/A
      
      Author: caoxuewen <cao.xuewen@zte.com.cn>
      
      Closes #19131 from heary-cao/unuse_import.
      dc74c0e6
  6. Sep 10, 2017
    • Jen-Ming Chung's avatar
      [SPARK-21610][SQL] Corrupt records are not handled properly when creating a dataframe from a file · 6273a711
      Jen-Ming Chung authored
      ## What changes were proposed in this pull request?
      ```
      echo '{"field": 1}
      {"field": 2}
      {"field": "3"}' >/tmp/sample.json
      ```
      
      ```scala
      import org.apache.spark.sql.types._
      
      val schema = new StructType()
        .add("field", ByteType)
        .add("_corrupt_record", StringType)
      
      val file = "/tmp/sample.json"
      
      val dfFromFile = spark.read.schema(schema).json(file)
      
      scala> dfFromFile.show(false)
      +-----+---------------+
      |field|_corrupt_record|
      +-----+---------------+
      |1    |null           |
      |2    |null           |
      |null |{"field": "3"} |
      +-----+---------------+
      
      scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
      res1: Long = 0
      
      scala> dfFromFile.filter($"_corrupt_record".isNull).count()
      res2: Long = 3
      ```
      When the `requiredSchema` only contains `_corrupt_record`, the derived `actualSchema` is empty and the `_corrupt_record` are all null for all rows. This PR captures above situation and raise an exception with a reasonable workaround messag so that users can know what happened and how to fix the query.
      
      ## How was this patch tested?
      
      Added test case.
      
      Author: Jen-Ming Chung <jenmingisme@gmail.com>
      
      Closes #18865 from jmchung/SPARK-21610.
      6273a711
  7. Sep 09, 2017
    • Jane Wang's avatar
      [SPARK-4131] Support "Writing data into the filesystem from queries" · f7679055
      Jane Wang authored
      ## What changes were proposed in this pull request?
      
      This PR implements the sql feature:
      INSERT OVERWRITE [LOCAL] DIRECTORY directory1
        [ROW FORMAT row_format] [STORED AS file_format]
        SELECT ... FROM ...
      
      ## How was this patch tested?
      Added new unittests and also pulled the code to fb-spark so that we could test writing to hdfs directory.
      
      Author: Jane Wang <janewang@fb.com>
      
      Closes #18975 from janewangfb/port_local_directory.
      f7679055
    • Yanbo Liang's avatar
      [MINOR][SQL] Correct DataFrame doc. · e4d8f9a3
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Correct DataFrame doc.
      
      ## How was this patch tested?
      Only doc change, no tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #19173 from yanboliang/df-doc.
      e4d8f9a3
    • Liang-Chi Hsieh's avatar
      [SPARK-21954][SQL] JacksonUtils should verify MapType's value type instead of key type · 6b45d7e9
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      `JacksonUtils.verifySchema` verifies if a data type can be converted to JSON. For `MapType`, it now verifies the key type. However, in `JacksonGenerator`, when converting a map to JSON, we only care about its values and create a writer for the values. The keys in a map are treated as strings by calling `toString` on the keys.
      
      Thus, we should change `JacksonUtils.verifySchema` to verify the value type of `MapType`.
      
      ## How was this patch tested?
      
      Added tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19167 from viirya/test-jacksonutils.
      6b45d7e9
    • Andrew Ash's avatar
      [SPARK-21941] Stop storing unused attemptId in SQLTaskMetrics · 8a5eb506
      Andrew Ash authored
      ## What changes were proposed in this pull request?
      
      In a driver heap dump containing 390,105 instances of SQLTaskMetrics this
      would have saved me approximately 3.2MB of memory.
      
      Since we're not getting any benefit from storing this unused value, let's
      eliminate it until a future PR makes use of it.
      
      ## How was this patch tested?
      
      Existing unit tests
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #19153 from ash211/aash/trim-sql-listener.
      8a5eb506
  8. Sep 08, 2017
    • Kazuaki Ishizaki's avatar
      [SPARK-21946][TEST] fix flaky test: "alter table: rename cached table" in InMemoryCatalogedDDLSuite · 8a4f228d
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR fixes flaky test `InMemoryCatalogedDDLSuite "alter table: rename cached table"`.
      Since this test validates distributed DataFrame, the result should be checked by using `checkAnswer`. The original version used `df.collect().Seq` method that does not guaranty an order of each element of the result.
      
      ## How was this patch tested?
      
      Use existing test case
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #19159 from kiszk/SPARK-21946.
      8a4f228d
    • Liang-Chi Hsieh's avatar
      [SPARK-21726][SQL][FOLLOW-UP] Check for structural integrity of the plan in Optimzer in test mode · 0dfc1ec5
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      The condition in `Optimizer.isPlanIntegral` is wrong. We should always return `true` if not in test mode.
      
      ## How was this patch tested?
      
      Manually test.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19161 from viirya/SPARK-21726-followup.
      0dfc1ec5
    • Wenchen Fan's avatar
      [SPARK-21936][SQL] backward compatibility test framework for HiveExternalCatalog · dbb82412
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      `HiveExternalCatalog` is a semi-public interface. When creating tables, `HiveExternalCatalog` converts the table metadata to hive table format and save into hive metastore. It's very import to guarantee backward compatibility here, i.e., tables created by previous Spark versions should still be readable in newer Spark versions.
      
      Previously we find backward compatibility issues manually, which is really easy to miss bugs. This PR introduces a test framework to automatically test `HiveExternalCatalog` backward compatibility, by downloading Spark binaries with different versions, and create tables with these Spark versions, and read these tables with current Spark version.
      
      ## How was this patch tested?
      
      test-only change
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #19148 from cloud-fan/test.
      dbb82412
    • Liang-Chi Hsieh's avatar
      [SPARK-21726][SQL] Check for structural integrity of the plan in Optimzer in test mode. · 6e37524a
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      We have many optimization rules now in `Optimzer`. Right now we don't have any checks in the optimizer to check for the structural integrity of the plan (e.g. resolved). When debugging, it is difficult to identify which rules return invalid plans.
      
      It would be great if in test mode, we can check whether a plan is still resolved after the execution of each rule, so we can catch rules that return invalid plans.
      
      ## How was this patch tested?
      
      Added tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18956 from viirya/SPARK-21726.
      6e37524a
    • liuxian's avatar
      [SPARK-21949][TEST] Tables created in unit tests should be dropped after use · f62b20f3
      liuxian authored
      ## What changes were proposed in this pull request?
       Tables should be dropped after use in unit tests.
      ## How was this patch tested?
      N/A
      
      Author: liuxian <liu.xian3@zte.com.cn>
      
      Closes #19155 from 10110346/droptable.
      f62b20f3
  9. Sep 07, 2017
    • Dongjoon Hyun's avatar
      [SPARK-21939][TEST] Use TimeLimits instead of Timeouts · c26976fe
      Dongjoon Hyun authored
      Since ScalaTest 3.0.0, `org.scalatest.concurrent.Timeouts` is deprecated.
      This PR replaces the deprecated one with `org.scalatest.concurrent.TimeLimits`.
      
      ```scala
      -import org.scalatest.concurrent.Timeouts._
      +import org.scalatest.concurrent.TimeLimits._
      ```
      
      Pass the existing test suites.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19150 from dongjoon-hyun/SPARK-21939.
      
      Change-Id: I1a1b07f1b97e51e2263dfb34b7eaaa099b2ded5e
      c26976fe
    • Dongjoon Hyun's avatar
      [SPARK-13656][SQL] Delete spark.sql.parquet.cacheMetadata from SQLConf and docs · e00f1a1d
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Since [SPARK-15639](https://github.com/apache/spark/pull/13701), `spark.sql.parquet.cacheMetadata` and `PARQUET_CACHE_METADATA` is not used. This PR removes from SQLConf and docs.
      
      ## How was this patch tested?
      
      Pass the existing Jenkins.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19129 from dongjoon-hyun/SPARK-13656.
      e00f1a1d
    • Dongjoon Hyun's avatar
      [SPARK-21912][SQL] ORC/Parquet table should not create invalid column names · eea2b877
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, users meet job abortions while creating or altering ORC/Parquet tables with invalid column names. We had better prevent this by raising **AnalysisException** with a guide to use aliases instead like Paquet data source tables.
      
      **BEFORE**
      ```scala
      scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
      17/09/04 13:28:21 ERROR Utils: Aborting task
      java.lang.IllegalArgumentException: Error: : expected at the position 8 of 'struct<a b:int>' but ' ' is found.
      17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 aborted.
      17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
      org.apache.spark.SparkException: Task failed while writing rows.
      ```
      
      **AFTER**
      ```scala
      scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
      17/09/04 13:27:40 ERROR CreateDataSourceTableAsSelectCommand: Failed to write to table orc1
      org.apache.spark.sql.AnalysisException: Attribute name "a b" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins with a new test case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19124 from dongjoon-hyun/SPARK-21912.
      eea2b877
    • Liang-Chi Hsieh's avatar
      [SPARK-21835][SQL][FOLLOW-UP] RewritePredicateSubquery should not produce unresolved query plans · ce7293c1
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up of #19050 to deal with `ExistenceJoin` case.
      
      ## How was this patch tested?
      
      Added test.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19151 from viirya/SPARK-21835-followup.
      ce7293c1
  10. Sep 06, 2017
    • Jacek Laskowski's avatar
      [SPARK-21901][SS] Define toString for StateOperatorProgress · fa0092bd
      Jacek Laskowski authored
      ## What changes were proposed in this pull request?
      
      Just `StateOperatorProgress.toString` + few formatting fixes
      
      ## How was this patch tested?
      
      Local build. Waiting for OK from Jenkins.
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #19112 from jaceklaskowski/SPARK-21901-StateOperatorProgress-toString.
      fa0092bd
    • Jose Torres's avatar
      [SPARK-21765] Check that optimization doesn't affect isStreaming bit. · acdf45fb
      Jose Torres authored
      ## What changes were proposed in this pull request?
      
      Add an assert in logical plan optimization that the isStreaming bit stays the same, and fix empty relation rules where that wasn't happening.
      
      ## How was this patch tested?
      
      new and existing unit tests
      
      Author: Jose Torres <joseph.torres@databricks.com>
      Author: Jose Torres <joseph-torres@databricks.com>
      
      Closes #19056 from joseph-torres/SPARK-21765-followup.
      acdf45fb
    • Liang-Chi Hsieh's avatar
      [SPARK-21835][SQL] RewritePredicateSubquery should not produce unresolved query plans · f2e22aeb
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Correlated predicate subqueries are rewritten into `Join` by the rule `RewritePredicateSubquery`  during optimization.
      
      It is possibly that the two sides of the `Join` have conflicting attributes. The query plans produced by `RewritePredicateSubquery` become unresolved and break structural integrity.
      
      We should check if there are conflicting attributes in the `Join` and de-duplicate them by adding a `Project`.
      
      ## How was this patch tested?
      
      Added tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19050 from viirya/SPARK-21835.
      f2e22aeb
  11. Sep 05, 2017
    • jerryshao's avatar
      [SPARK-18061][THRIFTSERVER] Add spnego auth support for ThriftServer thrift/http protocol · 6a232544
      jerryshao authored
      Spark ThriftServer doesn't support spnego auth for thrift/http protocol, this mainly used for knox+thriftserver scenario. Since in HiveServer2 CLIService there already has existing codes to support it. So here copy it to Spark ThriftServer to make it support.
      
      Related Hive JIRA HIVE-6697.
      
      Manual verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #18628 from jerryshao/SPARK-21407.
      
      Change-Id: I61ef0c09f6972bba982475084a6b0ae3a74e385e
      6a232544
    • Xingbo Jiang's avatar
      [SPARK-21652][SQL] Fix rule confliction between InferFiltersFromConstraints and ConstantPropagation · fd60d4fa
      Xingbo Jiang authored
      ## What changes were proposed in this pull request?
      
      For the given example below, the predicate added by `InferFiltersFromConstraints` is folded by `ConstantPropagation` later, this leads to unconverged optimize iteration:
      ```
      Seq((1, 1)).toDF("col1", "col2").createOrReplaceTempView("t1")
      Seq(1, 2).toDF("col").createOrReplaceTempView("t2")
      sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 = t2.col AND t1.col2 = t2.col")
      ```
      
      We can fix this by adjusting the indent of the optimize rules.
      
      ## How was this patch tested?
      
      Add test case that would have failed in `SQLQuerySuite`.
      
      Author: Xingbo Jiang <xingbo.jiang@databricks.com>
      
      Closes #19099 from jiangxb1987/unconverge-optimization.
      fd60d4fa
Loading