Skip to content
Snippets Groups Projects
  1. Jul 29, 2017
    • Takeshi Yamamuro's avatar
      [SPARK-20962][SQL] Support subquery column aliases in FROM clause · 6550086b
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added parsing rules to support subquery column aliases in FROM clause.
      This pr is a sub-task of #18079.
      
      ## How was this patch tested?
      Added tests in `PlanParserSuite` and `SQLQueryTestSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #18185 from maropu/SPARK-20962.
      6550086b
    • Xingbo Jiang's avatar
      [SPARK-19451][SQL] rangeBetween method should accept Long value as boundary · 92d85637
      Xingbo Jiang authored
      ## What changes were proposed in this pull request?
      
      Long values can be passed to `rangeBetween` as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this.
      
      Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add.
      
      This PR is mostly based on Herman's previous amazing work: https://github.com/hvanhovell/spark/commit/596f53c339b1b4629f5651070e56a8836a397768
      
      After this been merged, we can close #16818 .
      
      ## How was this patch tested?
      
      Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`.
      
      Author: Xingbo Jiang <xingbo.jiang@databricks.com>
      
      Closes #18540 from jiangxb1987/rangeFrame.
      92d85637
    • Liang-Chi Hsieh's avatar
      [SPARK-21555][SQL] RuntimeReplaceable should be compared semantically by its canonicalized child · 9c8109ef
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule `CleanupAliases`.
      
      An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions in a group by query because they contain different aliases.
      
      Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`. So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`.
      
      If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It makes the interface ugly IMO.
      
      Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`.
      
      One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`.
      
      ## How was this patch tested?
      
      Added test.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18761 from viirya/SPARK-21555.
      9c8109ef
  2. Jul 27, 2017
    • Wenchen Fan's avatar
      [SPARK-21319][SQL] Fix memory leak in sorter · 9f5647d6
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      `UnsafeExternalSorter.recordComparator` can be either `KVComparator` or `RowComparator`, and both of them will keep the reference to the input rows they compared last time.
      
      After sorting, we return the sorted iterator to upstream operators. However, the upstream operators may take a while to consume up the sorted iterator, and `UnsafeExternalSorter` is registered to `TaskContext` at [here](https://github.com/apache/spark/blob/v2.2.0/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L159-L161), which means we will keep the `UnsafeExternalSorter` instance and keep the last compared input rows in memory until the sorted iterator is consumed up.
      
      Things get worse if we sort within partitions of a dataset and coalesce all partitions into one, as we will keep a lot of input rows in memory and the time to consume up all the sorted iterators is long.
      
      This PR takes over https://github.com/apache/spark/pull/18543 , the idea is that, we do not keep the record comparator instance in `UnsafeExternalSorter`, but a generator of record comparator.
      
      close #18543
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18679 from cloud-fan/memory-leak.
      9f5647d6
    • Kazuaki Ishizaki's avatar
      [SPARK-21271][SQL] Ensure Unsafe.sizeInBytes is a multiple of 8 · ebbe589d
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR ensures that `Unsafe.sizeInBytes` must be a multiple of 8. It it is not satisfied. `Unsafe.hashCode` causes the assertion violation.
      
      ## How was this patch tested?
      
      Will add test cases
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #18503 from kiszk/SPARK-21271.
      ebbe589d
  3. Jul 25, 2017
    • gatorsmile's avatar
      [SPARK-20586][SQL] Add deterministic to ScalaUDF · ebc24a9b
      gatorsmile authored
      ### What changes were proposed in this pull request?
      Like [Hive UDFType](https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/ql/udf/UDFType.html), we should allow users to add the extra flags for ScalaUDF and JavaUDF too. _stateful_/_impliesOrder_ are not applicable to our Scala UDF. Thus, we only add the following two flags.
      
      - deterministic: Certain optimizations should not be applied if UDF is not deterministic. Deterministic UDF returns same result each time it is invoked with a particular input. This determinism just needs to hold within the context of a query.
      
      When the deterministic flag is not correctly set, the results could be wrong.
      
      For ScalaUDF in Dataset APIs, users can call the following extra APIs for `UserDefinedFunction` to make the corresponding changes.
      - `nonDeterministic`: Updates UserDefinedFunction to non-deterministic.
      
      Also fixed the Java UDF name loss issue.
      
      Will submit a separate PR for `distinctLike`  for UDAF
      
      ### How was this patch tested?
      Added test cases for both ScalaUDF
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: Wenchen Fan <cloud0fan@gmail.com>
      
      Closes #17848 from gatorsmile/udfRegister.
      ebc24a9b
  4. Jul 23, 2017
    • pj.fanning's avatar
      [SPARK-20871][SQL] limit logging of Janino code · 2a53fbfc
      pj.fanning authored
      ## What changes were proposed in this pull request?
      
      When the code that is generated is greater than 64k, then Janino compile will fail and CodeGenerator.scala will log the entire code at Error level.
      SPARK-20871 suggests only logging the code at Debug level.
      Since, the code is already logged at debug level, this Pull Request proposes not including the formatted code in the Error logging and exception message at all.
      When an exception occurs, the code will be logged at Info level but truncated if it is more than 1000 lines long.
      
      ## How was this patch tested?
      
      Existing tests were run.
      An extra test test case was added to CodeFormatterSuite to test the new maxLines parameter,
      
      Author: pj.fanning <pj.fanning@workday.com>
      
      Closes #18658 from pjfanning/SPARK-20871.
      2a53fbfc
  5. Jul 20, 2017
  6. Jul 18, 2017
  7. Jul 17, 2017
    • aokolnychyi's avatar
      [SPARK-21332][SQL] Incorrect result type inferred for some decimal expressions · 0be5fb41
      aokolnychyi authored
      ## What changes were proposed in this pull request?
      
      This PR changes the direction of expression transformation in the DecimalPrecision rule. Previously, the expressions were transformed down, which led to incorrect result types when decimal expressions had other decimal expressions as their operands. The root cause of this issue was in visiting outer nodes before their children. Consider the example below:
      
      ```
          val inputSchema = StructType(StructField("col", DecimalType(26, 6)) :: Nil)
          val sc = spark.sparkContext
          val rdd = sc.parallelize(1 to 2).map(_ => Row(BigDecimal(12)))
          val df = spark.createDataFrame(rdd, inputSchema)
      
          // Works correctly since no nested decimal expression is involved
          // Expected result type: (26, 6) * (26, 6) = (38, 12)
          df.select($"col" * $"col").explain(true)
          df.select($"col" * $"col").printSchema()
      
          // Gives a wrong result since there is a nested decimal expression that should be visited first
          // Expected result type: ((26, 6) * (26, 6)) * (26, 6) = (38, 12) * (26, 6) = (38, 18)
          df.select($"col" * $"col" * $"col").explain(true)
          df.select($"col" * $"col" * $"col").printSchema()
      ```
      
      The example above gives the following output:
      
      ```
      // Correct result without sub-expressions
      == Parsed Logical Plan ==
      'Project [('col * 'col) AS (col * col)#4]
      +- LogicalRDD [col#1]
      
      == Analyzed Logical Plan ==
      (col * col): decimal(38,12)
      Project [CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS (col * col)#4]
      +- LogicalRDD [col#1]
      
      == Optimized Logical Plan ==
      Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4]
      +- LogicalRDD [col#1]
      
      == Physical Plan ==
      *Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4]
      +- Scan ExistingRDD[col#1]
      
      // Schema
      root
       |-- (col * col): decimal(38,12) (nullable = true)
      
      // Incorrect result with sub-expressions
      == Parsed Logical Plan ==
      'Project [(('col * 'col) * 'col) AS ((col * col) * col)#11]
      +- LogicalRDD [col#1]
      
      == Analyzed Logical Plan ==
      ((col * col) * col): decimal(38,12)
      Project [CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS ((col * col) * col)#11]
      +- LogicalRDD [col#1]
      
      == Optimized Logical Plan ==
      Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11]
      +- LogicalRDD [col#1]
      
      == Physical Plan ==
      *Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11]
      +- Scan ExistingRDD[col#1]
      
      // Schema
      root
       |-- ((col * col) * col): decimal(38,12) (nullable = true)
      ```
      
      ## How was this patch tested?
      
      This PR was tested with available unit tests. Moreover, there are tests to cover previously failing scenarios.
      
      Author: aokolnychyi <anton.okolnychyi@sap.com>
      
      Closes #18583 from aokolnychyi/spark-21332.
      0be5fb41
  8. Jul 16, 2017
  9. Jul 14, 2017
    • Kazuaki Ishizaki's avatar
      [SPARK-21344][SQL] BinaryType comparison does signed byte array comparison · ac5d5d79
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations.
      
      ## How was this patch tested?
      
      Added a test suite in `OrderingSuite`.
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #18571 from kiszk/SPARK-21344.
      ac5d5d79
  10. Jul 13, 2017
    • Sean Owen's avatar
      [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 · 425c4ada
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Remove Scala 2.10 build profiles and support
      - Replace some 2.10 support in scripts with commented placeholders for 2.12 later
      - Remove deprecated API calls from 2.10 support
      - Remove usages of deprecated context bounds where possible
      - Remove Scala 2.10 workarounds like ScalaReflectionLock
      - Other minor Scala warning fixes
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17150 from srowen/SPARK-19810.
      425c4ada
  11. Jul 12, 2017
  12. Jul 10, 2017
    • Bryan Cutler's avatar
      [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas · d03aebbe
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`.  This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process.  The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame.  Data types except complex, date, timestamp, and decimal  are currently supported, otherwise an `UnsupportedOperation` exception is thrown.
      
      Additions to Spark include a Scala package private method `Dataset.toArrowPayload` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served.  A package private class/object `ArrowConverters` that provide data type mappings and conversion routines.  In Python, a private method `DataFrame._collectAsArrow` is added to collect Arrow payloads and a SQLConf "spark.sql.execution.arrow.enable" can be used in `toPandas()` to enable using Arrow (uses the old conversion by default).
      
      ## How was this patch tested?
      Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types.  The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data.  This will ensure that the schema and data has been converted correctly.
      
      Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow.  A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      Author: Li Jin <ice.xelloss@gmail.com>
      Author: Li Jin <li.jin@twosigma.com>
      Author: Wes McKinney <wes.mckinney@twosigma.com>
      
      Closes #18459 from BryanCutler/toPandas_with_arrow-SPARK-13534.
      d03aebbe
    • Takeshi Yamamuro's avatar
      [SPARK-20460][SQL] Make it more consistent to handle column name duplication · 647963a2
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr made it more consistent to handle column name duplication. In the current master, error handling is different when hitting column name duplication:
      ```
      // json
      scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
      scala> Seq("""{"a":1, "a":1}"""""").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
      scala> spark.read.format("json").schema(schema).load("/tmp/data").show
      org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#12, a#13.;
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
      
      scala> spark.read.format("json").load("/tmp/data").show
      org.apache.spark.sql.AnalysisException: Duplicate column(s) : "a" found, cannot save to JSON format;
        at org.apache.spark.sql.execution.datasources.json.JsonDataSource.checkConstraints(JsonDataSource.scala:81)
        at org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferSchema(JsonDataSource.scala:63)
        at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferSchema(JsonFileFormat.scala:57)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176)
      
      // csv
      scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
      scala> Seq("a,a", "1,1").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
      scala> spark.read.format("csv").schema(schema).option("header", false).load("/tmp/data").show
      org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#41, a#42.;
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152)
      
      // If `inferSchema` is true, a CSV format is duplicate-safe (See SPARK-16896)
      scala> spark.read.format("csv").option("header", true).load("/tmp/data").show
      +---+---+
      | a0| a1|
      +---+---+
      |  1|  1|
      +---+---+
      
      // parquet
      scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
      scala> Seq((1, 1)).toDF("a", "b").coalesce(1).write.mode("overwrite").parquet("/tmp/data")
      scala> spark.read.format("parquet").schema(schema).option("header", false).load("/tmp/data").show
      org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#110, a#111.;
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
      ```
      When this patch applied, the results change to;
      ```
      
      // json
      scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
      scala> Seq("""{"a":1, "a":1}"""""").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
      scala> spark.read.format("json").schema(schema).load("/tmp/data").show
      org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a";
        at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47)
        at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33)
        at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368)
      
      scala> spark.read.format("json").load("/tmp/data").show
      org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a";
        at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47)
        at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33)
        at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
      
      // csv
      scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
      scala> Seq("a,a", "1,1").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
      scala> spark.read.format("csv").schema(schema).option("header", false).load("/tmp/data").show
      org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a";
        at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47)
        at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33)
        at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
      
      scala> spark.read.format("csv").option("header", true).load("/tmp/data").show
      +---+---+
      | a0| a1|
      +---+---+
      |  1|  1|
      +---+---+
      
      // parquet
      scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
      scala> Seq((1, 1)).toDF("a", "b").coalesce(1).write.mode("overwrite").parquet("/tmp/data")
      scala> spark.read.format("parquet").schema(schema).option("header", false).load("/tmp/data").show
      org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a";
        at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47)
        at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33)
        at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368)
      ```
      
      ## How was this patch tested?
      Added tests in `DataFrameReaderWriterSuite` and `SQLQueryTestSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #17758 from maropu/SPARK-20460.
      647963a2
  13. Jul 09, 2017
    • Wenchen Fan's avatar
      [SPARK-18016][SQL][FOLLOWUP] merge declareAddedFunctions, initNestedClasses... · 680b33f1
      Wenchen Fan authored
      [SPARK-18016][SQL][FOLLOWUP] merge declareAddedFunctions, initNestedClasses and declareNestedClasses
      
      ## What changes were proposed in this pull request?
      
      These 3 methods have to be used together, so it makes more sense to merge them into one method and then the caller side only need to call one method.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18579 from cloud-fan/minor.
      680b33f1
  14. Jul 08, 2017
    • Xiao Li's avatar
      [SPARK-21307][REVERT][SQL] Remove SQLConf parameters from the parser-related classes · c3712b77
      Xiao Li authored
      ## What changes were proposed in this pull request?
      Since we do not set active sessions when parsing the plan, we are unable to correctly use SQLConf.get to find the correct active session. Since https://github.com/apache/spark/pull/18531 breaks the build, I plan to revert it at first.
      
      ## How was this patch tested?
      The existing test cases
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #18568 from gatorsmile/revert18531.
      c3712b77
    • Takeshi Yamamuro's avatar
      [SPARK-21281][SQL] Use string types by default if array and map have no argument · 7896e7b9
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr modified code to use string types by default if `array` and `map` in functions have no argument. This behaviour is the same with Hive one;
      ```
      hive> CREATE TEMPORARY TABLE t1 AS SELECT map();
      hive> DESCRIBE t1;
      _c0   map<string,string>
      
      hive> CREATE TEMPORARY TABLE t2 AS SELECT array();
      hive> DESCRIBE t2;
      _c0   array<string>
      ```
      
      ## How was this patch tested?
      Added tests in `DataFrameFunctionsSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #18516 from maropu/SPARK-21281.
      7896e7b9
  15. Jul 07, 2017
    • Wenchen Fan's avatar
      [SPARK-21335][SQL] support un-aliased subquery · fef08130
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      un-aliased subquery is supported by Spark SQL for a long time. Its semantic was not well defined and had confusing behaviors, and it's not a standard SQL syntax, so we disallowed it in https://issues.apache.org/jira/browse/SPARK-20690 .
      
      However, this is a breaking change, and we do have existing queries using un-aliased subquery. We should add the support back and fix its semantic.
      
      This PR fixes the un-aliased subquery by assigning a default alias name.
      
      After this PR, there is no syntax change from branch 2.2 to master, but we invalid a weird use case:
      `SELECT v.i from (SELECT i FROM v)`. Now this query will throw analysis exception because users should not be able to use the qualifier inside a subquery.
      
      ## How was this patch tested?
      
      new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18559 from cloud-fan/sub-query.
      fef08130
  16. Jul 06, 2017
    • Wang Gengliang's avatar
      [SPARK-21323][SQL] Rename plans.logical.statsEstimation.Range to ValueInterval · bf66335a
      Wang Gengliang authored
      ## What changes were proposed in this pull request?
      
      Rename org.apache.spark.sql.catalyst.plans.logical.statsEstimation.Range to ValueInterval.
      The current naming is identical to logical operator "range".
      Refactoring it to ValueInterval is more accurate.
      
      ## How was this patch tested?
      
      unit test
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Wang Gengliang <ltnwgl@gmail.com>
      
      Closes #18549 from gengliangwang/ValueInterval.
      bf66335a
    • Liang-Chi Hsieh's avatar
      [SPARK-21204][SQL] Add support for Scala Set collection types in serialization · 48e44b24
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Currently we can't produce a `Dataset` containing `Set` in SparkSQL. This PR tries to support serialization/deserialization of `Set`.
      
      Because there's no corresponding internal data type in SparkSQL for a `Set`, the most proper choice for serializing a set should be an array.
      
      ## How was this patch tested?
      
      Added unit tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18416 from viirya/SPARK-21204.
      48e44b24
    • Bogdan Raducanu's avatar
      [SPARK-21228][SQL] InSet incorrect handling of structs · 26ac085d
      Bogdan Raducanu authored
      ## What changes were proposed in this pull request?
      When data type is struct, InSet now uses TypeUtils.getInterpretedOrdering (similar to EqualTo) to build a TreeSet. In other cases it will use a HashSet as before (which should be faster). Similarly, In.eval uses Ordering.equiv instead of equals.
      
      ## How was this patch tested?
      New test in SQLQuerySuite.
      
      Author: Bogdan Raducanu <bogdan@databricks.com>
      
      Closes #18455 from bogdanrdc/SPARK-21228.
      26ac085d
    • Wang Gengliang's avatar
      [SPARK-21273][SQL][FOLLOW-UP] Add missing test cases back and revise code style · d540dfbf
      Wang Gengliang authored
      ## What changes were proposed in this pull request?
      
      Add missing test cases back and revise code style
      
      Follow up the previous PR: https://github.com/apache/spark/pull/18479
      
      ## How was this patch tested?
      
      Unit test
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Wang Gengliang <ltnwgl@gmail.com>
      
      Closes #18548 from gengliangwang/stat_propagation_revise.
      d540dfbf
    • Sumedh Wale's avatar
      [SPARK-21312][SQL] correct offsetInBytes in UnsafeRow.writeToStream · 14a3bb3a
      Sumedh Wale authored
      ## What changes were proposed in this pull request?
      
      Corrects offsetInBytes calculation in UnsafeRow.writeToStream. Known failures include writes to some DataSources that have own SparkPlan implementations and cause EXCHANGE in writes.
      
      ## How was this patch tested?
      
      Extended UnsafeRowSuite.writeToStream to include an UnsafeRow over byte array having non-zero offset.
      
      Author: Sumedh Wale <swale@snappydata.io>
      
      Closes #18535 from sumwale/SPARK-21312.
      14a3bb3a
    • gatorsmile's avatar
      [SPARK-21308][SQL] Remove SQLConf parameters from the optimizer · 75b168fd
      gatorsmile authored
      ### What changes were proposed in this pull request?
      This PR removes SQLConf parameters from the optimizer rules
      
      ### How was this patch tested?
      The existing test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18533 from gatorsmile/rmSQLConfOptimizer.
      75b168fd
  17. Jul 05, 2017
    • gatorsmile's avatar
      [SPARK-21307][SQL] Remove SQLConf parameters from the parser-related classes. · c8e7f445
      gatorsmile authored
      ### What changes were proposed in this pull request?
      This PR is to remove SQLConf parameters from the parser-related classes.
      
      ### How was this patch tested?
      The existing test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18531 from gatorsmile/rmSQLConfParser.
      c8e7f445
    • ouyangxiaochen's avatar
      [SPARK-20383][SQL] Supporting Create [temporary] Function with the keyword 'OR... · 5787ace4
      ouyangxiaochen authored
      [SPARK-20383][SQL] Supporting Create [temporary] Function with the keyword 'OR REPLACE' and 'IF NOT EXISTS'
      
      ## What changes were proposed in this pull request?
      
      support to create [temporary] function with the keyword 'OR REPLACE' and 'IF NOT EXISTS'
      
      ## How was this patch tested?
      manual test and added test cases
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: ouyangxiaochen <ou.yangxiaochen@zte.com.cn>
      
      Closes #17681 from ouyangxiaochen/spark-419.
      5787ace4
    • Takuya UESHIN's avatar
      [SPARK-16167][SQL] RowEncoder should preserve array/map type nullability. · 873f3ad2
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      Currently `RowEncoder` doesn't preserve nullability of `ArrayType` or `MapType`.
      It returns always `containsNull = true` for `ArrayType`, `valueContainsNull = true` for `MapType` and also the nullability of itself is always `true`.
      
      This pr fixes the nullability of them.
      ## How was this patch tested?
      
      Add tests to check if `RowEncoder` preserves array/map nullability.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #13873 from ueshin/issues/SPARK-16167.
      873f3ad2
    • Takuya UESHIN's avatar
      [SPARK-18623][SQL] Add `returnNullable` to `StaticInvoke` and modify it to handle properly. · a3864325
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      Add `returnNullable` to `StaticInvoke` the same as #15780 is trying to add to `Invoke` and modify to handle properly.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #16056 from ueshin/issues/SPARK-18623.
      a3864325
    • Wenchen Fan's avatar
      [SPARK-21304][SQL] remove unnecessary isNull variable for collection related encoder expressions · f2c3b1dd
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      For these collection-related encoder expressions, we don't need to create `isNull` variable if the loop element is not nullable.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18529 from cloud-fan/minor.
      f2c3b1dd
  18. Jul 04, 2017
    • Takuya UESHIN's avatar
      [SPARK-21300][SQL] ExternalMapToCatalyst should null-check map key prior to... · ce10545d
      Takuya UESHIN authored
      [SPARK-21300][SQL] ExternalMapToCatalyst should null-check map key prior to converting to internal value.
      
      ## What changes were proposed in this pull request?
      
      `ExternalMapToCatalyst` should null-check map key prior to converting to internal value to throw an appropriate Exception instead of something like NPE.
      
      ## How was this patch tested?
      
      Added a test and existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #18524 from ueshin/issues/SPARK-21300.
      ce10545d
    • gatorsmile's avatar
      [SPARK-21295][SQL] Use qualified names in error message for missing references · de14086e
      gatorsmile authored
      ### What changes were proposed in this pull request?
      It is strange to see the following error message. Actually, the column is from another table.
      ```
      cannot resolve '`right.a`' given input columns: [a, c, d];
      ```
      
      After the PR, the error message looks like
      ```
      cannot resolve '`right.a`' given input columns: [left.a, right.c, right.d];
      ```
      
      ### How was this patch tested?
      Added a test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18520 from gatorsmile/removeSQLConf.
      de14086e
    • gatorsmile's avatar
      [SPARK-21256][SQL] Add withSQLConf to Catalyst Test · 29b1f6b0
      gatorsmile authored
      ### What changes were proposed in this pull request?
      SQLConf is moved to Catalyst. We are adding more and more test cases for verifying the conf-specific behaviors. It is nice to add a helper function to simplify the test cases.
      
      ### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18469 from gatorsmile/withSQLConf.
      29b1f6b0
  19. Jul 03, 2017
    • Wenchen Fan's avatar
      [SPARK-21284][SQL] rename SessionCatalog.registerFunction parameter name · f953ca56
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Looking at the code in `SessionCatalog.registerFunction`, the parameter `ignoreIfExists` is a wrong name. When `ignoreIfExists` is true, we will override the function if it already exists. So `overrideIfExists` should be the corrected name.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18510 from cloud-fan/minor.
      f953ca56
    • aokolnychyi's avatar
      [SPARK-21102][SQL] Refresh command is too aggressive in parsing · 17bdc36e
      aokolnychyi authored
      ### Idea
      
      This PR adds validation to REFRESH sql statements. Currently, users can specify whatever they want as resource path. For example, spark.sql("REFRESH ! $ !") will be executed without any exceptions.
      
      ### Implementation
      
      I am not sure that my current implementation is the most optimal, so any feedback is appreciated. My first idea was to make the grammar as strict as possible. Unfortunately, there were some problems. I tried the approach below:
      
      SqlBase.g4
      ```
      ...
          | REFRESH TABLE tableIdentifier                                    #refreshTable
          | REFRESH resourcePath                                             #refreshResource
      ...
      
      resourcePath
          : STRING
          | (IDENTIFIER | number | nonReserved | '/' | '-')+ // other symbols can be added if needed
          ;
      ```
      It is not flexible enough and requires to explicitly mention all possible symbols. Therefore, I came up with the current approach that is implemented in the code.
      
      Let me know your opinion on which one is better.
      
      Author: aokolnychyi <anton.okolnychyi@sap.com>
      
      Closes #18368 from aokolnychyi/spark-21102.
      17bdc36e
  20. Jun 30, 2017
    • Reynold Xin's avatar
      [SPARK-21273][SQL] Propagate logical plan stats using visitor pattern and mixin · b1d719e7
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      We currently implement statistics propagation directly in logical plan. Given we already have two different implementations, it'd make sense to actually decouple the two and add stats propagation using mixin. This would reduce the coupling between logical plan and statistics handling.
      
      This can also be a powerful pattern in the future to add additional properties (e.g. constraints).
      
      ## How was this patch tested?
      Should be covered by existing test cases.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #18479 from rxin/stats-trait.
      b1d719e7
Loading