Skip to content
Snippets Groups Projects
  1. Apr 10, 2017
    • DB Tsai's avatar
      [SPARK-18555][MINOR][SQL] Fix the @since tag when backporting from 2.2 branch into 2.1 branch · 03a42c01
      DB Tsai authored
      ## What changes were proposed in this pull request?
      
      Fix the since tag when backporting critical bugs (SPARK-18555) from 2.2 branch into 2.1 branch.
      
      ## How was this patch tested?
      
      N/A
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: DB Tsai <dbtsai@dbtsai.com>
      
      Closes #17600 from dbtsai/branch-2.1.
      Unverified
      03a42c01
    • DB Tsai's avatar
      [SPARK-20270][SQL] na.fill should not change the values in long or integer... · f40e44de
      DB Tsai authored
      [SPARK-20270][SQL] na.fill should not change the values in long or integer when the default value is in double
      
      ## What changes were proposed in this pull request?
      
      This bug was partially addressed in SPARK-18555 https://github.com/apache/spark/pull/15994
      
      , but the root cause isn't completely solved. This bug is pretty critical since it changes the member id in Long in our application if the member id can not be represented by Double losslessly when the member id is very big.
      
      Here is an example how this happens, with
      ```
            Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), (9123146099426677101L, null),
              (9123146560113991650L, 1.6), (null, null)).toDF("a", "b").na.fill(0.2),
      ```
      the logical plan will be
      ```
      == Analyzed Logical Plan ==
      a: bigint, b: double
      Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as double) AS b#241]
      +- Project [_1#229L AS a#232L, _2#230 AS b#233]
         +- LocalRelation [_1#229L, _2#230]
      ```
      
      Note that even the value is not null, Spark will cast the Long into Double first. Then if it's not null, Spark will cast it back to Long which results in losing precision.
      
      The behavior should be that the original value should not be changed if it's not null, but Spark will change the value which is wrong.
      
      With the PR, the logical plan will be
      ```
      == Analyzed Logical Plan ==
      a: bigint, b: double
      Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241]
      +- Project [_1#229L AS a#232L, _2#230 AS b#233]
         +- LocalRelation [_1#229L, _2#230]
      ```
      which behaves correctly without changing the original Long values and also avoids extra cost of unnecessary casting.
      
      ## How was this patch tested?
      
      unit test added.
      
      +cc srowen rxin cloud-fan gatorsmile
      
      Thanks.
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #17577 from dbtsai/fixnafill.
      
      (cherry picked from commit 1a0bc416)
      Signed-off-by: default avatarDB Tsai <dbtsai@dbtsai.com>
      Unverified
      f40e44de
    • root's avatar
      [SPARK-18555][SQL] DataFrameNaFunctions.fill miss up original values in long integers · b26f2c2c
      root authored
      
      ## What changes were proposed in this pull request?
      
         DataSet.na.fill(0) used on a DataSet which has a long value column, it will change the original long value.
      
         The reason is that the type of the function fill's param is Double, and the numeric columns are always cast to double(`fillCol[Double](f, value)`) .
      ```
        def fill(value: Double, cols: Seq[String]): DataFrame = {
          val columnEquals = df.sparkSession.sessionState.analyzer.resolver
          val projections = df.schema.fields.map { f =>
            // Only fill if the column is part of the cols list.
            if (f.dataType.isInstanceOf[NumericType] && cols.exists(col => columnEquals(f.name, col))) {
              fillCol[Double](f, value)
            } else {
              df.col(f.name)
            }
          }
          df.select(projections : _*)
        }
      ```
      
       For example:
      ```
      scala> val df = Seq[(Long, Long)]((1, 2), (-1, -2), (9123146099426677101L, 9123146560113991650L)).toDF("a", "b")
      df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint]
      
      scala> df.show
      +-------------------+-------------------+
      |                  a|                  b|
      +-------------------+-------------------+
      |                  1|                  2|
      |                 -1|                 -2|
      |9123146099426677101|9123146560113991650|
      +-------------------+-------------------+
      
      scala> df.na.fill(0).show
      +-------------------+-------------------+
      |                  a|                  b|
      +-------------------+-------------------+
      |                  1|                  2|
      |                 -1|                 -2|
      |9123146099426676736|9123146560113991680|
      +-------------------+-------------------+
       ```
      
      the original values changed [which is not we expected result]:
      ```
       9123146099426677101 -> 9123146099426676736
       9123146560113991650 -> 9123146560113991680
      ```
      
      ## How was this patch tested?
      
      unit test added.
      
      Author: root <root@iZbp1gsnrlfzjxh82cz80vZ.(none)>
      
      Closes #15994 from windpiger/nafillMissupOriginalValue.
      
      (cherry picked from commit 508de38c)
      Signed-off-by: default avatarDB Tsai <dbtsai@dbtsai.com>
      Unverified
      b26f2c2c
    • Bogdan Raducanu's avatar
      [SPARK-20280][CORE] FileStatusCache Weigher integer overflow · bc7304e1
      Bogdan Raducanu authored
      
      ## What changes were proposed in this pull request?
      
      Weigher.weigh needs to return Int but it is possible for an Array[FileStatus] to have size > Int.maxValue. To avoid this, the size is scaled down by a factor of 32. The maximumWeight of the cache is also scaled down by the same factor.
      
      ## How was this patch tested?
      New test in FileIndexSuite
      
      Author: Bogdan Raducanu <bogdan@databricks.com>
      
      Closes #17591 from bogdanrdc/SPARK-20280.
      
      (cherry picked from commit f6dd8e0e)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      bc7304e1
  2. Apr 09, 2017
    • Reynold Xin's avatar
      [SPARK-20264][SQL] asm should be non-test dependency in sql/core · 1a73046b
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      sq/core module currently declares asm as a test scope dependency. Transitively it should actually be a normal dependency since the actual core module defines it. This occasionally confuses IntelliJ.
      
      ## How was this patch tested?
      N/A - This is a build change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #17574 from rxin/SPARK-20264.
      
      (cherry picked from commit 7bfa05e0)
      Signed-off-by: default avatarXiao Li <gatorsmile@gmail.com>
      1a73046b
  3. Apr 05, 2017
  4. Mar 31, 2017
    • Kunal Khamar's avatar
      [SPARK-20164][SQL] AnalysisException not tolerant of null query plan. · 6a1b2eb4
      Kunal Khamar authored
      
      The query plan in an `AnalysisException` may be `null` when an `AnalysisException` object is serialized and then deserialized, since `plan` is marked `transient`. Or when someone throws an `AnalysisException` with a null query plan (which should not happen).
      `def getMessage` is not tolerant of this and throws a `NullPointerException`, leading to loss of information about the original exception.
      The fix is to add a `null` check in `getMessage`.
      
      - Unit test
      
      Author: Kunal Khamar <kkhamar@outlook.com>
      
      Closes #17486 from kunalkhamar/spark-20164.
      
      (cherry picked from commit 254877c2)
      Signed-off-by: default avatarXiao Li <gatorsmile@gmail.com>
      6a1b2eb4
  5. Mar 29, 2017
  6. Mar 28, 2017
    • Patrick Wendell's avatar
      4964dbed
    • Patrick Wendell's avatar
      Preparing Spark release v2.1.1-rc2 · 02b165dc
      Patrick Wendell authored
      02b165dc
    • sureshthalamati's avatar
      [SPARK-14536][SQL][BACKPORT-2.1] fix to handle null value in array type column for postgres. · e669dd7e
      sureshthalamati authored
      ## What changes were proposed in this pull request?
      JDBC read is failing with NPE due to missing null value check for array data type if the source table has null values in the array type column. For null values Resultset.getArray() returns null.
      This PR adds null safe check to the Resultset.getArray() value before invoking method on the Array object
      
      ## How was this patch tested?
      Updated the PostgresIntegration test suite to test null values. Ran docker integration tests on my laptop.
      
      Author: sureshthalamati <suresh.thalamati@gmail.com>
      
      Closes #17460 from sureshthalamati/jdbc_array_null_fix_spark_2.1-SPARK-14536.
      e669dd7e
    • Wenchen Fan's avatar
      [SPARK-20125][SQL] Dataset of type option of map does not work · fd2e4061
      Wenchen Fan authored
      
      When we build the deserializer expression for map type, we will use `StaticInvoke` to call `ArrayBasedMapData.toScalaMap`, and declare the return type as `scala.collection.immutable.Map`. If the map is inside an Option, we will wrap this `StaticInvoke` with `WrapOption`, which requires the input to be `scala.collect.Map`. Ideally this should be fine, as `scala.collection.immutable.Map` extends `scala.collect.Map`, but our `ObjectType` is too strict about this, this PR fixes it.
      
      new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17454 from cloud-fan/map.
      
      (cherry picked from commit d4fac410)
      Signed-off-by: default avatarCheng Lian <lian@databricks.com>
      fd2e4061
  7. Mar 25, 2017
    • Carson Wang's avatar
      [SPARK-19674][SQL] Ignore driver accumulator updates don't belong to … · d989434e
      Carson Wang authored
      [SPARK-19674][SQL] Ignore driver accumulator updates don't belong to the execution when merging all accumulator updates
      
      N.B. This is a backport to branch-2.1 of #17009.
      
      ## What changes were proposed in this pull request?
      In SQLListener.getExecutionMetrics, driver accumulator updates don't belong to the execution should be ignored when merging all accumulator updates to prevent NoSuchElementException.
      
      ## How was this patch tested?
      Updated unit test.
      
      Author: Carson Wang <carson.wangintel.com>
      
      Author: Carson Wang <carson.wang@intel.com>
      
      Closes #17418 from mallman/spark-19674-backport_2.1.
      d989434e
  8. Mar 23, 2017
    • Kazuaki Ishizaki's avatar
      [SPARK-19959][SQL] Fix to throw NullPointerException in df[java.lang.Long].collect · 92f0b012
      Kazuaki Ishizaki authored
      
      ## What changes were proposed in this pull request?
      
      This PR fixes `NullPointerException` in the generated code by Catalyst. When we run the following code, we get the following `NullPointerException`. This is because there is no null checks for `inputadapter_value`  while `java.lang.Long inputadapter_value` at Line 30 may have `null`.
      
      This happen when a type of DataFrame is nullable primitive type such as `java.lang.Long` and the wholestage codegen is used. While the physical plan keeps `nullable=true` in `input[0, java.lang.Long, true].longValue`, `BoundReference.doGenCode` ignores `nullable=true`. Thus, nullcheck code will not be generated and `NullPointerException` will occur.
      
      This PR checks the nullability and correctly generates nullcheck if needed.
      ```java
      sparkContext.parallelize(Seq[java.lang.Long](0L, null, 2L), 1).toDF.collect
      ```
      
      ```java
      Caused by: java.lang.NullPointerException
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:37)
      	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
      	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:393)
      ...
      ```
      
      Generated code without this PR
      ```java
      /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
      /* 006 */   private Object[] references;
      /* 007 */   private scala.collection.Iterator[] inputs;
      /* 008 */   private scala.collection.Iterator inputadapter_input;
      /* 009 */   private UnsafeRow serializefromobject_result;
      /* 010 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder;
      /* 011 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter;
      /* 012 */
      /* 013 */   public GeneratedIterator(Object[] references) {
      /* 014 */     this.references = references;
      /* 015 */   }
      /* 016 */
      /* 017 */   public void init(int index, scala.collection.Iterator[] inputs) {
      /* 018 */     partitionIndex = index;
      /* 019 */     this.inputs = inputs;
      /* 020 */     inputadapter_input = inputs[0];
      /* 021 */     serializefromobject_result = new UnsafeRow(1);
      /* 022 */     this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 0);
      /* 023 */     this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1);
      /* 024 */
      /* 025 */   }
      /* 026 */
      /* 027 */   protected void processNext() throws java.io.IOException {
      /* 028 */     while (inputadapter_input.hasNext() && !stopEarly()) {
      /* 029 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
      /* 030 */       java.lang.Long inputadapter_value = (java.lang.Long)inputadapter_row.get(0, null);
      /* 031 */
      /* 032 */       boolean serializefromobject_isNull = true;
      /* 033 */       long serializefromobject_value = -1L;
      /* 034 */       if (!false) {
      /* 035 */         serializefromobject_isNull = false;
      /* 036 */         if (!serializefromobject_isNull) {
      /* 037 */           serializefromobject_value = inputadapter_value.longValue();
      /* 038 */         }
      /* 039 */
      /* 040 */       }
      /* 041 */       serializefromobject_rowWriter.zeroOutNullBytes();
      /* 042 */
      /* 043 */       if (serializefromobject_isNull) {
      /* 044 */         serializefromobject_rowWriter.setNullAt(0);
      /* 045 */       } else {
      /* 046 */         serializefromobject_rowWriter.write(0, serializefromobject_value);
      /* 047 */       }
      /* 048 */       append(serializefromobject_result);
      /* 049 */       if (shouldStop()) return;
      /* 050 */     }
      /* 051 */   }
      /* 052 */ }
      ```
      
      Generated code with this PR
      
      ```java
      /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
      /* 006 */   private Object[] references;
      /* 007 */   private scala.collection.Iterator[] inputs;
      /* 008 */   private scala.collection.Iterator inputadapter_input;
      /* 009 */   private UnsafeRow serializefromobject_result;
      /* 010 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder;
      /* 011 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter;
      /* 012 */
      /* 013 */   public GeneratedIterator(Object[] references) {
      /* 014 */     this.references = references;
      /* 015 */   }
      /* 016 */
      /* 017 */   public void init(int index, scala.collection.Iterator[] inputs) {
      /* 018 */     partitionIndex = index;
      /* 019 */     this.inputs = inputs;
      /* 020 */     inputadapter_input = inputs[0];
      /* 021 */     serializefromobject_result = new UnsafeRow(1);
      /* 022 */     this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 0);
      /* 023 */     this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1);
      /* 024 */
      /* 025 */   }
      /* 026 */
      /* 027 */   protected void processNext() throws java.io.IOException {
      /* 028 */     while (inputadapter_input.hasNext() && !stopEarly()) {
      /* 029 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
      /* 030 */       boolean inputadapter_isNull = inputadapter_row.isNullAt(0);
      /* 031 */       java.lang.Long inputadapter_value = inputadapter_isNull ? null : ((java.lang.Long)inputadapter_row.get(0, null));
      /* 032 */
      /* 033 */       boolean serializefromobject_isNull = true;
      /* 034 */       long serializefromobject_value = -1L;
      /* 035 */       if (!inputadapter_isNull) {
      /* 036 */         serializefromobject_isNull = false;
      /* 037 */         if (!serializefromobject_isNull) {
      /* 038 */           serializefromobject_value = inputadapter_value.longValue();
      /* 039 */         }
      /* 040 */
      /* 041 */       }
      /* 042 */       serializefromobject_rowWriter.zeroOutNullBytes();
      /* 043 */
      /* 044 */       if (serializefromobject_isNull) {
      /* 045 */         serializefromobject_rowWriter.setNullAt(0);
      /* 046 */       } else {
      /* 047 */         serializefromobject_rowWriter.write(0, serializefromobject_value);
      /* 048 */       }
      /* 049 */       append(serializefromobject_result);
      /* 050 */       if (shouldStop()) return;
      /* 051 */     }
      /* 052 */   }
      /* 053 */ }
      ```
      
      ## How was this patch tested?
      
      Added new test suites in `DataFrameSuites`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #17302 from kiszk/SPARK-19959.
      
      (cherry picked from commit bb823ca4)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      92f0b012
  9. Mar 21, 2017
    • Patrick Wendell's avatar
      c4d2b833
    • Patrick Wendell's avatar
      Preparing Spark release v2.1.1-rc1 · 30abb95c
      Patrick Wendell authored
      30abb95c
    • Takeshi Yamamuro's avatar
      [SPARK-19980][SQL][BACKPORT-2.1] Add NULL checks in Bean serializer · a04428fe
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      A Bean serializer in `ExpressionEncoder`  could change values when Beans having NULL. A concrete example is as follows;
      ```
      scala> :paste
      class Outer extends Serializable {
        private var cls: Inner = _
        def setCls(c: Inner): Unit = cls = c
        def getCls(): Inner = cls
      }
      
      class Inner extends Serializable {
        private var str: String = _
        def setStr(s: String): Unit = str = str
        def getStr(): String = str
      }
      
      scala> Seq("""{"cls":null}""", """{"cls": {"str":null}}""").toDF().write.text("data")
      scala> val encoder = Encoders.bean(classOf[Outer])
      scala> val schema = encoder.schema
      scala> val df = spark.read.schema(schema).json("data").as[Outer](encoder)
      scala> df.show
      +------+
      |   cls|
      +------+
      |[null]|
      |  null|
      +------+
      
      scala> df.map(x => x)(encoder).show()
      +------+
      |   cls|
      +------+
      |[null]|
      |[null]|     // <-- Value changed
      +------+
      ```
      
      This is because the Bean serializer does not have the NULL-check expressions that the serializer of Scala's product types has. Actually, this value change does not happen in Scala's product types;
      
      ```
      scala> :paste
      case class Outer(cls: Inner)
      case class Inner(str: String)
      
      scala> val encoder = Encoders.product[Outer]
      scala> val schema = encoder.schema
      scala> val df = spark.read.schema(schema).json("data").as[Outer](encoder)
      scala> df.show
      +------+
      |   cls|
      +------+
      |[null]|
      |  null|
      +------+
      
      scala> df.map(x => x)(encoder).show()
      +------+
      |   cls|
      +------+
      |[null]|
      |  null|
      +------+
      ```
      
      This pr added the NULL-check expressions in Bean serializer along with the serializer of Scala's product types.
      
      ## How was this patch tested?
      Added tests in `JavaDatasetSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #17372 from maropu/SPARK-19980-BACKPORT2.1.
      a04428fe
    • Will Manning's avatar
      clarify array_contains function description · 9dfdd2ad
      Will Manning authored
      ## What changes were proposed in this pull request?
      
      The description in the comment for array_contains is vague/incomplete (i.e., doesn't mention that it returns `null` if the array is `null`); this PR fixes that.
      
      ## How was this patch tested?
      
      No testing, since it merely changes a comment.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Will Manning <lwwmanning@gmail.com>
      
      Closes #17380 from lwwmanning/patch-1.
      
      (cherry picked from commit a04dcde8)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      9dfdd2ad
  10. Mar 20, 2017
    • wangzhenhua's avatar
      [SPARK-19994][SQL] Wrong outputOrdering for right/full outer smj · af8bf218
      wangzhenhua authored
      
      ## What changes were proposed in this pull request?
      
      For right outer join, values of the left key will be filled with nulls if it can't match the value of the right key, so `nullOrdering` of the left key can't be guaranteed. We should output right key order instead of left key order.
      
      For full outer join, neither left key nor right key guarantees `nullOrdering`. We should not output any ordering.
      
      In tests, besides adding three test cases for left/right/full outer sort merge join, this patch also reorganizes code in `PlannerSuite` by putting together tests for `Sort`, and also extracts common logic in Sort tests into a method.
      
      ## How was this patch tested?
      
      Corresponding test cases are added.
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      Author: Zhenhua Wang <wzh_zju@163.com>
      
      Closes #17331 from wzhfy/wrongOrdering.
      
      (cherry picked from commit 965a5abc)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      af8bf218
  11. Mar 17, 2017
    • Jacek Laskowski's avatar
      [SQL][MINOR] Fix scaladoc for UDFRegistration · 780f6060
      Jacek Laskowski authored
      
      ## What changes were proposed in this pull request?
      
      Fix scaladoc for UDFRegistration
      
      ## How was this patch tested?
      
      local build
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #17337 from jaceklaskowski/udfregistration-scaladoc.
      
      (cherry picked from commit 6326d406)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      780f6060
    • Liwei Lin's avatar
      [SPARK-19721][SS][BRANCH-2.1] Good error message for version mismatch in log files · 710b5554
      Liwei Lin authored
      ## Problem
      
      There are several places where we write out version identifiers in various logs for structured streaming (usually `v1`). However, in the places where we check for this, we throw a confusing error message.
      
      ## What changes were proposed in this pull request?
      
      This patch made two major changes:
      1. added a `parseVersion(...)` method, and based on this method, fixed the following places the way they did version checking (no other place needed to do this checking):
      ```
      HDFSMetadataLog
        - CompactibleFileStreamLog  ------------> fixed with this patch
          - FileStreamSourceLog  ---------------> inherited the fix of `CompactibleFileStreamLog`
          - FileStreamSinkLog  -----------------> inherited the fix of `CompactibleFileStreamLog`
        - OffsetSeqLog  ------------------------> fixed with this patch
        - anonymous subclass in KafkaSource  ---> fixed with this patch
      ```
      
      2. changed the type of `FileStreamSinkLog.VERSION`, `FileStreamSourceLog.VERSION` etc. from `String` to `Int`, so that we can identify newer versions via `version > 1` instead of `version != "v1"`
          - note this didn't break any backwards compatibility -- we are still writing out `"v1"` and reading back `"v1"`
      
      ## Exception message with this patch
      ```
      java.lang.IllegalStateException: Failed to read log file /private/var/folders/nn/82rmvkk568sd8p3p8tb33trw0000gn/T/spark-86867b65-0069-4ef1-b0eb-d8bd258ff5b8/0. UnsupportedLogVersion: maximum supported log version is v1, but encountered v99. The log file was produced by a newer version of Spark and cannot be read by this version. Please upgrade.
      	at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:202)
      	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3$$anonfun$apply$mcV$sp$2.apply(OffsetSeqLogSuite.scala:78)
      	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3$$anonfun$apply$mcV$sp$2.apply(OffsetSeqLogSuite.scala:75)
      	at org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:133)
      	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite.withTempDir(OffsetSeqLogSuite.scala:26)
      	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3.apply$mcV$sp(OffsetSeqLogSuite.scala:75)
      	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3.apply(OffsetSeqLogSuite.scala:75)
      	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3.apply(OffsetSeqLogSuite.scala:75)
      	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
      	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
      ```
      
      ## How was this patch tested?
      
      unit tests
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #17327 from lw-lin/good-msg-2.1.
      710b5554
  12. Mar 16, 2017
    • Xiao Li's avatar
      [SPARK-19765][SPARK-18549][SPARK-19093][SPARK-19736][BACKPORT-2.1][SQL]... · 4b977ff0
      Xiao Li authored
      [SPARK-19765][SPARK-18549][SPARK-19093][SPARK-19736][BACKPORT-2.1][SQL] Backport Three Cache-related PRs to Spark 2.1
      
      ### What changes were proposed in this pull request?
      
      Backport a few cache related PRs:
      
      ---
      [[SPARK-19093][SQL] Cached tables are not used in SubqueryExpression](https://github.com/apache/spark/pull/16493)
      
      Consider the plans inside subquery expressions while looking up cache manager to make
      use of cached data. Currently CacheManager.useCachedData does not consider the
      subquery expressions in the plan.
      
      ---
      [[SPARK-19736][SQL] refreshByPath should clear all cached plans with the specified path](https://github.com/apache/spark/pull/17064)
      
      Catalog.refreshByPath can refresh the cache entry and the associated metadata for all dataframes (if any), that contain the given data source path.
      
      However, CacheManager.invalidateCachedPath doesn't clear all cached plans with the specified path. It causes some strange behaviors reported in SPARK-15678.
      
      ---
      [[SPARK-19765][SPARK-18549][SQL] UNCACHE TABLE should un-cache all cached plans that refer to this table](https://github.com/apache/spark/pull/17097)
      
      When un-cache a table, we should not only remove the cache entry for this table, but also un-cache any other cached plans that refer to this table. The following commands trigger the table uncache: `DropTableCommand`, `TruncateTableCommand`, `AlterTableRenameCommand`, `UncacheTableCommand`, `RefreshTable` and `InsertIntoHiveTable`
      
      This PR also includes some refactors:
      - use java.util.LinkedList to store the cache entries, so that it's safer to remove elements while iterating
      - rename invalidateCache to recacheByPlan, which is more obvious about what it does.
      
      ### How was this patch tested?
      N/A
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17319 from gatorsmile/backport-17097.
      4b977ff0
    • windpiger's avatar
      [SPARK-19329][SQL][BRANCH-2.1] Reading from or writing to a datasource table... · 9d032d02
      windpiger authored
      [SPARK-19329][SQL][BRANCH-2.1] Reading from or writing to a datasource table with a non pre-existing location should succeed
      
      ## What changes were proposed in this pull request?
      
      This is a backport pr of https://github.com/apache/spark/pull/16672 into branch-2.1.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: windpiger <songjun@outlook.com>
      
      Closes #17317 from windpiger/backport-insertnotexists.
      9d032d02
  13. Mar 15, 2017
    • Reynold Xin's avatar
      [SPARK-19944][SQL] Move SQLConf from sql/core to sql/catalyst (branch-2.1) · 80ebca62
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch moves SQLConf from sql/core to sql/catalyst. To minimize the changes, the patch used type alias to still keep CatalystConf (as a type alias) and SimpleCatalystConf (as a concrete class that extends SQLConf).
      
      Motivation for the change is that it is pretty weird to have SQLConf only in sql/core and then we have to duplicate config options that impact optimizer/analyzer in sql/catalyst using CatalystConf.
      
      This is a backport into branch-2.1 to minimize merge conflicts.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #17301 from rxin/branch-2.1-conf.
      80ebca62
  14. Mar 14, 2017
    • Wenchen Fan's avatar
      [SPARK-19887][SQL] dynamic partition keys can be null or empty string · a0ce845d
      Wenchen Fan authored
      When dynamic partition value is null or empty string, we should write the data to a directory like `a=__HIVE_DEFAULT_PARTITION__`, when we read the data back, we should respect this special directory name and treat it as null.
      
      This is the same behavior of impala, see https://issues.apache.org/jira/browse/IMPALA-252
      
      
      
      new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17277 from cloud-fan/partition.
      
      (cherry picked from commit dacc382f)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      a0ce845d
    • Herman van Hovell's avatar
      [SPARK-19933][SQL] Do not change output of a subquery · 45457825
      Herman van Hovell authored
      
      ## What changes were proposed in this pull request?
      The `RemoveRedundantAlias` rule can change the output attributes (the expression id's to be precise) of a query by eliminating the redundant alias producing them. This is no problem for a regular query, but can cause problems for correlated subqueries: The attributes produced by the subquery are used in the parent plan; changing them will break the parent plan.
      
      This PR fixes this by wrapping a subquery in a `Subquery` top level node when it gets optimized. The `RemoveRedundantAlias` rule now recognizes `Subquery` and makes sure that the output attributes of the `Subquery` node are retained.
      
      ## How was this patch tested?
      Added a test case to `RemoveRedundantAliasAndProjectSuite` and added a regression test to `SubquerySuite`.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #17278 from hvanhovell/SPARK-19933.
      
      (cherry picked from commit e04c05cf)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      45457825
  15. Mar 12, 2017
  16. Mar 10, 2017
    • Budde's avatar
      [SPARK-19611][SQL] Introduce configurable table schema inference · e481a738
      Budde authored
      Add a new configuration option that allows Spark SQL to infer a case-sensitive schema from a Hive Metastore table's data files when a case-sensitive schema can't be read from the table properties.
      
      - Add spark.sql.hive.caseSensitiveInferenceMode param to SQLConf
      - Add schemaPreservesCase field to CatalogTable (set to false when schema can't
        successfully be read from Hive table props)
      - Perform schema inference in HiveMetastoreCatalog if schemaPreservesCase is
        false, depending on spark.sql.hive.caseSensitiveInferenceMode
      - Add alterTableSchema() method to the ExternalCatalog interface
      - Add HiveSchemaInferenceSuite tests
      - Refactor and move ParquetFileForamt.meregeMetastoreParquetSchema() as
        HiveMetastoreCatalog.mergeWithMetastoreSchema
      - Move schema merging tests from ParquetSchemaSuite to HiveSchemaInferenceSuite
      
      [JIRA for this change](https://issues.apache.org/jira/browse/SPARK-19611)
      
      The tests in ```HiveSchemaInferenceSuite``` should verify that schema inference is working as expected. ```ExternalCatalogSuite``` has also been extended to cover the new ```alterTableSchema()``` API.
      
      Author: Budde <budde@amazon.com>
      
      Closes #17229 from budde/SPARK-19611-2.1.
      e481a738
    • Wenchen Fan's avatar
      [SPARK-19893][SQL] should not run DataFrame set oprations with map type · 5a2ad431
      Wenchen Fan authored
      
      In spark SQL, map type can't be used in equality test/comparison, and `Intersect`/`Except`/`Distinct` do need equality test for all columns, we should not allow map type in `Intersect`/`Except`/`Distinct`.
      
      new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17236 from cloud-fan/map.
      
      (cherry picked from commit fb9beda5)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      5a2ad431
    • Tyson Condie's avatar
      [SPARK-19891][SS] Await Batch Lock notified on stream execution exit · f0d50fd5
      Tyson Condie authored
      ## What changes were proposed in this pull request?
      
      We need to notify the await batch lock when the stream exits early e.g., when an exception has been thrown.
      
      ## How was this patch tested?
      
      Current tests that throw exceptions at runtime will finish faster as a result of this update.
      
      zsxwing
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Tyson Condie <tcondie@gmail.com>
      
      Closes #17231 from tcondie/kafka-writer.
      
      (cherry picked from commit 501b7111)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      f0d50fd5
  17. Mar 09, 2017
    • uncleGen's avatar
      [SPARK-19861][SS] watermark should not be a negative time. · ffe65b06
      uncleGen authored
      
      ## What changes were proposed in this pull request?
      
      `watermark` should not be negative. This behavior is invalid, check it before real run.
      
      ## How was this patch tested?
      
      add new unit test.
      
      Author: uncleGen <hustyugm@gmail.com>
      Author: dylon <hustyugm@gmail.com>
      
      Closes #17202 from uncleGen/SPARK-19861.
      
      (cherry picked from commit 30b18e69)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      ffe65b06
    • Jason White's avatar
      [SPARK-19561][SQL] add int case handling for TimestampType · 2a76e242
      Jason White authored
      ## What changes were proposed in this pull request?
      
      Add handling of input of type `Int` for dataType `TimestampType` to `EvaluatePython.scala`. Py4J serializes ints smaller than MIN_INT or larger than MAX_INT to Long, which are handled correctly already, but values between MIN_INT and MAX_INT are serialized to Int.
      
      These range limits correspond to roughly half an hour on either side of the epoch. As a result, PySpark doesn't allow TimestampType values to be created in this range.
      
      Alternatives attempted: patching the `TimestampType.toInternal` function to cast return values to `long`, so Py4J would always serialize them to Scala Long. Python3 does not have a `long` type, so this approach failed on Python3.
      
      ## How was this patch tested?
      
      Added a new PySpark-side test that fails without the change.
      
      The contribution is my original work and I license the work to the project under the project’s open source license.
      
      Resubmission of https://github.com/apache/spark/pull/16896
      
      . The original PR didn't go through Jenkins and broke the build. davies dongjoon-hyun
      
      cloud-fan Could you kick off a Jenkins run for me? It passed everything for me locally, but it's possible something has changed in the last few weeks.
      
      Author: Jason White <jason.white@shopify.com>
      
      Closes #17200 from JasonMWhite/SPARK-19561.
      
      (cherry picked from commit 206030bd)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      2a76e242
    • uncleGen's avatar
      [SPARK-19859][SS][FOLLOW-UP] The new watermark should override the old one. · 0c140c16
      uncleGen authored
      
      ## What changes were proposed in this pull request?
      
      A follow up to SPARK-19859:
      
      - extract the calculation of `delayMs` and reuse it.
      - update EventTimeWatermarkExec
      - use the correct `delayMs` in EventTimeWatermark
      
      ## How was this patch tested?
      
      Jenkins.
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #17221 from uncleGen/SPARK-19859.
      
      (cherry picked from commit eeb1d6db)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      0c140c16
  18. Mar 08, 2017
    • Dilip Biswal's avatar
      [MINOR][SQL] The analyzer rules are fired twice for cases when... · 78cc5721
      Dilip Biswal authored
      [MINOR][SQL] The analyzer rules are fired twice for cases when AnalysisException is raised from analyzer.
      
      ## What changes were proposed in this pull request?
      In general we have a checkAnalysis phase which validates the logical plan and throws AnalysisException on semantic errors. However we also can throw AnalysisException from a few analyzer rules like ResolveSubquery.
      
      I found that we fire up the analyzer rules twice for the queries that throw AnalysisException from one of the analyzer rules. This is a very minor fix. We don't have to strictly fix it. I just got confused seeing the rule getting fired two times when i was not expecting it.
      
      ## How was this patch tested?
      
      Tested manually.
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #17214 from dilipbiswal/analyis_twice.
      
      (cherry picked from commit d809ceed)
      Signed-off-by: default avatarXiao Li <gatorsmile@gmail.com>
      78cc5721
    • Shixiong Zhu's avatar
    • Burak Yavuz's avatar
      [SPARK-19813] maxFilesPerTrigger combo latestFirst may miss old files in... · f6c1ad2e
      Burak Yavuz authored
      [SPARK-19813] maxFilesPerTrigger combo latestFirst may miss old files in combination with maxFileAge in FileStreamSource
      
      ## What changes were proposed in this pull request?
      
      **The Problem**
      There is a file stream source option called maxFileAge which limits how old the files can be, relative the latest file that has been seen. This is used to limit the files that need to be remembered as "processed". Files older than the latest processed files are ignored. This values is by default 7 days.
      This causes a problem when both
      latestFirst = true
      maxFilesPerTrigger > total files to be processed.
      Here is what happens in all combinations
      1) latestFirst = false - Since files are processed in order, there wont be any unprocessed file older than the latest processed file. All files will be processed.
      2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is not, then all old files get processed in the first batch, and so no file is left behind.
      3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch process the latest X files. That sets the threshold latest file - maxFileAge, so files older than this threshold will never be considered for processing.
      The bug is with case 3.
      
      **The Solution**
      
      Ignore `maxFileAge` when both `maxFilesPerTrigger` and `latestFirst` are set.
      
      ## How was this patch tested?
      
      Regression test in `FileStreamSourceSuite`
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #17153 from brkyvz/maxFileAge.
      
      (cherry picked from commit a3648b5d)
      Signed-off-by: default avatarBurak Yavuz <brkyvz@gmail.com>
      f6c1ad2e
  19. Mar 07, 2017
  20. Mar 03, 2017
  21. Mar 02, 2017
Loading