Skip to content
Snippets Groups Projects
  1. May 24, 2017
  2. May 19, 2017
    • Ala Luszczak's avatar
      [SPARK-20798] GenerateUnsafeProjection should check if a value is null before calling the getter · e326de48
      Ala Luszczak authored
      
      ## What changes were proposed in this pull request?
      
      GenerateUnsafeProjection.writeStructToBuffer() did not honor the assumption that the caller must make sure that a value is not null before using the getter. This could lead to various errors. This change fixes that behavior.
      
      Example of code generated before:
      ```scala
      /* 059 */         final UTF8String fieldName = value.getUTF8String(0);
      /* 060 */         if (value.isNullAt(0)) {
      /* 061 */           rowWriter1.setNullAt(0);
      /* 062 */         } else {
      /* 063 */           rowWriter1.write(0, fieldName);
      /* 064 */         }
      ```
      
      Example of code generated now:
      ```scala
      /* 060 */         boolean isNull1 = value.isNullAt(0);
      /* 061 */         UTF8String value1 = isNull1 ? null : value.getUTF8String(0);
      /* 062 */         if (isNull1) {
      /* 063 */           rowWriter1.setNullAt(0);
      /* 064 */         } else {
      /* 065 */           rowWriter1.write(0, value1);
      /* 066 */         }
      ```
      
      ## How was this patch tested?
      
      Adds GenerateUnsafeProjectionSuite.
      
      Author: Ala Luszczak <ala@databricks.com>
      
      Closes #18030 from ala/fix-generate-unsafe-projection.
      
      (cherry picked from commit ce8edb8b)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      e326de48
  3. May 15, 2017
  4. May 11, 2017
    • liuxian's avatar
      [SPARK-20665][SQL] Bround" and "Round" function return NULL · 6e89d574
      liuxian authored
      
         spark-sql>select bround(12.3, 2);
         spark-sql>NULL
      For this case,  the expected result is 12.3, but it is null.
      So ,when the second parameter is bigger than "decimal.scala", the result is not we expected.
      "round" function  has the same problem. This PR can solve the problem for both of them.
      
      unit test cases in MathExpressionsSuite and MathFunctionsSuite
      
      Author: liuxian <liu.xian3@zte.com.cn>
      
      Closes #17906 from 10110346/wip_lx_0509.
      
      (cherry picked from commit 2b36eb69)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      6e89d574
  5. May 10, 2017
    • Wenchen Fan's avatar
      [SPARK-20688][SQL] correctly check analysis for scalar sub-queries · bdc08ab6
      Wenchen Fan authored
      
      In `CheckAnalysis`, we should call `checkAnalysis` for `ScalarSubquery` at the beginning, as later we will call `plan.output` which is invalid if `plan` is not resolved.
      
      new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17930 from cloud-fan/tmp.
      
      (cherry picked from commit 789bdbe3)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      bdc08ab6
    • Josh Rosen's avatar
      [SPARK-20686][SQL] PropagateEmptyRelation incorrectly handles aggregate without grouping · 8e097890
      Josh Rosen authored
      
      The query
      
      ```
      SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1
      ```
      
      should return a single row of output because the subquery is an aggregate without a group-by and thus should return a single row. However, Spark incorrectly returns zero rows.
      
      This is caused by SPARK-16208 / #13906, a patch which added an optimizer rule to propagate EmptyRelation through operators. The logic for handling aggregates is wrong: it checks whether aggregate expressions are non-empty for deciding whether the output should be empty, whereas it should be checking grouping expressions instead:
      
      An aggregate with non-empty grouping expression will return one output row per group. If the input to the grouped aggregate is empty then all groups will be empty and thus the output will be empty. It doesn't matter whether the aggregation output columns include aggregate expressions since that won't affect the number of output rows.
      
      If the grouping expressions are empty, however, then the aggregate will always produce a single output row and thus we cannot propagate the EmptyRelation.
      
      The current implementation is incorrect and also misses an optimization opportunity by not propagating EmptyRelation in the case where a grouped aggregate has aggregate expressions (in other words, `SELECT COUNT(*) from emptyRelation GROUP BY x` would _not_ be optimized to `EmptyRelation` in the old code, even though it safely could be).
      
      This patch resolves this issue by modifying `PropagateEmptyRelation` to consider only the presence/absence of grouping expressions, not the aggregate functions themselves, when deciding whether to propagate EmptyRelation.
      
      - Added end-to-end regression tests in `SQLQueryTest`'s `group-by.sql` file.
      - Updated unit tests in `PropagateEmptyRelationSuite`.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #17929 from JoshRosen/fix-PropagateEmptyRelation.
      
      (cherry picked from commit a90c5cd8)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      8e097890
  6. May 09, 2017
    • Yuming Wang's avatar
      [SPARK-17685][SQL] Make SortMergeJoinExec's currentVars is null when calling createJoinKey · 50f28dfe
      Yuming Wang authored
      
      ## What changes were proposed in this pull request?
      
      The following SQL query cause `IndexOutOfBoundsException` issue when `LIMIT > 1310720`:
      ```sql
      CREATE TABLE tab1(int int, int2 int, str string);
      CREATE TABLE tab2(int int, int2 int, str string);
      INSERT INTO tab1 values(1,1,'str');
      INSERT INTO tab1 values(2,2,'str');
      INSERT INTO tab2 values(1,1,'str');
      INSERT INTO tab2 values(2,3,'str');
      
      SELECT
        count(*)
      FROM
        (
          SELECT t1.int, t2.int2
          FROM (SELECT * FROM tab1 LIMIT 1310721) t1
          INNER JOIN (SELECT * FROM tab2 LIMIT 1310721) t2
          ON (t1.int = t2.int AND t1.int2 = t2.int2)
        ) t;
      ```
      
      This pull request fix this issue.
      
      ## How was this patch tested?
      
      unit tests
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #17920 from wangyum/SPARK-17685.
      
      (cherry picked from commit 771abeb4)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      50f28dfe
  7. Apr 25, 2017
    • Xiao Li's avatar
      [SPARK-20439][SQL][BACKPORT-2.1] Fix Catalog API listTables and getTable when... · 6696ad0e
      Xiao Li authored
      [SPARK-20439][SQL][BACKPORT-2.1] Fix Catalog API listTables and getTable when failed to fetch table metadata
      
      ### What changes were proposed in this pull request?
      
      This PR is to backport https://github.com/apache/spark/pull/17730 to Spark 2.1
      --- --
      `spark.catalog.listTables` and `spark.catalog.getTable` does not work if we are unable to retrieve table metadata due to any reason (e.g., table serde class is not accessible or the table type is not accepted by Spark SQL). After this PR, the APIs still return the corresponding Table without the description and tableType)
      
      ### How was this patch tested?
      Added a test case
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17760 from gatorsmile/backport-17730.
      6696ad0e
    • Patrick Wendell's avatar
      8460b090
    • Patrick Wendell's avatar
    • Sameer Agarwal's avatar
      [SPARK-20451] Filter out nested mapType datatypes from sort order in randomSplit · 42796659
      Sameer Agarwal authored
      
      ## What changes were proposed in this pull request?
      
      In `randomSplit`, It is possible that the underlying dataset doesn't guarantee the ordering of rows in its constituent partitions each time a split is materialized which could result in overlapping
      splits.
      
      To prevent this, as part of SPARK-12662, we explicitly sort each input partition to make the ordering deterministic. Given that `MapTypes` cannot be sorted this patch explicitly prunes them out from the sort order. Additionally, if the resulting sort order is empty, this patch then materializes the dataset to guarantee determinism.
      
      ## How was this patch tested?
      
      Extended `randomSplit on reordered partitions` in `DataFrameStatSuite` to also test for dataframes with mapTypes nested mapTypes.
      
      Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
      
      Closes #17751 from sameeragarwal/randomsplit2.
      
      (cherry picked from commit 31345fde)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      42796659
  8. Apr 22, 2017
    • Bogdan Raducanu's avatar
      [SPARK-20407][TESTS][BACKPORT-2.1] ParquetQuerySuite 'Enabling/disabling... · ba505805
      Bogdan Raducanu authored
      [SPARK-20407][TESTS][BACKPORT-2.1] ParquetQuerySuite 'Enabling/disabling ignoreCorruptFiles' flaky test
      
      ## What changes were proposed in this pull request?
      
      SharedSQLContext.afterEach now calls DebugFilesystem.assertNoOpenStreams inside eventually.
      SQLTestUtils withTempDir calls waitForTasksToFinish before deleting the directory.
      
      ## How was this patch tested?
      New test but marked as ignored because it takes 30s. Can be unignored for review.
      
      Author: Bogdan Raducanu <bogdan@databricks.com>
      
      Closes #17720 from bogdanrdc/SPARK-20407-BACKPORT2.1.
      ba505805
  9. Apr 20, 2017
    • Wenchen Fan's avatar
      [SPARK-20409][SQL] fail early if aggregate function in GROUP BY · 66e7a8f1
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      It's illegal to have aggregate function in GROUP BY, and we should fail at analysis phase, if this happens.
      
      ## How was this patch tested?
      
      new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17704 from cloud-fan/minor.
      66e7a8f1
  10. Apr 19, 2017
    • Koert Kuipers's avatar
      [SPARK-20359][SQL] Avoid unnecessary execution in EliminateOuterJoin... · 171bf656
      Koert Kuipers authored
      [SPARK-20359][SQL] Avoid unnecessary execution in EliminateOuterJoin optimization that can lead to NPE
      
      Avoid necessary execution that can lead to NPE in EliminateOuterJoin and add test in DataFrameSuite to confirm NPE is no longer thrown
      
      ## What changes were proposed in this pull request?
      Change leftHasNonNullPredicate and rightHasNonNullPredicate to lazy so they are only executed when needed.
      
      ## How was this patch tested?
      
      Added test in DataFrameSuite that failed before this fix and now succeeds. Note that a test in catalyst project would be better but i am unsure how to do this.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Koert Kuipers <koert@tresata.com>
      
      Closes #17660 from koertkuipers/feat-catch-npe-in-eliminate-outer-join.
      
      (cherry picked from commit 608bf30f)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      171bf656
  11. Apr 17, 2017
    • Xiao Li's avatar
      [SPARK-20349][SQL][REVERT-BRANCH2.1] ListFunctions returns duplicate functions... · 3808b472
      Xiao Li authored
      [SPARK-20349][SQL][REVERT-BRANCH2.1] ListFunctions returns duplicate functions after using persistent functions
      
      Revert the changes of https://github.com/apache/spark/pull/17646 made in Branch 2.1, because it breaks the build. It needs the parser interface, but SessionCatalog in branch 2.1 does not have it.
      
      ### What changes were proposed in this pull request?
      
      The session catalog caches some persistent functions in the `FunctionRegistry`, so there can be duplicates. Our Catalog API `listFunctions` does not handle it.
      
      It would be better if `SessionCatalog` API can de-duplciate the records, instead of doing it by each API caller. In `FunctionRegistry`, our functions are identified by the unquoted string. Thus, this PR is try to parse it using our parser interface and then de-duplicate the names.
      
      ### How was this patch tested?
      Added test cases.
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17661 from gatorsmile/compilationFix17646.
      3808b472
    • Xiao Li's avatar
      [SPARK-20349][SQL] ListFunctions returns duplicate functions after using persistent functions · 7aad057b
      Xiao Li authored
      
      ### What changes were proposed in this pull request?
      The session catalog caches some persistent functions in the `FunctionRegistry`, so there can be duplicates. Our Catalog API `listFunctions` does not handle it.
      
      It would be better if `SessionCatalog` API can de-duplciate the records, instead of doing it by each API caller. In `FunctionRegistry`, our functions are identified by the unquoted string. Thus, this PR is try to parse it using our parser interface and then de-duplicate the names.
      
      ### How was this patch tested?
      Added test cases.
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17646 from gatorsmile/showFunctions.
      
      (cherry picked from commit 01ff0350)
      Signed-off-by: default avatarXiao Li <gatorsmile@gmail.com>
      7aad057b
  12. Apr 14, 2017
  13. Apr 12, 2017
  14. Apr 10, 2017
    • DB Tsai's avatar
      [SPARK-18555][MINOR][SQL] Fix the @since tag when backporting from 2.2 branch into 2.1 branch · 03a42c01
      DB Tsai authored
      ## What changes were proposed in this pull request?
      
      Fix the since tag when backporting critical bugs (SPARK-18555) from 2.2 branch into 2.1 branch.
      
      ## How was this patch tested?
      
      N/A
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: DB Tsai <dbtsai@dbtsai.com>
      
      Closes #17600 from dbtsai/branch-2.1.
      Unverified
      03a42c01
    • DB Tsai's avatar
      [SPARK-20270][SQL] na.fill should not change the values in long or integer... · f40e44de
      DB Tsai authored
      [SPARK-20270][SQL] na.fill should not change the values in long or integer when the default value is in double
      
      ## What changes were proposed in this pull request?
      
      This bug was partially addressed in SPARK-18555 https://github.com/apache/spark/pull/15994
      
      , but the root cause isn't completely solved. This bug is pretty critical since it changes the member id in Long in our application if the member id can not be represented by Double losslessly when the member id is very big.
      
      Here is an example how this happens, with
      ```
            Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), (9123146099426677101L, null),
              (9123146560113991650L, 1.6), (null, null)).toDF("a", "b").na.fill(0.2),
      ```
      the logical plan will be
      ```
      == Analyzed Logical Plan ==
      a: bigint, b: double
      Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as double) AS b#241]
      +- Project [_1#229L AS a#232L, _2#230 AS b#233]
         +- LocalRelation [_1#229L, _2#230]
      ```
      
      Note that even the value is not null, Spark will cast the Long into Double first. Then if it's not null, Spark will cast it back to Long which results in losing precision.
      
      The behavior should be that the original value should not be changed if it's not null, but Spark will change the value which is wrong.
      
      With the PR, the logical plan will be
      ```
      == Analyzed Logical Plan ==
      a: bigint, b: double
      Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241]
      +- Project [_1#229L AS a#232L, _2#230 AS b#233]
         +- LocalRelation [_1#229L, _2#230]
      ```
      which behaves correctly without changing the original Long values and also avoids extra cost of unnecessary casting.
      
      ## How was this patch tested?
      
      unit test added.
      
      +cc srowen rxin cloud-fan gatorsmile
      
      Thanks.
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #17577 from dbtsai/fixnafill.
      
      (cherry picked from commit 1a0bc416)
      Signed-off-by: default avatarDB Tsai <dbtsai@dbtsai.com>
      Unverified
      f40e44de
    • root's avatar
      [SPARK-18555][SQL] DataFrameNaFunctions.fill miss up original values in long integers · b26f2c2c
      root authored
      
      ## What changes were proposed in this pull request?
      
         DataSet.na.fill(0) used on a DataSet which has a long value column, it will change the original long value.
      
         The reason is that the type of the function fill's param is Double, and the numeric columns are always cast to double(`fillCol[Double](f, value)`) .
      ```
        def fill(value: Double, cols: Seq[String]): DataFrame = {
          val columnEquals = df.sparkSession.sessionState.analyzer.resolver
          val projections = df.schema.fields.map { f =>
            // Only fill if the column is part of the cols list.
            if (f.dataType.isInstanceOf[NumericType] && cols.exists(col => columnEquals(f.name, col))) {
              fillCol[Double](f, value)
            } else {
              df.col(f.name)
            }
          }
          df.select(projections : _*)
        }
      ```
      
       For example:
      ```
      scala> val df = Seq[(Long, Long)]((1, 2), (-1, -2), (9123146099426677101L, 9123146560113991650L)).toDF("a", "b")
      df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint]
      
      scala> df.show
      +-------------------+-------------------+
      |                  a|                  b|
      +-------------------+-------------------+
      |                  1|                  2|
      |                 -1|                 -2|
      |9123146099426677101|9123146560113991650|
      +-------------------+-------------------+
      
      scala> df.na.fill(0).show
      +-------------------+-------------------+
      |                  a|                  b|
      +-------------------+-------------------+
      |                  1|                  2|
      |                 -1|                 -2|
      |9123146099426676736|9123146560113991680|
      +-------------------+-------------------+
       ```
      
      the original values changed [which is not we expected result]:
      ```
       9123146099426677101 -> 9123146099426676736
       9123146560113991650 -> 9123146560113991680
      ```
      
      ## How was this patch tested?
      
      unit test added.
      
      Author: root <root@iZbp1gsnrlfzjxh82cz80vZ.(none)>
      
      Closes #15994 from windpiger/nafillMissupOriginalValue.
      
      (cherry picked from commit 508de38c)
      Signed-off-by: default avatarDB Tsai <dbtsai@dbtsai.com>
      Unverified
      b26f2c2c
    • Bogdan Raducanu's avatar
      [SPARK-20280][CORE] FileStatusCache Weigher integer overflow · bc7304e1
      Bogdan Raducanu authored
      
      ## What changes were proposed in this pull request?
      
      Weigher.weigh needs to return Int but it is possible for an Array[FileStatus] to have size > Int.maxValue. To avoid this, the size is scaled down by a factor of 32. The maximumWeight of the cache is also scaled down by the same factor.
      
      ## How was this patch tested?
      New test in FileIndexSuite
      
      Author: Bogdan Raducanu <bogdan@databricks.com>
      
      Closes #17591 from bogdanrdc/SPARK-20280.
      
      (cherry picked from commit f6dd8e0e)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      bc7304e1
  15. Apr 09, 2017
    • Reynold Xin's avatar
      [SPARK-20264][SQL] asm should be non-test dependency in sql/core · 1a73046b
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      sq/core module currently declares asm as a test scope dependency. Transitively it should actually be a normal dependency since the actual core module defines it. This occasionally confuses IntelliJ.
      
      ## How was this patch tested?
      N/A - This is a build change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #17574 from rxin/SPARK-20264.
      
      (cherry picked from commit 7bfa05e0)
      Signed-off-by: default avatarXiao Li <gatorsmile@gmail.com>
      1a73046b
  16. Apr 05, 2017
  17. Mar 31, 2017
    • Kunal Khamar's avatar
      [SPARK-20164][SQL] AnalysisException not tolerant of null query plan. · 6a1b2eb4
      Kunal Khamar authored
      
      The query plan in an `AnalysisException` may be `null` when an `AnalysisException` object is serialized and then deserialized, since `plan` is marked `transient`. Or when someone throws an `AnalysisException` with a null query plan (which should not happen).
      `def getMessage` is not tolerant of this and throws a `NullPointerException`, leading to loss of information about the original exception.
      The fix is to add a `null` check in `getMessage`.
      
      - Unit test
      
      Author: Kunal Khamar <kkhamar@outlook.com>
      
      Closes #17486 from kunalkhamar/spark-20164.
      
      (cherry picked from commit 254877c2)
      Signed-off-by: default avatarXiao Li <gatorsmile@gmail.com>
      6a1b2eb4
  18. Mar 29, 2017
  19. Mar 28, 2017
    • Patrick Wendell's avatar
      4964dbed
    • Patrick Wendell's avatar
      Preparing Spark release v2.1.1-rc2 · 02b165dc
      Patrick Wendell authored
      02b165dc
    • sureshthalamati's avatar
      [SPARK-14536][SQL][BACKPORT-2.1] fix to handle null value in array type column for postgres. · e669dd7e
      sureshthalamati authored
      ## What changes were proposed in this pull request?
      JDBC read is failing with NPE due to missing null value check for array data type if the source table has null values in the array type column. For null values Resultset.getArray() returns null.
      This PR adds null safe check to the Resultset.getArray() value before invoking method on the Array object
      
      ## How was this patch tested?
      Updated the PostgresIntegration test suite to test null values. Ran docker integration tests on my laptop.
      
      Author: sureshthalamati <suresh.thalamati@gmail.com>
      
      Closes #17460 from sureshthalamati/jdbc_array_null_fix_spark_2.1-SPARK-14536.
      e669dd7e
    • Wenchen Fan's avatar
      [SPARK-20125][SQL] Dataset of type option of map does not work · fd2e4061
      Wenchen Fan authored
      
      When we build the deserializer expression for map type, we will use `StaticInvoke` to call `ArrayBasedMapData.toScalaMap`, and declare the return type as `scala.collection.immutable.Map`. If the map is inside an Option, we will wrap this `StaticInvoke` with `WrapOption`, which requires the input to be `scala.collect.Map`. Ideally this should be fine, as `scala.collection.immutable.Map` extends `scala.collect.Map`, but our `ObjectType` is too strict about this, this PR fixes it.
      
      new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17454 from cloud-fan/map.
      
      (cherry picked from commit d4fac410)
      Signed-off-by: default avatarCheng Lian <lian@databricks.com>
      fd2e4061
  20. Mar 25, 2017
    • Carson Wang's avatar
      [SPARK-19674][SQL] Ignore driver accumulator updates don't belong to … · d989434e
      Carson Wang authored
      [SPARK-19674][SQL] Ignore driver accumulator updates don't belong to the execution when merging all accumulator updates
      
      N.B. This is a backport to branch-2.1 of #17009.
      
      ## What changes were proposed in this pull request?
      In SQLListener.getExecutionMetrics, driver accumulator updates don't belong to the execution should be ignored when merging all accumulator updates to prevent NoSuchElementException.
      
      ## How was this patch tested?
      Updated unit test.
      
      Author: Carson Wang <carson.wangintel.com>
      
      Author: Carson Wang <carson.wang@intel.com>
      
      Closes #17418 from mallman/spark-19674-backport_2.1.
      d989434e
  21. Mar 23, 2017
    • Kazuaki Ishizaki's avatar
      [SPARK-19959][SQL] Fix to throw NullPointerException in df[java.lang.Long].collect · 92f0b012
      Kazuaki Ishizaki authored
      
      ## What changes were proposed in this pull request?
      
      This PR fixes `NullPointerException` in the generated code by Catalyst. When we run the following code, we get the following `NullPointerException`. This is because there is no null checks for `inputadapter_value`  while `java.lang.Long inputadapter_value` at Line 30 may have `null`.
      
      This happen when a type of DataFrame is nullable primitive type such as `java.lang.Long` and the wholestage codegen is used. While the physical plan keeps `nullable=true` in `input[0, java.lang.Long, true].longValue`, `BoundReference.doGenCode` ignores `nullable=true`. Thus, nullcheck code will not be generated and `NullPointerException` will occur.
      
      This PR checks the nullability and correctly generates nullcheck if needed.
      ```java
      sparkContext.parallelize(Seq[java.lang.Long](0L, null, 2L), 1).toDF.collect
      ```
      
      ```java
      Caused by: java.lang.NullPointerException
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:37)
      	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
      	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:393)
      ...
      ```
      
      Generated code without this PR
      ```java
      /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
      /* 006 */   private Object[] references;
      /* 007 */   private scala.collection.Iterator[] inputs;
      /* 008 */   private scala.collection.Iterator inputadapter_input;
      /* 009 */   private UnsafeRow serializefromobject_result;
      /* 010 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder;
      /* 011 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter;
      /* 012 */
      /* 013 */   public GeneratedIterator(Object[] references) {
      /* 014 */     this.references = references;
      /* 015 */   }
      /* 016 */
      /* 017 */   public void init(int index, scala.collection.Iterator[] inputs) {
      /* 018 */     partitionIndex = index;
      /* 019 */     this.inputs = inputs;
      /* 020 */     inputadapter_input = inputs[0];
      /* 021 */     serializefromobject_result = new UnsafeRow(1);
      /* 022 */     this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 0);
      /* 023 */     this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1);
      /* 024 */
      /* 025 */   }
      /* 026 */
      /* 027 */   protected void processNext() throws java.io.IOException {
      /* 028 */     while (inputadapter_input.hasNext() && !stopEarly()) {
      /* 029 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
      /* 030 */       java.lang.Long inputadapter_value = (java.lang.Long)inputadapter_row.get(0, null);
      /* 031 */
      /* 032 */       boolean serializefromobject_isNull = true;
      /* 033 */       long serializefromobject_value = -1L;
      /* 034 */       if (!false) {
      /* 035 */         serializefromobject_isNull = false;
      /* 036 */         if (!serializefromobject_isNull) {
      /* 037 */           serializefromobject_value = inputadapter_value.longValue();
      /* 038 */         }
      /* 039 */
      /* 040 */       }
      /* 041 */       serializefromobject_rowWriter.zeroOutNullBytes();
      /* 042 */
      /* 043 */       if (serializefromobject_isNull) {
      /* 044 */         serializefromobject_rowWriter.setNullAt(0);
      /* 045 */       } else {
      /* 046 */         serializefromobject_rowWriter.write(0, serializefromobject_value);
      /* 047 */       }
      /* 048 */       append(serializefromobject_result);
      /* 049 */       if (shouldStop()) return;
      /* 050 */     }
      /* 051 */   }
      /* 052 */ }
      ```
      
      Generated code with this PR
      
      ```java
      /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
      /* 006 */   private Object[] references;
      /* 007 */   private scala.collection.Iterator[] inputs;
      /* 008 */   private scala.collection.Iterator inputadapter_input;
      /* 009 */   private UnsafeRow serializefromobject_result;
      /* 010 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder;
      /* 011 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter;
      /* 012 */
      /* 013 */   public GeneratedIterator(Object[] references) {
      /* 014 */     this.references = references;
      /* 015 */   }
      /* 016 */
      /* 017 */   public void init(int index, scala.collection.Iterator[] inputs) {
      /* 018 */     partitionIndex = index;
      /* 019 */     this.inputs = inputs;
      /* 020 */     inputadapter_input = inputs[0];
      /* 021 */     serializefromobject_result = new UnsafeRow(1);
      /* 022 */     this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 0);
      /* 023 */     this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1);
      /* 024 */
      /* 025 */   }
      /* 026 */
      /* 027 */   protected void processNext() throws java.io.IOException {
      /* 028 */     while (inputadapter_input.hasNext() && !stopEarly()) {
      /* 029 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
      /* 030 */       boolean inputadapter_isNull = inputadapter_row.isNullAt(0);
      /* 031 */       java.lang.Long inputadapter_value = inputadapter_isNull ? null : ((java.lang.Long)inputadapter_row.get(0, null));
      /* 032 */
      /* 033 */       boolean serializefromobject_isNull = true;
      /* 034 */       long serializefromobject_value = -1L;
      /* 035 */       if (!inputadapter_isNull) {
      /* 036 */         serializefromobject_isNull = false;
      /* 037 */         if (!serializefromobject_isNull) {
      /* 038 */           serializefromobject_value = inputadapter_value.longValue();
      /* 039 */         }
      /* 040 */
      /* 041 */       }
      /* 042 */       serializefromobject_rowWriter.zeroOutNullBytes();
      /* 043 */
      /* 044 */       if (serializefromobject_isNull) {
      /* 045 */         serializefromobject_rowWriter.setNullAt(0);
      /* 046 */       } else {
      /* 047 */         serializefromobject_rowWriter.write(0, serializefromobject_value);
      /* 048 */       }
      /* 049 */       append(serializefromobject_result);
      /* 050 */       if (shouldStop()) return;
      /* 051 */     }
      /* 052 */   }
      /* 053 */ }
      ```
      
      ## How was this patch tested?
      
      Added new test suites in `DataFrameSuites`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #17302 from kiszk/SPARK-19959.
      
      (cherry picked from commit bb823ca4)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      92f0b012
  22. Mar 21, 2017
    • Patrick Wendell's avatar
      c4d2b833
    • Patrick Wendell's avatar
      Preparing Spark release v2.1.1-rc1 · 30abb95c
      Patrick Wendell authored
      30abb95c
    • Takeshi Yamamuro's avatar
      [SPARK-19980][SQL][BACKPORT-2.1] Add NULL checks in Bean serializer · a04428fe
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      A Bean serializer in `ExpressionEncoder`  could change values when Beans having NULL. A concrete example is as follows;
      ```
      scala> :paste
      class Outer extends Serializable {
        private var cls: Inner = _
        def setCls(c: Inner): Unit = cls = c
        def getCls(): Inner = cls
      }
      
      class Inner extends Serializable {
        private var str: String = _
        def setStr(s: String): Unit = str = str
        def getStr(): String = str
      }
      
      scala> Seq("""{"cls":null}""", """{"cls": {"str":null}}""").toDF().write.text("data")
      scala> val encoder = Encoders.bean(classOf[Outer])
      scala> val schema = encoder.schema
      scala> val df = spark.read.schema(schema).json("data").as[Outer](encoder)
      scala> df.show
      +------+
      |   cls|
      +------+
      |[null]|
      |  null|
      +------+
      
      scala> df.map(x => x)(encoder).show()
      +------+
      |   cls|
      +------+
      |[null]|
      |[null]|     // <-- Value changed
      +------+
      ```
      
      This is because the Bean serializer does not have the NULL-check expressions that the serializer of Scala's product types has. Actually, this value change does not happen in Scala's product types;
      
      ```
      scala> :paste
      case class Outer(cls: Inner)
      case class Inner(str: String)
      
      scala> val encoder = Encoders.product[Outer]
      scala> val schema = encoder.schema
      scala> val df = spark.read.schema(schema).json("data").as[Outer](encoder)
      scala> df.show
      +------+
      |   cls|
      +------+
      |[null]|
      |  null|
      +------+
      
      scala> df.map(x => x)(encoder).show()
      +------+
      |   cls|
      +------+
      |[null]|
      |  null|
      +------+
      ```
      
      This pr added the NULL-check expressions in Bean serializer along with the serializer of Scala's product types.
      
      ## How was this patch tested?
      Added tests in `JavaDatasetSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #17372 from maropu/SPARK-19980-BACKPORT2.1.
      a04428fe
    • Will Manning's avatar
      clarify array_contains function description · 9dfdd2ad
      Will Manning authored
      ## What changes were proposed in this pull request?
      
      The description in the comment for array_contains is vague/incomplete (i.e., doesn't mention that it returns `null` if the array is `null`); this PR fixes that.
      
      ## How was this patch tested?
      
      No testing, since it merely changes a comment.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Will Manning <lwwmanning@gmail.com>
      
      Closes #17380 from lwwmanning/patch-1.
      
      (cherry picked from commit a04dcde8)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      9dfdd2ad
  23. Mar 20, 2017
    • wangzhenhua's avatar
      [SPARK-19994][SQL] Wrong outputOrdering for right/full outer smj · af8bf218
      wangzhenhua authored
      
      ## What changes were proposed in this pull request?
      
      For right outer join, values of the left key will be filled with nulls if it can't match the value of the right key, so `nullOrdering` of the left key can't be guaranteed. We should output right key order instead of left key order.
      
      For full outer join, neither left key nor right key guarantees `nullOrdering`. We should not output any ordering.
      
      In tests, besides adding three test cases for left/right/full outer sort merge join, this patch also reorganizes code in `PlannerSuite` by putting together tests for `Sort`, and also extracts common logic in Sort tests into a method.
      
      ## How was this patch tested?
      
      Corresponding test cases are added.
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      Author: Zhenhua Wang <wzh_zju@163.com>
      
      Closes #17331 from wzhfy/wrongOrdering.
      
      (cherry picked from commit 965a5abc)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      af8bf218
  24. Mar 17, 2017
Loading