Skip to content
Snippets Groups Projects
  1. Jan 20, 2017
  2. Jan 19, 2017
    • Wenchen Fan's avatar
      [SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor the error checking when... · 7bc3e9ba
      Wenchen Fan authored
      [SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor the error checking when append data to an existing table
      
      ## What changes were proposed in this pull request?
      
      When we append data to an existing table with `DataFrameWriter.saveAsTable`, we will do various checks to make sure the appended data is consistent with the existing data.
      
      However, we get the information of the existing table by matching the table relation, instead of looking at the table metadata. This is error-prone, e.g. we only check the number of columns for `HadoopFsRelation`, we forget to check bucketing, etc.
      
      This PR refactors the error checking by looking at the metadata of the existing table, and fix several bugs:
      * SPARK-18899: We forget to check if the specified bucketing matched the existing table, which may lead to a problematic table that has different bucketing in different data files.
      * SPARK-18912: We forget to check the number of columns for non-file-based data source table
      * SPARK-18913: We don't support append data to a table with special column names.
      
      ## How was this patch tested?
      new regression test.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16313 from cloud-fan/bug1.
      
      (cherry picked from commit f923c849)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      7bc3e9ba
  3. Jan 18, 2017
    • Liwei Lin's avatar
      [SPARK-19168][STRUCTURED STREAMING] StateStore should be aborted upon error · 4cff0b50
      Liwei Lin authored
      
      ## What changes were proposed in this pull request?
      
      We should call `StateStore.abort()` when there should be any error before the store is committed.
      
      ## How was this patch tested?
      
      Manually.
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #16547 from lw-lin/append-filter.
      
      (cherry picked from commit 569e5068)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      4cff0b50
    • Shixiong Zhu's avatar
      [SPARK-19113][SS][TESTS] Ignore StreamingQueryException thrown from... · 047506ba
      Shixiong Zhu authored
      [SPARK-19113][SS][TESTS] Ignore StreamingQueryException thrown from awaitInitialization to avoid breaking tests
      
      ## What changes were proposed in this pull request?
      
      #16492 missed one race condition: `StreamExecution.awaitInitialization` may throw fatal errors and fail the test. This PR just ignores `StreamingQueryException` thrown from `awaitInitialization` so that we can verify the exception in the `ExpectFailure` action later. It's fine since `StopStream` or `ExpectFailure` will catch `StreamingQueryException` as well.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16567 from zsxwing/SPARK-19113-2.
      
      (cherry picked from commit c050c122)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      047506ba
  4. Jan 17, 2017
    • gatorsmile's avatar
      [SPARK-19129][SQL] SessionCatalog: Disallow empty part col values in partition spec · 3ec3e3f2
      gatorsmile authored
      
      Empty partition column values are not valid for partition specification. Before this PR, we accept users to do it; however, Hive metastore does not detect and disallow it too. Thus, users hit the following strange error.
      
      ```Scala
      val df = spark.createDataFrame(Seq((0, "a"), (1, "b"))).toDF("partCol1", "name")
      df.write.mode("overwrite").partitionBy("partCol1").saveAsTable("partitionedTable")
      spark.sql("alter table partitionedTable drop partition(partCol1='')")
      spark.table("partitionedTable").show()
      ```
      
      In the above example, the WHOLE table is DROPPED when users specify a partition spec containing only one partition column with empty values.
      
      When the partition columns contains more than one, Hive metastore APIs simply ignore the columns with empty values and treat it as partial spec. This is also not expected. This does not follow the actual Hive behaviors. This PR is to disallow users to specify such an invalid partition spec in the `SessionCatalog` APIs.
      
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16583 from gatorsmile/disallowEmptyPartColValue.
      
      (cherry picked from commit a23debd7)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      3ec3e3f2
    • Shixiong Zhu's avatar
      [SPARK-19065][SQL] Don't inherit expression id in dropDuplicates · 13986a72
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      `dropDuplicates` will create an Alias using the same exprId, so `StreamExecution` should also replace Alias if necessary.
      
      ## How was this patch tested?
      
      test("SPARK-19065: dropDuplicates should not create expressions using the same id")
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16564 from zsxwing/SPARK-19065.
      
      (cherry picked from commit a83accfc)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      13986a72
  5. Jan 16, 2017
    • Liang-Chi Hsieh's avatar
      [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parquet · 4f3ce062
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      We have a config `spark.sql.files.ignoreCorruptFiles` which can be used to ignore corrupt files when reading files in SQL. Currently the `ignoreCorruptFiles` config has two issues and can't work for Parquet:
      
      1. We only ignore corrupt files in `FileScanRDD` . Actually, we begin to read those files as early as inferring data schema from the files. For corrupt files, we can't read the schema and fail the program. A related issue reported at http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tc20418.html
      2. In `FileScanRDD`, we assume that we only begin to read the files when starting to consume the iterator. However, it is possibly the files are read before that. In this case, `ignoreCorruptFiles` config doesn't work too.
      
      This patch targets Parquet datasource. If this direction is ok, we can address the same issue for other datasources like Orc.
      
      Two main changes in this patch:
      
      1. Replace `ParquetFileReader.readAllFootersInParallel` by implementing the logic to read footers in multi-threaded manner
      
          We can't ignore corrupt files if we use `ParquetFileReader.readAllFootersInParallel`. So this patch implements the logic to do the similar thing in `readParquetFootersInParallel`.
      
      2. In `FileScanRDD`, we need to ignore corrupt file too when we call `readFunction` to return iterator.
      
      One thing to notice is:
      
      We read schema from Parquet file's footer. The method to read footer `ParquetFileReader.readFooter` throws `RuntimeException`, instead of `IOException`, if it can't successfully read the footer. Please check out https://github.com/apache/parquet-mr/blob/df9d8e415436292ae33e1ca0b8da256640de9710/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L470. So this patch catches `RuntimeException`.  One concern is that it might also shadow other runtime exceptions other than reading corrupt files.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16474 from viirya/fix-ignorecorrupted-parquet-files.
      
      (cherry picked from commit 61e48f52)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      4f3ce062
  6. Jan 15, 2017
    • gatorsmile's avatar
      [SPARK-19092][SQL][BACKPORT-2.1] Save() API of DataFrameWriter should not scan... · bf2f233e
      gatorsmile authored
      [SPARK-19092][SQL][BACKPORT-2.1] Save() API of DataFrameWriter should not scan all the saved files #16481
      
      ### What changes were proposed in this pull request?
      
      #### This PR is to backport https://github.com/apache/spark/pull/16481 to Spark 2.1
      ---
      `DataFrameWriter`'s [save() API](https://github.com/gatorsmile/spark/blob/5d38f09f47a767a342a0a8219c63efa2943b5d1f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L207) is performing a unnecessary full filesystem scan for the saved files. The save() API is the most basic/core API in `DataFrameWriter`. We should avoid it.
      
      ### How was this patch tested?
      Added and modified the test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16588 from gatorsmile/backport-19092.
      bf2f233e
    • gatorsmile's avatar
      [SPARK-19120] Refresh Metadata Cache After Loading Hive Tables · db37049d
      gatorsmile authored
      
      ```Scala
              sql("CREATE TABLE tab (a STRING) STORED AS PARQUET")
      
              // This table fetch is to fill the cache with zero leaf files
              spark.table("tab").show()
      
              sql(
                s"""
                   |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
                   |INTO TABLE tab
                 """.stripMargin)
      
              spark.table("tab").show()
      ```
      
      In the above example, the returned result is empty after table loading. The metadata cache could be out of dated after loading new data into the table, because loading/inserting does not update the cache. So far, the metadata cache is only used for data source tables. Thus, for Hive serde tables, only `parquet` and `orc` formats are facing such issues, because the Hive serde tables in the format of  parquet/orc could be converted to data source tables when `spark.sql.hive.convertMetastoreParquet`/`spark.sql.hive.convertMetastoreOrc` is on.
      
      This PR is to refresh the metadata cache after processing the `LOAD DATA` command.
      
      In addition, Spark SQL does not convert **partitioned** Hive tables (orc/parquet) to data source tables in the write path, but the read path is using the metadata cache for both **partitioned** and non-partitioned Hive tables (orc/parquet). That means, writing the partitioned parquet/orc tables still use `InsertIntoHiveTable`, instead of `InsertIntoHadoopFsRelationCommand`. To avoid reading the out-of-dated cache, `InsertIntoHiveTable` needs to refresh the metadata cache for partitioned tables. Note, it does not need to refresh the cache for non-partitioned parquet/orc tables, because it does not call `InsertIntoHiveTable` at all. Based on the comments, this PR will keep the existing logics unchanged. That means, we always refresh the table no matter whether the table is partitioned or not.
      
      Added test cases in parquetSuites.scala
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16500 from gatorsmile/refreshInsertIntoHiveTable.
      
      (cherry picked from commit de62ddf7)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      db37049d
  7. Jan 13, 2017
    • Yucai Yu's avatar
      [SPARK-19180] [SQL] the offset of short should be 2 in OffHeapColumn · 5e9be1e1
      Yucai Yu authored
      
      ## What changes were proposed in this pull request?
      
      the offset of short is 4 in OffHeapColumnVector's putShorts, but actually it should be 2.
      
      ## How was this patch tested?
      
      unit test
      
      Author: Yucai Yu <yucai.yu@intel.com>
      
      Closes #16555 from yucai/offheap_short.
      
      (cherry picked from commit ad0dadaa)
      Signed-off-by: default avatarDavies Liu <davies.liu@gmail.com>
      5e9be1e1
    • Wenchen Fan's avatar
      [SPARK-19178][SQL] convert string of large numbers to int should return null · 2c2ca894
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      When we convert a string to integral, we will convert that string to `decimal(20, 0)` first, so that we can turn a string with decimal format to truncated integral, e.g. `CAST('1.2' AS int)` will return `1`.
      
      However, this brings problems when we convert a string with large numbers to integral, e.g. `CAST('1234567890123' AS int)` will return `1912276171`, while Hive returns null as we expected.
      
      This is a long standing bug(seems it was there the first day Spark SQL was created), this PR fixes this bug by adding the native support to convert `UTF8String` to integral.
      
      ## How was this patch tested?
      
      new regression tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16550 from cloud-fan/string-to-int.
      
      (cherry picked from commit 6b34e745)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      2c2ca894
    • Andrew Ash's avatar
      Fix missing close-parens for In filter's toString · 0668e061
      Andrew Ash authored
      
      Otherwise the open parentheses isn't closed in query plan descriptions of batch scans.
      
          PushedFilters: [In(COL_A, [1,2,4,6,10,16,219,815], IsNotNull(COL_B), ...
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #16558 from ash211/patch-9.
      
      (cherry picked from commit b040cef2)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      0668e061
  8. Jan 12, 2017
  9. Jan 10, 2017
    • Shixiong Zhu's avatar
      [SPARK-19140][SS] Allow update mode for non-aggregation streaming queries · 230607d6
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      This PR allow update mode for non-aggregation streaming queries. It will be same as the append mode if a query has no aggregations.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16520 from zsxwing/update-without-agg.
      
      (cherry picked from commit bc6c56e9)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      230607d6
    • Shixiong Zhu's avatar
      [SPARK-19113][SS][TESTS] Set UncaughtExceptionHandler in onQueryStarted to... · e0af4b72
      Shixiong Zhu authored
      [SPARK-19113][SS][TESTS] Set UncaughtExceptionHandler in onQueryStarted to ensure catching fatal errors during query initialization
      
      ## What changes were proposed in this pull request?
      
      StreamTest sets `UncaughtExceptionHandler` after starting the query now. It may not be able to catch fatal errors during query initialization. This PR uses `onQueryStarted` callback to fix it.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16492 from zsxwing/SPARK-19113.
      e0af4b72
    • Dongjoon Hyun's avatar
      [SPARK-19137][SQL] Fix `withSQLConf` to reset `OptionalConfigEntry` correctly · 69d1c4c5
      Dongjoon Hyun authored
      
      ## What changes were proposed in this pull request?
      
      `DataStreamReaderWriterSuite` makes test files in source folder like the followings. Interestingly, the root cause is `withSQLConf` fails to reset `OptionalConfigEntry` correctly. In other words, it resets the config into `Some(undefined)`.
      
      ```bash
      $ git status
      Untracked files:
        (use "git add <file>..." to include in what will be committed)
      
              sql/core/%253Cundefined%253E/
              sql/core/%3Cundefined%3E/
      ```
      
      ## How was this patch tested?
      
      Manual.
      ```
      build/sbt "project sql" test
      git status
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16522 from dongjoon-hyun/SPARK-19137.
      
      (cherry picked from commit d5b1dc93)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      69d1c4c5
    • Liwei Lin's avatar
      [SPARK-16845][SQL] `GeneratedClass$SpecificOrdering` grows beyond 64 KB · 65c866ef
      Liwei Lin authored
      
      ## What changes were proposed in this pull request?
      
      Prior to this patch, we'll generate `compare(...)` for `GeneratedClass$SpecificOrdering` like below, leading to Janino exceptions saying the code grows beyond 64 KB.
      
      ``` scala
      /* 005 */ class SpecificOrdering extends o.a.s.sql.catalyst.expressions.codegen.BaseOrdering {
      /* ..... */   ...
      /* 10969 */   private int compare(InternalRow a, InternalRow b) {
      /* 10970 */     InternalRow i = null;  // Holds current row being evaluated.
      /* 10971 */
      /* 1.... */     code for comparing field0
      /* 1.... */     code for comparing field1
      /* 1.... */     ...
      /* 1.... */     code for comparing field449
      /* 15012 */
      /* 15013 */     return 0;
      /* 15014 */   }
      /* 15015 */ }
      ```
      
      This patch would break `compare(...)` into smaller `compare_xxx(...)` methods when necessary; then we'll get generated `compare(...)` like:
      
      ``` scala
      /* 001 */ public SpecificOrdering generate(Object[] references) {
      /* 002 */   return new SpecificOrdering(references);
      /* 003 */ }
      /* 004 */
      /* 005 */ class SpecificOrdering extends o.a.s.sql.catalyst.expressions.codegen.BaseOrdering {
      /* 006 */
      /* 007 */     ...
      /* 1.... */
      /* 11290 */   private int compare_0(InternalRow a, InternalRow b) {
      /* 11291 */     InternalRow i = null;  // Holds current row being evaluated.
      /* 11292 */
      /* 11293 */     i = a;
      /* 11294 */     boolean isNullA;
      /* 11295 */     UTF8String primitiveA;
      /* 11296 */     {
      /* 11297 */
      /* 11298 */       Object obj = ((Expression) references[0]).eval(null);
      /* 11299 */       UTF8String value = (UTF8String) obj;
      /* 11300 */       isNullA = false;
      /* 11301 */       primitiveA = value;
      /* 11302 */     }
      /* 11303 */     i = b;
      /* 11304 */     boolean isNullB;
      /* 11305 */     UTF8String primitiveB;
      /* 11306 */     {
      /* 11307 */
      /* 11308 */       Object obj = ((Expression) references[0]).eval(null);
      /* 11309 */       UTF8String value = (UTF8String) obj;
      /* 11310 */       isNullB = false;
      /* 11311 */       primitiveB = value;
      /* 11312 */     }
      /* 11313 */     if (isNullA && isNullB) {
      /* 11314 */       // Nothing
      /* 11315 */     } else if (isNullA) {
      /* 11316 */       return -1;
      /* 11317 */     } else if (isNullB) {
      /* 11318 */       return 1;
      /* 11319 */     } else {
      /* 11320 */       int comp = primitiveA.compare(primitiveB);
      /* 11321 */       if (comp != 0) {
      /* 11322 */         return comp;
      /* 11323 */       }
      /* 11324 */     }
      /* 11325 */
      /* 11326 */
      /* 11327 */     i = a;
      /* 11328 */     boolean isNullA1;
      /* 11329 */     UTF8String primitiveA1;
      /* 11330 */     {
      /* 11331 */
      /* 11332 */       Object obj1 = ((Expression) references[1]).eval(null);
      /* 11333 */       UTF8String value1 = (UTF8String) obj1;
      /* 11334 */       isNullA1 = false;
      /* 11335 */       primitiveA1 = value1;
      /* 11336 */     }
      /* 11337 */     i = b;
      /* 11338 */     boolean isNullB1;
      /* 11339 */     UTF8String primitiveB1;
      /* 11340 */     {
      /* 11341 */
      /* 11342 */       Object obj1 = ((Expression) references[1]).eval(null);
      /* 11343 */       UTF8String value1 = (UTF8String) obj1;
      /* 11344 */       isNullB1 = false;
      /* 11345 */       primitiveB1 = value1;
      /* 11346 */     }
      /* 11347 */     if (isNullA1 && isNullB1) {
      /* 11348 */       // Nothing
      /* 11349 */     } else if (isNullA1) {
      /* 11350 */       return -1;
      /* 11351 */     } else if (isNullB1) {
      /* 11352 */       return 1;
      /* 11353 */     } else {
      /* 11354 */       int comp = primitiveA1.compare(primitiveB1);
      /* 11355 */       if (comp != 0) {
      /* 11356 */         return comp;
      /* 11357 */       }
      /* 11358 */     }
      /* 1.... */
      /* 1.... */   ...
      /* 1.... */
      /* 12652 */     return 0;
      /* 12653 */   }
      /* 1.... */
      /* 1.... */   ...
      /* 15387 */
      /* 15388 */   public int compare(InternalRow a, InternalRow b) {
      /* 15389 */
      /* 15390 */     int comp_0 = compare_0(a, b);
      /* 15391 */     if (comp_0 != 0) {
      /* 15392 */       return comp_0;
      /* 15393 */     }
      /* 15394 */
      /* 15395 */     int comp_1 = compare_1(a, b);
      /* 15396 */     if (comp_1 != 0) {
      /* 15397 */       return comp_1;
      /* 15398 */     }
      /* 1.... */
      /* 1.... */     ...
      /* 1.... */
      /* 15450 */     return 0;
      /* 15451 */   }
      /* 15452 */ }
      ```
      ## How was this patch tested?
      - a new added test case which
        - would fail prior to this patch
        - would pass with this patch
      - ordering correctness should already be covered by existing tests like those in `OrderingSuite`
      
      ## Acknowledgement
      
      A major part of this PR - the refactoring work of `splitExpression()` - has been done by ueshin.
      
      Author: Liwei Lin <lwlin7@gmail.com>
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      Author: Takuya Ueshin <ueshin@happy-camper.st>
      
      Closes #15480 from lw-lin/spec-ordering-64k-.
      
      (cherry picked from commit acfc5f35)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      65c866ef
  10. Jan 09, 2017
  11. Jan 08, 2017
  12. Jan 06, 2017
  13. Jan 04, 2017
    • Dongjoon Hyun's avatar
      [SPARK-18877][SQL][BACKPORT-2.1] CSVInferSchema.inferField` on DecimalType... · 1ecf1a95
      Dongjoon Hyun authored
      [SPARK-18877][SQL][BACKPORT-2.1] CSVInferSchema.inferField` on DecimalType should find a common type with `typeSoFar`
      
      ## What changes were proposed in this pull request?
      
      CSV type inferencing causes `IllegalArgumentException` on decimal numbers with heterogeneous precisions and scales because the current logic uses the last decimal type in a **partition**. Specifically, `inferRowType`, the **seqOp** of **aggregate**, returns the last decimal type. This PR fixes it to use `findTightestCommonType`.
      
      **decimal.csv**
      ```
      9.03E+12
      1.19E+11
      ```
      
      **BEFORE**
      ```scala
      scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
      root
       |-- _c0: decimal(3,-9) (nullable = true)
      
      scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
      16/12/16 14:32:49 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
      java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 3
      ```
      
      **AFTER**
      ```scala
      scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
      root
       |-- _c0: decimal(4,-9) (nullable = true)
      
      scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
      +---------+
      |      _c0|
      +---------+
      |9.030E+12|
      | 1.19E+11|
      +---------+
      ```
      
      ## How was this patch tested?
      
      Pass the newly add test case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16463 from dongjoon-hyun/SPARK-18877-BACKPORT-21.
      1ecf1a95
  14. Jan 03, 2017
    • gatorsmile's avatar
      [SPARK-19048][SQL] Delete Partition Location when Dropping Managed Partitioned... · 77625506
      gatorsmile authored
      [SPARK-19048][SQL] Delete Partition Location when Dropping Managed Partitioned Tables in InMemoryCatalog
      
      ### What changes were proposed in this pull request?
      The data in the managed table should be deleted after table is dropped. However, if the partition location is not under the location of the partitioned table, it is not deleted as expected. Users can specify any location for the partition when they adding a partition.
      
      This PR is to delete partition location when dropping managed partitioned tables stored in `InMemoryCatalog`.
      
      ### How was this patch tested?
      Added test cases for both HiveExternalCatalog and InMemoryCatalog
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16448 from gatorsmile/unsetSerdeProp.
      
      (cherry picked from commit b67b35f7)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      77625506
  15. Jan 02, 2017
  16. Jan 01, 2017
  17. Dec 28, 2016
  18. Dec 22, 2016
    • Shixiong Zhu's avatar
      [SPARK-18985][SS] Add missing @InterfaceStability.Evolving for Structured Streaming APIs · 5e801034
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      Add missing InterfaceStability.Evolving for Structured Streaming APIs
      
      ## How was this patch tested?
      
      Compiling the codes.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16385 from zsxwing/SPARK-18985.
      
      (cherry picked from commit 2246ce88)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      5e801034
    • Ryan Williams's avatar
      [SPARK-17807][CORE] split test-tags into test-JAR · 132f2297
      Ryan Williams authored
      
      Remove spark-tag's compile-scope dependency (and, indirectly, spark-core's compile-scope transitive-dependency) on scalatest by splitting test-oriented tags into spark-tags' test JAR.
      
      Alternative to #16303.
      
      Author: Ryan Williams <ryan.blake.williams@gmail.com>
      
      Closes #16311 from ryan-williams/tt.
      
      (cherry picked from commit afd9bc1d)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      132f2297
    • Reynold Xin's avatar
      [SPARK-18973][SQL] Remove SortPartitions and RedistributeData · f6853b3e
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      SortPartitions and RedistributeData logical operators are not actually used and can be removed. Note that we do have a Sort operator (with global flag false) that subsumed SortPartitions.
      
      ## How was this patch tested?
      Also updated test cases to reflect the removal.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #16381 from rxin/SPARK-18973.
      
      (cherry picked from commit 26151000)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      f6853b3e
    • Reynold Xin's avatar
      [DOC] bucketing is applicable to all file-based data sources · ec0d6e21
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      Starting Spark 2.1.0, bucketing feature is available for all file-based data sources. This patch fixes some function docs that haven't yet been updated to reflect that.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #16349 from rxin/ds-doc.
      
      (cherry picked from commit 2e861df9)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      ec0d6e21
    • Reynold Xin's avatar
      [SQL] Minor readability improvement for partition handling code · def3690f
      Reynold Xin authored
      
      This patch includes minor changes to improve readability for partition handling code. I'm in the middle of implementing some new feature and found some naming / implicit type inference not as intuitive.
      
      This patch should have no semantic change and the changes should be covered by existing test cases.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #16378 from rxin/minor-fix.
      
      (cherry picked from commit 7c5b7b3a)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      def3690f
    • Shixiong Zhu's avatar
      [SPARK-18908][SS] Creating StreamingQueryException should check if logicalPlan is created · 07e2a17d
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      This PR audits places using `logicalPlan` in StreamExecution and ensures they all handles the case that `logicalPlan` cannot be created.
      
      In addition, this PR also fixes the following issues in `StreamingQueryException`:
      - `StreamingQueryException` and `StreamExecution` are cycle-dependent because in the `StreamingQueryException`'s constructor, it calls `StreamExecution`'s `toDebugString` which uses `StreamingQueryException`. Hence it will output `null` value in the error message.
      - Duplicated stack trace when calling Throwable.printStackTrace because StreamingQueryException's toString contains the stack trace.
      
      ## How was this patch tested?
      
      The updated `test("max files per trigger - incorrect values")`. I found this issue when I switched from `testStream` to the real codes to verify the failure in this test.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16322 from zsxwing/SPARK-18907.
      
      (cherry picked from commit ff7d82a2)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      07e2a17d
  19. Dec 21, 2016
    • Takeshi YAMAMURO's avatar
      [SPARK-18528][SQL] Fix a bug to initialise an iterator of aggregation buffer · 021952d5
      Takeshi YAMAMURO authored
      ## What changes were proposed in this pull request?
      This pr is to fix an `NullPointerException` issue caused by a following `limit + aggregate` query;
      ```
      scala> val df = Seq(("a", 1), ("b", 2), ("c", 1), ("d", 5)).toDF("id", "value")
      scala> df.limit(2).groupBy("id").count().show
      WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 8204, lvsp20hdn012.stubprod.com): java.lang.NullPointerException
      at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
      at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
      ```
      The root culprit is that [`$doAgg()`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L596) skips an initialization of [the buffer iterator](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L603);
      
       `BaseLimitExec` sets `stopEarly=true` and `$doAgg()` exits in the middle without the initialization.
      
      ## How was this patch tested?
      Added a test to check if no exception happens for limit + aggregates in `DataFrameAggregateSuite.scala`.
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #15980 from maropu/SPARK-18528.
      
      (cherry picked from commit b41ec997)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      021952d5
    • Tathagata Das's avatar
      [SPARK-18234][SS] Made update mode public · 60e02a17
      Tathagata Das authored
      
      ## What changes were proposed in this pull request?
      
      Made update mode public. As part of that here are the changes.
      - Update DatastreamWriter to accept "update"
      - Changed package of InternalOutputModes from o.a.s.sql to o.a.s.sql.catalyst
      - Added update mode state removing with watermark to StateStoreSaveExec
      
      ## How was this patch tested?
      
      Added new tests in changed modules
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16360 from tdas/SPARK-18234.
      
      (cherry picked from commit 83a6ace0)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      60e02a17
    • gatorsmile's avatar
      [SPARK-18949][SQL][BACKPORT-2.1] Add recoverPartitions API to Catalog · 0e51bb08
      gatorsmile authored
      ### What changes were proposed in this pull request?
      
      This PR is to backport https://github.com/apache/spark/pull/16356 to Spark 2.1.1 branch.
      
      ----
      
      Currently, we only have a SQL interface for recovering all the partitions in the directory of a table and update the catalog. `MSCK REPAIR TABLE` or `ALTER TABLE table RECOVER PARTITIONS`. (Actually, very hard for me to remember `MSCK` and have no clue what it means)
      
      After the new "Scalable Partition Handling", the table repair becomes much more important for making visible the data in the created data source partitioned table.
      
      Thus, this PR is to add it into the Catalog interface. After this PR, users can repair the table by
      ```Scala
      spark.catalog.recoverPartitions("testTable")
      ```
      
      ### How was this patch tested?
      Modified the existing test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16372 from gatorsmile/repairTable2.1.1.
      0e51bb08
Loading