Skip to content
Snippets Groups Projects
  1. Jan 15, 2017
    • gatorsmile's avatar
      [SPARK-19120] Refresh Metadata Cache After Loading Hive Tables · db37049d
      gatorsmile authored
      
      ```Scala
              sql("CREATE TABLE tab (a STRING) STORED AS PARQUET")
      
              // This table fetch is to fill the cache with zero leaf files
              spark.table("tab").show()
      
              sql(
                s"""
                   |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
                   |INTO TABLE tab
                 """.stripMargin)
      
              spark.table("tab").show()
      ```
      
      In the above example, the returned result is empty after table loading. The metadata cache could be out of dated after loading new data into the table, because loading/inserting does not update the cache. So far, the metadata cache is only used for data source tables. Thus, for Hive serde tables, only `parquet` and `orc` formats are facing such issues, because the Hive serde tables in the format of  parquet/orc could be converted to data source tables when `spark.sql.hive.convertMetastoreParquet`/`spark.sql.hive.convertMetastoreOrc` is on.
      
      This PR is to refresh the metadata cache after processing the `LOAD DATA` command.
      
      In addition, Spark SQL does not convert **partitioned** Hive tables (orc/parquet) to data source tables in the write path, but the read path is using the metadata cache for both **partitioned** and non-partitioned Hive tables (orc/parquet). That means, writing the partitioned parquet/orc tables still use `InsertIntoHiveTable`, instead of `InsertIntoHadoopFsRelationCommand`. To avoid reading the out-of-dated cache, `InsertIntoHiveTable` needs to refresh the metadata cache for partitioned tables. Note, it does not need to refresh the cache for non-partitioned parquet/orc tables, because it does not call `InsertIntoHiveTable` at all. Based on the comments, this PR will keep the existing logics unchanged. That means, we always refresh the table no matter whether the table is partitioned or not.
      
      Added test cases in parquetSuites.scala
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16500 from gatorsmile/refreshInsertIntoHiveTable.
      
      (cherry picked from commit de62ddf7)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      db37049d
  2. Jan 13, 2017
    • Yucai Yu's avatar
      [SPARK-19180] [SQL] the offset of short should be 2 in OffHeapColumn · 5e9be1e1
      Yucai Yu authored
      
      ## What changes were proposed in this pull request?
      
      the offset of short is 4 in OffHeapColumnVector's putShorts, but actually it should be 2.
      
      ## How was this patch tested?
      
      unit test
      
      Author: Yucai Yu <yucai.yu@intel.com>
      
      Closes #16555 from yucai/offheap_short.
      
      (cherry picked from commit ad0dadaa)
      Signed-off-by: default avatarDavies Liu <davies.liu@gmail.com>
      5e9be1e1
    • Felix Cheung's avatar
      [SPARK-18335][SPARKR] createDataFrame to support numPartitions parameter · ee3642f5
      Felix Cheung authored
      
      ## What changes were proposed in this pull request?
      
      To allow specifying number of partitions when the DataFrame is created
      
      ## How was this patch tested?
      
      manual, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16512 from felixcheung/rnumpart.
      
      (cherry picked from commit b0e8eb6d)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      ee3642f5
    • Wenchen Fan's avatar
      [SPARK-19178][SQL] convert string of large numbers to int should return null · 2c2ca894
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      When we convert a string to integral, we will convert that string to `decimal(20, 0)` first, so that we can turn a string with decimal format to truncated integral, e.g. `CAST('1.2' AS int)` will return `1`.
      
      However, this brings problems when we convert a string with large numbers to integral, e.g. `CAST('1234567890123' AS int)` will return `1912276171`, while Hive returns null as we expected.
      
      This is a long standing bug(seems it was there the first day Spark SQL was created), this PR fixes this bug by adding the native support to convert `UTF8String` to integral.
      
      ## How was this patch tested?
      
      new regression tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16550 from cloud-fan/string-to-int.
      
      (cherry picked from commit 6b34e745)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      2c2ca894
    • Vinayak's avatar
      [SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on a... · b2c9a2c8
      Vinayak authored
      [SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on a new SQLContext object fails with a Derby error
      
      Change is for SQLContext to reuse the active SparkSession during construction if the sparkContext supplied is the same as the currently active SparkContext. Without this change, a new SparkSession is instantiated that results in a Derby error when attempting to create a dataframe using a new SQLContext object even though the SparkContext supplied to the new SQLContext is same as the currently active one. Refer https://issues.apache.org/jira/browse/SPARK-18687 for details on the error and a repro.
      
      Existing unit tests and a new unit test added to pyspark-sql:
      
      /python/run-tests --python-executables=python --modules=pyspark-sql
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Vinayak <vijoshi5@in.ibm.com>
      Author: Vinayak Joshi <vijoshi@users.noreply.github.com>
      
      Closes #16119 from vijoshi/SPARK-18687_master.
      
      (cherry picked from commit 285a7798)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      b2c9a2c8
    • Andrew Ash's avatar
      Fix missing close-parens for In filter's toString · 0668e061
      Andrew Ash authored
      
      Otherwise the open parentheses isn't closed in query plan descriptions of batch scans.
      
          PushedFilters: [In(COL_A, [1,2,4,6,10,16,219,815], IsNotNull(COL_B), ...
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #16558 from ash211/patch-9.
      
      (cherry picked from commit b040cef2)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      0668e061
  3. Jan 12, 2017
  4. Jan 11, 2017
  5. Jan 10, 2017
    • Felix Cheung's avatar
      [SPARK-19133][SPARKR][ML][BACKPORT-2.1] fix glm for Gamma, clarify glm family supported · 1022049c
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      backporting to 2.1, 2.0 and 1.6
      
      ## How was this patch tested?
      
      unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16532 from felixcheung/rgammabackport.
      1022049c
    • Shixiong Zhu's avatar
      [SPARK-19140][SS] Allow update mode for non-aggregation streaming queries · 230607d6
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      This PR allow update mode for non-aggregation streaming queries. It will be same as the append mode if a query has no aggregations.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16520 from zsxwing/update-without-agg.
      
      (cherry picked from commit bc6c56e9)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      230607d6
    • Sean Owen's avatar
      [SPARK-18997][CORE] Recommended upgrade libthrift to 0.9.3 · 81c94309
      Sean Owen authored
      
      ## What changes were proposed in this pull request?
      
      Updates to libthrift 0.9.3 to address a CVE.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16530 from srowen/SPARK-18997.
      
      (cherry picked from commit 856bae6a)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      81c94309
    • Shixiong Zhu's avatar
      [SPARK-19113][SS][TESTS] Set UncaughtExceptionHandler in onQueryStarted to... · e0af4b72
      Shixiong Zhu authored
      [SPARK-19113][SS][TESTS] Set UncaughtExceptionHandler in onQueryStarted to ensure catching fatal errors during query initialization
      
      ## What changes were proposed in this pull request?
      
      StreamTest sets `UncaughtExceptionHandler` after starting the query now. It may not be able to catch fatal errors during query initialization. This PR uses `onQueryStarted` callback to fix it.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16492 from zsxwing/SPARK-19113.
      e0af4b72
    • Dongjoon Hyun's avatar
      [SPARK-19137][SQL] Fix `withSQLConf` to reset `OptionalConfigEntry` correctly · 69d1c4c5
      Dongjoon Hyun authored
      
      ## What changes were proposed in this pull request?
      
      `DataStreamReaderWriterSuite` makes test files in source folder like the followings. Interestingly, the root cause is `withSQLConf` fails to reset `OptionalConfigEntry` correctly. In other words, it resets the config into `Some(undefined)`.
      
      ```bash
      $ git status
      Untracked files:
        (use "git add <file>..." to include in what will be committed)
      
              sql/core/%253Cundefined%253E/
              sql/core/%3Cundefined%3E/
      ```
      
      ## How was this patch tested?
      
      Manual.
      ```
      build/sbt "project sql" test
      git status
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16522 from dongjoon-hyun/SPARK-19137.
      
      (cherry picked from commit d5b1dc93)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      69d1c4c5
    • Liwei Lin's avatar
      [SPARK-16845][SQL] `GeneratedClass$SpecificOrdering` grows beyond 64 KB · 65c866ef
      Liwei Lin authored
      
      ## What changes were proposed in this pull request?
      
      Prior to this patch, we'll generate `compare(...)` for `GeneratedClass$SpecificOrdering` like below, leading to Janino exceptions saying the code grows beyond 64 KB.
      
      ``` scala
      /* 005 */ class SpecificOrdering extends o.a.s.sql.catalyst.expressions.codegen.BaseOrdering {
      /* ..... */   ...
      /* 10969 */   private int compare(InternalRow a, InternalRow b) {
      /* 10970 */     InternalRow i = null;  // Holds current row being evaluated.
      /* 10971 */
      /* 1.... */     code for comparing field0
      /* 1.... */     code for comparing field1
      /* 1.... */     ...
      /* 1.... */     code for comparing field449
      /* 15012 */
      /* 15013 */     return 0;
      /* 15014 */   }
      /* 15015 */ }
      ```
      
      This patch would break `compare(...)` into smaller `compare_xxx(...)` methods when necessary; then we'll get generated `compare(...)` like:
      
      ``` scala
      /* 001 */ public SpecificOrdering generate(Object[] references) {
      /* 002 */   return new SpecificOrdering(references);
      /* 003 */ }
      /* 004 */
      /* 005 */ class SpecificOrdering extends o.a.s.sql.catalyst.expressions.codegen.BaseOrdering {
      /* 006 */
      /* 007 */     ...
      /* 1.... */
      /* 11290 */   private int compare_0(InternalRow a, InternalRow b) {
      /* 11291 */     InternalRow i = null;  // Holds current row being evaluated.
      /* 11292 */
      /* 11293 */     i = a;
      /* 11294 */     boolean isNullA;
      /* 11295 */     UTF8String primitiveA;
      /* 11296 */     {
      /* 11297 */
      /* 11298 */       Object obj = ((Expression) references[0]).eval(null);
      /* 11299 */       UTF8String value = (UTF8String) obj;
      /* 11300 */       isNullA = false;
      /* 11301 */       primitiveA = value;
      /* 11302 */     }
      /* 11303 */     i = b;
      /* 11304 */     boolean isNullB;
      /* 11305 */     UTF8String primitiveB;
      /* 11306 */     {
      /* 11307 */
      /* 11308 */       Object obj = ((Expression) references[0]).eval(null);
      /* 11309 */       UTF8String value = (UTF8String) obj;
      /* 11310 */       isNullB = false;
      /* 11311 */       primitiveB = value;
      /* 11312 */     }
      /* 11313 */     if (isNullA && isNullB) {
      /* 11314 */       // Nothing
      /* 11315 */     } else if (isNullA) {
      /* 11316 */       return -1;
      /* 11317 */     } else if (isNullB) {
      /* 11318 */       return 1;
      /* 11319 */     } else {
      /* 11320 */       int comp = primitiveA.compare(primitiveB);
      /* 11321 */       if (comp != 0) {
      /* 11322 */         return comp;
      /* 11323 */       }
      /* 11324 */     }
      /* 11325 */
      /* 11326 */
      /* 11327 */     i = a;
      /* 11328 */     boolean isNullA1;
      /* 11329 */     UTF8String primitiveA1;
      /* 11330 */     {
      /* 11331 */
      /* 11332 */       Object obj1 = ((Expression) references[1]).eval(null);
      /* 11333 */       UTF8String value1 = (UTF8String) obj1;
      /* 11334 */       isNullA1 = false;
      /* 11335 */       primitiveA1 = value1;
      /* 11336 */     }
      /* 11337 */     i = b;
      /* 11338 */     boolean isNullB1;
      /* 11339 */     UTF8String primitiveB1;
      /* 11340 */     {
      /* 11341 */
      /* 11342 */       Object obj1 = ((Expression) references[1]).eval(null);
      /* 11343 */       UTF8String value1 = (UTF8String) obj1;
      /* 11344 */       isNullB1 = false;
      /* 11345 */       primitiveB1 = value1;
      /* 11346 */     }
      /* 11347 */     if (isNullA1 && isNullB1) {
      /* 11348 */       // Nothing
      /* 11349 */     } else if (isNullA1) {
      /* 11350 */       return -1;
      /* 11351 */     } else if (isNullB1) {
      /* 11352 */       return 1;
      /* 11353 */     } else {
      /* 11354 */       int comp = primitiveA1.compare(primitiveB1);
      /* 11355 */       if (comp != 0) {
      /* 11356 */         return comp;
      /* 11357 */       }
      /* 11358 */     }
      /* 1.... */
      /* 1.... */   ...
      /* 1.... */
      /* 12652 */     return 0;
      /* 12653 */   }
      /* 1.... */
      /* 1.... */   ...
      /* 15387 */
      /* 15388 */   public int compare(InternalRow a, InternalRow b) {
      /* 15389 */
      /* 15390 */     int comp_0 = compare_0(a, b);
      /* 15391 */     if (comp_0 != 0) {
      /* 15392 */       return comp_0;
      /* 15393 */     }
      /* 15394 */
      /* 15395 */     int comp_1 = compare_1(a, b);
      /* 15396 */     if (comp_1 != 0) {
      /* 15397 */       return comp_1;
      /* 15398 */     }
      /* 1.... */
      /* 1.... */     ...
      /* 1.... */
      /* 15450 */     return 0;
      /* 15451 */   }
      /* 15452 */ }
      ```
      ## How was this patch tested?
      - a new added test case which
        - would fail prior to this patch
        - would pass with this patch
      - ordering correctness should already be covered by existing tests like those in `OrderingSuite`
      
      ## Acknowledgement
      
      A major part of this PR - the refactoring work of `splitExpression()` - has been done by ueshin.
      
      Author: Liwei Lin <lwlin7@gmail.com>
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      Author: Takuya Ueshin <ueshin@happy-camper.st>
      
      Closes #15480 from lw-lin/spec-ordering-64k-.
      
      (cherry picked from commit acfc5f35)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      65c866ef
  6. Jan 09, 2017
  7. Jan 08, 2017
  8. Jan 07, 2017
  9. Jan 06, 2017
  10. Jan 04, 2017
    • Dongjoon Hyun's avatar
      [SPARK-18877][SQL][BACKPORT-2.1] CSVInferSchema.inferField` on DecimalType... · 1ecf1a95
      Dongjoon Hyun authored
      [SPARK-18877][SQL][BACKPORT-2.1] CSVInferSchema.inferField` on DecimalType should find a common type with `typeSoFar`
      
      ## What changes were proposed in this pull request?
      
      CSV type inferencing causes `IllegalArgumentException` on decimal numbers with heterogeneous precisions and scales because the current logic uses the last decimal type in a **partition**. Specifically, `inferRowType`, the **seqOp** of **aggregate**, returns the last decimal type. This PR fixes it to use `findTightestCommonType`.
      
      **decimal.csv**
      ```
      9.03E+12
      1.19E+11
      ```
      
      **BEFORE**
      ```scala
      scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
      root
       |-- _c0: decimal(3,-9) (nullable = true)
      
      scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
      16/12/16 14:32:49 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
      java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 3
      ```
      
      **AFTER**
      ```scala
      scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
      root
       |-- _c0: decimal(4,-9) (nullable = true)
      
      scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
      +---------+
      |      _c0|
      +---------+
      |9.030E+12|
      | 1.19E+11|
      +---------+
      ```
      
      ## How was this patch tested?
      
      Pass the newly add test case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16463 from dongjoon-hyun/SPARK-18877-BACKPORT-21.
      1ecf1a95
  11. Jan 03, 2017
    • gatorsmile's avatar
      [SPARK-19048][SQL] Delete Partition Location when Dropping Managed Partitioned... · 77625506
      gatorsmile authored
      [SPARK-19048][SQL] Delete Partition Location when Dropping Managed Partitioned Tables in InMemoryCatalog
      
      ### What changes were proposed in this pull request?
      The data in the managed table should be deleted after table is dropped. However, if the partition location is not under the location of the partitioned table, it is not deleted as expected. Users can specify any location for the partition when they adding a partition.
      
      This PR is to delete partition location when dropping managed partitioned tables stored in `InMemoryCatalog`.
      
      ### How was this patch tested?
      Added test cases for both HiveExternalCatalog and InMemoryCatalog
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16448 from gatorsmile/unsetSerdeProp.
      
      (cherry picked from commit b67b35f7)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      77625506
  12. Jan 02, 2017
  13. Jan 01, 2017
  14. Dec 30, 2016
    • Cheng Lian's avatar
      [SPARK-19016][SQL][DOC] Document scalable partition handling · 20ae1172
      Cheng Lian authored
      
      This PR documents the scalable partition handling feature in the body of the programming guide.
      
      Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra `MSCK REPAIR TABLE` command is to have per-partition information persisted since 2.1.
      
      N/A.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #16424 from liancheng/scalable-partition-handling-doc.
      
      (cherry picked from commit 871f6114)
      Signed-off-by: default avatarCheng Lian <lian@databricks.com>
      20ae1172
  15. Dec 29, 2016
    • adesharatushar's avatar
      [SPARK-19003][DOCS] Add Java example in Spark Streaming Guide, section Design... · 47ab4afe
      adesharatushar authored
      [SPARK-19003][DOCS] Add Java example in Spark Streaming Guide, section Design Patterns for using foreachRDD
      
      ## What changes were proposed in this pull request?
      
      Added missing Java example under section "Design Patterns for using foreachRDD". Now this section has examples in all 3 languages, improving consistency of documentation.
      
      ## How was this patch tested?
      
      Manual.
      Generated docs using command "SKIP_API=1 jekyll build" and verified generated HTML page manually.
      
      The syntax of example has been tested for correctness using sample code on Java1.7 and Spark 2.2.0-SNAPSHOT.
      
      Author: adesharatushar <tushar_adeshara@persistent.com>
      
      Closes #16408 from adesharatushar/streaming-doc-fix.
      
      (cherry picked from commit dba81e1d)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      47ab4afe
  16. Dec 28, 2016
Loading