Skip to content
Snippets Groups Projects
  1. Nov 28, 2016
    • Patrick Wendell's avatar
      75d73d13
    • Patrick Wendell's avatar
      Preparing Spark release v2.1.0-rc1 · 80aabc0b
      Patrick Wendell authored
      80aabc0b
    • Kazuaki Ishizaki's avatar
      [SPARK-17680][SQL][TEST] Added test cases for InMemoryRelation · b386943b
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This pull request adds test cases for the following cases:
      - keep all data types with null or without null
      - access `CachedBatch` disabling whole stage codegen
      - access only some columns in `CachedBatch`
      
      This PR is a part of https://github.com/apache/spark/pull/15219. Here are motivations to add these tests. When https://github.com/apache/spark/pull/15219 is enabled, the first two cases are handled by specialized (generated) code. The third one is a pitfall.
      
      In general, even for now, it would be helpful to increase test coverage.
      ## How was this patch tested?
      
      added test suites itself
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #15462 from kiszk/columnartestsuites.
      b386943b
    • jiangxingbo's avatar
      [SPARK-16282][SQL] Implement percentile SQL function. · 81e3f971
      jiangxingbo authored
      
      ## What changes were proposed in this pull request?
      
      Implement percentile SQL function. It computes the exact percentile(s) of expr at pc with range in [0, 1].
      
      ## How was this patch tested?
      
      Add a new testsuite `PercentileSuite` to test percentile directly.
      Updated related testcases in `ExpressionToSQLSuite`.
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      Author: 蒋星博 <jiangxingbo@meituan.com>
      Author: jiangxingbo <jiangxingbo@meituan.com>
      
      Closes #14136 from jiangxb1987/percentile.
      
      (cherry picked from commit 0f5f52a3)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      81e3f971
    • Wenchen Fan's avatar
      [SQL][MINOR] DESC should use 'Catalog' as partition provider · 4d794785
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      `CatalogTable` has a parameter named `tracksPartitionsInCatalog`, and in `CatalogTable.toString` we use `"Partition Provider: Catalog"` to represent it. This PR fixes `DESC TABLE` to make it consistent with `CatalogTable.toString`.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16035 from cloud-fan/minor.
      
      (cherry picked from commit 18564284)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      4d794785
    • Yin Huai's avatar
      [SPARK-18602] Set the version of org.codehaus.janino:commons-compiler to 3.0.0... · 34ad4d52
      Yin Huai authored
      [SPARK-18602] Set the version of org.codehaus.janino:commons-compiler to 3.0.0 to match the version of org.codehaus.janino:janino
      
      ## What changes were proposed in this pull request?
      org.codehaus.janino:janino depends on org.codehaus.janino:commons-compiler and we have been upgraded to org.codehaus.janino:janino 3.0.0.
      
      However, seems we are still pulling in org.codehaus.janino:commons-compiler 2.7.6 because of calcite. It looks like an accident because we exclude janino from calcite (see here https://github.com/apache/spark/blob/branch-2.1/pom.xml#L1759
      
      ). So, this PR upgrades org.codehaus.janino:commons-compiler to 3.0.0.
      
      ## How was this patch tested?
      jenkins
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #16025 from yhuai/janino-commons-compile.
      
      (cherry picked from commit eba72775)
      Signed-off-by: default avatarYin Huai <yhuai@databricks.com>
      34ad4d52
    • Herman van Hovell's avatar
      [SPARK-18597][SQL] Do not push-down join conditions to the right side of a LEFT ANTI join · 32b259fa
      Herman van Hovell authored
      
      ## What changes were proposed in this pull request?
      We currently push down join conditions of a Left Anti join to both sides of the join. This is similar to Inner, Left Semi and Existence (a specialized left semi) join. The problem is that this changes the semantics of the join; a left anti join filters out rows that matches the join condition.
      
      This PR fixes this by only pushing down conditions to the left hand side of the join. This is similar to the behavior of left outer join.
      
      ## How was this patch tested?
      Added tests to `FilterPushdownSuite.scala` and created a SQLQueryTestSuite file for left anti joins with a regression test.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #16026 from hvanhovell/SPARK-18597.
      
      (cherry picked from commit 38e29824)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      32b259fa
    • gatorsmile's avatar
      [SPARK-17783][SQL] Hide Credentials in CREATE and DESC FORMATTED/EXTENDED a... · a9d4febe
      gatorsmile authored
      [SPARK-17783][SQL] Hide Credentials in CREATE and DESC FORMATTED/EXTENDED a PERSISTENT/TEMP Table for JDBC
      
      ### What changes were proposed in this pull request?
      
      We should never expose the Credentials in the EXPLAIN and DESC FORMATTED/EXTENDED command. However, below commands exposed the credentials.
      
      In the related PR: https://github.com/apache/spark/pull/10452
      
      
      
      > URL patterns to specify credential seems to be vary between different databases.
      
      Thus, we hide the whole `url` value if it contains the keyword `password`. We also hide the `password` property.
      
      Before the fix, the command outputs look like:
      
      ``` SQL
      CREATE TABLE tab1
      USING org.apache.spark.sql.jdbc
      OPTIONS (
       url 'jdbc:h2:mem:testdb0;user=testUser;password=testPass',
       dbtable 'TEST.PEOPLE',
       user 'testUser',
       password '$password')
      
      DESC FORMATTED tab1
      DESC EXTENDED tab1
      ```
      
      Before the fix,
      - The output of SQL statement EXPLAIN
      ```
      == Physical Plan ==
      ExecutedCommand
         +- CreateDataSourceTableCommand CatalogTable(
      	Table: `tab1`
      	Created: Wed Nov 16 23:00:10 PST 2016
      	Last Access: Wed Dec 31 15:59:59 PST 1969
      	Type: MANAGED
      	Provider: org.apache.spark.sql.jdbc
      	Storage(Properties: [url=jdbc:h2:mem:testdb0;user=testUser;password=testPass, dbtable=TEST.PEOPLE, user=testUser, password=testPass])), false
      ```
      
      - The output of `DESC FORMATTED`
      ```
      ...
      |Storage Desc Parameters:    |                                                                  |       |
      |  url                       |jdbc:h2:mem:testdb0;user=testUser;password=testPass               |       |
      |  dbtable                   |TEST.PEOPLE                                                       |       |
      |  user                      |testUser                                                          |       |
      |  password                  |testPass                                                          |       |
      +----------------------------+------------------------------------------------------------------+-------+
      ```
      
      - The output of `DESC EXTENDED`
      ```
      |# Detailed Table Information|CatalogTable(
      	Table: `default`.`tab1`
      	Created: Wed Nov 16 23:00:10 PST 2016
      	Last Access: Wed Dec 31 15:59:59 PST 1969
      	Type: MANAGED
      	Schema: [StructField(NAME,StringType,false), StructField(THEID,IntegerType,false)]
      	Provider: org.apache.spark.sql.jdbc
      	Storage(Location: file:/Users/xiaoli/IdeaProjects/sparkDelivery/spark-warehouse/tab1, Properties: [url=jdbc:h2:mem:testdb0;user=testUser;password=testPass, dbtable=TEST.PEOPLE, user=testUser, password=testPass]))|       |
      ```
      
      After the fix,
      - The output of SQL statement EXPLAIN
      ```
      == Physical Plan ==
      ExecutedCommand
         +- CreateDataSourceTableCommand CatalogTable(
      	Table: `tab1`
      	Created: Wed Nov 16 22:43:49 PST 2016
      	Last Access: Wed Dec 31 15:59:59 PST 1969
      	Type: MANAGED
      	Provider: org.apache.spark.sql.jdbc
      	Storage(Properties: [url=###, dbtable=TEST.PEOPLE, user=testUser, password=###])), false
      ```
      - The output of `DESC FORMATTED`
      ```
      ...
      |Storage Desc Parameters:    |                                                                  |       |
      |  url                       |###                                                               |       |
      |  dbtable                   |TEST.PEOPLE                                                       |       |
      |  user                      |testUser                                                          |       |
      |  password                  |###                                                               |       |
      +----------------------------+------------------------------------------------------------------+-------+
      ```
      
      - The output of `DESC EXTENDED`
      ```
      |# Detailed Table Information|CatalogTable(
      	Table: `default`.`tab1`
      	Created: Wed Nov 16 22:43:49 PST 2016
      	Last Access: Wed Dec 31 15:59:59 PST 1969
      	Type: MANAGED
      	Schema: [StructField(NAME,StringType,false), StructField(THEID,IntegerType,false)]
      	Provider: org.apache.spark.sql.jdbc
      	Storage(Location: file:/Users/xiaoli/IdeaProjects/sparkDelivery/spark-warehouse/tab1, Properties: [url=###, dbtable=TEST.PEOPLE, user=testUser, password=###]))|       |
      ```
      
      ### How was this patch tested?
      
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15358 from gatorsmile/maskCredentials.
      
      (cherry picked from commit 9f273c51)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      a9d4febe
    • Herman van Hovell's avatar
      [SPARK-18118][SQL] fix a compilation error due to nested JavaBeans · e449f754
      Herman van Hovell authored
      
      Remove this reference.
      
      (cherry picked from commit 70dfdcbb)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      e449f754
    • Kazuaki Ishizaki's avatar
      [SPARK-18118][SQL] fix a compilation error due to nested JavaBeans · 712bd5ab
      Kazuaki Ishizaki authored
      
      ## What changes were proposed in this pull request?
      
      This PR avoids a compilation error due to more than 64KB Java byte code size. This error occur since generated java code `SpecificSafeProjection.apply()` for nested JavaBeans is too big. This PR avoids this compilation error by splitting a big code chunk into multiple methods by calling `CodegenContext.splitExpression` at `InitializeJavaBean.doGenCode`
      An object reference for JavaBean is stored to an instance variable `javaBean...`. Then, the instance variable will be referenced in the split methods.
      
      Generated code with this PR
      ````
      /* 22098 */   private void apply130_0(InternalRow i) {
      ...
      /* 22125 */     boolean isNull238 = i.isNullAt(2);
      /* 22126 */     InternalRow value238 = isNull238 ? null : (i.getStruct(2, 3));
      /* 22127 */     boolean isNull236 = false;
      /* 22128 */     test.org.apache.spark.sql.JavaDatasetSuite$Nesting1 value236 = null;
      /* 22129 */     if (!false && isNull238) {
      /* 22130 */
      /* 22131 */       final test.org.apache.spark.sql.JavaDatasetSuite$Nesting1 value239 = null;
      /* 22132 */       isNull236 = true;
      /* 22133 */       value236 = value239;
      /* 22134 */     } else {
      /* 22135 */
      /* 22136 */       final test.org.apache.spark.sql.JavaDatasetSuite$Nesting1 value241 = false ? null : new test.org.apache.spark.sql.JavaDatasetSuite$Nesting1();
      /* 22137 */       this.javaBean14 = value241;
      /* 22138 */       if (!false) {
      /* 22139 */         apply25_0(i);
      /* 22140 */         apply25_1(i);
      /* 22141 */         apply25_2(i);
      /* 22142 */       }
      /* 22143 */       isNull236 = false;
      /* 22144 */       value236 = value241;
      /* 22145 */     }
      /* 22146 */     this.javaBean.setField2(value236);
      /* 22147 */
      /* 22148 */   }
      ...
      /* 22928 */   public java.lang.Object apply(java.lang.Object _i) {
      /* 22929 */     InternalRow i = (InternalRow) _i;
      /* 22930 */
      /* 22931 */     final test.org.apache.spark.sql.JavaDatasetSuite$NestedComplicatedJavaBean value1 = false ? null : new test.org.apache.spark.sql.JavaDatasetSuite$NestedComplicatedJavaBean();
      /* 22932 */     this.javaBean = value1;
      /* 22933 */     if (!false) {
      /* 22934 */       apply130_0(i);
      /* 22935 */       apply130_1(i);
      /* 22936 */       apply130_2(i);
      /* 22937 */       apply130_3(i);
      /* 22938 */       apply130_4(i);
      /* 22939 */     }
      /* 22940 */     if (false) {
      /* 22941 */       mutableRow.setNullAt(0);
      /* 22942 */     } else {
      /* 22943 */
      /* 22944 */       mutableRow.update(0, value1);
      /* 22945 */     }
      /* 22946 */
      /* 22947 */     return mutableRow;
      /* 22948 */   }
      ````
      
      ## How was this patch tested?
      
      added a test suite into `JavaDatasetSuite.java`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #16032 from kiszk/SPARK-18118.
      
      (cherry picked from commit f075cd9c)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      712bd5ab
    • Herman van Hovell's avatar
      [SPARK-18604][SQL] Make sure CollapseWindow returns the attributes in the same order. · d6e027e6
      Herman van Hovell authored
      
      ## What changes were proposed in this pull request?
      The `CollapseWindow` optimizer rule changes the order of output attributes. This modifies the output of the plan, which the optimizer cannot do. This also breaks things like `collect()` for which we use a `RowEncoder` that assumes that the output attributes of the executed plan are equal to those outputted by the logical plan.
      
      ## How was this patch tested?
      I have updated an incorrect test in `CollapseWindowSuite`.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #16027 from hvanhovell/SPARK-18604.
      
      (cherry picked from commit 454b8049)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      d6e027e6
    • Takuya UESHIN's avatar
      [SPARK-18585][SQL] Use `ev.isNull = "false"` if possible for Janino to have a chance to optimize. · 886f880d
      Takuya UESHIN authored
      
      ## What changes were proposed in this pull request?
      
      Janino can optimize `true ? a : b` into `a` or `false ? a : b` into `b`, or if/else with literal condition, so we should use literal as `ev.isNull` if possible.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #16008 from ueshin/issues/SPARK-18585.
      
      (cherry picked from commit 87141622)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      886f880d
  2. Nov 27, 2016
    • Wenchen Fan's avatar
      [SPARK-18482][SQL] make sure Spark can access the table metadata created by older version of spark · 6b77889e
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      In Spark 2.1, we did a lot of refactor for `HiveExternalCatalog` and related code path. These refactor may introduce external behavior changes and break backward compatibility. e.g. http://issues.apache.org/jira/browse/SPARK-18464
      
      
      
      To avoid future compatibility problems of `HiveExternalCatalog`, this PR dumps some typical table metadata from tables created by 2.0, and test if they can recognized by current version of Spark.
      
      ## How was this patch tested?
      
      test only change
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16003 from cloud-fan/test.
      
      (cherry picked from commit fc2c13bd)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      6b77889e
    • gatorsmile's avatar
      [SPARK-18594][SQL] Name Validation of Databases/Tables · 1e8fbefa
      gatorsmile authored
      
      ### What changes were proposed in this pull request?
      Currently, the name validation checks are limited to table creation. It is enfored by Analyzer rule: `PreWriteCheck`.
      
      However, table renaming and database creation have the same issues. It makes more sense to do the checks in `SessionCatalog`. This PR is to add it into `SessionCatalog`.
      
      ### How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16018 from gatorsmile/nameValidate.
      
      (cherry picked from commit 07f32c22)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      1e8fbefa
  3. Nov 26, 2016
    • Dongjoon Hyun's avatar
      [SPARK-17251][SQL] Improve `OuterReference` to be `NamedExpression` · 9c549572
      Dongjoon Hyun authored
      
      ## What changes were proposed in this pull request?
      
      Currently, `OuterReference` is not `NamedExpression`. So, it raises 'ClassCastException` when it used in projection lists of IN correlated subqueries. This PR aims to support that by making `OuterReference` as `NamedExpression` to show correct error messages.
      
      ```scala
      scala> sql("CREATE TEMPORARY VIEW t1 AS SELECT * FROM VALUES 1, 2 AS t1(a)")
      scala> sql("CREATE TEMPORARY VIEW t2 AS SELECT * FROM VALUES 1 AS t2(b)")
      scala> sql("SELECT a FROM t1 WHERE a IN (SELECT a FROM t2)").show
      java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.OuterReference cannot be cast to org.apache.spark.sql.catalyst.expressions.NamedExpression
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins test with new test cases.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16015 from dongjoon-hyun/SPARK-17251-2.
      
      (cherry picked from commit 9c03c564)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      9c549572
    • Weiqing Yang's avatar
      [WIP][SQL][DOC] Fix incorrect `code` tag · ff699332
      Weiqing Yang authored
      
      ## What changes were proposed in this pull request?
      This PR is to fix incorrect `code` tag in `sql-programming-guide.md`
      
      ## How was this patch tested?
      Manually.
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #15941 from weiqingy/fixtag.
      
      (cherry picked from commit f4a98e42)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      ff699332
    • Yanbo Liang's avatar
      [SPARK-18481][ML] ML 2.1 QA: Remove deprecated methods for ML · 830ee134
      Yanbo Liang authored
      
      ## What changes were proposed in this pull request?
      Remove deprecated methods for ML.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15913 from yanboliang/spark-18481.
      
      (cherry picked from commit c4a7eef0)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      830ee134
  4. Nov 25, 2016
    • Takuya UESHIN's avatar
      [SPARK-18583][SQL] Fix nullability of InputFileName. · da66b974
      Takuya UESHIN authored
      
      ## What changes were proposed in this pull request?
      
      The nullability of `InputFileName` should be `false`.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #16007 from ueshin/issues/SPARK-18583.
      
      (cherry picked from commit a88329d4)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      da66b974
    • jiangxingbo's avatar
      [SPARK-18436][SQL] isin causing SQL syntax error with JDBC · 906d82c4
      jiangxingbo authored
      
      ## What changes were proposed in this pull request?
      
      The expression `in(empty seq)` is invalid in some data source. Since `in(empty seq)` is always false, we should generate `in(empty seq)` to false literal in optimizer.
      The sql `SELECT * FROM t WHERE a IN ()` throws a `ParseException` which is consistent with Hive, don't need to change that behavior.
      
      ## How was this patch tested?
      Add new test case in `OptimizeInSuite`.
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      
      Closes #15977 from jiangxb1987/isin-empty.
      
      (cherry picked from commit e2fb9fd3)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      906d82c4
    • Zhenhua Wang's avatar
      [SPARK-18559][SQL] Fix HLL++ with small relative error · b5afdaca
      Zhenhua Wang authored
      
      ## What changes were proposed in this pull request?
      
      In `HyperLogLogPlusPlus`, if the relative error is so small that p >= 19, it will cause ArrayIndexOutOfBoundsException in `THRESHOLDS(p-4)` . We should check `p` and when p >= 19, regress to the original HLL result and use the small range correction they use.
      
      The pr also fixes the upper bound in the log info in `require()`.
      The upper bound is computed by:
      ```
      val relativeSD = 1.106d / Math.pow(Math.E, p * Math.log(2.0d) / 2.0d)
      ```
      which is derived from the equation for computing `p`:
      ```
      val p = 2.0d * Math.log(1.106d / relativeSD) / Math.log(2.0d)
      ```
      
      ## How was this patch tested?
      
      add test cases for:
      1. checking validity of parameter relatvieSD
      2. estimation with smaller relative error so that p >= 19
      
      Author: Zhenhua Wang <wzh_zju@163.com>
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #15990 from wzhfy/hllppRsd.
      
      (cherry picked from commit 5ecdc7c5)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      b5afdaca
    • hyukjinkwon's avatar
      [SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will... · 69856f28
      hyukjinkwon authored
      [SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will help unidoc/genjavadoc compatibility
      
      ## What changes were proposed in this pull request?
      
      This PR only tries to fix things that looks pretty straightforward and were fixed in other previous PRs before.
      
      This PR roughly fixes several things as below:
      
      - Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` ``
      
        ```
        [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/DataStreamReader.java:226: error: reference not found
        [error]    * Loads text files and returns a {link DataFrame} whose schema starts with a string column named
        ```
      
      - Fix an exception annotation and remove code backticks in `throws` annotation
      
        Currently, sbt unidoc with Java 8 complains as below:
      
        ```
        [error] .../java/org/apache/spark/sql/streaming/StreamingQuery.java:72: error: unexpected text
        [error]    * throws StreamingQueryException, if <code>this</code> query has terminated with an exception.
        ```
      
        `throws` should specify the correct class name from `StreamingQueryException,` to `StreamingQueryException` without backticks. (see [JDK-8007644](https://bugs.openjdk.java.net/browse/JDK-8007644)).
      
      - Fix `[[http..]]` to `<a href="http..."></a>`.
      
        ```diff
        -   * [[https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https Oracle
        -   * blog page]].
        +   * <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https
      
      ">
        +   * Oracle blog page</a>.
        ```
      
         `[[http...]]` link markdown in scaladoc is unrecognisable in javadoc.
      
      - It seems class can't have `return` annotation. So, two cases of this were removed.
      
        ```
        [error] .../java/org/apache/spark/mllib/regression/IsotonicRegression.java:27: error: invalid use of return
        [error]    * return New instance of IsotonicRegression.
        ```
      
      - Fix < to `&lt;` and > to `&gt;` according to HTML rules.
      
      - Fix `</p>` complaint
      
      - Exclude unrecognisable in javadoc, `constructor`, `todo` and `groupname`.
      
      ## How was this patch tested?
      
      Manually tested by `jekyll build` with Java 7 and 8
      
      ```
      java version "1.7.0_80"
      Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
      Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
      ```
      
      ```
      java version "1.8.0_45"
      Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
      Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
      ```
      
      Note: this does not yet make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15999 from HyukjinKwon/SPARK-3359-errors.
      
      (cherry picked from commit 51b1c155)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      69856f28
    • n.fraison's avatar
      [SPARK-18119][SPARK-CORE] Namenode safemode check is only performed on one... · a49dfa93
      n.fraison authored
      [SPARK-18119][SPARK-CORE] Namenode safemode check is only performed on one namenode which can stuck the startup of SparkHistory server
      
      ## What changes were proposed in this pull request?
      
      Instead of using the setSafeMode method that check the first namenode used the one which permitts to check only for active NNs
      ## How was this patch tested?
      
      manual tests
      
      Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
      
       before opening a pull request.
      
      This commit is contributed by Criteo SA under the Apache v2 licence.
      
      Author: n.fraison <n.fraison@criteo.com>
      
      Closes #15648 from ashangit/SPARK-18119.
      
      (cherry picked from commit f42db0c0)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      a49dfa93
    • uncleGen's avatar
      [SPARK-18575][WEB] Keep same style: adjust the position of driver log links · 57dbc682
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      NOT BUG, just adjust the position of driver log link to keep the same style with other executors log link.
      
      ![image](https://cloud.githubusercontent.com/assets/7402327/20590092/f8bddbb8-b25b-11e6-9aaf-3b5b3073df10.png
      
      )
      
      ## How was this patch tested?
       no
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #16001 from uncleGen/SPARK-18575.
      
      (cherry picked from commit f58a8aa2)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      57dbc682
  5. Nov 24, 2016
  6. Nov 23, 2016
    • Shixiong Zhu's avatar
      [SPARK-18510][SQL] Follow up to address comments in #15951 · 27d81d00
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      This PR addressed the rest comments in #15951.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15997 from zsxwing/SPARK-18510-follow-up.
      
      (cherry picked from commit 223fa218)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      27d81d00
    • Burak Yavuz's avatar
      [SPARK-18510] Fix data corruption from inferred partition column dataTypes · 15d2cf26
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      ### The Issue
      
      If I specify my schema when doing
      ```scala
      spark.read
        .schema(someSchemaWherePartitionColumnsAreStrings)
      ```
      but if the partition inference can infer it as IntegerType or I assume LongType or DoubleType (basically fixed size types), then once UnsafeRows are generated, your data will be corrupted.
      
      ### Proposed solution
      
      The partition handling code path is kind of a mess. In my fix I'm probably adding to the mess, but at least trying to standardize the code path.
      
      The real issue is that a user that uses the `spark.read` code path can never clearly specify what the partition columns are. If you try to specify the fields in `schema`, we practically ignore what the user provides, and fall back to our inferred data types. What happens in the end is data corruption.
      
      My solution tries to fix this by always trying to infer partition columns the first time you specify the table. Once we find what the partition columns are, we try to find them in the user specified schema and use the dataType provided there, or fall back to the smallest common data type.
      
      We will ALWAYS append partition columns to the user's schema, even if they didn't ask for it. We will only use the data type they provided if they specified it. While this is confusing, this has been the behavior since Spark 1.6, and I didn't want to change this behavior in the QA period of Spark 2.1. We may revisit this decision later.
      
      A side effect of this PR is that we won't need https://github.com/apache/spark/pull/15942
      
       if this PR goes in.
      
      ## How was this patch tested?
      
      Regression tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #15951 from brkyvz/partition-corruption.
      
      (cherry picked from commit 0d1bf2b6)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      15d2cf26
    • Wenchen Fan's avatar
      [SPARK-18050][SQL] do not create default database if it already exists · 835f03f3
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      When we try to create the default database, we ask hive to do nothing if it already exists. However, Hive will log an error message instead of doing nothing, and the error message is quite annoying and confusing.
      
      In this PR, we only create default database if it doesn't exist.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15993 from cloud-fan/default-db.
      
      (cherry picked from commit f129ebcd)
      Signed-off-by: default avatarAndrew Or <andrewor14@gmail.com>
      835f03f3
    • Reynold Xin's avatar
      [SPARK-18522][SQL] Explicit contract for column stats serialization · 599dac15
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      The current implementation of column stats uses the base64 encoding of the internal UnsafeRow format to persist statistics (in table properties in Hive metastore). This is an internal format that is not stable across different versions of Spark and should NOT be used for persistence. In addition, it would be better if statistics stored in the catalog is human readable.
      
      This pull request introduces the following changes:
      
      1. Created a single ColumnStat class to for all data types. All data types track the same set of statistics.
      2. Updated the implementation for stats collection to get rid of the dependency on internal data structures (e.g. InternalRow, or storing DateType as an int32). For example, previously dates were stored as a single integer, but are now stored as java.sql.Date. When we implement the next steps of CBO, we can add code to convert those back into internal types again.
      3. Documented clearly what JVM data types are being used to store what data.
      4. Defined a simple Map[String, String] interface for serializing and deserializing column stats into/from the catalog.
      5. Rearranged the method/function structure so it is more clear what the supported data types are, and also moved how stats are generated into ColumnStat class so they are easy to find.
      
      ## How was this patch tested?
      Removed most of the original test cases created for column statistics, and added three very simple ones to cover all the cases. The three test cases validate:
      1. Roundtrip serialization works.
      2. Behavior when analyzing non-existent column or unsupported data type column.
      3. Result for stats collection for all valid data types.
      
      Also moved parser related tests into a parser test suite and added an explicit serialization test for the Hive external catalog.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15959 from rxin/SPARK-18522.
      
      (cherry picked from commit 70ad07a9)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      599dac15
    • Reynold Xin's avatar
      [SPARK-18557] Downgrade confusing memory leak warning message · e11d7c68
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      TaskMemoryManager has a memory leak detector that gets called at task completion callback and checks whether any memory has not been released. If they are not released by the time the callback is invoked, TaskMemoryManager releases them.
      
      The current error message says something like the following:
      ```
      WARN  [Executor task launch worker-0]
      org.apache.spark.memory.TaskMemoryManager - leak 16.3 MB memory from
      org.apache.spark.unsafe.map.BytesToBytesMap33fb6a15
      In practice, there are multiple reasons why these can be triggered in the normal code path (e.g. limit, or task failures), and the fact that these messages are log means the "leak" is fixed by TaskMemoryManager.
      ```
      
      To not confuse users, this patch downgrade the message from warning to debug level, and avoids using the word "leak" since it is not actually a leak.
      
      ## How was this patch tested?
      N/A - this is a simple logging improvement.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15989 from rxin/SPARK-18557.
      
      (cherry picked from commit 9785ed40)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      e11d7c68
    • Eric Liang's avatar
      [SPARK-18545][SQL] Verify number of hive client RPCs in PartitionedTablePerfStatsSuite · 539c193a
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      This would help catch accidental O(n) calls to the hive client as in https://issues.apache.org/jira/browse/SPARK-18507
      
      ## How was this patch tested?
      
      Checked that the test fails before https://issues.apache.org/jira/browse/SPARK-18507
      
       was patched. cc cloud-fan
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #15985 from ericl/spark-18545.
      
      (cherry picked from commit 85235ed6)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      539c193a
    • Wenchen Fan's avatar
      [SPARK-18053][SQL] compare unsafe and safe complex-type values correctly · ebeb0514
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      In Spark SQL, some expression may output safe format values, e.g. `CreateArray`, `CreateStruct`, `Cast`, etc. When we compare 2 values, we should be able to compare safe and unsafe formats.
      
      The `GreaterThan`, `LessThan`, etc. in Spark SQL already handles it, but the `EqualTo` doesn't. This PR fixes it.
      
      ## How was this patch tested?
      
      new unit test and regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15929 from cloud-fan/type-aware.
      
      (cherry picked from commit 84284e8c)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      ebeb0514
    • Sean Owen's avatar
      [SPARK-18073][DOCS][WIP] Migrate wiki to spark.apache.org web site · 5f198d20
      Sean Owen authored
      
      ## What changes were proposed in this pull request?
      
      Updates links to the wiki to links to the new location of content on spark.apache.org.
      
      ## How was this patch tested?
      
      Doc builds
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15967 from srowen/SPARK-18073.1.
      
      (cherry picked from commit 7e0cd1d9)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      5f198d20
    • hyukjinkwon's avatar
      [SPARK-18179][SQL] Throws analysis exception with a proper message for... · fabb5aea
      hyukjinkwon authored
      [SPARK-18179][SQL] Throws analysis exception with a proper message for unsupported argument types in reflect/java_method function
      
      ## What changes were proposed in this pull request?
      
      This PR proposes throwing an `AnalysisException` with a proper message rather than `NoSuchElementException` with the message ` key not found: TimestampType` when unsupported types are given to `reflect` and `java_method` functions.
      
      ```scala
      spark.range(1).selectExpr("reflect('java.lang.String', 'valueOf', cast('1990-01-01' as timestamp))")
      ```
      
      produces
      
      **Before**
      
      ```
      java.util.NoSuchElementException: key not found: TimestampType
        at scala.collection.MapLike$class.default(MapLike.scala:228)
        at scala.collection.AbstractMap.default(Map.scala:59)
        at scala.collection.MapLike$class.apply(MapLike.scala:141)
        at scala.collection.AbstractMap.apply(Map.scala:59)
        at org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection$$anonfun$findMethod$1$$anonfun$apply$1.apply(CallMethodViaReflection.scala:159)
      ...
      ```
      
      **After**
      
      ```
      cannot resolve 'reflect('java.lang.String', 'valueOf', CAST('1990-01-01' AS TIMESTAMP))' due to data type mismatch: arguments from the third require boolean, byte, short, integer, long, float, double or string expressions; line 1 pos 0;
      'Project [unresolvedalias(reflect(java.lang.String, valueOf, cast(1990-01-01 as timestamp)), Some(<function1>))]
      +- Range (0, 1, step=1, splits=Some(2))
      ...
      ```
      
      Added message is,
      
      ```
      arguments from the third require boolean, byte, short, integer, long, float, double or string expressions
      ```
      
      ## How was this patch tested?
      
      Tests added in `CallMethodViaReflection`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15694 from HyukjinKwon/SPARK-18179.
      
      (cherry picked from commit 2559fb4b)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      fabb5aea
  7. Nov 22, 2016
    • Yanbo Liang's avatar
      [SPARK-18501][ML][SPARKR] Fix spark.glm errors when fitting on collinear data · fc5fee83
      Yanbo Liang authored
      
      ## What changes were proposed in this pull request?
      * Fix SparkR ```spark.glm``` errors when fitting on collinear data, since ```standard error of coefficients, t value and p value``` are not available in this condition.
      * Scala/Python GLM summary should throw exception if users get ```standard error of coefficients, t value and p value``` but the underlying WLS was solved by local "l-bfgs".
      
      ## How was this patch tested?
      Add unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15930 from yanboliang/spark-18501.
      
      (cherry picked from commit 982b82e3)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      fc5fee83
    • Shixiong Zhu's avatar
      [SPARK-18530][SS][KAFKA] Change Kafka timestamp column type to TimestampType · 3be2d1e0
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      Changed Kafka timestamp column type to TimestampType.
      
      ## How was this patch tested?
      
      `test("Kafka column types")`.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15969 from zsxwing/SPARK-18530.
      
      (cherry picked from commit d0212eb0)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      3be2d1e0
    • Dilip Biswal's avatar
      [SPARK-18533] Raise correct error upon specification of schema for datasource... · 4b96ffb1
      Dilip Biswal authored
      [SPARK-18533] Raise correct error upon specification of schema for datasource tables created using CTAS
      
      ## What changes were proposed in this pull request?
      Fixes the inconsistency of error raised between data source and hive serde
      tables when schema is specified in CTAS scenario. In the process the grammar for
      create table (datasource) is simplified.
      
      **before:**
      ``` SQL
      spark-sql> create table t2 (c1 int, c2 int) using parquet as select * from t1;
      Error in query:
      mismatched input 'as' expecting {<EOF>, '.', 'OPTIONS', 'CLUSTERED', 'PARTITIONED'}(line 1, pos 64)
      
      == SQL ==
      create table t2 (c1 int, c2 int) using parquet as select * from t1
      ----------------------------------------------------------------^^^
      ```
      
      **After:**
      ```SQL
      spark-sql> create table t2 (c1 int, c2 int) using parquet as select * from t1
               > ;
      Error in query:
      Operation not allowed: Schema may not be specified in a Create Table As Select (CTAS) statement(line 1, pos 0)
      
      == SQL ==
      create table t2 (c1 int, c2 int) using parquet as select * from t1
      ^^^
      ```
      ## How was this patch tested?
      Added a new test in CreateTableAsSelectSuite
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #15968 from dilipbiswal/ctas.
      
      (cherry picked from commit 39a1d306)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      4b96ffb1
    • gatorsmile's avatar
      [SPARK-16803][SQL] SaveAsTable does not work when target table is a Hive serde table · 64b9de9c
      gatorsmile authored
      
      ### What changes were proposed in this pull request?
      
      In Spark 2.0, `SaveAsTable` does not work when the target table is a Hive serde table, but Spark 1.6 works.
      
      **Spark 1.6**
      
      ``` Scala
      scala> sql("create table sample.sample stored as SEQUENCEFILE as select 1 as key, 'abc' as value")
      res2: org.apache.spark.sql.DataFrame = []
      
      scala> val df = sql("select key, value as value from sample.sample")
      df: org.apache.spark.sql.DataFrame = [key: int, value: string]
      
      scala> df.write.mode("append").saveAsTable("sample.sample")
      
      scala> sql("select * from sample.sample").show()
      +---+-----+
      |key|value|
      +---+-----+
      |  1|  abc|
      |  1|  abc|
      +---+-----+
      ```
      
      **Spark 2.0**
      
      ``` Scala
      scala> df.write.mode("append").saveAsTable("sample.sample")
      org.apache.spark.sql.AnalysisException: Saving data in MetastoreRelation sample, sample
       is not supported.;
      ```
      
      So far, we do not plan to support it in Spark 2.1 due to the risk. Spark 1.6 works because it internally uses insertInto. But, if we change it back it will break the semantic of saveAsTable (this method uses by-name resolution instead of using by-position resolution used by insertInto). More extra changes are needed to support `hive` as a `format` in DataFrameWriter.
      
      Instead, users should use insertInto API. This PR corrects the error messages. Users can understand how to bypass it before we support it in a separate PR.
      ### How was this patch tested?
      
      Test cases are added
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15926 from gatorsmile/saveAsTableFix5.
      
      (cherry picked from commit 9c42d4a7)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      64b9de9c
    • Shixiong Zhu's avatar
      [SPARK-18373][SPARK-18529][SS][KAFKA] Make failOnDataLoss=false work with Spark jobs · bd338f60
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR adds `CachedKafkaConsumer.getAndIgnoreLostData` to handle corner cases of `failOnDataLoss=false`.
      
      It also resolves [SPARK-18529](https://issues.apache.org/jira/browse/SPARK-18529
      
      ) after refactoring codes: Timeout will throw a TimeoutException.
      
      ## How was this patch tested?
      
      Because I cannot find any way to manually control the Kafka server to clean up logs, it's impossible to write unit tests for each corner case. Therefore, I just created `test("stress test for failOnDataLoss=false")` which should cover most of corner cases.
      
      I also modified some existing tests to test for both `failOnDataLoss=false` and `failOnDataLoss=true` to make sure it doesn't break existing logic.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15820 from zsxwing/failOnDataLoss.
      
      (cherry picked from commit 2fd101b2)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      bd338f60
    • Burak Yavuz's avatar
      [SPARK-18465] Add 'IF EXISTS' clause to 'UNCACHE' to not throw exceptions when table doesn't exist · fb2ea54a
      Burak Yavuz authored
      
      ## What changes were proposed in this pull request?
      
      While this behavior is debatable, consider the following use case:
      ```sql
      UNCACHE TABLE foo;
      CACHE TABLE foo AS
      SELECT * FROM bar
      ```
      The command above fails the first time you run it. But I want to run the command above over and over again, and I don't want to change my code just for the first run of it.
      The issue is that subsequent `CACHE TABLE` commands do not overwrite the existing table.
      
      Now we can do:
      ```sql
      UNCACHE TABLE IF EXISTS foo;
      CACHE TABLE foo AS
      SELECT * FROM bar
      ```
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #15896 from brkyvz/uncache.
      
      (cherry picked from commit bdc8153e)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      fb2ea54a
Loading