Skip to content
Snippets Groups Projects
  1. May 15, 2017
  2. May 14, 2017
  3. May 13, 2017
    • Wenchen Fan's avatar
      [SPARK-20725][SQL] partial aggregate should behave correctly for sameResult · 1283c3d1
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      For aggregate function with `PartialMerge` or `Final` mode, the input is aggregate buffers instead of the actual children expressions. So the actual children expressions won't affect the result, we should normalize the expr id for them.
      
      ## How was this patch tested?
      
      a new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17964 from cloud-fan/tmp.
      1283c3d1
    • hyukjinkwon's avatar
      [SPARK-18772][SQL] Avoid unnecessary conversion try for special floats in JSON · 3f98375d
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR is based on  https://github.com/apache/spark/pull/16199 and extracts the valid change from https://github.com/apache/spark/pull/9759 to resolve SPARK-18772
      
      This avoids additional conversion try with `toFloat` and `toDouble`.
      
      For avoiding additional conversions, please refer the codes below:
      
      **Before**
      
      ```scala
      scala> import org.apache.spark.sql.types._
      import org.apache.spark.sql.types._
      
      scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
      17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
      java.lang.NumberFormatException: For input string: "nan"
      ...
      ```
      
      **After**
      
      ```scala
      scala> import org.apache.spark.sql.types._
      import org.apache.spark.sql.types._
      
      scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
      17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
      java.lang.RuntimeException: Cannot parse nan as DoubleType.
      ...
      ```
      
      ## How was this patch tested?
      
      Unit tests added in `JsonSuite`.
      
      Closes #16199
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: Nathan Howell <nhowell@godaddy.com>
      
      Closes #17956 from HyukjinKwon/SPARK-18772.
      3f98375d
  4. May 12, 2017
    • Xiao Li's avatar
      [SPARK-20719][SQL] Support LIMIT ALL · b84ff7eb
      Xiao Li authored
      ### What changes were proposed in this pull request?
      `LIMIT ALL` is the same as omitting the `LIMIT` clause. It is supported by both PrestgreSQL and Presto. This PR is to support it by adding it in the parser.
      
      ### How was this patch tested?
      Added a test case
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17960 from gatorsmile/LimitAll.
      b84ff7eb
    • zuotingbing's avatar
      [SPARK-20594][SQL] The staging directory should be a child directory starts... · e3d2022e
      zuotingbing authored
      [SPARK-20594][SQL] The staging directory should be a child directory starts with "." to avoid being deleted if we set hive.exec.stagingdir under the table directory.
      
      JIRA Issue: https://issues.apache.org/jira/browse/SPARK-20594
      
      ## What changes were proposed in this pull request?
      
      The staging directory should be a child directory starts with "." to avoid being deleted before moving staging directory to table directory if we set hive.exec.stagingdir under the table directory.
      
      ## How was this patch tested?
      
      Added unit tests
      
      Author: zuotingbing <zuo.tingbing9@zte.com.cn>
      
      Closes #17858 from zuotingbing/spark-stagingdir.
      e3d2022e
    • Tathagata Das's avatar
      [SPARK-20714][SS] Fix match error when watermark is set with timeout = no... · 0d3a6319
      Tathagata Das authored
      [SPARK-20714][SS] Fix match error when watermark is set with timeout = no timeout / processing timeout
      
      ## What changes were proposed in this pull request?
      
      When watermark is set, and timeout conf is NoTimeout or ProcessingTimeTimeout (both do not need the watermark), the query fails at runtime with the following exception.
      ```
      MatchException: Some(org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate1a9b798e) (of class scala.Some)
          org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec$$anonfun$doExecute$1.apply(FlatMapGroupsWithStateExec.scala:120)
          org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec$$anonfun$doExecute$1.apply(FlatMapGroupsWithStateExec.scala:116)
          org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:70)
          org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:65)
          org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
      ```
      
      The match did not correctly handle cases where watermark was defined by the timeout was different from EventTimeTimeout.
      
      ## How was this patch tested?
      New unit tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #17954 from tdas/SPARK-20714.
      0d3a6319
    • Shixiong Zhu's avatar
      [SPARK-20702][CORE] TaskContextImpl.markTaskCompleted should not hide the original error · 7d6ff391
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR adds an `error` parameter to `TaskContextImpl.markTaskCompleted` to propagate the original error.
      
      It also fixes an issue that `TaskCompletionListenerException.getMessage` doesn't include `previousError`.
      
      ## How was this patch tested?
      
      New unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17942 from zsxwing/SPARK-20702.
      7d6ff391
    • Takeshi Yamamuro's avatar
      [SPARK-19951][SQL] Add string concatenate operator || to Spark SQL · b526f70c
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added code to support `||` for string concatenation. This string operation is supported in PostgreSQL and MySQL.
      
      ## How was this patch tested?
      Added tests in `SparkSqlParserSuite`
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #17711 from maropu/SPARK-19951.
      b526f70c
    • Takeshi Yamamuro's avatar
      [SPARK-20710][SQL] Support aliases in CUBE/ROLLUP/GROUPING SETS · 92ea7fd7
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added  `Analyzer` code for supporting aliases in CUBE/ROLLUP/GROUPING SETS (This is follow-up of #17191).
      
      ## How was this patch tested?
      Added tests in `SQLQueryTestSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #17948 from maropu/SPARK-20710.
      92ea7fd7
    • wangzhenhua's avatar
      [SPARK-20718][SQL][FOLLOWUP] Fix canonicalization for HiveTableScanExec · 54b4f2ad
      wangzhenhua authored
      ## What changes were proposed in this pull request?
      
      Fix canonicalization for different filter orders in `HiveTableScanExec`.
      
      ## How was this patch tested?
      
      Added a new test case.
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #17962 from wzhfy/canonicalizeHiveTableScanExec.
      54b4f2ad
    • Ryan Blue's avatar
      [SPARK-17424] Fix unsound substitution bug in ScalaReflection. · b2369339
      Ryan Blue authored
      ## What changes were proposed in this pull request?
      
      This method gets a type's primary constructor and fills in type parameters with concrete types. For example, `MapPartitions[T, U] -> MapPartitions[Int, String]`. This Substitution fails when the actual type args are empty because they are still unknown. Instead, when there are no resolved types to subsitute, this returns the original args with unresolved type parameters.
      ## How was this patch tested?
      
      This doesn't affect substitutions where the type args are determined. This fixes our case where the actual type args are empty and our job runs successfully.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #15062 from rdblue/SPARK-17424-fix-unsound-reflect-substitution.
      b2369339
    • Sean Owen's avatar
      [SPARK-20554][BUILD] Remove usage of scala.language.reflectiveCalls · fc8a2b6e
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Remove uses of scala.language.reflectiveCalls that are either unnecessary or probably resulting in more complex code. This turned out to be less significant than I thought, but, still worth a touch-up.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17949 from srowen/SPARK-20554.
      fc8a2b6e
    • hyukjinkwon's avatar
      [SPARK-20639][SQL] Add single argument support for to_timestamp in SQL with... · 720708cc
      hyukjinkwon authored
      [SPARK-20639][SQL] Add single argument support for to_timestamp in SQL with documentation improvement
      
      ## What changes were proposed in this pull request?
      
      This PR proposes three things as below:
      
      - Use casting rules to a timestamp in `to_timestamp` by default (it was `yyyy-MM-dd HH:mm:ss`).
      
      - Support single argument for `to_timestamp` similarly with APIs in other languages.
      
        For example, the one below works
      
        ```
        import org.apache.spark.sql.functions._
        Seq("2016-12-31 00:12:00.00").toDF("a").select(to_timestamp(col("a"))).show()
        ```
      
        prints
      
        ```
        +----------------------------------------+
        |to_timestamp(`a`, 'yyyy-MM-dd HH:mm:ss')|
        +----------------------------------------+
        |                     2016-12-31 00:12:00|
        +----------------------------------------+
        ```
      
        whereas this does not work in SQL.
      
        **Before**
      
        ```
        spark-sql> SELECT to_timestamp('2016-12-31 00:12:00');
        Error in query: Invalid number of arguments for function to_timestamp; line 1 pos 7
        ```
      
        **After**
      
        ```
        spark-sql> SELECT to_timestamp('2016-12-31 00:12:00');
        2016-12-31 00:12:00
        ```
      
      - Related document improvement for SQL function descriptions and other API descriptions accordingly.
      
        **Before**
      
        ```
        spark-sql> DESCRIBE FUNCTION extended to_date;
        ...
        Usage: to_date(date_str, fmt) - Parses the `left` expression with the `fmt` expression. Returns null with invalid input.
        Extended Usage:
            Examples:
              > SELECT to_date('2016-12-31', 'yyyy-MM-dd');
               2016-12-31
        ```
      
        ```
        spark-sql> DESCRIBE FUNCTION extended to_timestamp;
        ...
        Usage: to_timestamp(timestamp, fmt) - Parses the `left` expression with the `format` expression to a timestamp. Returns null with invalid input.
        Extended Usage:
            Examples:
              > SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd');
               2016-12-31 00:00:00.0
        ```
      
        **After**
      
        ```
        spark-sql> DESCRIBE FUNCTION extended to_date;
        ...
        Usage:
            to_date(date_str[, fmt]) - Parses the `date_str` expression with the `fmt` expression to
              a date. Returns null with invalid input. By default, it follows casting rules to a date if
              the `fmt` is omitted.
      
        Extended Usage:
            Examples:
              > SELECT to_date('2009-07-30 04:17:52');
               2009-07-30
              > SELECT to_date('2016-12-31', 'yyyy-MM-dd');
               2016-12-31
        ```
      
        ```
        spark-sql> DESCRIBE FUNCTION extended to_timestamp;
        ...
         Usage:
            to_timestamp(timestamp[, fmt]) - Parses the `timestamp` expression with the `fmt` expression to
              a timestamp. Returns null with invalid input. By default, it follows casting rules to
              a timestamp if the `fmt` is omitted.
      
        Extended Usage:
            Examples:
              > SELECT to_timestamp('2016-12-31 00:12:00');
               2016-12-31 00:12:00
              > SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd');
               2016-12-31 00:00:00
        ```
      
      ## How was this patch tested?
      
      Added tests in `datetime.sql`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17901 from HyukjinKwon/to_timestamp_arg.
      720708cc
    • Wayne Zhang's avatar
      [SPARK-20619][ML] StringIndexer supports multiple ways to order label · af40bb11
      Wayne Zhang authored
      ## What changes were proposed in this pull request?
      
      StringIndexer maps labels to numbers according to the descending order of label frequency. Other types of ordering (e.g., alphabetical) may be needed in feature ETL.  For example, the ordering will affect the result in one-hot encoding and RFormula.
      
      This PR proposes to support other ordering methods and we add a parameter `stringOrderType` that supports the following four options:
      - 'frequencyDesc': descending order by label frequency (most frequent label assigned 0)
      - 'frequencyAsc': ascending order by label frequency (least frequent label assigned 0)
      - 'alphabetDesc': descending alphabetical order
      - 'alphabetAsc': ascending alphabetical order
      
      The default is still descending order of label frequency, so there should be no impact to existing programs.
      
      ## How was this patch tested?
      new test
      
      Author: Wayne Zhang <actuaryzhang@uber.com>
      
      Closes #17879 from actuaryzhang/stringIndexer.
      af40bb11
    • Felix Cheung's avatar
      [SPARK-20704][SPARKR] change CRAN test to run single thread · 888b84ab
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      - [x] need to test by running R CMD check --as-cran
      - [x] sanity check vignettes
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17945 from felixcheung/rchangesforpackage.
      888b84ab
    • wangzhenhua's avatar
      [SPARK-20718][SQL] FileSourceScanExec with different filter orders should be... · c8da5356
      wangzhenhua authored
      [SPARK-20718][SQL] FileSourceScanExec with different filter orders should be the same after canonicalization
      
      ## What changes were proposed in this pull request?
      
      Since `constraints` in `QueryPlan` is a set, the order of filters can differ. Usually this is ok because of canonicalization. However, in `FileSourceScanExec`, its data filters and partition filters are sequences, and their orders are not canonicalized. So `def sameResult` returns different results for different orders of data/partition filters. This leads to, e.g. different decision for `ReuseExchange`, and thus results in unstable performance.
      
      ## How was this patch tested?
      
      Added a new test for `FileSourceScanExec.sameResult`.
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #17959 from wzhfy/canonicalizeFileSourceScanExec.
      c8da5356
  5. May 11, 2017
    • liuxian's avatar
      [SPARK-20665][SQL] Bround" and "Round" function return NULL · 2b36eb69
      liuxian authored
      ## What changes were proposed in this pull request?
         spark-sql>select bround(12.3, 2);
         spark-sql>NULL
      For this case,  the expected result is 12.3, but it is null.
      So ,when the second parameter is bigger than "decimal.scala", the result is not we expected.
      "round" function  has the same problem. This PR can solve the problem for both of them.
      
      ## How was this patch tested?
      unit test cases in MathExpressionsSuite and MathFunctionsSuite
      
      Author: liuxian <liu.xian3@zte.com.cn>
      
      Closes #17906 from 10110346/wip_lx_0509.
      2b36eb69
    • Liang-Chi Hsieh's avatar
      [SPARK-20399][SQL] Add a config to fallback string literal parsing consistent... · 609ba5f2
      Liang-Chi Hsieh authored
      [SPARK-20399][SQL] Add a config to fallback string literal parsing consistent with old sql parser behavior
      
      ## What changes were proposed in this pull request?
      
      The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string.
      
      The following codes can reproduce it:
      
          val data = Seq("\u0020\u0021\u0023", "abc")
          val df = data.toDF()
      
          // 1st usage: works in 1.6
          // Let parser parse pattern string
          val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'")
          // 2nd usage: works in 1.6, 2.x
          // Call Column.rlike so the pattern string is a literal which doesn't go through parser
          val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$"))
      
          // In 2.x, we need add backslashes to make regex pattern parsed correctly
          val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'")
      
      Follow the discussion in #17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #17887 from viirya/add-config-fallback-string-parsing.
      609ba5f2
    • Takeshi Yamamuro's avatar
      [SPARK-20431][SQL] Specify a schema by using a DDL-formatted string · 04901dd0
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr supported a DDL-formatted string in `DataFrameReader.schema`.
      This fix could make users easily define a schema without importing  `o.a.spark.sql.types._`.
      
      ## How was this patch tested?
      Added tests in `DataFrameReaderWriterSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #17719 from maropu/SPARK-20431.
      04901dd0
    • Jacek Laskowski's avatar
      [SPARK-20600][SS] KafkaRelation should be pretty printed in web UI · 7144b518
      Jacek Laskowski authored
      ## What changes were proposed in this pull request?
      
      User-friendly name of `KafkaRelation` in web UI (under Details for Query).
      
      ### Before
      
      <img width="516" alt="spark-20600-before" src="https://cloud.githubusercontent.com/assets/62313/25841955/74479ac6-34a2-11e7-87fb-d9f62a1356a7.png">
      
      ### After
      
      <img width="439" alt="spark-20600-after" src="https://cloud.githubusercontent.com/assets/62313/25841829/f5335630-34a1-11e7-85a4-afe9b66d73c8.png">
      
      ## How was this patch tested?
      
      Local build
      
      ```
      ./bin/spark-shell --jars ~/.m2/repository/org/apache/spark/spark-sql-kafka-0-10_2.11/2.3.0-SNAPSHOT/spark-sql-kafka-0-10_2.11-2.3.0-SNAPSHOT.jar --packages org.apache.kafka:kafka-clients:0.10.0.1
      ```
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #17917 from jaceklaskowski/SPARK-20600-KafkaRelation-webUI.
      7144b518
    • Takeshi Yamamuro's avatar
      [SPARK-20416][SQL] Print UDF names in EXPLAIN · 3aa4e464
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added `withName` in `UserDefinedFunction` for printing UDF names in EXPLAIN
      
      ## How was this patch tested?
      Added tests in `UDFSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #17712 from maropu/SPARK-20416.
      3aa4e464
    • Takeshi Yamamuro's avatar
      [SPARK-20311][SQL] Support aliases for table value functions · 8c67aa7f
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added parsing rules to support aliases in table value functions.
      The previous pr (#17666) has been reverted because of the regression. This new pr fixed the regression and add tests in `SQLQueryTestSuite`.
      
      ## How was this patch tested?
      Added tests in `PlanParserSuite` and `SQLQueryTestSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #17928 from maropu/SPARK-20311-3.
      8c67aa7f
    • Wenchen Fan's avatar
      [SPARK-20569][SQL] RuntimeReplaceable functions should not take extra parameters · b4c99f43
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      `RuntimeReplaceable` always has a constructor with the expression to replace with, and this constructor should not be the function builder.
      
      ## How was this patch tested?
      
      new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17876 from cloud-fan/minor.
      b4c99f43
    • Robert Kruszewski's avatar
      [SPARK-17029] make toJSON not go through rdd form but operate on dataset always · 65accb81
      Robert Kruszewski authored
      ## What changes were proposed in this pull request?
      
      Don't convert toRdd when doing toJSON
      ## How was this patch tested?
      
      Existing unit tests
      
      Author: Robert Kruszewski <robertk@palantir.com>
      
      Closes #14615 from robert3005/robertk/correct-tojson.
      65accb81
    • Yanbo Liang's avatar
      [SPARK-20606][ML] Revert "[] ML 2.2 QA: Remove deprecated methods for ML" · 0698e6c8
      Yanbo Liang authored
      This reverts commit b8733e0a.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #17944 from yanboliang/spark-20606-revert.
      0698e6c8
  6. May 10, 2017
    • Josh Rosen's avatar
      [SPARK-20685] Fix BatchPythonEvaluation bug in case of single UDF w/ repeated arg. · 8ddbc431
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      There's a latent corner-case bug in PySpark UDF evaluation where executing a `BatchPythonEvaluation` with a single multi-argument UDF where _at least one argument value is repeated_ will crash at execution with a confusing error.
      
      This problem was introduced in #12057: the code there has a fast path for handling a "batch UDF evaluation consisting of a single Python UDF", but that branch incorrectly assumes that a single UDF won't have repeated arguments and therefore skips the code for unpacking arguments from the input row (whose schema may not necessarily match the UDF inputs due to de-duplication of repeated arguments which occurred in the JVM before sending UDF inputs to Python).
      
      This fix here is simply to remove this special-casing: it turns out that the code in the "multiple UDFs" branch just so happens to work for the single-UDF case because Python treats `(x)` as equivalent to `x`, not as a single-argument tuple.
      
      ## How was this patch tested?
      
      New regression test in `pyspark.python.sql.tests` module (tested and confirmed that it fails before my fix).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #17927 from JoshRosen/SPARK-20685.
      8ddbc431
    • Felix Cheung's avatar
      [SPARK-20689][PYSPARK] python doctest leaking bucketed table · af8b6cc8
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      It turns out pyspark doctest is calling saveAsTable without ever dropping them. Since we have separate python tests for bucketed table, and there is no checking of results, there is really no need to run the doctest, other than leaving it as an example in the generated doc
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17932 from felixcheung/pytablecleanup.
      af8b6cc8
    • Ala Luszczak's avatar
      [SPARK-19447] Remove remaining references to generated rows metric · 5c2c4dcc
      Ala Luszczak authored
      ## What changes were proposed in this pull request?
      
      https://github.com/apache/spark/commit/b486ffc86d8ad6c303321dcf8514afee723f61f8 left behind references to "number of generated rows" metrics, that should have been removed.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: Ala Luszczak <ala@databricks.com>
      
      Closes #17939 from ala/SPARK-19447-fix.
      5c2c4dcc
    • Xianyang Liu's avatar
      [MINOR][BUILD] Fix lint-java breaks. · fcb88f92
      Xianyang Liu authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to fix the lint-breaks as below:
      ```
      [ERROR] src/main/java/org/apache/spark/unsafe/Platform.java:[51] (regexp) RegexpSingleline: No trailing whitespace allowed.
      [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[45,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
      [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[62,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
      [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[78,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
      [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[92,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
      [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[102,25] (naming) MethodName: Method name 'Once' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
      [ERROR] src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisInputDStreamBuilderSuite.java:[28,8] (imports) UnusedImports: Unused import - org.apache.spark.streaming.api.java.JavaDStream.
      ```
      
      after:
      ```
      dev/lint-java
      Checkstyle checks passed.
      ```
      [Test Result](https://travis-ci.org/ConeyLiu/spark/jobs/229666169)
      
      ## How was this patch tested?
      
      Travis CI
      
      Author: Xianyang Liu <xianyang.liu@intel.com>
      
      Closes #17890 from ConeyLiu/codestyle.
      fcb88f92
    • wangzhenhua's avatar
      [SPARK-20678][SQL] Ndv for columns not in filter condition should also be updated · 76e4a556
      wangzhenhua authored
      ## What changes were proposed in this pull request?
      
      In filter estimation, we update column stats for those columns in filter condition. However, if the number of rows decreases after the filter (i.e. the overall selectivity is less than 1), we need to update (scale down) the number of distinct values (NDV) for all columns, no matter they are in filter conditions or not.
      
      This pr also fixes the inconsistency of rounding mode for ndv and rowCount.
      
      ## How was this patch tested?
      
      Added new tests.
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #17918 from wzhfy/scaleDownNdvAfterFilter.
      76e4a556
    • Wenchen Fan's avatar
      [SPARK-20688][SQL] correctly check analysis for scalar sub-queries · 789bdbe3
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      In `CheckAnalysis`, we should call `checkAnalysis` for `ScalarSubquery` at the beginning, as later we will call `plan.output` which is invalid if `plan` is not resolved.
      
      ## How was this patch tested?
      
      new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17930 from cloud-fan/tmp.
      789bdbe3
    • NICHOLAS T. MARION's avatar
      [SPARK-20393][WEBU UI] Strengthen Spark to prevent XSS vulnerabilities · b512233a
      NICHOLAS T. MARION authored
      ## What changes were proposed in this pull request?
      
      Add stripXSS and stripXSSMap to Spark Core's UIUtils. Calling these functions at any point that getParameter is called against a HttpServletRequest.
      
      ## How was this patch tested?
      
      Unit tests, IBM Security AppScan Standard no longer showing vulnerabilities, manual verification of WebUI pages.
      
      Author: NICHOLAS T. MARION <nmarion@us.ibm.com>
      
      Closes #17686 from n-marion/xss-fix.
      b512233a
    • Michael Mior's avatar
      [SPARK-20637][CORE] Remove mention of old RDD classes from comments · a4cbf26b
      Michael Mior authored
      ## What changes were proposed in this pull request?
      
      A few comments around the code mention RDD classes that do not exist anymore. I'm not sure of the best way to replace these, so I've just removed them here.
      
      ## How was this patch tested?
      
      Only changes code comments, no testing required
      
      Author: Michael Mior <mmior@uwaterloo.ca>
      
      Closes #17900 from michaelmior/remove-old-rdds.
      a4cbf26b
    • Alex Bozarth's avatar
      [SPARK-20630][WEB UI] Fixed column visibility in Executor Tab · ca4625e0
      Alex Bozarth authored
      ## What changes were proposed in this pull request?
      
      #14617 added new columns to the executor table causing the visibility checks for the logs and threadDump columns to toggle the wrong columns since they used hard-coded column numbers.
      
      I've updated the checks to use column names instead of numbers so future updates don't accidentally break this again.
      
      Note: This will also need to be back ported into 2.2 since #14617 was merged there
      
      ## How was this patch tested?
      
      Manually tested
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #17904 from ajbozarth/spark20630.
      ca4625e0
    • zero323's avatar
      [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should... · 804949c6
      zero323 authored
      [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params
      
      ## What changes were proposed in this pull request?
      
      - Replace `getParam` calls with `getOrDefault` calls.
      - Fix exception message to avoid unintended `TypeError`.
      - Add unit tests
      
      ## How was this patch tested?
      
      New unit tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17891 from zero323/SPARK-20631.
      804949c6
    • Takuya UESHIN's avatar
      [SPARK-20668][SQL] Modify ScalaUDF to handle nullability. · 0ef16bd4
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      When registering Scala UDF, we can know if the udf will return nullable value or not. `ScalaUDF` and related classes should handle the nullability.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #17911 from ueshin/issues/SPARK-20668.
      0ef16bd4
    • Yuhao Yang's avatar
      [SPARK-20670][ML] Simplify FPGrowth transform · a819dab6
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      jira: https://issues.apache.org/jira/browse/SPARK-20670
      As suggested by Sean Owen in https://github.com/apache/spark/pull/17130, the transform code in FPGrowthModel can be simplified.
      
      As I tested on some public dataset http://fimi.ua.ac.be/data/, the performance of the new transform code is even or better than the old implementation.
      
      ## How was this patch tested?
      
      Existing unit test.
      
      Author: Yuhao Yang <yuhao.yang@intel.com>
      
      Closes #17912 from hhbyyh/fpgrowthTransform.
      a819dab6
    • Josh Rosen's avatar
      [SPARK-20686][SQL] PropagateEmptyRelation incorrectly handles aggregate without grouping · a90c5cd8
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      The query
      
      ```
      SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1
      ```
      
      should return a single row of output because the subquery is an aggregate without a group-by and thus should return a single row. However, Spark incorrectly returns zero rows.
      
      This is caused by SPARK-16208 / #13906, a patch which added an optimizer rule to propagate EmptyRelation through operators. The logic for handling aggregates is wrong: it checks whether aggregate expressions are non-empty for deciding whether the output should be empty, whereas it should be checking grouping expressions instead:
      
      An aggregate with non-empty grouping expression will return one output row per group. If the input to the grouped aggregate is empty then all groups will be empty and thus the output will be empty. It doesn't matter whether the aggregation output columns include aggregate expressions since that won't affect the number of output rows.
      
      If the grouping expressions are empty, however, then the aggregate will always produce a single output row and thus we cannot propagate the EmptyRelation.
      
      The current implementation is incorrect and also misses an optimization opportunity by not propagating EmptyRelation in the case where a grouped aggregate has aggregate expressions (in other words, `SELECT COUNT(*) from emptyRelation GROUP BY x` would _not_ be optimized to `EmptyRelation` in the old code, even though it safely could be).
      
      This patch resolves this issue by modifying `PropagateEmptyRelation` to consider only the presence/absence of grouping expressions, not the aggregate functions themselves, when deciding whether to propagate EmptyRelation.
      
      ## How was this patch tested?
      
      - Added end-to-end regression tests in `SQLQueryTest`'s `group-by.sql` file.
      - Updated unit tests in `PropagateEmptyRelationSuite`.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #17929 from JoshRosen/fix-PropagateEmptyRelation.
      a90c5cd8
Loading