Skip to content
Snippets Groups Projects
  1. Oct 04, 2016
    • Ergin Seyfe's avatar
      [SPARK-17773] Input/Output] Add VoidObjectInspector · d2dc8c4a
      Ergin Seyfe authored
      ## What changes were proposed in this pull request?
      Added VoidObjectInspector to the list of PrimitiveObjectInspectors
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      Executing following query was failing.
      select SOME_UDAF*(a.arr)
      from (
      select Array(null) as arr from dim_one_row
      ) a
      
      After the fix, I am getting the correct output:
      res0: Array[org.apache.spark.sql.Row] = Array([null])
      
      Author: Ergin Seyfe <eseyfe@fb.com>
      
      Closes #15337 from seyfe/add_void_object_inspector.
      d2dc8c4a
  2. Oct 03, 2016
    • Takuya UESHIN's avatar
      [SPARK-17702][SQL] Code generation including too many mutable states exceeds JVM size limit. · b1b47274
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      Code generation including too many mutable states exceeds JVM size limit to extract values from `references` into fields in the constructor.
      We should split the generated extractions in the constructor into smaller functions.
      
      ## How was this patch tested?
      
      I added some tests to check if the generated codes for the expressions exceed or not.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #15275 from ueshin/issues/SPARK-17702.
      b1b47274
    • Dongjoon Hyun's avatar
      [SPARK-17112][SQL] "select null" via JDBC triggers IllegalArgumentException in Thriftserver · c571cfb2
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, Spark Thrift Server raises `IllegalArgumentException` for queries whose column types are `NullType`, e.g., `SELECT null` or `SELECT if(true,null,null)`. This PR fixes that by returning `void` like Hive 1.2.
      
      **Before**
      ```sql
      $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select null"
      Connecting to jdbc:hive2://localhost:10000
      Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
      Driver: Hive JDBC (version 1.2.1.spark2)
      Transaction isolation: TRANSACTION_REPEATABLE_READ
      Error: java.lang.IllegalArgumentException: Unrecognized type name: null (state=,code=0)
      Closing: 0: jdbc:hive2://localhost:10000
      
      $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select if(true,null,null)"
      Connecting to jdbc:hive2://localhost:10000
      Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
      Driver: Hive JDBC (version 1.2.1.spark2)
      Transaction isolation: TRANSACTION_REPEATABLE_READ
      Error: java.lang.IllegalArgumentException: Unrecognized type name: null (state=,code=0)
      Closing: 0: jdbc:hive2://localhost:10000
      ```
      
      **After**
      ```sql
      $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select null"
      Connecting to jdbc:hive2://localhost:10000
      Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
      Driver: Hive JDBC (version 1.2.1.spark2)
      Transaction isolation: TRANSACTION_REPEATABLE_READ
      +-------+--+
      | NULL  |
      +-------+--+
      | NULL  |
      +-------+--+
      1 row selected (3.242 seconds)
      Beeline version 1.2.1.spark2 by Apache Hive
      Closing: 0: jdbc:hive2://localhost:10000
      
      $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select if(true,null,null)"
      Connecting to jdbc:hive2://localhost:10000
      Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
      Driver: Hive JDBC (version 1.2.1.spark2)
      Transaction isolation: TRANSACTION_REPEATABLE_READ
      +-------------------------+--+
      | (IF(true, NULL, NULL))  |
      +-------------------------+--+
      | NULL                    |
      +-------------------------+--+
      1 row selected (0.201 seconds)
      Beeline version 1.2.1.spark2 by Apache Hive
      Closing: 0: jdbc:hive2://localhost:10000
      ```
      
      ## How was this patch tested?
      
      * Pass the Jenkins test with a new testsuite.
      * Also, Manually, after starting Spark Thrift Server, run the following command.
      ```sql
      $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select null"
      $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select if(true,null,null)"
      ```
      
      **Hive 1.2**
      ```sql
      hive> create table null_table as select null;
      hive> desc null_table;
      OK
      _c0                     void
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15325 from dongjoon-hyun/SPARK-17112.
      c571cfb2
    • Herman van Hovell's avatar
      [SPARK-17753][SQL] Allow a complex expression as the input a value based case statement · 2bbecdec
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      We currently only allow relatively simple expressions as the input for a value based case statement. Expressions like `case (a > 1) or (b = 2) when true then 1 when false then 0 end` currently fail. This PR adds support for such expressions.
      
      ## How was this patch tested?
      Added a test to the ExpressionParserSuite.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #15322 from hvanhovell/SPARK-17753.
      2bbecdec
    • zero323's avatar
      [SPARK-17587][PYTHON][MLLIB] SparseVector __getitem__ should follow __getitem__ contract · d8399b60
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Replaces` ValueError` with `IndexError` when index passed to `ml` / `mllib` `SparseVector.__getitem__` is out of range. This ensures correct iteration behavior.
      
      Replaces `ValueError` with `IndexError` for `DenseMatrix` and `SparkMatrix` in `ml` / `mllib`.
      
      ## How was this patch tested?
      
      PySpark `ml` / `mllib` unit tests. Additional unit tests to prove that the problem has been resolved.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #15144 from zero323/SPARK-17587.
      d8399b60
    • Jason White's avatar
      [SPARK-17679] [PYSPARK] remove unnecessary Py4J ListConverter patch · 1f31bdae
      Jason White authored
      ## What changes were proposed in this pull request?
      
      This PR removes a patch on ListConverter from https://github.com/apache/spark/pull/5570, as it is no longer necessary. The underlying issue in Py4J https://github.com/bartdag/py4j/issues/160 was patched in https://github.com/bartdag/py4j/commit/224b94b6665e56a93a064073886e1d803a4969d2 and is present in 0.10.3, the version currently in use in Spark.
      
      ## How was this patch tested?
      
      The original test added in https://github.com/apache/spark/pull/5570 remains.
      
      Author: Jason White <jason.white@shopify.com>
      
      Closes #15254 from JasonMWhite/remove_listconverter_patch.
      1f31bdae
    • Sean Owen's avatar
      [SPARK-17718][DOCS][MLLIB] Make loss function formulation label note clearer in MLlib docs · 1dd68d38
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Move note about labels being +1/-1 in formulation only to be just under the table of formulations.
      
      ## How was this patch tested?
      
      Doc build
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15330 from srowen/SPARK-17718.
      Unverified
      1dd68d38
    • Zhenhua Wang's avatar
      [SPARK-17073][SQL] generate column-level statistics · 7bf92127
      Zhenhua Wang authored
      ## What changes were proposed in this pull request?
      
      Generate basic column statistics for all the atomic types:
      - numeric types: max, min, num of nulls, ndv (number of distinct values)
      - date/timestamp types: they are also represented as numbers internally, so they have the same stats as above.
      - string: avg length, max length, num of nulls, ndv
      - binary: avg length, max length, num of nulls
      - boolean: num of nulls, num of trues, num of falsies
      
      Also support storing and loading these statistics.
      
      One thing to notice:
      We support analyzing columns independently, e.g.:
      sql1: `ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS key;`
      sql2: `ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS value;`
      when running sql2 to collect column stats for `value`, we don’t remove stats of columns `key` which are analyzed in sql1 and not in sql2. As a result, **users need to guarantee consistency** between sql1 and sql2. If the table has been changed before sql2, users should re-analyze column `key` when they want to analyze column `value`:
      `ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS key, value;`
      
      ## How was this patch tested?
      
      add unit tests
      
      Author: Zhenhua Wang <wzh_zju@163.com>
      
      Closes #15090 from wzhfy/colStats.
      7bf92127
    • Jagadeesan's avatar
      [SPARK-17736][DOCUMENTATION][SPARKR] Update R README for rmarkdown,… · a27033c0
      Jagadeesan authored
      ## What changes were proposed in this pull request?
      
      To build R docs (which are built when R tests are run), users need to install pandoc and rmarkdown. This was done for Jenkins in ~~[SPARK-17420](https://issues.apache.org/jira/browse/SPARK-17420)~~
      
      … pandoc]
      
      Author: Jagadeesan <as2@us.ibm.com>
      
      Closes #15309 from jagadeesanas2/SPARK-17736.
      Unverified
      a27033c0
    • Alex Bozarth's avatar
      [SPARK-17598][SQL][WEB UI] User-friendly name for Spark Thrift Server in web UI · de3f71ed
      Alex Bozarth authored
      ## What changes were proposed in this pull request?
      
      The name of Spark Thrift JDBC/ODBC Server in web UI reflects the name of the class, i.e. org.apache.spark.sql.hive.thrift.HiveThriftServer2. I changed it to Thrift JDBC/ODBC Server (like Spark shell for spark-shell) as recommended by jaceklaskowski. Note the user can still change the name adding `--name "App Name"` parameter to the start script as before
      
      ## How was this patch tested?
      
      By running the script with various parameters and checking the web ui
      
      ![screen shot 2016-09-27 at 12 19 12 pm](https://cloud.githubusercontent.com/assets/13952758/18888329/aebca47c-84ac-11e6-93d0-6e98684977c5.png)
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #15268 from ajbozarth/spark17598.
      Unverified
      de3f71ed
  3. Oct 02, 2016
    • Tao LI's avatar
      [SPARK-14914][CORE][SQL] Skip/fix some test cases on Windows due to limitation of Windows · 76dc2d90
      Tao LI authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to fix/skip some tests failed on Windows. This PR takes over https://github.com/apache/spark/pull/12696.
      
      **Before**
      
      - **SparkSubmitSuite**
      
        ```
      [info] - launch simple application with spark-submit *** FAILED *** (202 milliseconds)
      [info]   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specifie
      
      [info] - includes jars passed in through --jars *** FAILED *** (1 second, 625 milliseconds)
      [info]   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      ```
      
      - **DiskStoreSuite**
      
        ```
      [info] - reads of memory-mapped and non memory-mapped files are equivalent *** FAILED *** (1 second, 78 milliseconds)
      [info]   diskStoreMapped.remove(blockId) was false (DiskStoreSuite.scala:41)
      ```
      
      **After**
      
      - **SparkSubmitSuite**
      
        ```
      [info] - launch simple application with spark-submit (578 milliseconds)
      [info] - includes jars passed in through --jars (1 second, 875 milliseconds)
      ```
      
      - **DiskStoreSuite**
      
        ```
      [info] DiskStoreSuite:
      [info] - reads of memory-mapped and non memory-mapped files are equivalent !!! CANCELED !!! (766 milliseconds
      ```
      
      For `CreateTableAsSelectSuite` and `FsHistoryProviderSuite`, I could not reproduce as the Java version seems higher than the one that has the bugs about `setReadable(..)` and `setWritable(...)` but as they are bugs reported clearly, it'd be sensible to skip those. We should revert the changes for both back as soon as we drop the support of Java 7.
      
      ## How was this patch tested?
      
      Manually tested via AppVeyor.
      
      Closes #12696
      
      Author: Tao LI <tl@microsoft.com>
      Author: U-FAREAST\tl <tl@microsoft.com>
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15320 from HyukjinKwon/SPARK-14914.
      76dc2d90
    • Sital Kedia's avatar
      [SPARK-17509][SQL] When wrapping catalyst datatype to Hive data type avoid… · f8d7fade
      Sital Kedia authored
      ## What changes were proposed in this pull request?
      
      When wrapping catalyst datatypes to Hive data type, wrap function was doing an expensive pattern matching which was consuming around 11% of cpu time. Avoid the pattern matching by returning the wrapper only once and reuse it.
      
      ## How was this patch tested?
      
      Tested by running the job on cluster and saw around 8% cpu improvements.
      
      Author: Sital Kedia <skedia@fb.com>
      
      Closes #15064 from sitalkedia/skedia/hive_wrapper.
      f8d7fade
  4. Oct 01, 2016
    • Sean Owen's avatar
      [SPARK-17704][ML][MLLIB] ChiSqSelector performance improvement. · b88cb63d
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Partial revert of #15277 to instead sort and store input to model rather than require sorted input
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15299 from srowen/SPARK-17704.2.
      Unverified
      b88cb63d
    • Herman van Hovell's avatar
      [SPARK-17717][SQL] Add Exist/find methods to Catalog [FOLLOW-UP] · af6ece33
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      We added find and exists methods for Databases, Tables and Functions to the user facing Catalog in PR https://github.com/apache/spark/pull/15301. However, it was brought up that the semantics of the  `find` methods are more in line a `get` method (get an object or else fail). So we rename these in this PR.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #15308 from hvanhovell/SPARK-17717-2.
      af6ece33
    • Eric Liang's avatar
      [SPARK-17740] Spark tests should mock / interpose HDFS to ensure that streams are closed · 4bcd9b72
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      As a followup to SPARK-17666, ensure filesystem connections are not leaked at least in unit tests. This is done here by intercepting filesystem calls as suggested by JoshRosen . At the end of each test, we assert no filesystem streams are left open.
      
      This applies to all tests using SharedSQLContext or SharedSparkContext.
      
      ## How was this patch tested?
      
      I verified that tests in sql and core are indeed using the filesystem backend, and fixed the detected leaks. I also checked that reverting https://github.com/apache/spark/pull/15245 causes many actual test failures due to connection leaks.
      
      Author: Eric Liang <ekl@databricks.com>
      Author: Eric Liang <ekhliang@gmail.com>
      
      Closes #15306 from ericl/sc-4672.
      4bcd9b72
    • Dongjoon Hyun's avatar
      [MINOR][DOC] Add an up-to-date description for default serialization during shuffling · 15e9bbb4
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR aims to make the doc up-to-date. The documentation is generally correct, but after https://issues.apache.org/jira/browse/SPARK-13926, Spark starts to choose Kyro as a default serialization library during shuffling of simple types, arrays of simple types, or string type.
      
      ## How was this patch tested?
      
      This is a documentation update.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15315 from dongjoon-hyun/SPARK-DOC-SERIALIZER.
      15e9bbb4
  5. Sep 30, 2016
    • Dongjoon Hyun's avatar
      [SPARK-17739][SQL] Collapse adjacent similar Window operators · aef506e3
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, Spark does not collapse adjacent windows with the same partitioning and sorting. This PR implements `CollapseWindow` optimizer to do the followings.
      
      1. If the partition specs and order specs are the same, collapse into the parent.
      2. If the partition specs are the same and one order spec is a prefix of the other, collapse to the more specific one.
      
      For example:
      ```scala
      val df = spark.range(1000).select($"id" % 100 as "grp", $"id", rand() as "col1", rand() as "col2")
      
      // Add summary statistics for all columns
      import org.apache.spark.sql.expressions.Window
      val cols = Seq("id", "col1", "col2")
      val window = Window.partitionBy($"grp").orderBy($"id")
      val result = cols.foldLeft(df) { (base, name) =>
        base.withColumn(s"${name}_avg", avg(col(name)).over(window))
            .withColumn(s"${name}_stddev", stddev(col(name)).over(window))
            .withColumn(s"${name}_min", min(col(name)).over(window))
            .withColumn(s"${name}_max", max(col(name)).over(window))
      }
      ```
      
      **Before**
      ```scala
      scala> result.explain
      == Physical Plan ==
      Window [max(col2#19) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_max#234], [grp#17L], [id#14L ASC NULLS FIRST]
      +- Window [min(col2#19) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_min#216], [grp#17L], [id#14L ASC NULLS FIRST]
         +- Window [stddev_samp(col2#19) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_stddev#191], [grp#17L], [id#14L ASC NULLS FIRST]
            +- Window [avg(col2#19) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_avg#167], [grp#17L], [id#14L ASC NULLS FIRST]
               +- Window [max(col1#18) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col1_max#152], [grp#17L], [id#14L ASC NULLS FIRST]
                  +- Window [min(col1#18) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col1_min#138], [grp#17L], [id#14L ASC NULLS FIRST]
                     +- Window [stddev_samp(col1#18) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col1_stddev#117], [grp#17L], [id#14L ASC NULLS FIRST]
                        +- Window [avg(col1#18) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col1_avg#97], [grp#17L], [id#14L ASC NULLS FIRST]
                           +- Window [max(id#14L) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS id_max#86L], [grp#17L], [id#14L ASC NULLS FIRST]
                              +- Window [min(id#14L) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS id_min#76L], [grp#17L], [id#14L ASC NULLS FIRST]
                                 +- *Project [grp#17L, id#14L, col1#18, col2#19, id_avg#26, id_stddev#42]
                                    +- Window [stddev_samp(_w0#59) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS id_stddev#42], [grp#17L], [id#14L ASC NULLS FIRST]
                                       +- *Project [grp#17L, id#14L, col1#18, col2#19, id_avg#26, cast(id#14L as double) AS _w0#59]
                                          +- Window [avg(id#14L) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS id_avg#26], [grp#17L], [id#14L ASC NULLS FIRST]
                                             +- *Sort [grp#17L ASC NULLS FIRST, id#14L ASC NULLS FIRST], false, 0
                                                +- Exchange hashpartitioning(grp#17L, 200)
                                                   +- *Project [(id#14L % 100) AS grp#17L, id#14L, rand(-6329949029880411066) AS col1#18, rand(-7251358484380073081) AS col2#19]
                                                      +- *Range (0, 1000, step=1, splits=Some(8))
      ```
      
      **After**
      ```scala
      scala> result.explain
      == Physical Plan ==
      Window [max(col2#5) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_max#220, min(col2#5) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_min#202, stddev_samp(col2#5) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_stddev#177, avg(col2#5) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_avg#153, max(col1#4) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col1_max#138, min(col1#4) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col1_min#124, stddev_samp(col1#4) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col1_stddev#103, avg(col1#4) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col1_avg#83, max(id#0L) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS id_max#72L, min(id#0L) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS id_min#62L], [grp#3L], [id#0L ASC NULLS FIRST]
      +- *Project [grp#3L, id#0L, col1#4, col2#5, id_avg#12, id_stddev#28]
         +- Window [stddev_samp(_w0#45) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS id_stddev#28], [grp#3L], [id#0L ASC NULLS FIRST]
            +- *Project [grp#3L, id#0L, col1#4, col2#5, id_avg#12, cast(id#0L as double) AS _w0#45]
               +- Window [avg(id#0L) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS id_avg#12], [grp#3L], [id#0L ASC NULLS FIRST]
                  +- *Sort [grp#3L ASC NULLS FIRST, id#0L ASC NULLS FIRST], false, 0
                     +- Exchange hashpartitioning(grp#3L, 200)
                        +- *Project [(id#0L % 100) AS grp#3L, id#0L, rand(6537478539664068821) AS col1#4, rand(-8961093871295252795) AS col2#5]
                           +- *Range (0, 1000, step=1, splits=Some(8))
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests with a newly added testsuite.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15317 from dongjoon-hyun/SPARK-17739.
      aef506e3
    • Shubham Chopra's avatar
      [SPARK-15353][CORE] Making peer selection for block replication pluggable · a26afd52
      Shubham Chopra authored
      ## What changes were proposed in this pull request?
      
      This PR makes block replication strategies pluggable. It provides two trait that can be implemented, one that maps a host to its topology and is used in the master, and the second that helps prioritize a list of peers for block replication and would run in the executors.
      
      This patch contains default implementations of these traits that make sure current Spark behavior is unchanged.
      
      ## How was this patch tested?
      
      This patch should not change Spark behavior in any way, and was tested with unit tests for storage.
      
      Author: Shubham Chopra <schopra31@bloomberg.net>
      
      Closes #13152 from shubhamchopra/RackAwareBlockReplication.
      a26afd52
    • Takuya UESHIN's avatar
      [SPARK-17703][SQL] Add unnamed version of addReferenceObj for minor objects. · 81455a9c
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      There are many minor objects in references, which are extracted to the generated class field, e.g. `errMsg` in `GetExternalRowField` or `ValidateExternalType`, but number of fields in class is limited so we should reduce the number.
      This pr adds unnamed version of `addReferenceObj` for these minor objects not to store the object into field but refer it from the `references` field at the time of use.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #15276 from ueshin/issues/SPARK-17703.
      81455a9c
    • Davies Liu's avatar
      [SPARK-17738] [SQL] fix ARRAY/MAP in columnar cache · f327e168
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      The actualSize() of array and map is different from the actual size, the header is Int, rather than Long.
      
      ## How was this patch tested?
      
      The flaky test should be fixed.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #15305 from davies/fix_MAP.
      f327e168
    • Zheng RuiFeng's avatar
      [SPARK-14077][ML][FOLLOW-UP] Revert change for NB Model's Load to maintain... · 8e491af5
      Zheng RuiFeng authored
      [SPARK-14077][ML][FOLLOW-UP] Revert change for NB Model's Load to maintain compatibility with the model stored before 2.0
      
      ## What changes were proposed in this pull request?
      Revert change for NB Model's Load to maintain compatibility with the model stored before 2.0
      
      ## How was this patch tested?
      local build
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #15313 from zhengruifeng/revert_save_load.
      8e491af5
    • Zheng RuiFeng's avatar
      [SPARK-14077][ML] Refactor NaiveBayes to support weighted instances · 1fad5596
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      1,support weighted data
      2,use dataset/dataframe instead of rdd
      3,make mllib as a wrapper to call ml
      
      ## How was this patch tested?
      local manual tests in spark-shell
      unit tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #12819 from zhengruifeng/weighted_nb.
      1fad5596
  6. Sep 29, 2016
    • Herman van Hovell's avatar
      [SPARK-17717][SQL] Add exist/find methods to Catalog. · 74ac1c43
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      The current user facing catalog does not implement methods for checking object existence or finding objects. You could theoretically do this using the `list*` commands, but this is rather cumbersome and can actually be costly when there are many objects. This PR adds `exists*` and `find*` methods for Databases, Table and Functions.
      
      ## How was this patch tested?
      Added tests to `org.apache.spark.sql.internal.CatalogSuite`
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #15301 from hvanhovell/SPARK-17717.
      74ac1c43
    • Bryan Cutler's avatar
      [SPARK-17697][ML] Fixed bug in summary calculations that pattern match against... · 2f739567
      Bryan Cutler authored
      [SPARK-17697][ML] Fixed bug in summary calculations that pattern match against label without casting
      
      ## What changes were proposed in this pull request?
      In calling LogisticRegression.evaluate and GeneralizedLinearRegression.evaluate using a Dataset where the Label is not of a double type, calculations pattern match against a double and throw a MatchError.  This fix casts the Label column to a DoubleType to ensure there is no MatchError.
      
      ## How was this patch tested?
      Added unit tests to call evaluate with a dataset that has Label as other numeric types.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #15288 from BryanCutler/binaryLOR-numericCheck-SPARK-17697.
      2f739567
    • Dongjoon Hyun's avatar
      [SPARK-17412][DOC] All test should not be run by `root` or any admin user · 39eb3bb1
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      `FsHistoryProviderSuite` fails if `root` user runs it. The test case **SPARK-3697: ignore directories that cannot be read** depends on `setReadable(false, false)` to make test data files and expects the number of accessible files is 1. But, `root` can access all files, so it returns 2.
      
      This PR adds the assumption explicitly on doc. `building-spark.md`.
      
      ## How was this patch tested?
      
      This is a documentation change.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15291 from dongjoon-hyun/SPARK-17412.
      39eb3bb1
    • Imran Rashid's avatar
      [SPARK-17676][CORE] FsHistoryProvider should ignore hidden files · 3993ebca
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      FsHistoryProvider was writing a hidden file (to check the fs's clock).
      Even though it deleted the file immediately, sometimes another thread
      would try to scan the files on the fs in-between, and then there would
      be an error msg logged which was very misleading for the end-user.
      (The logged error was harmless, though.)
      
      ## How was this patch tested?
      
      I added one unit test, but to be clear, that test was passing before.  The actual change in behavior in that test is just logging (after the change, there is no more logged error), which I just manually verified.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #15250 from squito/SPARK-17676.
      3993ebca
    • Bjarne Fruergaard's avatar
      [SPARK-17721][MLLIB][ML] Fix for multiplying transposed SparseMatrix with SparseVector · 29396e7d
      Bjarne Fruergaard authored
      ## What changes were proposed in this pull request?
      
      * changes the implementation of gemv with transposed SparseMatrix and SparseVector both in mllib-local and mllib (identical)
      * adds a test that was failing before this change, but succeeds with these changes.
      
      The problem in the previous implementation was that it only increments `i`, that is enumerating the columns of a row in the SparseMatrix, when the row-index of the vector matches the column-index of the SparseMatrix. In cases where a particular row of the SparseMatrix has non-zero values at column-indices lower than corresponding non-zero row-indices of the SparseVector, the non-zero values of the SparseVector are enumerated without ever matching the column-index at index `i` and the remaining column-indices i+1,...,indEnd-1 are never attempted. The test cases in this PR illustrate this issue.
      
      ## How was this patch tested?
      
      I have run the specific `gemv` tests in both mllib-local and mllib. I am currently still running `./dev/run-tests`.
      
      ## ___
      As per instructions, I hereby state that this is my original work and that I license the work to the project (Apache Spark) under the project's open source license.
      
      Mentioning dbtsai, viirya and brkyvz whom I can see have worked/authored on these parts before.
      
      Author: Bjarne Fruergaard <bwahlgreen@gmail.com>
      
      Closes #15296 from bwahlgreen/bugfix-spark-17721.
      29396e7d
    • Dongjoon Hyun's avatar
      [SPARK-17612][SQL] Support `DESCRIBE table PARTITION` SQL syntax · 4ecc648a
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR implements `DESCRIBE table PARTITION` SQL Syntax again. It was supported until Spark 1.6.2, but was dropped since 2.0.0.
      
      **Spark 1.6.2**
      ```scala
      scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY (c STRING, d STRING)")
      res1: org.apache.spark.sql.DataFrame = [result: string]
      
      scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
      res2: org.apache.spark.sql.DataFrame = [result: string]
      
      scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
      +----------------------------------------------------------------+
      |result                                                          |
      +----------------------------------------------------------------+
      |a                      string                                   |
      |b                      int                                      |
      |c                      string                                   |
      |d                      string                                   |
      |                                                                |
      |# Partition Information                                         |
      |# col_name             data_type               comment          |
      |                                                                |
      |c                      string                                   |
      |d                      string                                   |
      +----------------------------------------------------------------+
      ```
      
      **Spark 2.0**
      - **Before**
      ```scala
      scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY (c STRING, d STRING)")
      res0: org.apache.spark.sql.DataFrame = []
      
      scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
      res1: org.apache.spark.sql.DataFrame = []
      
      scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
      org.apache.spark.sql.catalyst.parser.ParseException:
      Unsupported SQL statement
      ```
      
      - **After**
      ```scala
      scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY (c STRING, d STRING)")
      res0: org.apache.spark.sql.DataFrame = []
      
      scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
      res1: org.apache.spark.sql.DataFrame = []
      
      scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
      +-----------------------+---------+-------+
      |col_name               |data_type|comment|
      +-----------------------+---------+-------+
      |a                      |string   |null   |
      |b                      |int      |null   |
      |c                      |string   |null   |
      |d                      |string   |null   |
      |# Partition Information|         |       |
      |# col_name             |data_type|comment|
      |c                      |string   |null   |
      |d                      |string   |null   |
      +-----------------------+---------+-------+
      
      scala> sql("DESC EXTENDED partitioned_table PARTITION (c='Us', d=1)").show(100,false)
      +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-------+
      |col_name                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |data_type|comment|
      +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-------+
      |a                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |string   |null   |
      |b                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |int      |null   |
      |c                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |string   |null   |
      |d                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |string   |null   |
      |# Partition Information                                                                                                                                                                                                                                                                                                                                                                                                                                                            |         |       |
      |# col_name                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |data_type|comment|
      |c                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |string   |null   |
      |d                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |string   |null   |
      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |         |       |
      |Detailed Partition Information CatalogPartition(
              Partition Values: [Us, 1]
              Storage(Location: file:/Users/dhyun/SPARK-17612-DESC-PARTITION/spark-warehouse/partitioned_table/c=Us/d=1, InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, Serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Properties: [serialization.format=1])
              Partition Parameters:{transient_lastDdlTime=1475001066})|         |       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-------+
      
      scala> sql("DESC FORMATTED partitioned_table PARTITION (c='Us', d=1)").show(100,false)
      +--------------------------------+---------------------------------------------------------------------------------------+-------+
      |col_name                        |data_type                                                                              |comment|
      +--------------------------------+---------------------------------------------------------------------------------------+-------+
      |a                               |string                                                                                 |null   |
      |b                               |int                                                                                    |null   |
      |c                               |string                                                                                 |null   |
      |d                               |string                                                                                 |null   |
      |# Partition Information         |                                                                                       |       |
      |# col_name                      |data_type                                                                              |comment|
      |c                               |string                                                                                 |null   |
      |d                               |string                                                                                 |null   |
      |                                |                                                                                       |       |
      |# Detailed Partition Information|                                                                                       |       |
      |Partition Value:                |[Us, 1]                                                                                |       |
      |Database:                       |default                                                                                |       |
      |Table:                          |partitioned_table                                                                      |       |
      |Location:                       |file:/Users/dhyun/SPARK-17612-DESC-PARTITION/spark-warehouse/partitioned_table/c=Us/d=1|       |
      |Partition Parameters:           |                                                                                       |       |
      |  transient_lastDdlTime         |1475001066                                                                             |       |
      |                                |                                                                                       |       |
      |# Storage Information           |                                                                                       |       |
      |SerDe Library:                  |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                                     |       |
      |InputFormat:                    |org.apache.hadoop.mapred.TextInputFormat                                               |       |
      |OutputFormat:                   |org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                             |       |
      |Compressed:                     |No                                                                                     |       |
      |Storage Desc Parameters:        |                                                                                       |       |
      |  serialization.format          |1                                                                                      |       |
      +--------------------------------+---------------------------------------------------------------------------------------+-------+
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests with a new testcase.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15168 from dongjoon-hyun/SPARK-17612.
      4ecc648a
    • Liang-Chi Hsieh's avatar
      [SPARK-17653][SQL] Remove unnecessary distincts in multiple unions · 566d7f28
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Currently for `Union [Distinct]`, a `Distinct` operator is necessary to be on the top of `Union`. Once there are adjacent `Union [Distinct]`,  there will be multiple `Distinct` in the query plan.
      
      E.g.,
      
      For a query like: select 1 a union select 2 b union select 3 c
      
      Before this patch, its physical plan looks like:
      
          *HashAggregate(keys=[a#13], functions=[])
          +- Exchange hashpartitioning(a#13, 200)
             +- *HashAggregate(keys=[a#13], functions=[])
                +- Union
                   :- *HashAggregate(keys=[a#13], functions=[])
                   :  +- Exchange hashpartitioning(a#13, 200)
                   :     +- *HashAggregate(keys=[a#13], functions=[])
                   :        +- Union
                   :           :- *Project [1 AS a#13]
                   :           :  +- Scan OneRowRelation[]
                   :           +- *Project [2 AS b#14]
                   :              +- Scan OneRowRelation[]
                   +- *Project [3 AS c#15]
                      +- Scan OneRowRelation[]
      
      Only the top distinct should be necessary.
      
      After this patch, the physical plan looks like:
      
          *HashAggregate(keys=[a#221], functions=[], output=[a#221])
          +- Exchange hashpartitioning(a#221, 5)
             +- *HashAggregate(keys=[a#221], functions=[], output=[a#221])
                +- Union
                   :- *Project [1 AS a#221]
                   :  +- Scan OneRowRelation[]
                   :- *Project [2 AS b#222]
                   :  +- Scan OneRowRelation[]
                   +- *Project [3 AS c#223]
                      +- Scan OneRowRelation[]
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #15238 from viirya/remove-extra-distinct-union.
      566d7f28
    • Michael Armbrust's avatar
      [SPARK-17699] Support for parsing JSON string columns · fe33121a
      Michael Armbrust authored
      Spark SQL has great support for reading text files that contain JSON data.  However, in many cases the JSON data is just one column amongst others.  This is particularly true when reading from sources such as Kafka.  This PR adds a new functions `from_json` that converts a string column into a nested `StructType` with a user specified schema.
      
      Example usage:
      ```scala
      val df = Seq("""{"a": 1}""").toDS()
      val schema = new StructType().add("a", IntegerType)
      
      df.select(from_json($"value", schema) as 'json) // => [json: <a: int>]
      ```
      
      This PR adds support for java, scala and python.  I leveraged our existing JSON parsing support by moving it into catalyst (so that we could define expressions using it).  I left SQL out for now, because I'm not sure how users would specify a schema.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #15274 from marmbrus/jsonParser.
      fe33121a
    • Brian Cho's avatar
      [SPARK-17715][SCHEDULER] Make task launch logs DEBUG · 027dea8f
      Brian Cho authored
      ## What changes were proposed in this pull request?
      
      Ramp down the task launch logs from INFO to DEBUG. Task launches can happen orders of magnitude more than executor registration so it makes the logs easier to handle if they are different log levels. For larger jobs, there can be 100,000s of task launches which makes the driver log huge.
      
      ## How was this patch tested?
      
      No tests, as this is a trivial change.
      
      Author: Brian Cho <bcho@fb.com>
      
      Closes #15290 from dafrista/ramp-down-task-logging.
      027dea8f
    • Gang Wu's avatar
      [SPARK-17672] Spark 2.0 history server web Ui takes too long for a single application · cb87b3ce
      Gang Wu authored
      Added a new API getApplicationInfo(appId: String) in class ApplicationHistoryProvider and class SparkUI to get app info. In this change, FsHistoryProvider can directly fetch one app info in O(1) time complexity compared to O(n) before the change which used an Iterator.find() interface.
      
      Both ApplicationCache and OneApplicationResource classes adopt this new api.
      
       manual tests
      
      Author: Gang Wu <wgtmac@uber.com>
      
      Closes #15247 from wgtmac/SPARK-17671.
      cb87b3ce
    • Imran Rashid's avatar
      [SPARK-17648][CORE] TaskScheduler really needs offers to be an IndexedSeq · 7f779e74
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      The Seq[WorkerOffer] is accessed by index, so it really should be an
      IndexedSeq, otherwise an O(n) operation becomes O(n^2).  In practice
      this hasn't been an issue b/c where these offers are generated, the call
      to `.toSeq` just happens to create an IndexedSeq anyway.I got bitten by
      this in performance tests I was doing, and its better for the types to be
      more precise so eg. a change in Scala doesn't destroy performance.
      
      ## How was this patch tested?
      
      Unit tests via jenkins.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #15221 from squito/SPARK-17648.
      7f779e74
    • José Hiram Soltren's avatar
      [DOCS] Reorganize explanation of Accumulators and Broadcast Variables · 95820049
      José Hiram Soltren authored
      ## What changes were proposed in this pull request?
      
      The discussion of the interaction of Accumulators and Broadcast Variables should logically follow the discussion on Checkpointing. As currently written, this section discusses Checkpointing before it is formally introduced. To remedy this:
      
       - Rename this section to "Accumulators, Broadcast Variables, and Checkpoints", and
       - Move this section after "Checkpointing".
      
      ## How was this patch tested?
      
      Testing: ran
      
      $ SKIP_API=1 jekyll build
      
      , and verified changes in a Web browser pointed at docs/_site/index.html.
      
      Author: José Hiram Soltren <jose@cloudera.com>
      
      Closes #15281 from jsoltren/doc-changes.
      95820049
    • Takeshi YAMAMURO's avatar
      [MINOR][DOCS] Fix th doc. of spark-streaming with kinesis · b2e9731c
      Takeshi YAMAMURO authored
      ## What changes were proposed in this pull request?
      This pr is just to fix the document of `spark-kinesis-integration`.
      Since `SPARK-17418` prevented all the kinesis stuffs (including kinesis example code)
      from publishing,  `bin/run-example streaming.KinesisWordCountASL` and `bin/run-example streaming.JavaKinesisWordCountASL` does not work.
      Instead, it fetches the kinesis jar from the Spark Package.
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #15260 from maropu/DocFixKinesis.
      Unverified
      b2e9731c
    • Sean Owen's avatar
      [SPARK-17614][SQL] sparkSession.read() .jdbc(***) use the sql syntax "where... · b35b0dbb
      Sean Owen authored
      [SPARK-17614][SQL] sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support
      
      ## What changes were proposed in this pull request?
      
      Use dialect's table-exists query rather than hard-coded WHERE 1=0 query
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15196 from srowen/SPARK-17614.
      Unverified
      b35b0dbb
    • Yanbo Liang's avatar
      [SPARK-17704][ML][MLLIB] ChiSqSelector performance improvement. · f7082ac1
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Several performance improvement for ```ChiSqSelector```:
      1, Keep ```selectedFeatures``` ordered ascendent.
      ```ChiSqSelectorModel.transform``` need ```selectedFeatures``` ordered to make prediction. We should sort it when training model rather than making prediction, since users usually train model once and use the model to do prediction multiple times.
      2, When training ```fpr``` type ```ChiSqSelectorModel```, it's not necessary to sort the ChiSq test result by statistic.
      
      ## How was this patch tested?
      Existing unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15277 from yanboliang/spark-17704.
      f7082ac1
    • Yanbo Liang's avatar
      [SPARK-16356][FOLLOW-UP][ML] Enforce ML test of exception for local/distributed Dataset. · a19a1bb5
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      #14035 added ```testImplicits``` to ML unit tests and promoted ```toDF()```, but left one minor issue at ```VectorIndexerSuite```. If we create the DataFrame by ```Seq(...).toDF()```, it will throw different error/exception compared with ```sc.parallelize(Seq(...)).toDF()``` for one of the test cases.
      After in-depth study, I found it was caused by different behavior of local and distributed Dataset if the UDF failed at ```assert```. If the data is local Dataset, it throws ```AssertionError``` directly; If the data is distributed Dataset, it throws ```SparkException``` which is the wrapper of ```AssertionError```. I think we should enforce this test to cover both case.
      
      ## How was this patch tested?
      Unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15261 from yanboliang/spark-16356.
      a19a1bb5
  7. Sep 28, 2016
    • Josh Rosen's avatar
      [SPARK-17712][SQL] Fix invalid pushdown of data-independent filters beneath aggregates · 37eb9184
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      This patch fixes a minor correctness issue impacting the pushdown of filters beneath aggregates. Specifically, if a filter condition references no grouping or aggregate columns (e.g. `WHERE false`) then it would be incorrectly pushed beneath an aggregate.
      
      Intuitively, the only case where you can push a filter beneath an aggregate is when that filter is deterministic and is defined over the grouping columns / expressions, since in that case the filter is acting to exclude entire groups from the query (like a `HAVING` clause). The existing code would only push deterministic filters beneath aggregates when all of the filter's references were grouping columns, but this logic missed the case where a filter has no references. For example, `WHERE false` is deterministic but is independent of the actual data.
      
      This patch fixes this minor bug by adding a new check to ensure that we don't push filters beneath aggregates when those filters don't reference any columns.
      
      ## How was this patch tested?
      
      New regression test in FilterPushdownSuite.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #15289 from JoshRosen/SPARK-17712.
      37eb9184
    • Weiqing Yang's avatar
      [SPARK-17710][HOTFIX] Fix ClassCircularityError in ReplSuite tests in Maven... · 7dfad4b1
      Weiqing Yang authored
      [SPARK-17710][HOTFIX] Fix ClassCircularityError in ReplSuite tests in Maven build: use 'Class.forName' instead of 'Utils.classForName'
      
      ## What changes were proposed in this pull request?
      Fix ClassCircularityError in ReplSuite tests when Spark is built by Maven build.
      
      ## How was this patch tested?
      (1)
      ```
      build/mvn -DskipTests -Phadoop-2.3 -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl -Pmesos clean package
      ```
      Then test:
      ```
      build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.repl.ReplSuite test
      ```
      ReplSuite tests passed
      
      (2)
      Manual Tests against some Spark applications in Yarn client mode and Yarn cluster mode. Need to check if spark caller contexts are written into HDFS hdfs-audit.log and Yarn RM audit log successfully.
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #15286 from Sherry302/SPARK-16757.
      7dfad4b1
Loading