Skip to content
Snippets Groups Projects
  1. Jun 16, 2017
  2. Jun 15, 2017
    • Xianyang Liu's avatar
      [SPARK-21072][SQL] TreeNode.mapChildren should only apply to the children node. · 87ab0cec
      Xianyang Liu authored
      ## What changes were proposed in this pull request?
      
      Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code should judge whether it is the children node.
      
      https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Xianyang Liu <xianyang.liu@intel.com>
      
      Closes #18284 from ConeyLiu/treenode.
      87ab0cec
    • Xiao Li's avatar
      [SPARK-21112][SQL] ALTER TABLE SET TBLPROPERTIES should not overwrite COMMENT · 5d35d5c1
      Xiao Li authored
      ### What changes were proposed in this pull request?
      `ALTER TABLE SET TBLPROPERTIES` should not overwrite `COMMENT` even if the input property does not have the property of `COMMENT`. This PR is to fix the issue.
      
      ### How was this patch tested?
      Covered by the existing tests.
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #18318 from gatorsmile/fixTableComment.
      5d35d5c1
    • Michael Gummelt's avatar
      [SPARK-20434][YARN][CORE] Move Hadoop delegation token code from yarn to core · a18d6371
      Michael Gummelt authored
      ## What changes were proposed in this pull request?
      
      Move Hadoop delegation token code from `spark-yarn` to `spark-core`, so that other schedulers (such as Mesos), may use it.  In order to avoid exposing Hadoop interfaces in spark-core, the new Hadoop delegation token classes are kept private.  In order to provider backward compatiblity, and to allow YARN users to continue to load their own delegation token providers via Java service loading, the old YARN interfaces, as well as the client code that uses them, have been retained.
      
      Summary:
      - Move registered `yarn.security.ServiceCredentialProvider` classes from `spark-yarn` to `spark-core`.  Moved them into a new, private hierarchy under `HadoopDelegationTokenProvider`.  Client code in `HadoopDelegationTokenManager` now loads credentials from a whitelist of three providers (`HadoopFSDelegationTokenProvider`, `HiveDelegationTokenProvider`, `HBaseDelegationTokenProvider`), instead of service loading, which means that users are not able to implement their own delegation token providers, as they are in the `spark-yarn` module.
      
      - The `yarn.security.ServiceCredentialProvider` interface has been kept for backwards compatibility, and to continue to allow YARN users to implement their own delegation token provider implementations.  Client code in YARN now fetches tokens via the new `YARNHadoopDelegationTokenManager` class, which fetches tokens from the core providers through `HadoopDelegationTokenManager`, as well as service loads them from `yarn.security.ServiceCredentialProvider`.
      
      Old Hierarchy:
      
      ```
      yarn.security.ServiceCredentialProvider (service loaded)
        HadoopFSCredentialProvider
        HiveCredentialProvider
        HBaseCredentialProvider
      yarn.security.ConfigurableCredentialManager
      ```
      
      New Hierarchy:
      
      ```
      HadoopDelegationTokenManager
      HadoopDelegationTokenProvider (not service loaded)
        HadoopFSDelegationTokenProvider
        HiveDelegationTokenProvider
        HBaseDelegationTokenProvider
      
      yarn.security.ServiceCredentialProvider (service loaded)
      yarn.security.YARNHadoopDelegationTokenManager
      ```
      ## How was this patch tested?
      
      unit tests
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      Author: Dr. Stefan Schimanski <sttts@mesosphere.io>
      
      Closes #17723 from mgummelt/SPARK-20434-refactor-kerberos.
      a18d6371
    • Xingbo Jiang's avatar
      [SPARK-16251][SPARK-20200][CORE][TEST] Flaky test:... · 7dc3e697
      Xingbo Jiang authored
      [SPARK-16251][SPARK-20200][CORE][TEST] Flaky test: org.apache.spark.rdd.LocalCheckpointSuite.missing checkpoint block fails with informative message
      
      ## What changes were proposed in this pull request?
      
      Currently we don't wait to confirm the removal of the block from the slave's BlockManager, if the removal takes too much time, we will fail the assertion in this test case.
      The failure can be easily reproduced if we sleep for a while before we remove the block in BlockManagerSlaveEndpoint.receiveAndReply().
      
      ## How was this patch tested?
      N/A
      
      Author: Xingbo Jiang <xingbo.jiang@databricks.com>
      
      Closes #18314 from jiangxb1987/LocalCheckpointSuite.
      7dc3e697
    • Felix Cheung's avatar
      [SPARK-20980][DOCS] update doc to reflect multiLine change · 1bf55e39
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      doc only change
      
      ## How was this patch tested?
      
      manually
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #18312 from felixcheung/sqljsonwholefiledoc.
      1bf55e39
    • ALeksander Eskilson's avatar
      [SPARK-18016][SQL][CATALYST] Code Generation: Constant Pool Limit - Class Splitting · b32b2123
      ALeksander Eskilson authored
      ## What changes were proposed in this pull request?
      
      This pull-request exclusively includes the class splitting feature described in #16648. When code for a given class would grow beyond 1600k bytes, a private, nested sub-class is generated into which subsequent functions are inlined. Additional sub-classes are generated as the code threshold is met subsequent times. This code includes 3 changes:
      
      1. Includes helper maps, lists, and functions for keeping track of sub-classes during code generation (included in the `CodeGenerator` class). These helper functions allow nested classes and split functions to be initialized/declared/inlined to the appropriate locations in the various projection classes.
      2. Changes `addNewFunction` to return a string to support instances where a split function is inlined to a nested class and not the outer class (and so must be invoked using the class-qualified name). Uses of `addNewFunction` throughout the codebase are modified so that the returned name is properly used.
      3. Removes instances of the `this` keyword when used on data inside generated classes. All state declared in the outer class is by default global and accessible to the nested classes. However, if a reference to global state in a nested class is prepended with the `this` keyword, it would attempt to reference state belonging to the nested class (which would not exist), rather than the correct variable belonging to the outer class.
      
      ## How was this patch tested?
      
      Added a test case to the `GeneratedProjectionSuite` that increases the number of columns tested in various projections to a threshold that would previously have triggered a `JaninoRuntimeException` for the Constant Pool.
      
      Note: This PR does not address the second Constant Pool issue with code generation (also mentioned in #16648): excess global mutable state. A second PR may be opened to resolve that issue.
      
      Author: ALeksander Eskilson <alek.eskilson@cerner.com>
      
      Closes #18075 from bdrillard/class_splitting_only.
      b32b2123
    • Xiao Li's avatar
      [SPARK-20980][SQL] Rename `wholeFile` to `multiLine` for both CSV and JSON · 20514281
      Xiao Li authored
      ### What changes were proposed in this pull request?
      The current option name `wholeFile` is misleading for CSV users. Currently, it is not representing a record per file. Actually, one file could have multiple records. Thus, we should rename it. Now, the proposal is `multiLine`.
      
      ### How was this patch tested?
      N/A
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #18202 from gatorsmile/renameCVSOption.
      20514281
    • Reynold Xin's avatar
      [SPARK-21092][SQL] Wire SQLConf in logical plan and expressions · fffeb6d7
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      It is really painful to not have configs in logical plan and expressions. We had to add all sorts of hacks (e.g. pass SQLConf explicitly in functions). This patch exposes SQLConf in logical plan, using a thread local variable and a getter closure that's set once there is an active SparkSession.
      
      The implementation is a bit of a hack, since we didn't anticipate this need in the beginning (config was only exposed in physical plan). The implementation is described in `SQLConf.get`.
      
      In terms of future work, we should follow up to clean up CBO (remove the need for passing in config).
      
      ## How was this patch tested?
      Updated relevant tests for constraint propagation.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #18299 from rxin/SPARK-21092.
      fffeb6d7
  3. Jun 14, 2017
  4. Jun 13, 2017
  5. Jun 12, 2017
    • Dongjoon Hyun's avatar
      [SPARK-19910][SQL] `stack` should not reject NULL values due to type mismatch · 2639c3ed
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Since `stack` function generates a table with nullable columns, it should allow mixed null values.
      
      ```scala
      scala> sql("select stack(3, 1, 2, 3)").printSchema
      root
       |-- col0: integer (nullable = true)
      
      scala> sql("select stack(3, 1, 2, null)").printSchema
      org.apache.spark.sql.AnalysisException: cannot resolve 'stack(3, 1, 2, NULL)' due to data type mismatch: Argument 1 (IntegerType) != Argument 3 (NullType); line 1 pos 7;
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins with a new test case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #17251 from dongjoon-hyun/SPARK-19910.
      2639c3ed
    • Wenchen Fan's avatar
    • Shixiong Zhu's avatar
      [SPARK-20979][SS] Add RateSource to generate values for tests and benchmark · 74a432d3
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR adds RateSource for Structured Streaming so that the user can use it to generate data for tests and benchmark easily.
      
      This source generates increment long values with timestamps. Each generated row has two columns: a timestamp column for the generated time and an auto increment long column starting with 0L.
      
      It supports the following options:
      - `rowsPerSecond` (e.g. 100, default: 1): How many rows should be generated per second.
      - `rampUpTime` (e.g. 5s, default: 0s): How long to ramp up before the generating speed becomes `rowsPerSecond`. Using finer granularities than seconds will be truncated to integer seconds.
      - `numPartitions` (e.g. 10, default: Spark's default parallelism): The partition number for the generated rows. The source will try its best to reach `rowsPerSecond`, but the query may be resource constrained, and `numPartitions` can be tweaked to help reach the desired speed.
      
      Here is a simple example that prints 10 rows per seconds:
      ```
          spark.readStream
            .format("rate")
            .option("rowsPerSecond", "10")
            .load()
            .writeStream
            .format("console")
            .start()
      ```
      
      The idea came from marmbrus and he did the initial work.
      
      ## How was this patch tested?
      
      The added tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #18199 from zsxwing/rate.
      74a432d3
    • Joseph K. Bradley's avatar
      [SPARK-21050][ML] Word2vec persistence overflow bug fix · ff318c0d
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib version), so it is very easily to have an overflow in calculating the number of partitions for ML persistence.
      
      This modifies the calculations to use Long.
      
      ## How was this patch tested?
      
      New unit test.  I verified that the test fails before this patch.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #18265 from jkbradley/word2vec-save-fix.
      ff318c0d
    • Reynold Xin's avatar
      [SPARK-21059][SQL] LikeSimplification can NPE on null pattern · b1436c74
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch fixes a bug that can cause NullPointerException in LikeSimplification, when the pattern for like is null.
      
      ## How was this patch tested?
      Added a new unit test case in LikeSimplificationSuite.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #18273 from rxin/SPARK-21059.
      b1436c74
    • Dongjoon Hyun's avatar
      [SPARK-20345][SQL] Fix STS error handling logic on HiveSQLException · 32818d9b
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      [SPARK-5100](https://github.com/apache/spark/commit/343d3bfafd449a0371feb6a88f78e07302fa7143) added Spark Thrift Server(STS) UI and the following logic to handle exceptions on case `Throwable`.
      
      ```scala
      HiveThriftServer2.listener.onStatementError(
        statementId, e.getMessage, SparkUtils.exceptionString(e))
      ```
      
      However, there occurred a missed case after implementing [SPARK-6964](https://github.com/apache/spark/commit/eb19d3f75cbd002f7e72ce02017a8de67f562792)'s `Support Cancellation in the Thrift Server` by adding case `HiveSQLException` before case `Throwable`.
      
      ```scala
      case e: HiveSQLException =>
        if (getStatus().getState() == OperationState.CANCELED) {
          return
        } else {
          setState(OperationState.ERROR)
          throw e
        }
        // Actually do need to catch Throwable as some failures don't inherit from Exception and
        // HiveServer will silently swallow them.
      case e: Throwable =>
        val currentState = getStatus().getState()
        logError(s"Error executing query, currentState $currentState, ", e)
        setState(OperationState.ERROR)
        HiveThriftServer2.listener.onStatementError(
          statementId, e.getMessage, SparkUtils.exceptionString(e))
        throw new HiveSQLException(e.toString)
      ```
      
      Logically, we had better add `HiveThriftServer2.listener.onStatementError` on case `HiveSQLException`, too.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #17643 from dongjoon-hyun/SPARK-20345.
      32818d9b
    • aokolnychyi's avatar
      [SPARK-17914][SQL] Fix parsing of timestamp strings with nanoseconds · ca4e960a
      aokolnychyi authored
      The PR contains a tiny change to fix the way Spark parses string literals into timestamps. Currently, some timestamps that contain nanoseconds are corrupted during the conversion from internal UTF8Strings into the internal representation of timestamps.
      
      Consider the following example:
      ```
      spark.sql("SELECT cast('2015-01-02 00:00:00.000000001' as TIMESTAMP)").show(false)
      +------------------------------------------------+
      |CAST(2015-01-02 00:00:00.000000001 AS TIMESTAMP)|
      +------------------------------------------------+
      |2015-01-02 00:00:00.000001                      |
      +------------------------------------------------+
      ```
      
      The fix was tested with existing tests. Also, there is a new test to cover cases that did not work previously.
      
      Author: aokolnychyi <anton.okolnychyi@sap.com>
      
      Closes #18252 from aokolnychyi/spark-17914.
      ca4e960a
Loading