Skip to content
Snippets Groups Projects
  1. Mar 06, 2017
    • Gaurav's avatar
      [SPARK-19304][STREAMING][KINESIS] fix kinesis slow checkpoint recovery · 46a64d1e
      Gaurav authored
      ## What changes were proposed in this pull request?
      added a limit to getRecords api call call in KinesisBackedBlockRdd. This helps reduce the amount of data returned by kinesis api call making the recovery considerably faster
      
      As we are storing the `fromSeqNum` & `toSeqNum` in checkpoint metadata, we can also store the number of records. Which can later be used for api call.
      
      ## How was this patch tested?
      The patch was manually tested
      
      Apologies for any silly mistakes, opening first pull request
      
      Author: Gaurav <gaurav@techtinium.com>
      
      Closes #16842 from Gauravshah/kinesis_checkpoint_recovery_fix_2_1_0.
      46a64d1e
    • Cheng Lian's avatar
      [SPARK-19737][SQL] New analysis rule for reporting unregistered functions... · 339b53a1
      Cheng Lian authored
      [SPARK-19737][SQL] New analysis rule for reporting unregistered functions without relying on relation resolution
      
      ## What changes were proposed in this pull request?
      
      This PR adds a new `Once` analysis rule batch consists of a single analysis rule `LookupFunctions` that performs simple existence check over `UnresolvedFunctions` without actually resolving them.
      
      The benefit of this rule is that it doesn't require function arguments to be resolved first and therefore doesn't rely on relation resolution, which may incur potentially expensive partition/schema discovery cost.
      
      Please refer to [SPARK-19737][1] for more details about the motivation.
      
      ## How was this patch tested?
      
      New test case added in `AnalysisErrorSuite`.
      
      [1]: https://issues.apache.org/jira/browse/SPARK-19737
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #17168 from liancheng/spark-19737-lookup-functions.
      339b53a1
    • Tejas Patil's avatar
      [SPARK-17495][SQL] Support Decimal type in Hive-hash · 2a0bc867
      Tejas Patil authored
      ## What changes were proposed in this pull request?
      
      Hive hash to support Decimal datatype. [Hive internally normalises decimals](https://github.com/apache/hive/blob/4ba713ccd85c3706d195aeef9476e6e6363f1c21/storage-api/src/java/org/apache/hadoop/hive/common/type/HiveDecimalV1.java#L307) and I have ported that logic as-is to HiveHash.
      
      ## How was this patch tested?
      
      Added unit tests
      
      Author: Tejas Patil <tejasp@fb.com>
      
      Closes #17056 from tejasapatil/SPARK-17495_decimal.
      2a0bc867
  2. Mar 05, 2017
    • uncleGen's avatar
      [SPARK-19822][TEST] CheckpointSuite.testCheckpointedOperation: should not... · 207067ea
      uncleGen authored
      [SPARK-19822][TEST] CheckpointSuite.testCheckpointedOperation: should not filter checkpointFilesOfLatestTime with the PATH string.
      
      ## What changes were proposed in this pull request?
      
      https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73800/testReport/
      
      ```
      sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedDueToTimeoutException: The code
      passed to eventually never returned normally. Attempted 617 times over 10.003740484 seconds.
      Last failure message: 8 did not equal 2.
      	at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
      	at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
      	at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
      	at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:336)
      	at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
      	at org.apache.spark.streaming.DStreamCheckpointTester$class.generateOutput(CheckpointSuite
      .scala:172)
      	at org.apache.spark.streaming.CheckpointSuite.generateOutput(CheckpointSuite.scala:211)
      ```
      
      the check condition is:
      
      ```
      val checkpointFilesOfLatestTime = Checkpoint.getCheckpointFiles(checkpointDir).filter {
           _.toString.contains(clock.getTimeMillis.toString)
      }
      // Checkpoint files are written twice for every batch interval. So assert that both
      // are written to make sure that both of them have been written.
      assert(checkpointFilesOfLatestTime.size === 2)
      ```
      
      the path string may contain the `clock.getTimeMillis.toString`, like `3500` :
      
      ```
      file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-500
      file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-1000
      file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-1500
      file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-2000
      file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-2500
      file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3000
      file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3500.bk
      file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3500
                                                             ▲▲▲▲
      ```
      
      so we should only check the filename, but not the whole path.
      
      ## How was this patch tested?
      
      Jenkins.
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #17167 from uncleGen/flaky-CheckpointSuite.
      207067ea
    • hyukjinkwon's avatar
      [SPARK-19701][SQL][PYTHON] Throws a correct exception for 'in' operator against column · 224e0e78
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to remove incorrect implementation that has been not executed so far (at least from Spark 1.5.2) for `in` operator and throw a correct exception rather than saying it is a bool. I tested the codes above in 1.5.2, 1.6.3, 2.1.0 and in the master branch as below:
      
      **1.5.2**
      
      ```python
      >>> df = sqlContext.createDataFrame([[1]])
      >>> 1 in df._1
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark-1.5.2-bin-hadoop2.6/python/pyspark/sql/column.py", line 418, in __nonzero__
          raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
      ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
      ```
      
      **1.6.3**
      
      ```python
      >>> 1 in sqlContext.range(1).id
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark-1.6.3-bin-hadoop2.6/python/pyspark/sql/column.py", line 447, in __nonzero__
          raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
      ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
      ```
      
      **2.1.0**
      
      ```python
      >>> 1 in spark.range(1).id
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/column.py", line 426, in __nonzero__
          raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
      ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
      ```
      
      **Current Master**
      
      ```python
      >>> 1 in spark.range(1).id
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/sql/column.py", line 452, in __nonzero__
          raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
      ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
      ```
      
      **After**
      
      ```python
      >>> 1 in spark.range(1).id
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/sql/column.py", line 184, in __contains__
          raise ValueError("Cannot apply 'in' operator against a column: please use 'contains' "
      ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column.
      ```
      
      In more details,
      
      It seems the implementation intended to support this
      
      ```python
      1 in df.column
      ```
      
      However, currently, it throws an exception as below:
      
      ```python
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/sql/column.py", line 426, in __nonzero__
          raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
      ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
      ```
      
      What happens here is as below:
      
      ```python
      class Column(object):
          def __contains__(self, item):
              print "I am contains"
              return Column()
          def __nonzero__(self):
              raise Exception("I am nonzero.")
      
      >>> 1 in Column()
      I am contains
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "<stdin>", line 6, in __nonzero__
      Exception: I am nonzero.
      ```
      
      It seems it calls `__contains__` first and then `__nonzero__` or `__bool__` is being called against `Column()` to make this a bool (or int to be specific).
      
      It seems `__nonzero__` (for Python 2), `__bool__` (for Python 3) and `__contains__` forcing the the return into a bool unlike other operators. There are few references about this as below:
      
      https://bugs.python.org/issue16011
      http://stackoverflow.com/questions/12244074/python-source-code-for-built-in-in-operator/12244378#12244378
      http://stackoverflow.com/questions/38542543/functionality-of-python-in-vs-contains/38542777
      
      It seems we can't overwrite `__nonzero__` or `__bool__` as a workaround to make this working because these force the return type as a bool as below:
      
      ```python
      class Column(object):
          def __contains__(self, item):
              print "I am contains"
              return Column()
          def __nonzero__(self):
              return "a"
      
      >>> 1 in Column()
      I am contains
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
      TypeError: __nonzero__ should return bool or int, returned str
      ```
      
      ## How was this patch tested?
      
      Added unit tests in `tests.py`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17160 from HyukjinKwon/SPARK-19701.
      224e0e78
    • Sue Ann Hong's avatar
      [SPARK-19535][ML] RecommendForAllUsers RecommendForAllItems for ALS on Dataframe · 70f9d7f7
      Sue Ann Hong authored
      ## What changes were proposed in this pull request?
      
      This is a simple implementation of RecommendForAllUsers & RecommendForAllItems for the Dataframe version of ALS. It uses Dataframe operations (not a wrapper on the RDD implementation). Haven't benchmarked against a wrapper, but unit test examples do work.
      
      ## How was this patch tested?
      
      Unit tests
      ```
      $ build/sbt
      > mllib/testOnly *ALSSuite -- -z "recommendFor"
      > mllib/testOnly
      ```
      
      Author: Your Name <you@example.com>
      Author: sueann <sueann@databricks.com>
      
      Closes #17090 from sueann/SPARK-19535.
      70f9d7f7
    • hyukjinkwon's avatar
      [SPARK-19595][SQL] Support json array in from_json · 369a148e
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to both,
      
      **Do not allow json arrays with multiple elements and return null in `from_json` with `StructType` as the schema.**
      
      Currently, it only reads the single row when the input is a json array. So, the codes below:
      
      ```scala
      import org.apache.spark.sql.functions._
      import org.apache.spark.sql.types._
      val schema = StructType(StructField("a", IntegerType) :: Nil)
      Seq(("""[{"a": 1}, {"a": 2}]""")).toDF("struct").select(from_json(col("struct"), schema)).show()
      ```
      prints
      
      ```
      +--------------------+
      |jsontostruct(struct)|
      +--------------------+
      |                 [1]|
      +--------------------+
      ```
      
      This PR simply suggests to print this as `null` if the schema is `StructType` and input is json array.with multiple elements
      
      ```
      +--------------------+
      |jsontostruct(struct)|
      +--------------------+
      |                null|
      +--------------------+
      ```
      
      **Support json arrays in `from_json` with `ArrayType` as the schema.**
      
      ```scala
      import org.apache.spark.sql.functions._
      import org.apache.spark.sql.types._
      val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
      Seq(("""[{"a": 1}, {"a": 2}]""")).toDF("array").select(from_json(col("array"), schema)).show()
      ```
      
      prints
      
      ```
      +-------------------+
      |jsontostruct(array)|
      +-------------------+
      |         [[1], [2]]|
      +-------------------+
      ```
      
      ## How was this patch tested?
      
      Unit test in `JsonExpressionsSuite`, `JsonFunctionsSuite`, Python doctests and manual test.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16929 from HyukjinKwon/disallow-array.
      369a148e
    • Felix Cheung's avatar
      [SPARK-19795][SPARKR] add column functions to_json, from_json · 80d5338b
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Add column functions: to_json, from_json, and tests covering error cases.
      
      ## How was this patch tested?
      
      unit tests, manual
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17134 from felixcheung/rtojson.
      80d5338b
    • Takeshi Yamamuro's avatar
      [SPARK-19254][SQL] Support Seq, Map, and Struct in functions.lit · 14bb398f
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr is to support Seq, Map, and Struct in functions.lit; it adds a new IF named `lit2` with `TypeTag` for avoiding type erasure.
      
      ## How was this patch tested?
      Added tests in `LiteralExpressionSuite`
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #16610 from maropu/SPARK-19254.
      14bb398f
    • uncleGen's avatar
      [SPARK-19805][TEST] Log the row type when query result dose not match · f48461ab
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      improve the log message when query result does not match.
      
      before pr:
      
      ```
      == Results ==
      !== Correct Answer - 3 ==   == Spark Answer - 3 ==
       [1]                        [1]
       [2]                        [2]
       [3]                        [3]
      
      ```
      
      after pr:
      
      ~~== Results ==
      !== Correct Answer - 3 ==   == Spark Answer - 3 ==
      !RowType[string]            RowType[integer]
       [1]                        [1]
       [2]                        [2]
       [3]                        [3]~~
      
      ```
      == Results ==
      !== Correct Answer - 3 ==   == Spark Answer - 3 ==
      !struct<value:string>       struct<value:int>
       [1]                        [1]
       [2]                        [2]
       [3]                        [3]
      ```
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #17145 from uncleGen/improve-test-result.
      f48461ab
    • liuxian's avatar
      [SPARK-19792][WEBUI] In the Master Page,the column named “Memory per Node” ,I... · 42c4cd9e
      liuxian authored
      [SPARK-19792][WEBUI] In the Master Page,the column named “Memory per Node” ,I think it is not all right
      
      Signed-off-by: liuxian <liu.xian3zte.com.cn>
      
      ## What changes were proposed in this pull request?
      
      Open the spark web page,in the Master Page ,have two tables:Running Applications table and Completed Applications table, to the column named “Memory per Node” ,I think it is not all right ,because a node may be not have only one executor.So I think that should be named as “Memory per Executor”.Otherwise easy to let the user misunderstanding
      
      ## How was this patch tested?
      
      N/A
      
      Author: liuxian <liu.xian3@zte.com.cn>
      
      Closes #17132 from 10110346/wid-lx-0302.
      42c4cd9e
  3. Mar 04, 2017
  4. Mar 03, 2017
    • Shixiong Zhu's avatar
      [SPARK-19816][SQL][TESTS] Fix an issue that DataFrameCallbackSuite doesn't recover the log level · fbc40580
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      "DataFrameCallbackSuite.execute callback functions when a DataFrame action failed" sets the log level to "fatal" but doesn't recover it. Hence, tests running after it won't output any logs except fatal logs.
      
      This PR uses `testQuietly` instead to avoid changing the log level.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17156 from zsxwing/SPARK-19816.
      fbc40580
    • Marcelo Vanzin's avatar
      [SPARK-19084][SQL] Ensure context class loader is set when initializing Hive. · 9e5b4ce7
      Marcelo Vanzin authored
      A change in Hive 2.2 (most probably HIVE-13149) causes this code path to fail,
      since the call to "state.getConf.setClassLoader" does not actually change the
      context's class loader. Spark doesn't yet officially support Hive 2.2, but some
      distribution-specific metastore client libraries may have that change (as certain
      versions of CDH already do), and this also makes it easier to support 2.2 when it
      comes out.
      
      Tested with existing unit tests; we've also used this patch extensively with Hive
      metastore client jars containing the offending patch.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #17154 from vanzin/SPARK-19804.
      9e5b4ce7
    • Shixiong Zhu's avatar
      [SPARK-19718][SS] Handle more interrupt cases properly for Hadoop · a6a7a95e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      [SPARK-19617](https://issues.apache.org/jira/browse/SPARK-19617) changed `HDFSMetadataLog` to enable interrupts when using the local file system. However, now we hit [HADOOP-12074](https://issues.apache.org/jira/browse/HADOOP-12074): `Shell.runCommand` converts `InterruptedException` to `new IOException(ie.toString())` before Hadoop 2.8. This is the Hadoop patch to fix HADOOP-1207: https://github.com/apache/hadoop/commit/95c73d49b1bb459b626a9ac52acadb8f5fa724de
      
      This PR adds new logic to handle the following cases related to `InterruptedException`.
      - Check if the message of IOException starts with `java.lang.InterruptedException`. If so, treat it as `InterruptedException`. This is for pre-Hadoop 2.8.
      - Treat `InterruptedIOException` as `InterruptedException`. This is for Hadoop 2.8+ and other places that may throw `InterruptedIOException` when the thread is interrupted.
      
      ## How was this patch tested?
      
      The new unit test.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17044 from zsxwing/SPARK-19718.
      a6a7a95e
    • Xiao Li's avatar
      [SPARK-13446][SQL] Support reading data from Hive 2.0.1 metastore · f5fdbe04
      Xiao Li authored
      ### What changes were proposed in this pull request?
      This PR is to make Spark work with Hive 2.0's metastores. Compared with Hive 1.2, Hive 2.0's metastore has an API update due to removal of `HOLD_DDLTIME` in https://issues.apache.org/jira/browse/HIVE-12224. Based on the following Hive JIRA description, `HOLD_DDLTIME` should be removed from our internal API too. (https://github.com/apache/spark/pull/17063 was submitted for it):
      > This arcane feature was introduced long ago via HIVE-1394 It was broken as soon as it landed, HIVE-1442 and is thus useless. Fact that no one has fixed it since informs that its not really used by anyone. Better is to remove it so no one hits the bug of HIVE-1442
      
      In the next PR, we will support 2.1.0 metastore, whose APIs were changed due to https://issues.apache.org/jira/browse/HIVE-12730. However, before that, we need a code cleanup for stats collection and setting.
      
      ### How was this patch tested?
      Added test cases to VersionsSuite.scala
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17061 from gatorsmile/Hive2.
      f5fdbe04
    • Bryan Cutler's avatar
      [SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe · 44281ca8
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      The `keyword_only` decorator in PySpark is not thread-safe.  It writes kwargs to a static class variable in the decorator, which is then retrieved later in the class method as `_input_kwargs`.  If multiple threads are constructing the same class with different kwargs, it becomes a race condition to read from the static class variable before it's overwritten.  See [SPARK-19348](https://issues.apache.org/jira/browse/SPARK-19348) for reproduction code.
      
      This change will write the kwargs to a member variable so that multiple threads can operate on separate instances without the race condition.  It does not protect against multiple threads operating on a single instance, but that is better left to the user to synchronize.
      
      ## How was this patch tested?
      Added new unit tests for using the keyword_only decorator and a regression test that verifies `_input_kwargs` can be overwritten from different class instances.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #16782 from BryanCutler/pyspark-keyword_only-threadsafe-SPARK-19348.
      44281ca8
    • Takuya UESHIN's avatar
      [SPARK-18939][SQL] Timezone support in partition values. · 2a7921a8
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up pr of #16308 and #16750.
      
      This pr enables timezone support in partition values.
      
      We should use `timeZone` option introduced at #16750 to parse/format partition values of the `TimestampType`.
      
      For example, if you have timestamp `"2016-01-01 00:00:00"` in `GMT` which will be used for partition values, the values written by the default timezone option, which is `"GMT"` because the session local timezone is `"GMT"` here, are:
      
      ```scala
      scala> spark.conf.set("spark.sql.session.timeZone", "GMT")
      
      scala> val df = Seq((1, new java.sql.Timestamp(1451606400000L))).toDF("i", "ts")
      df: org.apache.spark.sql.DataFrame = [i: int, ts: timestamp]
      
      scala> df.show()
      +---+-------------------+
      |  i|                 ts|
      +---+-------------------+
      |  1|2016-01-01 00:00:00|
      +---+-------------------+
      
      scala> df.write.partitionBy("ts").save("/path/to/gmtpartition")
      ```
      
      ```sh
      $ ls /path/to/gmtpartition/
      _SUCCESS			ts=2016-01-01 00%3A00%3A00
      ```
      
      whereas setting the option to `"PST"`, they are:
      
      ```scala
      scala> df.write.option("timeZone", "PST").partitionBy("ts").save("/path/to/pstpartition")
      ```
      
      ```sh
      $ ls /path/to/pstpartition/
      _SUCCESS			ts=2015-12-31 16%3A00%3A00
      ```
      
      We can properly read the partition values if the session local timezone and the timezone of the partition values are the same:
      
      ```scala
      scala> spark.read.load("/path/to/gmtpartition").show()
      +---+-------------------+
      |  i|                 ts|
      +---+-------------------+
      |  1|2016-01-01 00:00:00|
      +---+-------------------+
      ```
      
      And even if the timezones are different, we can properly read the values with setting corrent timezone option:
      
      ```scala
      // wrong result
      scala> spark.read.load("/path/to/pstpartition").show()
      +---+-------------------+
      |  i|                 ts|
      +---+-------------------+
      |  1|2015-12-31 16:00:00|
      +---+-------------------+
      
      // correct result
      scala> spark.read.option("timeZone", "PST").load("/path/to/pstpartition").show()
      +---+-------------------+
      |  i|                 ts|
      +---+-------------------+
      |  1|2016-01-01 00:00:00|
      +---+-------------------+
      ```
      
      ## How was this patch tested?
      
      Existing tests and added some tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #17053 from ueshin/issues/SPARK-18939.
      2a7921a8
    • jerryshao's avatar
      [MINOR][DOC] Fix doc for web UI https configuration · ba186a84
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Doc about enabling web UI https is not correct, "spark.ui.https.enabled" is not existed, actually enabling SSL is enough for https.
      
      ## How was this patch tested?
      
      N/A
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #17147 from jerryshao/fix-doc-ssl.
      ba186a84
    • Burak Yavuz's avatar
      [SPARK-19774] StreamExecution should call stop() on sources when a stream fails · 9314c083
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      We call stop() on a Structured Streaming Source only when the stream is shutdown when a user calls streamingQuery.stop(). We should actually stop all sources when the stream fails as well, otherwise we may leak resources, e.g. connections to Kafka.
      
      ## How was this patch tested?
      
      Unit tests in `StreamingQuerySuite`.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #17107 from brkyvz/close-source.
      9314c083
    • Pete Robbins's avatar
      [SPARK-19710][SQL][TESTS] Fix ordering of rows in query results · 37a1c0e4
      Pete Robbins authored
      ## What changes were proposed in this pull request?
      Changes to SQLQueryTests to make the order of the results constant.
      Where possible ORDER BY has been added to match the existing expected output
      
      ## How was this patch tested?
      Test runs on x86, zLinux (big endian), ppc (big endian)
      
      Author: Pete Robbins <robbinspg@gmail.com>
      
      Closes #17039 from robbinspg/SPARK-19710.
      37a1c0e4
    • Liang-Chi Hsieh's avatar
      [SPARK-19758][SQL] Resolving timezone aware expressions with time zone when resolving inline table · 98bcc188
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      When we resolve inline tables in analyzer, we will evaluate the expressions of inline tables.
      
      When it evaluates a `TimeZoneAwareExpression` expression, an error will happen because the `TimeZoneAwareExpression` is not associated with timezone yet.
      
      So we need to resolve these `TimeZoneAwareExpression`s with time zone when resolving inline tables.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #17114 from viirya/resolve-timeawareexpr-inline-table.
      98bcc188
    • Dongjoon Hyun's avatar
      [SPARK-19801][BUILD] Remove JDK7 from Travis CI · 776fac39
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Since Spark 2.1.0, Travis CI was supported by SPARK-15207 for automated PR verification (JDK7/JDK8 maven compilation and Java Linter) and contributors can see the additional result via their Travis CI dashboard (or PC).
      
      This PR aims to make `.travis.yml` up-to-date by removing JDK7 which was removed via SPARK-19550.
      
      ## How was this patch tested?
      
      See the result via Travis CI.
      
      - https://travis-ci.org/dongjoon-hyun/spark/builds/207111713
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #17143 from dongjoon-hyun/SPARK-19801.
      776fac39
    • Zhe Sun's avatar
      [SPARK-19797][DOC] ML pipeline document correction · 0bac3e4c
      Zhe Sun authored
      ## What changes were proposed in this pull request?
      Description about pipeline in this paragraph is incorrect https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works
      
      > If the Pipeline had more **stages**, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage.
      
      Reason: Transformer could also be a stage. But only another Estimator will invoke an transform call and pass the data to next stage. The description in the document misleads ML pipeline users.
      
      ## How was this patch tested?
      This is a tiny modification of **docs/ml-pipelines.md**. I jekyll build the modification and check the compiled document.
      
      Author: Zhe Sun <ymwdalex@gmail.com>
      
      Closes #17137 from ymwdalex/SPARK-19797-ML-pipeline-document-correction.
      0bac3e4c
    • uncleGen's avatar
      [SPARK-19739][CORE] propagate S3 session token to cluser · fa50143c
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      propagate S3 session token to cluser
      
      ## How was this patch tested?
      
      existing ut
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #17080 from uncleGen/SPARK-19739.
      fa50143c
    • hyukjinkwon's avatar
      [SPARK-18699][SQL][FOLLOWUP] Add explanation in CSV parser and minor cleanup · d556b317
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR suggests adding some comments in `UnivocityParser` logics to explain what happens. Also, it proposes, IMHO, a little bit cleaner (at least easy for me to explain).
      
      ## How was this patch tested?
      
      Unit tests in `CSVSuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17142 from HyukjinKwon/SPARK-18699.
      d556b317
    • windpiger's avatar
      [SPARK-18726][SQL] resolveRelation for FileFormat DataSource don't need to listFiles twice · 982f3223
      windpiger authored
      ## What changes were proposed in this pull request?
      
      Currently when we resolveRelation for a `FileFormat DataSource` without providing user schema, it will execute `listFiles`  twice in `InMemoryFileIndex` during `resolveRelation`.
      
      This PR add a `FileStatusCache` for DataSource, this can avoid listFiles twice.
      
      But there is a bug in `InMemoryFileIndex` see:
       [SPARK-19748](https://github.com/apache/spark/pull/17079)
       [SPARK-19761](https://github.com/apache/spark/pull/17093),
      so this pr should be after SPARK-19748/ SPARK-19761.
      
      ## How was this patch tested?
      unit test added
      
      Author: windpiger <songjun@outlook.com>
      
      Closes #17081 from windpiger/resolveDataSourceScanFilesTwice.
      982f3223
  5. Mar 02, 2017
    • guifeng's avatar
      [SPARK-19779][SS] Delete needless tmp file after restart structured streaming job · e24f21b5
      guifeng authored
      ## What changes were proposed in this pull request?
      
      [SPARK-19779](https://issues.apache.org/jira/browse/SPARK-19779)
      
      The PR (https://github.com/apache/spark/pull/17012) can to fix restart a Structured Streaming application using hdfs as fileSystem, but also exist a problem that a tmp file of delta file is still reserved in hdfs. And Structured Streaming don't delete the tmp file generated when restart streaming job in future.
      
      ## How was this patch tested?
       unit tests
      
      Author: guifeng <guifengleaf@gmail.com>
      
      Closes #17124 from gf53520/SPARK-19779.
      e24f21b5
    • Sunitha Kambhampati's avatar
      [SPARK-19602][SQL][TESTS] Add tests for qualified column names · f37bb143
      Sunitha Kambhampati authored
      ## What changes were proposed in this pull request?
      - Add tests covering different scenarios with qualified column names
      - Please see Section 2 in the design doc for the various test scenarios [here](https://issues.apache.org/jira/secure/attachment/12854681/Design_ColResolution_JIRA19602.pdf)
      - As part of SPARK-19602, changes are made to support three part column name. In order to aid in the review and to reduce the diff, the test scenarios are separated out into this PR.
      
      ## How was this patch tested?
      - This is a **test only** change. The individual test suites were run successfully.
      
      Author: Sunitha Kambhampati <skambha@us.ibm.com>
      
      Closes #17067 from skambha/colResolutionTests.
      f37bb143
    • sethah's avatar
      [SPARK-19745][ML] SVCAggregator captures coefficients in its closure · 93ae176e
      sethah authored
      ## What changes were proposed in this pull request?
      
      JIRA: [SPARK-19745](https://issues.apache.org/jira/browse/SPARK-19745)
      
      Reorganize SVCAggregator to avoid serializing coefficients. This patch also makes the gradient array a `lazy val` which will avoid materializing a large array on the driver before shipping the class to the executors. This improvement stems from https://github.com/apache/spark/pull/16037. Actually, probably all ML aggregators can benefit from this.
      
      We can either: a.) separate the gradient improvement into another patch b.) keep what's here _plus_ add the lazy evaluation to all other aggregators in this patch or c.) keep it as is.
      
      ## How was this patch tested?
      
      This is an interesting question! I don't know of a reasonable way to test this right now. Ideally, we could perform an optimization and look at the shuffle write data for each task, and we could compare the size to what it we know it should be: `numCoefficients * 8 bytes`. Not sure if there is a good way to do that right now? We could discuss this here or in another JIRA, but I suspect it would be a significant undertaking.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #17076 from sethah/svc_agg.
      93ae176e
    • Imran Rashid's avatar
      [SPARK-19276][CORE] Fetch Failure handling robust to user error handling · 8417a7ae
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      Fault-tolerance in spark requires special handling of shuffle fetch
      failures.  The Executor would catch FetchFailedException and send a
      special msg back to the driver.
      
      However, intervening user code could intercept that exception, and wrap
      it with something else.  This even happens in SparkSQL.  So rather than
      checking the thrown exception only, we'll store the fetch failure directly
      in the TaskContext, where users can't touch it.
      
      ## How was this patch tested?
      
      Added a test case which failed before the fix.  Full test suite via jenkins.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #16639 from squito/SPARK-19276.
      8417a7ae
    • Patrick Woody's avatar
      [SPARK-19631][CORE] OutputCommitCoordinator should not allow commits for already failed tasks · 433d9eb6
      Patrick Woody authored
      ## What changes were proposed in this pull request?
      
      Previously it was possible for there to be a race between a task failure and committing the output of a task. For example, the driver may mark a task attempt as failed due to an executor heartbeat timeout (possibly due to GC), but the task attempt actually ends up coordinating with the OutputCommitCoordinator once the executor recovers and committing its result. This will lead to any retry attempt failing because the task result has already been committed despite the original attempt failing.
      
      This ensures that any previously failed task attempts cannot enter the commit protocol.
      
      ## How was this patch tested?
      
      Added a unit test
      
      Author: Patrick Woody <pwoody@palantir.com>
      
      Closes #16959 from pwoody/pw/recordFailuresForCommitter.
      433d9eb6
    • Mark Grover's avatar
      [SPARK-19720][CORE] Redact sensitive information from SparkSubmit console · 5ae3516b
      Mark Grover authored
      ## What changes were proposed in this pull request?
      This change redacts senstive information (based on `spark.redaction.regex` property)
      from the Spark Submit console logs. Such sensitive information is already being
      redacted from event logs and yarn logs, etc.
      
      ## How was this patch tested?
      Testing was done manually to make sure that the console logs were not printing any
      sensitive information.
      
      Here's some output from the console:
      
      ```
      Spark properties used, including those specified through
       --conf and those from the properties file /etc/spark2/conf/spark-defaults.conf:
        (spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD,*********(redacted))
        (spark.authenticate,false)
        (spark.executorEnv.HADOOP_CREDSTORE_PASSWORD,*********(redacted))
      ```
      
      ```
      System properties:
      (spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD,*********(redacted))
      (spark.authenticate,false)
      (spark.executorEnv.HADOOP_CREDSTORE_PASSWORD,*********(redacted))
      ```
      There is a risk if new print statements were added to the console down the road, sensitive information may still get leaked, since there is no test that asserts on the console log output. I considered it out of the scope of this JIRA to write an integration test to make sure new leaks don't happen in the future.
      
      Running unit tests to make sure nothing else is broken by this change.
      
      Author: Mark Grover <mark@apache.org>
      
      Closes #17047 from markgrover/master_redaction.
      5ae3516b
    • Nick Pentreath's avatar
      [SPARK-19345][ML][DOC] Add doc for "coldStartStrategy" usage in ALS · 9cca3dbf
      Nick Pentreath authored
      [SPARK-14489](https://issues.apache.org/jira/browse/SPARK-14489) added the ability to skip `NaN` predictions during `ALSModel.transform`. This PR adds documentation for the `coldStartStrategy` param to the ALS user guide, and add code to the examples to illustrate usage.
      
      ## How was this patch tested?
      
      Doc and example change only. Build HTML doc locally and verified example code builds, and runs in shell for Scala/Python.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #17102 from MLnick/SPARK-19345-coldstart-doc.
      9cca3dbf
    • Zheng RuiFeng's avatar
      [SPARK-19704][ML] AFTSurvivalRegression should support numeric censorCol · 50c08e82
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      make `AFTSurvivalRegression` support numeric censorCol
      ## How was this patch tested?
      existing tests and added tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #17034 from zhengruifeng/aft_numeric_censor.
      50c08e82
    • Vasilis Vryniotis's avatar
      [SPARK-19733][ML] Removed unnecessary castings and refactored checked casts in ALS. · 625cfe09
      Vasilis Vryniotis authored
      ## What changes were proposed in this pull request?
      
      The original ALS was performing unnecessary casting to the user and item ids because the protected checkedCast() method required a double. I removed the castings and refactored the method to receive Any and efficiently handle all permitted numeric values.
      
      ## How was this patch tested?
      
      I tested it by running the unit-tests and by manually validating the result of checkedCast for various legal and illegal values.
      
      Author: Vasilis Vryniotis <bbriniotis@datumbox.com>
      
      Closes #17059 from datumbox/als_casting_fix.
      625cfe09
    • Felix Cheung's avatar
      [SPARK-18352][DOCS] wholeFile JSON update doc and programming guide · 8d6ef895
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Update doc for R, programming guide. Clarify default behavior for all languages.
      
      ## How was this patch tested?
      
      manually
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17128 from felixcheung/jsonwholefiledoc.
      8d6ef895
    • Mark Grover's avatar
      [SPARK-19734][PYTHON][ML] Correct OneHotEncoder doc string to say dropLast · d2a87976
      Mark Grover authored
      ## What changes were proposed in this pull request?
      Updates the doc string to match up with the code
      i.e. say dropLast instead of includeFirst
      
      ## How was this patch tested?
      Not much, since it's a doc-like change. Will run unit tests via Jenkins job.
      
      Author: Mark Grover <mark@apache.org>
      
      Closes #17127 from markgrover/spark_19734.
      d2a87976
    • Yun Ni's avatar
      [MINOR][ML] Fix comments in LSH Examples and Python API · 3bd8ddf7
      Yun Ni authored
      ## What changes were proposed in this pull request?
      Remove `org.apache.spark.examples.` in
      Add slash in one of the python doc.
      
      ## How was this patch tested?
      Run examples using the commands in the comments.
      
      Author: Yun Ni <yunn@uber.com>
      
      Closes #17104 from Yunni/yunn_minor.
      3bd8ddf7
    • windpiger's avatar
      [SPARK-19583][SQL] CTAS for data source table with a created location should succeed · de2b53df
      windpiger authored
      ## What changes were proposed in this pull request?
      
      ```
        spark.sql(
                s"""
                   |CREATE TABLE t
                   |USING parquet
                   |PARTITIONED BY(a, b)
                   |LOCATION '$dir'
                   |AS SELECT 3 as a, 4 as b, 1 as c, 2 as d
                 """.stripMargin)
      ```
      
      Failed with the error message:
      ```
      path file:/private/var/folders/6r/15tqm8hn3ldb3rmbfqm1gf4c0000gn/T/spark-195cd513-428a-4df9-b196-87db0c73e772 already exists.;
      org.apache.spark.sql.AnalysisException: path file:/private/var/folders/6r/15tqm8hn3ldb3rmbfqm1gf4c0000gn/T/spark-195cd513-428a-4df9-b196-87db0c73e772 already exists.;
      	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:102)
      ```
      while hive table is ok ,so we should fix it for datasource table.
      
      The reason is that the SaveMode check is put in  `InsertIntoHadoopFsRelationCommand` , and the SaveMode check actually use `path`, this is fine when we use `DataFrameWriter.save()`, because this situation of SaveMode act on `path`.
      
      While when we use  `CreateDataSourceAsSelectCommand`, the situation of SaveMode act on table, and
      we have already do SaveMode check in `CreateDataSourceAsSelectCommand` for table , so we should not do SaveMode check in the following logic in `InsertIntoHadoopFsRelationCommand` for path, this is redundant and wrong logic for `CreateDataSourceAsSelectCommand`
      
      After this PR, the following DDL will succeed, when the location has been created we will append it or overwrite it.
      ```
      CREATE TABLE ... (PARTITIONED BY ...) LOCATION path AS SELECT ...
      ```
      
      ## How was this patch tested?
      unit test added
      
      Author: windpiger <songjun@outlook.com>
      
      Closes #16938 from windpiger/CTASDataSourceWitLocation.
      de2b53df
Loading