Skip to content
Snippets Groups Projects
  1. Mar 23, 2017
    • Tyson Condie's avatar
      [SPARK-19876][SS][WIP] OneTime Trigger Executor · 746a558d
      Tyson Condie authored
      ## What changes were proposed in this pull request?
      
      An additional trigger and trigger executor that will execute a single trigger only. One can use this OneTime trigger to have more control over the scheduling of triggers.
      
      In addition, this patch requires an optimization to StreamExecution that logs a commit record at the end of successfully processing a batch. This new commit log will be used to determine the next batch (offsets) to process after a restart, instead of using the offset log itself to determine what batch to process next after restart; using the offset log to determine this would process the previously logged batch, always, thus not permitting a OneTime trigger feature.
      
      ## How was this patch tested?
      
      A number of existing tests have been revised. These tests all assumed that when restarting a stream, the last batch in the offset log is to be re-processed. Given that we now have a commit log that will tell us if that last batch was processed successfully, the results/assumptions of those tests needed to be revised accordingly.
      
      In addition, a OneTime trigger test was added to StreamingQuerySuite, which tests:
      - The semantics of OneTime trigger (i.e., on start, execute a single batch, then stop).
      - The case when the commit log was not able to successfully log the completion of a batch before restart, which would mean that we should fall back to what's in the offset log.
      - A OneTime trigger execution that results in an exception being thrown.
      
      marmbrus tdas zsxwing
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Tyson Condie <tcondie@gmail.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #17219 from tcondie/stream-commit.
      746a558d
    • hyukjinkwon's avatar
      [SPARK-18579][SQL] Use ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace options in CSV writing · 07c12c09
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to support _not_ trimming the white spaces when writing out. These are `false` by default in CSV reading path but these are `true` by default in CSV writing in univocity parser.
      
      Both `ignoreLeadingWhiteSpace` and `ignoreTrailingWhiteSpace` options are not being used for writing and therefore, we are always trimming the white spaces.
      
      It seems we should provide a way to keep this white spaces easily.
      
      WIth the data below:
      
      ```scala
      val df = spark.read.csv(Seq("a , b  , c").toDS)
      df.show()
      ```
      
      ```
      +---+----+---+
      |_c0| _c1|_c2|
      +---+----+---+
      | a | b  |  c|
      +---+----+---+
      ```
      
      **Before**
      
      ```scala
      df.write.csv("/tmp/text.csv")
      spark.read.text("/tmp/text.csv").show()
      ```
      
      ```
      +-----+
      |value|
      +-----+
      |a,b,c|
      +-----+
      ```
      
      It seems this can't be worked around via `quoteAll` too.
      
      ```scala
      df.write.option("quoteAll", true).csv("/tmp/text.csv")
      spark.read.text("/tmp/text.csv").show()
      ```
      ```
      +-----------+
      |      value|
      +-----------+
      |"a","b","c"|
      +-----------+
      ```
      
      **After**
      
      ```scala
      df.write.option("ignoreLeadingWhiteSpace", false).option("ignoreTrailingWhiteSpace", false).csv("/tmp/text.csv")
      spark.read.text("/tmp/text.csv").show()
      ```
      
      ```
      +----------+
      |     value|
      +----------+
      |a , b  , c|
      +----------+
      ```
      
      Note that this case is possible in R
      
      ```r
      > system("cat text.csv")
      f1,f2,f3
      a , b  , c
      > df <- read.csv(file="text.csv")
      > df
        f1   f2 f3
      1 a   b    c
      > write.csv(df, file="text1.csv", quote=F, row.names=F)
      > system("cat text1.csv")
      f1,f2,f3
      a , b  , c
      ```
      
      ## How was this patch tested?
      
      Unit tests in `CSVSuite` and manual tests for Python.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17310 from HyukjinKwon/SPARK-18579.
      07c12c09
  2. Mar 22, 2017
    • hyukjinkwon's avatar
      [SPARK-19949][SQL][FOLLOW-UP] Clean up parse modes and update related comments · 46581838
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to make `mode` options in both CSV and JSON to use `cass object` and fix some related comments related previous fix.
      
      Also, this PR modifies some tests related parse modes.
      
      ## How was this patch tested?
      
      Modified unit tests in both `CSVSuite.scala` and `JsonSuite.scala`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17377 from HyukjinKwon/SPARK-19949.
      46581838
  3. Mar 21, 2017
    • Zheng RuiFeng's avatar
      [SPARK-20041][DOC] Update docs for NaN handling in approxQuantile · 63f077fb
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Update docs for NaN handling in approxQuantile.
      
      ## How was this patch tested?
      existing tests.
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #17369 from zhengruifeng/doc_quantiles_nan.
      63f077fb
    • christopher snow's avatar
      [SPARK-20011][ML][DOCS] Clarify documentation for ALS 'rank' parameter · 7620aed8
      christopher snow authored
      ## What changes were proposed in this pull request?
      
      API documentation and collaborative filtering documentation page changes to clarify inconsistent description of ALS rank parameter.
      
       - [DOCS] was previously: "rank is the number of latent factors in the model."
       - [API] was previously:  "rank - number of features to use"
      
      This change describes rank in both places consistently as:
      
       - "Number of features to use (also referred to as the number of latent factors)"
      
      Author: Chris Snow <chris.snowuk.ibm.com>
      
      Author: christopher snow <chsnow123@gmail.com>
      
      Closes #17345 from snowch/SPARK-20011.
      7620aed8
  4. Mar 20, 2017
    • hyukjinkwon's avatar
      [SPARK-19849][SQL] Support ArrayType in to_json to produce JSON array · 0cdcf911
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to support an array of struct type in `to_json` as below:
      
      ```scala
      import org.apache.spark.sql.functions._
      
      val df = Seq(Tuple1(Tuple1(1) :: Nil)).toDF("a")
      df.select(to_json($"a").as("json")).show()
      ```
      
      ```
      +----------+
      |      json|
      +----------+
      |[{"_1":1}]|
      +----------+
      ```
      
      Currently, it throws an exception as below (a newline manually inserted for readability):
      
      ```
      org.apache.spark.sql.AnalysisException: cannot resolve 'structtojson(`array`)' due to data type
      mismatch: structtojson requires that the expression is a struct expression.;;
      ```
      
      This allows the roundtrip with `from_json` as below:
      
      ```scala
      import org.apache.spark.sql.functions._
      import org.apache.spark.sql.types._
      
      val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
      val df = Seq("""[{"a":1}, {"a":2}]""").toDF("json").select(from_json($"json", schema).as("array"))
      df.show()
      
      // Read back.
      df.select(to_json($"array").as("json")).show()
      ```
      
      ```
      +----------+
      |     array|
      +----------+
      |[[1], [2]]|
      +----------+
      
      +-----------------+
      |             json|
      +-----------------+
      |[{"a":1},{"a":2}]|
      +-----------------+
      ```
      
      Also, this PR proposes to rename from `StructToJson` to `StructsToJson ` and `JsonToStruct` to `JsonToStructs`.
      
      ## How was this patch tested?
      
      Unit tests in `JsonFunctionsSuite` and `JsonExpressionsSuite` for Scala, doctest for Python and test in `test_sparkSQL.R` for R.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17192 from HyukjinKwon/SPARK-19849.
      0cdcf911
  5. Mar 17, 2017
    • Shixiong Zhu's avatar
      [SPARK-19986][TESTS] Make pyspark.streaming.tests.CheckpointTests more stable · 376d7821
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Sometimes, CheckpointTests will hang on a busy machine because the streaming jobs are too slow and cannot catch up. I observed the scheduled delay was keeping increasing for dozens of seconds locally.
      
      This PR increases the batch interval from 0.5 seconds to 2 seconds to generate less Spark jobs. It should make `pyspark.streaming.tests.CheckpointTests` more stable. I also replaced `sleep` with `awaitTerminationOrTimeout` so that if the streaming job fails, it will also fail the test.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17323 from zsxwing/SPARK-19986.
      376d7821
  6. Mar 15, 2017
    • hyukjinkwon's avatar
      [SPARK-19872] [PYTHON] Use the correct deserializer for RDD construction for coalesce/repartition · 7387126f
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to use the correct deserializer, `BatchedSerializer` for RDD construction for coalesce/repartition when the shuffle is enabled. Currently, it is passing `UTF8Deserializer` as is not `BatchedSerializer` from the copied one.
      
      with the file, `text.txt` below:
      
      ```
      a
      b
      
      d
      e
      f
      g
      h
      i
      j
      k
      l
      
      ```
      
      - Before
      
      ```python
      >>> sc.textFile('text.txt').repartition(1).collect()
      ```
      
      ```
      UTF8Deserializer(True)
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/rdd.py", line 811, in collect
          return list(_load_from_socket(port, self._jrdd_deserializer))
        File ".../spark/python/pyspark/serializers.py", line 549, in load_stream
          yield self.loads(stream)
        File ".../spark/python/pyspark/serializers.py", line 544, in loads
          return s.decode("utf-8") if self.use_unicode else s
        File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
          return codecs.utf_8_decode(input, errors, True)
      UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
      ```
      
      - After
      
      ```python
      >>> sc.textFile('text.txt').repartition(1).collect()
      ```
      
      ```
      [u'a', u'b', u'', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l', u'']
      ```
      
      ## How was this patch tested?
      
      Unit test in `python/pyspark/tests.py`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17282 from HyukjinKwon/SPARK-19872.
      7387126f
    • Liwei Lin's avatar
      [SPARK-19817][SS] Make it clear that `timeZone` is a general option in DataStreamReader/Writer · e1ac5534
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      As timezone setting can also affect partition values, it works for all formats, we should make it clear.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #17299 from lw-lin/timezone.
      e1ac5534
  7. Mar 14, 2017
    • Takuya UESHIN's avatar
      [SPARK-19817][SQL] Make it clear that `timeZone` option is a general option in... · 7ded39c2
      Takuya UESHIN authored
      [SPARK-19817][SQL] Make it clear that `timeZone` option is a general option in DataFrameReader/Writer.
      
      ## What changes were proposed in this pull request?
      
      As timezone setting can also affect partition values, it works for all formats, we should make it clear.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #17281 from ueshin/issues/SPARK-19817.
      7ded39c2
  8. Mar 09, 2017
    • Jeff Zhang's avatar
      [SPARK-12334][SQL][PYSPARK] Support read from multiple input paths for orc... · cabe1df8
      Jeff Zhang authored
      [SPARK-12334][SQL][PYSPARK] Support read from multiple input paths for orc file in DataFrameReader.orc
      
      Beside the issue in spark api, also fix 2 minor issues in pyspark
      - support read from multiple input paths for orc
      - support read from multiple input paths for text
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #10307 from zjffdu/SPARK-12334.
      cabe1df8
    • Jason White's avatar
      [SPARK-19561][SQL] add int case handling for TimestampType · 206030bd
      Jason White authored
      ## What changes were proposed in this pull request?
      
      Add handling of input of type `Int` for dataType `TimestampType` to `EvaluatePython.scala`. Py4J serializes ints smaller than MIN_INT or larger than MAX_INT to Long, which are handled correctly already, but values between MIN_INT and MAX_INT are serialized to Int.
      
      These range limits correspond to roughly half an hour on either side of the epoch. As a result, PySpark doesn't allow TimestampType values to be created in this range.
      
      Alternatives attempted: patching the `TimestampType.toInternal` function to cast return values to `long`, so Py4J would always serialize them to Scala Long. Python3 does not have a `long` type, so this approach failed on Python3.
      
      ## How was this patch tested?
      
      Added a new PySpark-side test that fails without the change.
      
      The contribution is my original work and I license the work to the project under the project’s open source license.
      
      Resubmission of https://github.com/apache/spark/pull/16896. The original PR didn't go through Jenkins and broke the build. davies dongjoon-hyun
      
      cloud-fan Could you kick off a Jenkins run for me? It passed everything for me locally, but it's possible something has changed in the last few weeks.
      
      Author: Jason White <jason.white@shopify.com>
      
      Closes #17200 from JasonMWhite/SPARK-19561.
      206030bd
  9. Mar 08, 2017
  10. Mar 07, 2017
  11. Mar 05, 2017
    • hyukjinkwon's avatar
      [SPARK-19701][SQL][PYTHON] Throws a correct exception for 'in' operator against column · 224e0e78
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to remove incorrect implementation that has been not executed so far (at least from Spark 1.5.2) for `in` operator and throw a correct exception rather than saying it is a bool. I tested the codes above in 1.5.2, 1.6.3, 2.1.0 and in the master branch as below:
      
      **1.5.2**
      
      ```python
      >>> df = sqlContext.createDataFrame([[1]])
      >>> 1 in df._1
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark-1.5.2-bin-hadoop2.6/python/pyspark/sql/column.py", line 418, in __nonzero__
          raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
      ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
      ```
      
      **1.6.3**
      
      ```python
      >>> 1 in sqlContext.range(1).id
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark-1.6.3-bin-hadoop2.6/python/pyspark/sql/column.py", line 447, in __nonzero__
          raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
      ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
      ```
      
      **2.1.0**
      
      ```python
      >>> 1 in spark.range(1).id
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/column.py", line 426, in __nonzero__
          raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
      ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
      ```
      
      **Current Master**
      
      ```python
      >>> 1 in spark.range(1).id
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/sql/column.py", line 452, in __nonzero__
          raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
      ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
      ```
      
      **After**
      
      ```python
      >>> 1 in spark.range(1).id
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/sql/column.py", line 184, in __contains__
          raise ValueError("Cannot apply 'in' operator against a column: please use 'contains' "
      ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column.
      ```
      
      In more details,
      
      It seems the implementation intended to support this
      
      ```python
      1 in df.column
      ```
      
      However, currently, it throws an exception as below:
      
      ```python
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/sql/column.py", line 426, in __nonzero__
          raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
      ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
      ```
      
      What happens here is as below:
      
      ```python
      class Column(object):
          def __contains__(self, item):
              print "I am contains"
              return Column()
          def __nonzero__(self):
              raise Exception("I am nonzero.")
      
      >>> 1 in Column()
      I am contains
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "<stdin>", line 6, in __nonzero__
      Exception: I am nonzero.
      ```
      
      It seems it calls `__contains__` first and then `__nonzero__` or `__bool__` is being called against `Column()` to make this a bool (or int to be specific).
      
      It seems `__nonzero__` (for Python 2), `__bool__` (for Python 3) and `__contains__` forcing the the return into a bool unlike other operators. There are few references about this as below:
      
      https://bugs.python.org/issue16011
      http://stackoverflow.com/questions/12244074/python-source-code-for-built-in-in-operator/12244378#12244378
      http://stackoverflow.com/questions/38542543/functionality-of-python-in-vs-contains/38542777
      
      It seems we can't overwrite `__nonzero__` or `__bool__` as a workaround to make this working because these force the return type as a bool as below:
      
      ```python
      class Column(object):
          def __contains__(self, item):
              print "I am contains"
              return Column()
          def __nonzero__(self):
              return "a"
      
      >>> 1 in Column()
      I am contains
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
      TypeError: __nonzero__ should return bool or int, returned str
      ```
      
      ## How was this patch tested?
      
      Added unit tests in `tests.py`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17160 from HyukjinKwon/SPARK-19701.
      224e0e78
    • hyukjinkwon's avatar
      [SPARK-19595][SQL] Support json array in from_json · 369a148e
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to both,
      
      **Do not allow json arrays with multiple elements and return null in `from_json` with `StructType` as the schema.**
      
      Currently, it only reads the single row when the input is a json array. So, the codes below:
      
      ```scala
      import org.apache.spark.sql.functions._
      import org.apache.spark.sql.types._
      val schema = StructType(StructField("a", IntegerType) :: Nil)
      Seq(("""[{"a": 1}, {"a": 2}]""")).toDF("struct").select(from_json(col("struct"), schema)).show()
      ```
      prints
      
      ```
      +--------------------+
      |jsontostruct(struct)|
      +--------------------+
      |                 [1]|
      +--------------------+
      ```
      
      This PR simply suggests to print this as `null` if the schema is `StructType` and input is json array.with multiple elements
      
      ```
      +--------------------+
      |jsontostruct(struct)|
      +--------------------+
      |                null|
      +--------------------+
      ```
      
      **Support json arrays in `from_json` with `ArrayType` as the schema.**
      
      ```scala
      import org.apache.spark.sql.functions._
      import org.apache.spark.sql.types._
      val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
      Seq(("""[{"a": 1}, {"a": 2}]""")).toDF("array").select(from_json(col("array"), schema)).show()
      ```
      
      prints
      
      ```
      +-------------------+
      |jsontostruct(array)|
      +-------------------+
      |         [[1], [2]]|
      +-------------------+
      ```
      
      ## How was this patch tested?
      
      Unit test in `JsonExpressionsSuite`, `JsonFunctionsSuite`, Python doctests and manual test.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16929 from HyukjinKwon/disallow-array.
      369a148e
  12. Mar 03, 2017
    • Bryan Cutler's avatar
      [SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe · 44281ca8
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      The `keyword_only` decorator in PySpark is not thread-safe.  It writes kwargs to a static class variable in the decorator, which is then retrieved later in the class method as `_input_kwargs`.  If multiple threads are constructing the same class with different kwargs, it becomes a race condition to read from the static class variable before it's overwritten.  See [SPARK-19348](https://issues.apache.org/jira/browse/SPARK-19348) for reproduction code.
      
      This change will write the kwargs to a member variable so that multiple threads can operate on separate instances without the race condition.  It does not protect against multiple threads operating on a single instance, but that is better left to the user to synchronize.
      
      ## How was this patch tested?
      Added new unit tests for using the keyword_only decorator and a regression test that verifies `_input_kwargs` can be overwritten from different class instances.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #16782 from BryanCutler/pyspark-keyword_only-threadsafe-SPARK-19348.
      44281ca8
  13. Mar 02, 2017
    • Felix Cheung's avatar
      [SPARK-18352][DOCS] wholeFile JSON update doc and programming guide · 8d6ef895
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Update doc for R, programming guide. Clarify default behavior for all languages.
      
      ## How was this patch tested?
      
      manually
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17128 from felixcheung/jsonwholefiledoc.
      8d6ef895
    • Mark Grover's avatar
      [SPARK-19734][PYTHON][ML] Correct OneHotEncoder doc string to say dropLast · d2a87976
      Mark Grover authored
      ## What changes were proposed in this pull request?
      Updates the doc string to match up with the code
      i.e. say dropLast instead of includeFirst
      
      ## How was this patch tested?
      Not much, since it's a doc-like change. Will run unit tests via Jenkins job.
      
      Author: Mark Grover <mark@apache.org>
      
      Closes #17127 from markgrover/spark_19734.
      d2a87976
    • Yun Ni's avatar
      [MINOR][ML] Fix comments in LSH Examples and Python API · 3bd8ddf7
      Yun Ni authored
      ## What changes were proposed in this pull request?
      Remove `org.apache.spark.examples.` in
      Add slash in one of the python doc.
      
      ## How was this patch tested?
      Run examples using the commands in the comments.
      
      Author: Yun Ni <yunn@uber.com>
      
      Closes #17104 from Yunni/yunn_minor.
      3bd8ddf7
  14. Feb 28, 2017
  15. Feb 24, 2017
    • Jeff Zhang's avatar
      [SPARK-13330][PYSPARK] PYTHONHASHSEED is not propgated to python worker · 330c3e33
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      self.environment will be propagated to executor. Should set PYTHONHASHSEED as long as the python version is greater than 3.3
      
      ## How was this patch tested?
      Manually tested it.
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #11211 from zjffdu/SPARK-13330.
      330c3e33
    • zero323's avatar
      [SPARK-19161][PYTHON][SQL] Improving UDF Docstrings · 4a5e38f5
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Replaces `UserDefinedFunction` object returned from `udf` with a function wrapper providing docstring and arguments information as proposed in [SPARK-19161](https://issues.apache.org/jira/browse/SPARK-19161).
      
      ### Backward incompatible changes:
      
      - `pyspark.sql.functions.udf` will return a `function` instead of `UserDefinedFunction`. To ensure backward compatible public API we use function attributes to mimic  `UserDefinedFunction` API (`func` and `returnType` attributes).  This should have a minimal impact on the user code.
      
        An alternative implementation could use dynamical sub-classing. This would ensure full backward compatibility but is more fragile in practice.
      
      ### Limitations:
      
      Full functionality (retained docstring and argument list) is achieved only in the recent Python version. Legacy Python version will preserve only docstrings, but not argument list. This should be an acceptable trade-off between achieved improvements and overall complexity.
      
      ### Possible impact on other tickets:
      
      This can affect [SPARK-18777](https://issues.apache.org/jira/browse/SPARK-18777).
      
      ## How was this patch tested?
      
      Existing unit tests to ensure backward compatibility, additional tests targeting proposed changes.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16534 from zero323/SPARK-19161.
      4a5e38f5
  16. Feb 23, 2017
    • Bryan Cutler's avatar
      [SPARK-14772][PYTHON][ML] Fixed Params.copy method to match Scala implementation · 2f69e3f6
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Fixed the PySpark Params.copy method to behave like the Scala implementation.  The main issue was that it did not account for the _defaultParamMap and merged it into the explicitly created param map.
      
      ## How was this patch tested?
      Added new unit test to verify the copy method behaves correctly for copying uid, explicitly created params, and default params.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #16772 from BryanCutler/pyspark-ml-param_copy-Scala_sync-SPARK-14772.
      2f69e3f6
    • Wenchen Fan's avatar
      [SPARK-19706][PYSPARK] add Column.contains in pyspark · 4fa4cf1d
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      to be consistent with the scala API, we should also add `contains` to `Column` in pyspark.
      
      ## How was this patch tested?
      
      updated unit test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17036 from cloud-fan/pyspark.
      4fa4cf1d
    • Takeshi Yamamuro's avatar
      [SPARK-18699][SQL] Put malformed tokens into a new field when parsing CSV data · 09ed6e77
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added a logic to put malformed tokens into a new field when parsing CSV data  in case of permissive modes. In the current master, if the CSV parser hits these malformed ones, it throws an exception below (and then a job fails);
      ```
      Caused by: java.lang.IllegalArgumentException
      	at java.sql.Date.valueOf(Date.java:143)
      	at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
      	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
      	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
      	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
      	at scala.util.Try.getOrElse(Try.scala:79)
      	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
      	at
      ```
      In case that users load large CSV-formatted data, the job failure makes users get some confused. So, this fix set NULL for original columns and put malformed tokens in a new field.
      
      ## How was this patch tested?
      Added tests in `CSVSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #16928 from maropu/SPARK-18699-2.
      09ed6e77
    • Shixiong Zhu's avatar
      [SPARK-19497][SS] Implement streaming deduplication · 9bf4e2ba
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR adds a special streaming deduplication operator to support `dropDuplicates` with `aggregation` and watermark. It reuses the `dropDuplicates` API but creates new logical plan `Deduplication` and new physical plan `DeduplicationExec`.
      
      The following cases are supported:
      
      - one or multiple `dropDuplicates()` without aggregation (with or without watermark)
      - `dropDuplicates` before aggregation
      
      Not supported cases:
      
      - `dropDuplicates` after aggregation
      
      Breaking changes:
      - `dropDuplicates` without aggregation doesn't work with `complete` or `update` mode.
      
      ## How was this patch tested?
      
      The new unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16970 from zsxwing/dedup.
      9bf4e2ba
  17. Feb 22, 2017
    • Adam Budde's avatar
      [SPARK-19405][STREAMING] Support for cross-account Kinesis reads via STS · e4065376
      Adam Budde authored
      - Add dependency on aws-java-sdk-sts
      - Replace SerializableAWSCredentials with new SerializableCredentialsProvider interface
      - Make KinesisReceiver take SerializableCredentialsProvider as argument and
        pass credential provider to KCL
      - Add new implementations of KinesisUtils.createStream() that take STS
        arguments
      - Make JavaKinesisStreamSuite test the entire KinesisUtils Java API
      - Update KCL/AWS SDK dependencies to 1.7.x/1.11.x
      
      ## What changes were proposed in this pull request?
      
      [JIRA link with detailed description.](https://issues.apache.org/jira/browse/SPARK-19405)
      
      * Replace SerializableAWSCredentials with new SerializableKCLAuthProvider class that takes 5 optional config params for configuring AWS auth and returns the appropriate credential provider object
      * Add new public createStream() APIs for specifying these parameters in KinesisUtils
      
      ## How was this patch tested?
      
      * Manually tested using explicit keypair and instance profile to read data from Kinesis stream in separate account (difficult to write a test orchestrating creation and assumption of IAM roles across separate accounts)
      * Expanded JavaKinesisStreamSuite to test the entire Java API in KinesisUtils
      
      ## License acknowledgement
      This contribution is my original work and that I license the work to the project under the project’s open source license.
      
      Author: Budde <budde@amazon.com>
      
      Closes #16744 from budde/master.
      e4065376
  18. Feb 17, 2017
  19. Feb 16, 2017
    • Nathan Howell's avatar
      [SPARK-18352][SQL] Support parsing multiline json files · 21fde57f
      Nathan Howell authored
      ## What changes were proposed in this pull request?
      
      If a new option `wholeFile` is set to `true` the JSON reader will parse each file (instead of a single line) as a value. This is done with Jackson streaming and it should be capable of parsing very large documents, assuming the row will fit in memory.
      
      Because the file is not buffered in memory the corrupt record handling is also slightly different when `wholeFile` is enabled: the corrupt column will contain the filename instead of the literal JSON if there is a parsing failure. It would be easy to extend this to add the parser location (line, column and byte offsets) to the output if desired.
      
      These changes have allowed types other than `String` to be parsed. Support for `UTF8String` and `Text` have been added (alongside `String` and `InputFormat`) and no longer require a conversion to `String` just for parsing.
      
      I've also included a few other changes that generate slightly better bytecode and (imo) make it more obvious when and where boxing is occurring in the parser. These are included as separate commits, let me know if they should be flattened into this PR or moved to a new one.
      
      ## How was this patch tested?
      
      New and existing unit tests. No performance or load tests have been run.
      
      Author: Nathan Howell <nhowell@godaddy.com>
      
      Closes #16386 from NathanHowell/SPARK-18352.
      21fde57f
  20. Feb 15, 2017
    • Yun Ni's avatar
      [SPARK-18080][ML][PYTHON] Python API & Examples for Locality Sensitive Hashing · 08c1972a
      Yun Ni authored
      ## What changes were proposed in this pull request?
      This pull request includes python API and examples for LSH. The API changes was based on yanboliang 's PR #15768 and resolved conflicts and API changes on the Scala API. The examples are consistent with Scala examples of MinHashLSH and BucketedRandomProjectionLSH.
      
      ## How was this patch tested?
      API and examples are tested using spark-submit:
      `bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py`
      `bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py`
      
      User guide changes are generated and manually inspected:
      `SKIP_API=1 jekyll build`
      
      Author: Yun Ni <yunn@uber.com>
      Author: Yanbo Liang <ybliang8@gmail.com>
      Author: Yunni <Euler57721@gmail.com>
      
      Closes #16715 from Yunni/spark-18080.
      08c1972a
    • Yin Huai's avatar
      [SPARK-19604][TESTS] Log the start of every Python test · f6c3bba2
      Yin Huai authored
      ## What changes were proposed in this pull request?
      Right now, we only have info level log after we finish the tests of a Python test file. We should also log the start of a test. So, if a test is hanging, we can tell which test file is running.
      
      ## How was this patch tested?
      This is a change for python tests.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #16935 from yhuai/SPARK-19604.
      f6c3bba2
    • Takuya UESHIN's avatar
      [SPARK-18937][SQL] Timezone support in CSV/JSON parsing · 865b2fd8
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up pr of #16308.
      
      This pr enables timezone support in CSV/JSON parsing.
      
      We should introduce `timeZone` option for CSV/JSON datasources (the default value of the option is session local timezone).
      
      The datasources should use the `timeZone` option to format/parse to write/read timestamp values.
      Notice that while reading, if the timestampFormat has the timezone info, the timezone will not be used because we should respect the timezone in the values.
      
      For example, if you have timestamp `"2016-01-01 00:00:00"` in `GMT`, the values written with the default timezone option, which is `"GMT"` because session local timezone is `"GMT"` here, are:
      
      ```scala
      scala> spark.conf.set("spark.sql.session.timeZone", "GMT")
      
      scala> val df = Seq(new java.sql.Timestamp(1451606400000L)).toDF("ts")
      df: org.apache.spark.sql.DataFrame = [ts: timestamp]
      
      scala> df.show()
      +-------------------+
      |ts                 |
      +-------------------+
      |2016-01-01 00:00:00|
      +-------------------+
      
      scala> df.write.json("/path/to/gmtjson")
      ```
      
      ```sh
      $ cat /path/to/gmtjson/part-*
      {"ts":"2016-01-01T00:00:00.000Z"}
      ```
      
      whereas setting the option to `"PST"`, they are:
      
      ```scala
      scala> df.write.option("timeZone", "PST").json("/path/to/pstjson")
      ```
      
      ```sh
      $ cat /path/to/pstjson/part-*
      {"ts":"2015-12-31T16:00:00.000-08:00"}
      ```
      
      We can properly read these files even if the timezone option is wrong because the timestamp values have timezone info:
      
      ```scala
      scala> val schema = new StructType().add("ts", TimestampType)
      schema: org.apache.spark.sql.types.StructType = StructType(StructField(ts,TimestampType,true))
      
      scala> spark.read.schema(schema).json("/path/to/gmtjson").show()
      +-------------------+
      |ts                 |
      +-------------------+
      |2016-01-01 00:00:00|
      +-------------------+
      
      scala> spark.read.schema(schema).option("timeZone", "PST").json("/path/to/gmtjson").show()
      +-------------------+
      |ts                 |
      +-------------------+
      |2016-01-01 00:00:00|
      +-------------------+
      ```
      
      And even if `timezoneFormat` doesn't contain timezone info, we can properly read the values with setting correct timezone option:
      
      ```scala
      scala> df.write.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("timeZone", "JST").json("/path/to/jstjson")
      ```
      
      ```sh
      $ cat /path/to/jstjson/part-*
      {"ts":"2016-01-01T09:00:00"}
      ```
      
      ```scala
      // wrong result
      scala> spark.read.schema(schema).option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").json("/path/to/jstjson").show()
      +-------------------+
      |ts                 |
      +-------------------+
      |2016-01-01 09:00:00|
      +-------------------+
      
      // correct result
      scala> spark.read.schema(schema).option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("timeZone", "JST").json("/path/to/jstjson").show()
      +-------------------+
      |ts                 |
      +-------------------+
      |2016-01-01 00:00:00|
      +-------------------+
      ```
      
      This pr also makes `JsonToStruct` and `StructToJson` `TimeZoneAwareExpression` to be able to evaluate values with timezone option.
      
      ## How was this patch tested?
      
      Existing tests and added some tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #16750 from ueshin/issues/SPARK-18937.
      865b2fd8
    • Felix Cheung's avatar
      [SPARK-19399][SPARKR] Add R coalesce API for DataFrame and Column · 671bc08e
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Add coalesce on DataFrame for down partitioning without shuffle and coalesce on Column
      
      ## How was this patch tested?
      
      manual, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16739 from felixcheung/rcoalesce.
      671bc08e
    • zero323's avatar
      [SPARK-19160][PYTHON][SQL] Add udf decorator · c97f4e17
      zero323 authored
      ## What changes were proposed in this pull request?
      
      This PR adds `udf` decorator syntax as proposed in [SPARK-19160](https://issues.apache.org/jira/browse/SPARK-19160).
      
      This allows users to define UDF using simplified syntax:
      
      ```python
      from pyspark.sql.decorators import udf
      
      udf(IntegerType())
      def add_one(x):
          """Adds one"""
          if x is not None:
              return x + 1
       ```
      
      without need to define a separate function and udf.
      
      ## How was this patch tested?
      
      Existing unit tests to ensure backward compatibility and additional unit tests covering new functionality.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16533 from zero323/SPARK-19160.
      c97f4e17
    • VinceShieh's avatar
      [SPARK-19590][PYSPARK][ML] Update the document for QuantileDiscretizer in pyspark · 6eca21ba
      VinceShieh authored
      ## What changes were proposed in this pull request?
      This PR is to document the changes on QuantileDiscretizer in pyspark for PR:
      https://github.com/apache/spark/pull/15428
      
      ## How was this patch tested?
      No test needed
      
      Signed-off-by: VinceShieh <vincent.xieintel.com>
      
      Author: VinceShieh <vincent.xie@intel.com>
      
      Closes #16922 from VinceShieh/spark-19590.
      6eca21ba
  21. Feb 14, 2017
    • Sheamus K. Parkes's avatar
      [SPARK-18541][PYTHON] Add metadata parameter to pyspark.sql.Column.alias() · 7b64f7aa
      Sheamus K. Parkes authored
      ## What changes were proposed in this pull request?
      
      Add a `metadata` keyword parameter to `pyspark.sql.Column.alias()` to allow users to mix-in metadata while manipulating `DataFrame`s in `pyspark`.  Without this, I believe it was necessary to pass back through `SparkSession.createDataFrame` each time a user wanted to manipulate `StructField.metadata` in `pyspark`.
      
      This pull request also improves consistency between the Scala and Python APIs (i.e. I did not add any functionality that was not already in the Scala API).
      
      Discussed ahead of time on JIRA with marmbrus
      
      ## How was this patch tested?
      
      Added unit tests (and doc tests).  Ran the pertinent tests manually.
      
      Author: Sheamus K. Parkes <shea.parkes@milliman.com>
      
      Closes #16094 from shea-parkes/pyspark-column-alias-metadata.
      7b64f7aa
Loading