Skip to content
Snippets Groups Projects
  1. Apr 22, 2017
    • Michael Patterson's avatar
      [SPARK-20132][DOCS] Add documentation for column string functions · 8765bc17
      Michael Patterson authored
      ## What changes were proposed in this pull request?
      Add docstrings to column.py for the Column functions `rlike`, `like`, `startswith`, and `endswith`. Pass these docstrings through `_bin_op`
      
      There may be a better place to put the docstrings. I put them immediately above the Column class.
      
      ## How was this patch tested?
      
      I ran `make html` on my local computer to remake the documentation, and verified that the html pages were displaying the docstrings correctly. I tried running `dev-tests`, and the formatting tests passed. However, my mvn build didn't work I think due to issues on my computer.
      
      These docstrings are my original work and free license.
      
      davies has done the most recent work reorganizing `_bin_op`
      
      Author: Michael Patterson <map222@gmail.com>
      
      Closes #17469 from map222/patterson-documentation.
      8765bc17
  2. Apr 18, 2017
  3. Apr 13, 2017
    • David Gingrich's avatar
      [SPARK-20232][PYTHON] Improve combineByKey docs · 8ddf0d2a
      David Gingrich authored
      ## What changes were proposed in this pull request?
      
      Improve combineByKey documentation:
      
      * Add note on memory allocation
      * Change example code to use different mergeValue and mergeCombiners
      
      ## How was this patch tested?
      
      Doctest.
      
      ## Legal
      
      This is my original work and I license the work to the project under the project’s open source license.
      
      Author: David Gingrich <david@textio.com>
      
      Closes #17545 from dgingrich/topic-spark-20232-combinebykey-docs.
      8ddf0d2a
  4. Apr 12, 2017
    • Jeff Zhang's avatar
      [SPARK-19570][PYSPARK] Allow to disable hive in pyspark shell · 99a94731
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      
      SPARK-15236 do this for scala shell, this ticket is for pyspark shell. This is not only for pyspark itself, but can also benefit downstream project like livy which use shell.py for its interactive session. For now, livy has no control of whether enable hive or not.
      
      ## How was this patch tested?
      
      I didn't find a way to add test for it. Just manually test it.
      Run `bin/pyspark --master local --conf spark.sql.catalogImplementation=in-memory` and verify hive is not enabled.
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #16906 from zjffdu/SPARK-19570.
      99a94731
    • hyukjinkwon's avatar
      [MINOR][DOCS] JSON APIs related documentation fixes · bca4259f
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes corrections related to JSON APIs as below:
      
      - Rendering links in Python documentation
      - Replacing `RDD` to `Dataset` in programing guide
      - Adding missing description about JSON Lines consistently in `DataFrameReader.json` in Python API
      - De-duplicating little bit of `DataFrameReader.json` in Scala/Java API
      
      ## How was this patch tested?
      
      Manually build the documentation via `jekyll build`. Corresponding snapstops will be left on the codes.
      
      Note that currently there are Javadoc8 breaks in several places. These are proposed to be handled in https://github.com/apache/spark/pull/17477. So, this PR does not fix those.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17602 from HyukjinKwon/minor-json-documentation.
      bca4259f
  5. Apr 11, 2017
    • David Gingrich's avatar
      [SPARK-19505][PYTHON] AttributeError on Exception.message in Python3 · 6297697f
      David Gingrich authored
      ## What changes were proposed in this pull request?
      
      Added `util._message_exception` helper to use `str(e)` when `e.message` is unavailable (Python3).  Grepped for all occurrences of `.message` in `pyspark/` and these were the only occurrences.
      
      ## How was this patch tested?
      
      - Doctests for helper function
      
      ## Legal
      
      This is my original work and I license the work to the project under the project’s open source license.
      
      Author: David Gingrich <david@textio.com>
      
      Closes #16845 from dgingrich/topic-spark-19505-py3-exceptions.
      6297697f
  6. Apr 10, 2017
    • Shixiong Zhu's avatar
      [SPARK-20285][TESTS] Increase the pyspark streaming test timeout to 30 seconds · f9a50ba2
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Saw the following failure locally:
      
      ```
      Traceback (most recent call last):
        File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, in test_cogroup
          self._test_func(input, func, expected, sort=True, input2=input2)
        File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, in _test_func
          self.assertEqual(expected, result)
      AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != []
      
      First list contains 3 additional elements.
      First extra element 0:
      [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))]
      
      + []
      - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))],
      -  [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))],
      -  [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]]
      ```
      
      It also happened on Jenkins: http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120
      
      It's because when the machine is overloaded, the timeout is not enough. This PR just increases the timeout to 30 seconds.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17597 from zsxwing/SPARK-20285.
      f9a50ba2
  7. Apr 07, 2017
  8. Apr 06, 2017
  9. Apr 05, 2017
    • Liang-Chi Hsieh's avatar
      [SPARK-20214][ML] Make sure converted csc matrix has sorted indices · 12206058
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      `_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that:
      
          from scipy.sparse import lil_matrix
          lil = lil_matrix((4, 1))
          lil[1, 0] = 1
          lil[3, 0] = 2
          _convert_to_vector(lil.todok())
      
          File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector
            return SparseVector(l.shape[0], csc.indices, csc.data)
          File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__
            % (self.indices[i], self.indices[i + 1]))
          TypeError: Indices 3 and 1 are not strictly increasing
      
      A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices:
      
          >>> from scipy.sparse import lil_matrix
          >>> lil = lil_matrix((4, 1))
          >>> lil[1, 0] = 1
          >>> lil[3, 0] = 2
          >>> dok = lil.todok()
          >>> csc = dok.tocsc()
          >>> csc.has_sorted_indices
          0
          >>> csc.indices
          array([3, 1], dtype=int32)
      
      I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #17532 from viirya/make-sure-sorted-indices.
      12206058
    • zero323's avatar
      [SPARK-19454][PYTHON][SQL] DataFrame.replace improvements · e2773996
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Allows skipping `value` argument if `to_replace` is a `dict`:
      	```python
      	df = sc.parallelize([("Alice", 1, 3.0)]).toDF()
      	df.replace({"Alice": "Bob"}).show()
      	````
      - Adds validation step to ensure homogeneous values / replacements.
      - Simplifies internal control flow.
      - Improves unit tests coverage.
      
      ## How was this patch tested?
      
      Existing unit tests, additional unit tests, manual testing.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16793 from zero323/SPARK-19454.
      e2773996
  10. Apr 03, 2017
    • hyukjinkwon's avatar
      [SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead of ZZ (FastDateFormat... · cff11fd2
      hyukjinkwon authored
      [SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead of ZZ (FastDateFormat specific) in CSV/JSON timeformat options
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to use `XXX` format instead of `ZZ`. `ZZ` seems a `FastDateFormat` specific.
      
      `ZZ` supports "ISO 8601 extended format time zones" but it seems `FastDateFormat` specific option.
      I misunderstood this is compatible format with `SimpleDateFormat` when this change is introduced.
      Please see [SimpleDateFormat documentation]( https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html#iso8601timezone) and [FastDateFormat documentation](https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/time/FastDateFormat.html).
      
      It seems we better replace `ZZ` to `XXX` because they look using the same strategy - [FastDateParser.java#L930](https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L930), [FastDateParser.java#L932-L951 ](https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L932-L951) and [FastDateParser.java#L596-L601](https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L596-L601).
      
      I also checked the codes and manually debugged it for sure. It seems both cases use the same pattern `( Z|(?:[+-]\\d{2}(?::)\\d{2}))`.
      
      _Note that this should be rather a fix about documentation and not the behaviour change because `ZZ` seems invalid date format in `SimpleDateFormat` as documented in `DataFrameReader` and etc, and both `ZZ` and `XXX` look identically working with `FastDateFormat`_
      
      Current documentation is as below:
      
      ```
         * <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss.SSSZZ`): sets the string that
         * indicates a timestamp format. Custom date formats follow the formats at
         * `java.text.SimpleDateFormat`. This applies to timestamp type.</li>
      ```
      
      ## How was this patch tested?
      
      Existing tests should cover this. Also, manually tested as below (BTW, I don't think these are worth being added as tests within Spark):
      
      **Parse**
      
      ```scala
      scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00")
      res4: java.util.Date = Tue Mar 21 20:00:00 KST 2017
      
      scala>  new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z")
      res10: java.util.Date = Tue Mar 21 09:00:00 KST 2017
      
      scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00")
      java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000-11:00"
        at java.text.DateFormat.parse(DateFormat.java:366)
        ... 48 elided
      scala>  new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z")
      java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000Z"
        at java.text.DateFormat.parse(DateFormat.java:366)
        ... 48 elided
      ```
      
      ```scala
      scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00")
      res7: java.util.Date = Tue Mar 21 20:00:00 KST 2017
      
      scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z")
      res1: java.util.Date = Tue Mar 21 09:00:00 KST 2017
      
      scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00")
      res8: java.util.Date = Tue Mar 21 20:00:00 KST 2017
      
      scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z")
      res2: java.util.Date = Tue Mar 21 09:00:00 KST 2017
      ```
      
      **Format**
      
      ```scala
      scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").format(new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00"))
      res6: String = 2017-03-21T20:00:00.000+09:00
      ```
      
      ```scala
      scala> val fd = org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ")
      fd: org.apache.commons.lang3.time.FastDateFormat = FastDateFormat[yyyy-MM-dd'T'HH:mm:ss.SSSZZ,ko_KR,Asia/Seoul]
      
      scala> fd.format(fd.parse("2017-03-21T00:00:00.000-11:00"))
      res1: String = 2017-03-21T20:00:00.000+09:00
      
      scala> val fd = org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX")
      fd: org.apache.commons.lang3.time.FastDateFormat = FastDateFormat[yyyy-MM-dd'T'HH:mm:ss.SSSXXX,ko_KR,Asia/Seoul]
      
      scala> fd.format(fd.parse("2017-03-21T00:00:00.000-11:00"))
      res2: String = 2017-03-21T20:00:00.000+09:00
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17489 from HyukjinKwon/SPARK-20166.
      cff11fd2
  11. Mar 29, 2017
    • Holden Karau's avatar
      [SPARK-19955][PYSPARK] Jenkins Python Conda based test. · d6ddfdf6
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Allow Jenkins Python tests to use the installed conda to test Python 2.7 support & test pip installability.
      
      ## How was this patch tested?
      
      Updated shell scripts, ran tests locally with installed conda, ran tests in Jenkins.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #17355 from holdenk/SPARK-19955-support-python-tests-with-conda.
      d6ddfdf6
  12. Mar 28, 2017
  13. Mar 27, 2017
    • Josh Rosen's avatar
      [SPARK-20102] Fix nightly packaging and RC packaging scripts w/ two minor build fixes · 314cf51d
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      The master snapshot publisher builds are currently broken due to two minor build issues:
      
      1. For unknown reasons, the LFTP `mkdir -p` command began throwing errors when the remote directory already exists. This change of behavior might have been caused by configuration changes in the ASF's SFTP server, but I'm not entirely sure of that. To work around this problem, this patch updates the script to ignore errors from the `lftp mkdir -p` commands.
      2. The PySpark `setup.py` file references a non-existent `pyspark.ml.stat` module, causing Python packaging to fail by complaining about a missing directory. The fix is to simply drop that line from the setup script.
      
      ## How was this patch tested?
      
      The LFTP fix was tested by manually running the failing commands on AMPLab Jenkins against the ASF SFTP server. The PySpark fix was tested locally.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #17437 from JoshRosen/spark-20102.
      314cf51d
  14. Mar 26, 2017
  15. Mar 24, 2017
    • Nick Pentreath's avatar
      [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark · d9f4ce69
      Nick Pentreath authored
      Add Python wrapper for `Imputer` feature transformer.
      
      ## How was this patch tested?
      
      New doc tests and tweak to PySpark ML `tests.py`
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #17316 from MLnick/SPARK-15040-pyspark-imputer.
      d9f4ce69
  16. Mar 23, 2017
    • Tyson Condie's avatar
      [SPARK-19876][SS][WIP] OneTime Trigger Executor · 746a558d
      Tyson Condie authored
      ## What changes were proposed in this pull request?
      
      An additional trigger and trigger executor that will execute a single trigger only. One can use this OneTime trigger to have more control over the scheduling of triggers.
      
      In addition, this patch requires an optimization to StreamExecution that logs a commit record at the end of successfully processing a batch. This new commit log will be used to determine the next batch (offsets) to process after a restart, instead of using the offset log itself to determine what batch to process next after restart; using the offset log to determine this would process the previously logged batch, always, thus not permitting a OneTime trigger feature.
      
      ## How was this patch tested?
      
      A number of existing tests have been revised. These tests all assumed that when restarting a stream, the last batch in the offset log is to be re-processed. Given that we now have a commit log that will tell us if that last batch was processed successfully, the results/assumptions of those tests needed to be revised accordingly.
      
      In addition, a OneTime trigger test was added to StreamingQuerySuite, which tests:
      - The semantics of OneTime trigger (i.e., on start, execute a single batch, then stop).
      - The case when the commit log was not able to successfully log the completion of a batch before restart, which would mean that we should fall back to what's in the offset log.
      - A OneTime trigger execution that results in an exception being thrown.
      
      marmbrus tdas zsxwing
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Tyson Condie <tcondie@gmail.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #17219 from tcondie/stream-commit.
      746a558d
    • hyukjinkwon's avatar
      [SPARK-18579][SQL] Use ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace options in CSV writing · 07c12c09
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to support _not_ trimming the white spaces when writing out. These are `false` by default in CSV reading path but these are `true` by default in CSV writing in univocity parser.
      
      Both `ignoreLeadingWhiteSpace` and `ignoreTrailingWhiteSpace` options are not being used for writing and therefore, we are always trimming the white spaces.
      
      It seems we should provide a way to keep this white spaces easily.
      
      WIth the data below:
      
      ```scala
      val df = spark.read.csv(Seq("a , b  , c").toDS)
      df.show()
      ```
      
      ```
      +---+----+---+
      |_c0| _c1|_c2|
      +---+----+---+
      | a | b  |  c|
      +---+----+---+
      ```
      
      **Before**
      
      ```scala
      df.write.csv("/tmp/text.csv")
      spark.read.text("/tmp/text.csv").show()
      ```
      
      ```
      +-----+
      |value|
      +-----+
      |a,b,c|
      +-----+
      ```
      
      It seems this can't be worked around via `quoteAll` too.
      
      ```scala
      df.write.option("quoteAll", true).csv("/tmp/text.csv")
      spark.read.text("/tmp/text.csv").show()
      ```
      ```
      +-----------+
      |      value|
      +-----------+
      |"a","b","c"|
      +-----------+
      ```
      
      **After**
      
      ```scala
      df.write.option("ignoreLeadingWhiteSpace", false).option("ignoreTrailingWhiteSpace", false).csv("/tmp/text.csv")
      spark.read.text("/tmp/text.csv").show()
      ```
      
      ```
      +----------+
      |     value|
      +----------+
      |a , b  , c|
      +----------+
      ```
      
      Note that this case is possible in R
      
      ```r
      > system("cat text.csv")
      f1,f2,f3
      a , b  , c
      > df <- read.csv(file="text.csv")
      > df
        f1   f2 f3
      1 a   b    c
      > write.csv(df, file="text1.csv", quote=F, row.names=F)
      > system("cat text1.csv")
      f1,f2,f3
      a , b  , c
      ```
      
      ## How was this patch tested?
      
      Unit tests in `CSVSuite` and manual tests for Python.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17310 from HyukjinKwon/SPARK-18579.
      07c12c09
  17. Mar 22, 2017
    • hyukjinkwon's avatar
      [SPARK-19949][SQL][FOLLOW-UP] Clean up parse modes and update related comments · 46581838
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to make `mode` options in both CSV and JSON to use `cass object` and fix some related comments related previous fix.
      
      Also, this PR modifies some tests related parse modes.
      
      ## How was this patch tested?
      
      Modified unit tests in both `CSVSuite.scala` and `JsonSuite.scala`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17377 from HyukjinKwon/SPARK-19949.
      46581838
  18. Mar 21, 2017
    • Zheng RuiFeng's avatar
      [SPARK-20041][DOC] Update docs for NaN handling in approxQuantile · 63f077fb
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Update docs for NaN handling in approxQuantile.
      
      ## How was this patch tested?
      existing tests.
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #17369 from zhengruifeng/doc_quantiles_nan.
      63f077fb
    • christopher snow's avatar
      [SPARK-20011][ML][DOCS] Clarify documentation for ALS 'rank' parameter · 7620aed8
      christopher snow authored
      ## What changes were proposed in this pull request?
      
      API documentation and collaborative filtering documentation page changes to clarify inconsistent description of ALS rank parameter.
      
       - [DOCS] was previously: "rank is the number of latent factors in the model."
       - [API] was previously:  "rank - number of features to use"
      
      This change describes rank in both places consistently as:
      
       - "Number of features to use (also referred to as the number of latent factors)"
      
      Author: Chris Snow <chris.snowuk.ibm.com>
      
      Author: christopher snow <chsnow123@gmail.com>
      
      Closes #17345 from snowch/SPARK-20011.
      7620aed8
  19. Mar 20, 2017
    • hyukjinkwon's avatar
      [SPARK-19849][SQL] Support ArrayType in to_json to produce JSON array · 0cdcf911
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to support an array of struct type in `to_json` as below:
      
      ```scala
      import org.apache.spark.sql.functions._
      
      val df = Seq(Tuple1(Tuple1(1) :: Nil)).toDF("a")
      df.select(to_json($"a").as("json")).show()
      ```
      
      ```
      +----------+
      |      json|
      +----------+
      |[{"_1":1}]|
      +----------+
      ```
      
      Currently, it throws an exception as below (a newline manually inserted for readability):
      
      ```
      org.apache.spark.sql.AnalysisException: cannot resolve 'structtojson(`array`)' due to data type
      mismatch: structtojson requires that the expression is a struct expression.;;
      ```
      
      This allows the roundtrip with `from_json` as below:
      
      ```scala
      import org.apache.spark.sql.functions._
      import org.apache.spark.sql.types._
      
      val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
      val df = Seq("""[{"a":1}, {"a":2}]""").toDF("json").select(from_json($"json", schema).as("array"))
      df.show()
      
      // Read back.
      df.select(to_json($"array").as("json")).show()
      ```
      
      ```
      +----------+
      |     array|
      +----------+
      |[[1], [2]]|
      +----------+
      
      +-----------------+
      |             json|
      +-----------------+
      |[{"a":1},{"a":2}]|
      +-----------------+
      ```
      
      Also, this PR proposes to rename from `StructToJson` to `StructsToJson ` and `JsonToStruct` to `JsonToStructs`.
      
      ## How was this patch tested?
      
      Unit tests in `JsonFunctionsSuite` and `JsonExpressionsSuite` for Scala, doctest for Python and test in `test_sparkSQL.R` for R.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17192 from HyukjinKwon/SPARK-19849.
      0cdcf911
  20. Mar 17, 2017
    • Shixiong Zhu's avatar
      [SPARK-19986][TESTS] Make pyspark.streaming.tests.CheckpointTests more stable · 376d7821
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Sometimes, CheckpointTests will hang on a busy machine because the streaming jobs are too slow and cannot catch up. I observed the scheduled delay was keeping increasing for dozens of seconds locally.
      
      This PR increases the batch interval from 0.5 seconds to 2 seconds to generate less Spark jobs. It should make `pyspark.streaming.tests.CheckpointTests` more stable. I also replaced `sleep` with `awaitTerminationOrTimeout` so that if the streaming job fails, it will also fail the test.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17323 from zsxwing/SPARK-19986.
      376d7821
  21. Mar 15, 2017
    • hyukjinkwon's avatar
      [SPARK-19872] [PYTHON] Use the correct deserializer for RDD construction for coalesce/repartition · 7387126f
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to use the correct deserializer, `BatchedSerializer` for RDD construction for coalesce/repartition when the shuffle is enabled. Currently, it is passing `UTF8Deserializer` as is not `BatchedSerializer` from the copied one.
      
      with the file, `text.txt` below:
      
      ```
      a
      b
      
      d
      e
      f
      g
      h
      i
      j
      k
      l
      
      ```
      
      - Before
      
      ```python
      >>> sc.textFile('text.txt').repartition(1).collect()
      ```
      
      ```
      UTF8Deserializer(True)
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/rdd.py", line 811, in collect
          return list(_load_from_socket(port, self._jrdd_deserializer))
        File ".../spark/python/pyspark/serializers.py", line 549, in load_stream
          yield self.loads(stream)
        File ".../spark/python/pyspark/serializers.py", line 544, in loads
          return s.decode("utf-8") if self.use_unicode else s
        File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
          return codecs.utf_8_decode(input, errors, True)
      UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
      ```
      
      - After
      
      ```python
      >>> sc.textFile('text.txt').repartition(1).collect()
      ```
      
      ```
      [u'a', u'b', u'', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l', u'']
      ```
      
      ## How was this patch tested?
      
      Unit test in `python/pyspark/tests.py`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17282 from HyukjinKwon/SPARK-19872.
      7387126f
    • Liwei Lin's avatar
      [SPARK-19817][SS] Make it clear that `timeZone` is a general option in DataStreamReader/Writer · e1ac5534
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      As timezone setting can also affect partition values, it works for all formats, we should make it clear.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #17299 from lw-lin/timezone.
      e1ac5534
  22. Mar 14, 2017
    • Takuya UESHIN's avatar
      [SPARK-19817][SQL] Make it clear that `timeZone` option is a general option in... · 7ded39c2
      Takuya UESHIN authored
      [SPARK-19817][SQL] Make it clear that `timeZone` option is a general option in DataFrameReader/Writer.
      
      ## What changes were proposed in this pull request?
      
      As timezone setting can also affect partition values, it works for all formats, we should make it clear.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #17281 from ueshin/issues/SPARK-19817.
      7ded39c2
  23. Mar 09, 2017
    • Jeff Zhang's avatar
      [SPARK-12334][SQL][PYSPARK] Support read from multiple input paths for orc... · cabe1df8
      Jeff Zhang authored
      [SPARK-12334][SQL][PYSPARK] Support read from multiple input paths for orc file in DataFrameReader.orc
      
      Beside the issue in spark api, also fix 2 minor issues in pyspark
      - support read from multiple input paths for orc
      - support read from multiple input paths for text
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #10307 from zjffdu/SPARK-12334.
      cabe1df8
    • Jason White's avatar
      [SPARK-19561][SQL] add int case handling for TimestampType · 206030bd
      Jason White authored
      ## What changes were proposed in this pull request?
      
      Add handling of input of type `Int` for dataType `TimestampType` to `EvaluatePython.scala`. Py4J serializes ints smaller than MIN_INT or larger than MAX_INT to Long, which are handled correctly already, but values between MIN_INT and MAX_INT are serialized to Int.
      
      These range limits correspond to roughly half an hour on either side of the epoch. As a result, PySpark doesn't allow TimestampType values to be created in this range.
      
      Alternatives attempted: patching the `TimestampType.toInternal` function to cast return values to `long`, so Py4J would always serialize them to Scala Long. Python3 does not have a `long` type, so this approach failed on Python3.
      
      ## How was this patch tested?
      
      Added a new PySpark-side test that fails without the change.
      
      The contribution is my original work and I license the work to the project under the project’s open source license.
      
      Resubmission of https://github.com/apache/spark/pull/16896. The original PR didn't go through Jenkins and broke the build. davies dongjoon-hyun
      
      cloud-fan Could you kick off a Jenkins run for me? It passed everything for me locally, but it's possible something has changed in the last few weeks.
      
      Author: Jason White <jason.white@shopify.com>
      
      Closes #17200 from JasonMWhite/SPARK-19561.
      206030bd
  24. Mar 08, 2017
  25. Mar 07, 2017
  26. Mar 05, 2017
    • hyukjinkwon's avatar
      [SPARK-19701][SQL][PYTHON] Throws a correct exception for 'in' operator against column · 224e0e78
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to remove incorrect implementation that has been not executed so far (at least from Spark 1.5.2) for `in` operator and throw a correct exception rather than saying it is a bool. I tested the codes above in 1.5.2, 1.6.3, 2.1.0 and in the master branch as below:
      
      **1.5.2**
      
      ```python
      >>> df = sqlContext.createDataFrame([[1]])
      >>> 1 in df._1
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark-1.5.2-bin-hadoop2.6/python/pyspark/sql/column.py", line 418, in __nonzero__
          raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
      ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
      ```
      
      **1.6.3**
      
      ```python
      >>> 1 in sqlContext.range(1).id
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark-1.6.3-bin-hadoop2.6/python/pyspark/sql/column.py", line 447, in __nonzero__
          raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
      ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
      ```
      
      **2.1.0**
      
      ```python
      >>> 1 in spark.range(1).id
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/column.py", line 426, in __nonzero__
          raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
      ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
      ```
      
      **Current Master**
      
      ```python
      >>> 1 in spark.range(1).id
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/sql/column.py", line 452, in __nonzero__
          raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
      ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
      ```
      
      **After**
      
      ```python
      >>> 1 in spark.range(1).id
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/sql/column.py", line 184, in __contains__
          raise ValueError("Cannot apply 'in' operator against a column: please use 'contains' "
      ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column.
      ```
      
      In more details,
      
      It seems the implementation intended to support this
      
      ```python
      1 in df.column
      ```
      
      However, currently, it throws an exception as below:
      
      ```python
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/sql/column.py", line 426, in __nonzero__
          raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
      ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
      ```
      
      What happens here is as below:
      
      ```python
      class Column(object):
          def __contains__(self, item):
              print "I am contains"
              return Column()
          def __nonzero__(self):
              raise Exception("I am nonzero.")
      
      >>> 1 in Column()
      I am contains
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "<stdin>", line 6, in __nonzero__
      Exception: I am nonzero.
      ```
      
      It seems it calls `__contains__` first and then `__nonzero__` or `__bool__` is being called against `Column()` to make this a bool (or int to be specific).
      
      It seems `__nonzero__` (for Python 2), `__bool__` (for Python 3) and `__contains__` forcing the the return into a bool unlike other operators. There are few references about this as below:
      
      https://bugs.python.org/issue16011
      http://stackoverflow.com/questions/12244074/python-source-code-for-built-in-in-operator/12244378#12244378
      http://stackoverflow.com/questions/38542543/functionality-of-python-in-vs-contains/38542777
      
      It seems we can't overwrite `__nonzero__` or `__bool__` as a workaround to make this working because these force the return type as a bool as below:
      
      ```python
      class Column(object):
          def __contains__(self, item):
              print "I am contains"
              return Column()
          def __nonzero__(self):
              return "a"
      
      >>> 1 in Column()
      I am contains
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
      TypeError: __nonzero__ should return bool or int, returned str
      ```
      
      ## How was this patch tested?
      
      Added unit tests in `tests.py`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17160 from HyukjinKwon/SPARK-19701.
      224e0e78
    • hyukjinkwon's avatar
      [SPARK-19595][SQL] Support json array in from_json · 369a148e
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to both,
      
      **Do not allow json arrays with multiple elements and return null in `from_json` with `StructType` as the schema.**
      
      Currently, it only reads the single row when the input is a json array. So, the codes below:
      
      ```scala
      import org.apache.spark.sql.functions._
      import org.apache.spark.sql.types._
      val schema = StructType(StructField("a", IntegerType) :: Nil)
      Seq(("""[{"a": 1}, {"a": 2}]""")).toDF("struct").select(from_json(col("struct"), schema)).show()
      ```
      prints
      
      ```
      +--------------------+
      |jsontostruct(struct)|
      +--------------------+
      |                 [1]|
      +--------------------+
      ```
      
      This PR simply suggests to print this as `null` if the schema is `StructType` and input is json array.with multiple elements
      
      ```
      +--------------------+
      |jsontostruct(struct)|
      +--------------------+
      |                null|
      +--------------------+
      ```
      
      **Support json arrays in `from_json` with `ArrayType` as the schema.**
      
      ```scala
      import org.apache.spark.sql.functions._
      import org.apache.spark.sql.types._
      val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
      Seq(("""[{"a": 1}, {"a": 2}]""")).toDF("array").select(from_json(col("array"), schema)).show()
      ```
      
      prints
      
      ```
      +-------------------+
      |jsontostruct(array)|
      +-------------------+
      |         [[1], [2]]|
      +-------------------+
      ```
      
      ## How was this patch tested?
      
      Unit test in `JsonExpressionsSuite`, `JsonFunctionsSuite`, Python doctests and manual test.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16929 from HyukjinKwon/disallow-array.
      369a148e
  27. Mar 03, 2017
    • Bryan Cutler's avatar
      [SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe · 44281ca8
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      The `keyword_only` decorator in PySpark is not thread-safe.  It writes kwargs to a static class variable in the decorator, which is then retrieved later in the class method as `_input_kwargs`.  If multiple threads are constructing the same class with different kwargs, it becomes a race condition to read from the static class variable before it's overwritten.  See [SPARK-19348](https://issues.apache.org/jira/browse/SPARK-19348) for reproduction code.
      
      This change will write the kwargs to a member variable so that multiple threads can operate on separate instances without the race condition.  It does not protect against multiple threads operating on a single instance, but that is better left to the user to synchronize.
      
      ## How was this patch tested?
      Added new unit tests for using the keyword_only decorator and a regression test that verifies `_input_kwargs` can be overwritten from different class instances.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #16782 from BryanCutler/pyspark-keyword_only-threadsafe-SPARK-19348.
      44281ca8
  28. Mar 02, 2017
    • Felix Cheung's avatar
      [SPARK-18352][DOCS] wholeFile JSON update doc and programming guide · 8d6ef895
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Update doc for R, programming guide. Clarify default behavior for all languages.
      
      ## How was this patch tested?
      
      manually
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17128 from felixcheung/jsonwholefiledoc.
      8d6ef895
    • Mark Grover's avatar
      [SPARK-19734][PYTHON][ML] Correct OneHotEncoder doc string to say dropLast · d2a87976
      Mark Grover authored
      ## What changes were proposed in this pull request?
      Updates the doc string to match up with the code
      i.e. say dropLast instead of includeFirst
      
      ## How was this patch tested?
      Not much, since it's a doc-like change. Will run unit tests via Jenkins job.
      
      Author: Mark Grover <mark@apache.org>
      
      Closes #17127 from markgrover/spark_19734.
      d2a87976
    • Yun Ni's avatar
      [MINOR][ML] Fix comments in LSH Examples and Python API · 3bd8ddf7
      Yun Ni authored
      ## What changes were proposed in this pull request?
      Remove `org.apache.spark.examples.` in
      Add slash in one of the python doc.
      
      ## How was this patch tested?
      Run examples using the commands in the comments.
      
      Author: Yun Ni <yunn@uber.com>
      
      Closes #17104 from Yunni/yunn_minor.
      3bd8ddf7
Loading