Skip to content
Snippets Groups Projects
  1. Aug 17, 2016
  2. Jul 28, 2016
  3. Jul 08, 2016
  4. Jun 29, 2016
    • hyukjinkwon's avatar
      [TRIVIAL] [PYSPARK] Clean up orc compression option as well · d8a87a3e
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR corrects ORC compression option for PySpark as well. I think this was missed mistakenly in https://github.com/apache/spark/pull/13948.
      
      ## How was this patch tested?
      
      N/A
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #13963 from HyukjinKwon/minor-orc-compress.
      d8a87a3e
    • gatorsmile's avatar
      [SPARK-16236][SQL][FOLLOWUP] Add Path Option back to Load API in DataFrameReader · 39f2eb1d
      gatorsmile authored
      #### What changes were proposed in this pull request?
      In Python API, we have the same issue. Thanks for identifying this issue, zsxwing ! Below is an example:
      ```Python
      spark.read.format('json').load('python/test_support/sql/people.json')
      ```
      #### How was this patch tested?
      Existing test cases cover the changes by this PR
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13965 from gatorsmile/optionPaths.
      39f2eb1d
    • Tathagata Das's avatar
      [SPARK-16266][SQL][STREAING] Moved DataStreamReader/Writer from pyspark.sql to... · f454a7f9
      Tathagata Das authored
      [SPARK-16266][SQL][STREAING] Moved DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming
      
      ## What changes were proposed in this pull request?
      
      - Moved DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming to make them consistent with scala packaging
      - Exposed the necessary classes in sql.streaming package so that they appear in the docs
      - Added pyspark.sql.streaming module to the docs
      
      ## How was this patch tested?
      - updated unit tests.
      - generated docs for testing visibility of pyspark.sql.streaming classes.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #13955 from tdas/SPARK-16266.
      f454a7f9
  5. Jun 28, 2016
    • Davies Liu's avatar
      [SPARK-16259][PYSPARK] cleanup options in DataFrame read/write API · 1aad8c6e
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      There are some duplicated code for options in DataFrame reader/writer API, this PR clean them up, it also fix a bug for `escapeQuotes` of csv().
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13948 from davies/csv_options.
      1aad8c6e
  6. Jun 21, 2016
    • Reynold Xin's avatar
      [SPARK-13792][SQL] Addendum: Fix Python API · 93338807
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This is a follow-up to https://github.com/apache/spark/pull/13795 to properly set CSV options in Python API. As part of this, I also make the Python option setting for both CSV and JSON more robust against positional errors.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13800 from rxin/SPARK-13792-2.
      93338807
  7. Jun 20, 2016
    • Reynold Xin's avatar
      [SPARK-13792][SQL] Limit logging of bad records in CSV data source · c775bf09
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader to limit the maximum of logging message Spark generates per partition for malformed records.
      
      The error log looks something like
      ```
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been found on this partition. Malformed records from now on will not be logged.
      ```
      
      Closes #12173
      
      ## How was this patch tested?
      Manually tested.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13795 from rxin/SPARK-13792.
      c775bf09
  8. Jun 16, 2016
    • Tathagata Das's avatar
      [SPARK-15981][SQL][STREAMING] Fixed bug and added tests in DataStreamReader Python API · 084dca77
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      - Fixed bug in Python API of DataStreamReader.  Because a single path was being converted to a array before calling Java DataStreamReader method (which takes a string only), it gave the following error.
      ```
      File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/readwriter.py", line 947, in pyspark.sql.readwriter.DataStreamReader.json
      Failed example:
          json_sdf = spark.readStream.json(os.path.join(tempfile.mkdtemp(), 'data'),                 schema = sdf_schema)
      Exception raised:
          Traceback (most recent call last):
            File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", line 1253, in __run
              compileflags, 1) in test.globs
            File "<doctest pyspark.sql.readwriter.DataStreamReader.json[0]>", line 1, in <module>
              json_sdf = spark.readStream.json(os.path.join(tempfile.mkdtemp(), 'data'),                 schema = sdf_schema)
            File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/readwriter.py", line 963, in json
              return self._df(self._jreader.json(path))
            File "/Users/tdas/Projects/Spark/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
              answer, self.gateway_client, self.target_id, self.name)
            File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/utils.py", line 63, in deco
              return f(*a, **kw)
            File "/Users/tdas/Projects/Spark/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 316, in get_return_value
              format(target_id, ".", name, value))
          Py4JError: An error occurred while calling o121.json. Trace:
          py4j.Py4JException: Method json([class java.util.ArrayList]) does not exist
          	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
          	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
          	at py4j.Gateway.invoke(Gateway.java:272)
          	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
          	at py4j.commands.CallCommand.execute(CallCommand.java:79)
          	at py4j.GatewayConnection.run(GatewayConnection.java:211)
          	at java.lang.Thread.run(Thread.java:744)
      ```
      
      - Reduced code duplication between DataStreamReader and DataFrameWriter
      - Added missing Python doctests
      
      ## How was this patch tested?
      New tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #13703 from tdas/SPARK-15981.
      084dca77
  9. Jun 15, 2016
  10. Jun 14, 2016
    • Tathagata Das's avatar
      [SPARK-15933][SQL][STREAMING] Refactored DF reader-writer to use readStream... · 214adb14
      Tathagata Das authored
      [SPARK-15933][SQL][STREAMING] Refactored DF reader-writer to use readStream and writeStream for streaming DFs
      
      ## What changes were proposed in this pull request?
      Currently, the DataFrameReader/Writer has method that are needed for streaming and non-streaming DFs. This is quite awkward because each method in them through runtime exception for one case or the other. So rather having half the methods throw runtime exceptions, its just better to have a different reader/writer API for streams.
      
      - [x] Python API!!
      
      ## How was this patch tested?
      Existing unit tests + two sets of unit tests for DataFrameReader/Writer and DataStreamReader/Writer.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #13653 from tdas/SPARK-15933.
      214adb14
  11. Jun 12, 2016
  12. Jun 11, 2016
    • Takeshi YAMAMURO's avatar
      [SPARK-15585][SQL] Add doc for turning off quotations · cb5d933d
      Takeshi YAMAMURO authored
      ## What changes were proposed in this pull request?
      This pr is to add doc for turning off quotations because this behavior is different from `com.databricks.spark.csv`.
      
      ## How was this patch tested?
      Check behavior  to put an empty string in csv options.
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #13616 from maropu/SPARK-15585-2.
      cb5d933d
  13. Jun 06, 2016
  14. May 31, 2016
    • Tathagata Das's avatar
      [SPARK-15517][SQL][STREAMING] Add support for complete output mode in Structure Streaming · 90b11439
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      Currently structured streaming only supports append output mode.  This PR adds the following.
      
      - Added support for Complete output mode in the internal state store, analyzer and planner.
      - Added public API in Scala and Python for users to specify output mode
      - Added checks for unsupported combinations of output mode and DF operations
        - Plans with no aggregation should support only Append mode
        - Plans with aggregation should support only Update and Complete modes
        - Default output mode is Append mode (**Question: should we change this to automatically set to Complete mode when there is aggregation?**)
      - Added support for Complete output mode in Memory Sink. So Memory Sink internally supports append and complete, update. But from public API only Complete and Append output modes are supported.
      
      ## How was this patch tested?
      Unit tests in various test suites
      - StreamingAggregationSuite: tests for complete mode
      - MemorySinkSuite: tests for checking behavior in Append and Complete modes.
      - UnsupportedOperationSuite: tests for checking unsupported combinations of DF ops and output modes
      - DataFrameReaderWriterSuite: tests for checking that output mode cannot be called on static DFs
      - Python doc test and existing unit tests modified to call write.outputMode.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #13286 from tdas/complete-mode.
      90b11439
    • Shixiong Zhu's avatar
      Revert "[SPARK-11753][SQL][TEST-HADOOP2.2] Make allowNonNumericNumbers option work · 9a74de18
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This reverts commit c24b6b67. Sent a PR to run Jenkins tests due to the revert conflicts of `dev/deps/spark-deps-hadoop*`.
      
      ## How was this patch tested?
      
      Jenkins unit tests, integration tests, manual tests)
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13417 from zsxwing/revert-SPARK-11753.
      9a74de18
  15. May 27, 2016
    • Zheng RuiFeng's avatar
      [MINOR] Fix Typos 'a -> an' · 6b1a6180
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      `a` -> `an`
      
      I use regex to generate potential error lines:
      `grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala`
      and review them line by line.
      
      ## How was this patch tested?
      
      local build
      `lint-java` checking
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #13317 from zhengruifeng/a_an.
      6b1a6180
  16. May 25, 2016
  17. May 24, 2016
    • Liang-Chi Hsieh's avatar
      [SPARK-11753][SQL][TEST-HADOOP2.2] Make allowNonNumericNumbers option work · c24b6b67
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Jackson suppprts `allowNonNumericNumbers` option to parse non-standard non-numeric numbers such as "NaN", "Infinity", "INF".  Currently used Jackson version (2.5.3) doesn't support it all. This patch upgrades the library and make the two ignored tests in `JsonParsingOptionsSuite` passed.
      
      ## How was this patch tested?
      
      `JsonParsingOptionsSuite`.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #9759 from viirya/fix-json-nonnumric.
      c24b6b67
  18. May 18, 2016
  19. May 17, 2016
  20. May 11, 2016
  21. May 10, 2016
    • Reynold Xin's avatar
      [SPARK-15261][SQL] Remove experimental tag from DataFrameReader/Writer · 5a5b83c9
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes experimental tag from DataFrameReader and DataFrameWriter, and explicitly tags a few methods added for structured streaming as experimental.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13038 from rxin/SPARK-15261.
      5a5b83c9
  22. May 02, 2016
  23. May 01, 2016
    • hyukjinkwon's avatar
      [SPARK-13425][SQL] Documentation for CSV datasource options · a832cef1
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR adds the explanation and documentation for CSV options for reading and writing.
      
      ## How was this patch tested?
      
      Style tests with `./dev/run_tests` for documentation style.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: Hyukjin Kwon <gurwls223@gmail.com>
      
      Closes #12817 from HyukjinKwon/SPARK-13425.
      a832cef1
  24. Apr 28, 2016
    • Burak Yavuz's avatar
      [SPARK-14555] Second cut of Python API for Structured Streaming · 78c8aaf8
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      This PR adds Python APIs for:
       - `ContinuousQueryManager`
       - `ContinuousQueryException`
      
      The `ContinuousQueryException` is a very basic wrapper, it doesn't provide the functionality that the Scala side provides, but it follows the same pattern for `AnalysisException`.
      
      For `ContinuousQueryManager`, all APIs are provided except for registering listeners.
      
      This PR also attempts to fix test flakiness by stopping all active streams just before tests.
      
      ## How was this patch tested?
      
      Python Doc tests and unit tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #12673 from brkyvz/pyspark-cqm.
      78c8aaf8
    • Andrew Or's avatar
      [SPARK-14945][PYTHON] SparkSession Python API · 89addd40
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      ```
      Welcome to
            ____              __
           / __/__  ___ _____/ /__
          _\ \/ _ \/ _ `/ __/  '_/
         /__ / .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
            /_/
      
      Using Python version 2.7.5 (default, Mar  9 2014 22:15:05)
      SparkSession available as 'spark'.
      >>> spark
      <pyspark.sql.session.SparkSession object at 0x101f3bfd0>
      >>> spark.sql("SHOW TABLES").show()
      ...
      +---------+-----------+
      |tableName|isTemporary|
      +---------+-----------+
      |      src|      false|
      +---------+-----------+
      
      >>> spark.range(1, 10, 2).show()
      +---+
      | id|
      +---+
      |  1|
      |  3|
      |  5|
      |  7|
      |  9|
      +---+
      ```
      **Note**: This API is NOT complete in its current state. In particular, for now I left out the `conf` and `catalog` APIs, which were added later in Scala. These will be added later before 2.0.
      
      ## How was this patch tested?
      
      Python tests.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12746 from andrewor14/python-spark-session.
      89addd40
  25. Apr 22, 2016
    • Liang-Chi Hsieh's avatar
      [SPARK-13266] [SQL] None read/writer options were not transalated to "null" · 056883e0
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      In Python, the `option` and `options` method of `DataFrameReader` and `DataFrameWriter` were sending the string "None" instead of `null` when passed `None`, therefore making it impossible to send an actual `null`. This fixes that problem.
      
      This is based on #11305 from mathieulongtin.
      
      ## How was this patch tested?
      
      Added test to readwriter.py.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      Author: mathieu longtin <mathieu.longtin@nuance.com>
      
      Closes #12494 from viirya/py-df-none-option.
      056883e0
  26. Apr 20, 2016
    • Burak Yavuz's avatar
      [SPARK-14555] First cut of Python API for Structured Streaming · 80bf48f4
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes:
       - ContinuousQuery
       - Trigger
       - ProcessingTime
      in pyspark under `pyspark.sql.streaming`.
      
      In addition, it contains the new methods added under:
       -  `DataFrameWriter`
           a) `startStream`
           b) `trigger`
           c) `queryName`
      
       -  `DataFrameReader`
           a) `stream`
      
       - `DataFrame`
          a) `isStreaming`
      
      This PR doesn't contain all methods exposed for `ContinuousQuery`, for example:
       - `exception`
       - `sourceStatuses`
       - `sinkStatus`
      
      They may be added in a follow up.
      
      This PR also contains some very minor doc fixes in the Scala side.
      
      ## How was this patch tested?
      
      Python doc tests
      
      TODO:
       - [ ] verify Python docs look good
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      Author: Burak Yavuz <burak@databricks.com>
      
      Closes #12320 from brkyvz/stream-python.
      80bf48f4
  27. Apr 03, 2016
    • hyukjinkwon's avatar
      [SPARK-14231] [SQL] JSON data source infers floating-point values as a double... · 2262a933
      hyukjinkwon authored
      [SPARK-14231] [SQL] JSON data source infers floating-point values as a double when they do not fit in a decimal
      
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-14231
      
      Currently, JSON data source supports to infer `DecimalType` for big numbers and `floatAsBigDecimal` option which reads floating-point values as `DecimalType`.
      
      But there are few restrictions in Spark `DecimalType` below:
      
      1. The precision cannot be bigger than 38.
      2. scale cannot be bigger than precision.
      
      Currently, both restrictions are not being handled.
      
      This PR handles the cases by inferring them as `DoubleType`. Also, the option name was changed from `floatAsBigDecimal` to `prefersDecimal` as suggested [here](https://issues.apache.org/jira/browse/SPARK-14231?focusedCommentId=15215579&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15215579).
      
      So, the codes below:
      
      ```scala
      def doubleRecords: RDD[String] =
        sqlContext.sparkContext.parallelize(
          s"""{"a": 1${"0" * 38}, "b": 0.01}""" ::
          s"""{"a": 2${"0" * 38}, "b": 0.02}""" :: Nil)
      
      val jsonDF = sqlContext.read
        .option("prefersDecimal", "true")
        .json(doubleRecords)
      jsonDF.printSchema()
      ```
      
      produces below:
      
      - **Before**
      
      ```scala
      org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater than precision (1).;
      	at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:44)
      	at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
      	at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
      	at
      ...
      ```
      
      - **After**
      
      ```scala
      root
       |-- a: double (nullable = true)
       |-- b: double (nullable = true)
      ```
      
      ## How was this patch tested?
      
      Unit tests were used and `./dev/run_tests` for coding style tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #12030 from HyukjinKwon/SPARK-14231.
      2262a933
  28. Mar 22, 2016
    • hyukjinkwon's avatar
      [SPARK-13953][SQL] Specifying the field name for corrupted record via option at JSON datasource · 4e09a0d5
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-13953
      
      Currently, JSON data source creates a new field in `PERMISSIVE` mode for storing malformed string.
      This field can be renamed via `spark.sql.columnNameOfCorruptRecord` option but it is a global configuration.
      
      This PR make that option can be applied per read and can be specified via `option()`. This will overwrites `spark.sql.columnNameOfCorruptRecord` if it is set.
      
      ## How was this patch tested?
      
      Unit tests were used and `./dev/run_tests` for coding style tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11881 from HyukjinKwon/SPARK-13953.
      4e09a0d5
  29. Mar 21, 2016
    • hyukjinkwon's avatar
      [SPARK-13764][SQL] Parse modes in JSON data source · e4740881
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, there is no way to control the behaviour when fails to parse corrupt records in JSON data source .
      
      This PR adds the support for parse modes just like CSV data source. There are three modes below:
      
      - `PERMISSIVE` :  When it fails to parse, this sets `null` to to field. This is a default mode when it has been this mode.
      - `DROPMALFORMED`: When it fails to parse, this drops the whole record.
      - `FAILFAST`: When it fails to parse, it just throws an exception.
      
      This PR also make JSON data source share the `ParseModes` in CSV data source.
      
      ## How was this patch tested?
      
      Unit tests were used and `./dev/run_tests` for code style tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11756 from HyukjinKwon/SPARK-13764.
      e4740881
  30. Mar 03, 2016
Loading