Skip to content
Snippets Groups Projects
  1. Aug 30, 2016
  2. Aug 25, 2016
  3. Aug 24, 2016
    • hyukjinkwon's avatar
      [SPARK-16216][SQL] Read/write timestamps and dates in ISO 8601 and... · 29952ed0
      hyukjinkwon authored
      [SPARK-16216][SQL] Read/write timestamps and dates in ISO 8601 and dateFormat/timestampFormat option for CSV and JSON
      
      ## What changes were proposed in this pull request?
      
      ### Default - ISO 8601
      
      Currently, CSV datasource is writing `Timestamp` and `Date` as numeric form and JSON datasource is writing both as below:
      
      - CSV
        ```
        // TimestampType
        1414459800000000
        // DateType
        16673
        ```
      
      - Json
      
        ```
        // TimestampType
        1970-01-01 11:46:40.0
        // DateType
        1970-01-01
        ```
      
      So, for CSV we can't read back what we write and for JSON it becomes ambiguous because the timezone is being missed.
      
      So, this PR make both **write** `Timestamp` and `Date` in ISO 8601 formatted string (please refer the [ISO 8601 specification](https://www.w3.org/TR/NOTE-datetime)).
      
      - For `Timestamp` it becomes as below: (`yyyy-MM-dd'T'HH:mm:ss.SSSZZ`)
      
        ```
        1970-01-01T02:00:01.000-01:00
        ```
      
      - For `Date` it becomes as below (`yyyy-MM-dd`)
      
        ```
        1970-01-01
        ```
      
      ### Custom date format option - `dateFormat`
      
      This PR also adds the support to write and read dates and timestamps in a formatted string as below:
      
      - **DateType**
      
        - With `dateFormat` option (e.g. `yyyy/MM/dd`)
      
          ```
          +----------+
          |      date|
          +----------+
          |2015/08/26|
          |2014/10/27|
          |2016/01/28|
          +----------+
          ```
      
      ### Custom date format option - `timestampFormat`
      
      - **TimestampType**
      
        - With `dateFormat` option (e.g. `dd/MM/yyyy HH:mm`)
      
          ```
          +----------------+
          |            date|
          +----------------+
          |2015/08/26 18:00|
          |2014/10/27 18:30|
          |2016/01/28 20:00|
          +----------------+
          ```
      
      ## How was this patch tested?
      
      Unit tests were added in `CSVSuite` and `JsonSuite`. For JSON, existing tests cover the default cases.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14279 from HyukjinKwon/SPARK-16216-json-csv.
      29952ed0
  4. Jul 28, 2016
  5. Jul 01, 2016
    • Reynold Xin's avatar
      [SPARK-16335][SQL] Structured streaming should fail if source directory does not exist · d601894c
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      In structured streaming, Spark does not report errors when the specified directory does not exist. This is a behavior different from the batch mode. This patch changes the behavior to fail if the directory does not exist (when the path is not a glob pattern).
      
      ## How was this patch tested?
      Updated unit tests to reflect the new behavior.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14002 from rxin/SPARK-16335.
      d601894c
  6. Jun 30, 2016
    • Reynold Xin's avatar
      [SPARK-16313][SQL] Spark should not silently drop exceptions in file listing · 3d75a5b2
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Spark silently drops exceptions during file listing. This is a very bad behavior because it can mask legitimate errors and the resulting plan will silently have 0 rows. This patch changes it to not silently drop the errors.
      
      ## How was this patch tested?
      Manually verified.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13987 from rxin/SPARK-16313.
      3d75a5b2
  7. Jun 29, 2016
    • Tathagata Das's avatar
      [SPARK-16266][SQL][STREAING] Moved DataStreamReader/Writer from pyspark.sql to... · f454a7f9
      Tathagata Das authored
      [SPARK-16266][SQL][STREAING] Moved DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming
      
      ## What changes were proposed in this pull request?
      
      - Moved DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming to make them consistent with scala packaging
      - Exposed the necessary classes in sql.streaming package so that they appear in the docs
      - Added pyspark.sql.streaming module to the docs
      
      ## How was this patch tested?
      - updated unit tests.
      - generated docs for testing visibility of pyspark.sql.streaming classes.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #13955 from tdas/SPARK-16266.
      f454a7f9
  8. Jun 15, 2016
  9. Jun 14, 2016
    • Shixiong Zhu's avatar
      [SPARK-15935][PYSPARK] Fix a wrong format tag in the error message · 0ee9fd9e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      A follow up PR for #13655 to fix a wrong format tag.
      
      ## How was this patch tested?
      
      Jenkins unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13665 from zsxwing/fix.
      0ee9fd9e
    • Tathagata Das's avatar
      [SPARK-15933][SQL][STREAMING] Refactored DF reader-writer to use readStream... · 214adb14
      Tathagata Das authored
      [SPARK-15933][SQL][STREAMING] Refactored DF reader-writer to use readStream and writeStream for streaming DFs
      
      ## What changes were proposed in this pull request?
      Currently, the DataFrameReader/Writer has method that are needed for streaming and non-streaming DFs. This is quite awkward because each method in them through runtime exception for one case or the other. So rather having half the methods throw runtime exceptions, its just better to have a different reader/writer API for streams.
      
      - [x] Python API!!
      
      ## How was this patch tested?
      Existing unit tests + two sets of unit tests for DataFrameReader/Writer and DataStreamReader/Writer.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #13653 from tdas/SPARK-15933.
      214adb14
    • Shixiong Zhu's avatar
      [SPARK-15935][PYSPARK] Enable test for sql/streaming.py and fix these tests · 96c3500c
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR just enables tests for sql/streaming.py and also fixes the failures.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13655 from zsxwing/python-streaming-test.
      96c3500c
  10. Jun 06, 2016
    • Zheng RuiFeng's avatar
      [MINOR] Fix Typos 'an -> a' · fd8af397
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      `an -> a`
      
      Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.
      
      ## How was this patch tested?
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #13515 from zhengruifeng/an_a.
      fd8af397
  11. Jun 01, 2016
    • Reynold Xin's avatar
      [SPARK-15686][SQL] Move user-facing streaming classes into sql.streaming · a71d1364
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch moves all user-facing structured streaming classes into sql.streaming. As part of this, I also added some since version annotation to methods and classes that don't have them.
      
      ## How was this patch tested?
      Updated tests to reflect the moves.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13429 from rxin/SPARK-15686.
      a71d1364
  12. May 04, 2016
    • Andrew Or's avatar
      [SPARK-14896][SQL] Deprecate HiveContext in python · fa79d346
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      See title.
      
      ## How was this patch tested?
      
      PySpark tests.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12917 from andrewor14/deprecate-hive-context-python.
      fa79d346
  13. Apr 28, 2016
    • Burak Yavuz's avatar
      [SPARK-14555] Second cut of Python API for Structured Streaming · 78c8aaf8
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      This PR adds Python APIs for:
       - `ContinuousQueryManager`
       - `ContinuousQueryException`
      
      The `ContinuousQueryException` is a very basic wrapper, it doesn't provide the functionality that the Scala side provides, but it follows the same pattern for `AnalysisException`.
      
      For `ContinuousQueryManager`, all APIs are provided except for registering listeners.
      
      This PR also attempts to fix test flakiness by stopping all active streams just before tests.
      
      ## How was this patch tested?
      
      Python Doc tests and unit tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #12673 from brkyvz/pyspark-cqm.
      78c8aaf8
  14. Apr 20, 2016
    • Burak Yavuz's avatar
      [SPARK-14555] First cut of Python API for Structured Streaming · 80bf48f4
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes:
       - ContinuousQuery
       - Trigger
       - ProcessingTime
      in pyspark under `pyspark.sql.streaming`.
      
      In addition, it contains the new methods added under:
       -  `DataFrameWriter`
           a) `startStream`
           b) `trigger`
           c) `queryName`
      
       -  `DataFrameReader`
           a) `stream`
      
       - `DataFrame`
          a) `isStreaming`
      
      This PR doesn't contain all methods exposed for `ContinuousQuery`, for example:
       - `exception`
       - `sourceStatuses`
       - `sinkStatus`
      
      They may be added in a follow up.
      
      This PR also contains some very minor doc fixes in the Scala side.
      
      ## How was this patch tested?
      
      Python doc tests
      
      TODO:
       - [ ] verify Python docs look good
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      Author: Burak Yavuz <burak@databricks.com>
      
      Closes #12320 from brkyvz/stream-python.
      80bf48f4
Loading