Skip to content
Snippets Groups Projects
  • hyukjinkwon's avatar
    29952ed0
    [SPARK-16216][SQL] Read/write timestamps and dates in ISO 8601 and... · 29952ed0
    hyukjinkwon authored
    [SPARK-16216][SQL] Read/write timestamps and dates in ISO 8601 and dateFormat/timestampFormat option for CSV and JSON
    
    ## What changes were proposed in this pull request?
    
    ### Default - ISO 8601
    
    Currently, CSV datasource is writing `Timestamp` and `Date` as numeric form and JSON datasource is writing both as below:
    
    - CSV
      ```
      // TimestampType
      1414459800000000
      // DateType
      16673
      ```
    
    - Json
    
      ```
      // TimestampType
      1970-01-01 11:46:40.0
      // DateType
      1970-01-01
      ```
    
    So, for CSV we can't read back what we write and for JSON it becomes ambiguous because the timezone is being missed.
    
    So, this PR make both **write** `Timestamp` and `Date` in ISO 8601 formatted string (please refer the [ISO 8601 specification](https://www.w3.org/TR/NOTE-datetime)).
    
    - For `Timestamp` it becomes as below: (`yyyy-MM-dd'T'HH:mm:ss.SSSZZ`)
    
      ```
      1970-01-01T02:00:01.000-01:00
      ```
    
    - For `Date` it becomes as below (`yyyy-MM-dd`)
    
      ```
      1970-01-01
      ```
    
    ### Custom date format option - `dateFormat`
    
    This PR also adds the support to write and read dates and timestamps in a formatted string as below:
    
    - **DateType**
    
      - With `dateFormat` option (e.g. `yyyy/MM/dd`)
    
        ```
        +----------+
        |      date|
        +----------+
        |2015/08/26|
        |2014/10/27|
        |2016/01/28|
        +----------+
        ```
    
    ### Custom date format option - `timestampFormat`
    
    - **TimestampType**
    
      - With `dateFormat` option (e.g. `dd/MM/yyyy HH:mm`)
    
        ```
        +----------------+
        |            date|
        +----------------+
        |2015/08/26 18:00|
        |2014/10/27 18:30|
        |2016/01/28 20:00|
        +----------------+
        ```
    
    ## How was this patch tested?
    
    Unit tests were added in `CSVSuite` and `JsonSuite`. For JSON, existing tests cover the default cases.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #14279 from HyukjinKwon/SPARK-16216-json-csv.
    29952ed0
    History
    [SPARK-16216][SQL] Read/write timestamps and dates in ISO 8601 and...
    hyukjinkwon authored
    [SPARK-16216][SQL] Read/write timestamps and dates in ISO 8601 and dateFormat/timestampFormat option for CSV and JSON
    
    ## What changes were proposed in this pull request?
    
    ### Default - ISO 8601
    
    Currently, CSV datasource is writing `Timestamp` and `Date` as numeric form and JSON datasource is writing both as below:
    
    - CSV
      ```
      // TimestampType
      1414459800000000
      // DateType
      16673
      ```
    
    - Json
    
      ```
      // TimestampType
      1970-01-01 11:46:40.0
      // DateType
      1970-01-01
      ```
    
    So, for CSV we can't read back what we write and for JSON it becomes ambiguous because the timezone is being missed.
    
    So, this PR make both **write** `Timestamp` and `Date` in ISO 8601 formatted string (please refer the [ISO 8601 specification](https://www.w3.org/TR/NOTE-datetime)).
    
    - For `Timestamp` it becomes as below: (`yyyy-MM-dd'T'HH:mm:ss.SSSZZ`)
    
      ```
      1970-01-01T02:00:01.000-01:00
      ```
    
    - For `Date` it becomes as below (`yyyy-MM-dd`)
    
      ```
      1970-01-01
      ```
    
    ### Custom date format option - `dateFormat`
    
    This PR also adds the support to write and read dates and timestamps in a formatted string as below:
    
    - **DateType**
    
      - With `dateFormat` option (e.g. `yyyy/MM/dd`)
    
        ```
        +----------+
        |      date|
        +----------+
        |2015/08/26|
        |2014/10/27|
        |2016/01/28|
        +----------+
        ```
    
    ### Custom date format option - `timestampFormat`
    
    - **TimestampType**
    
      - With `dateFormat` option (e.g. `dd/MM/yyyy HH:mm`)
    
        ```
        +----------------+
        |            date|
        +----------------+
        |2015/08/26 18:00|
        |2014/10/27 18:30|
        |2016/01/28 20:00|
        +----------------+
        ```
    
    ## How was this patch tested?
    
    Unit tests were added in `CSVSuite` and `JsonSuite`. For JSON, existing tests cover the default cases.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #14279 from HyukjinKwon/SPARK-16216-json-csv.