-
- Downloads
[SPARK-18937][SQL] Timezone support in CSV/JSON parsing
## What changes were proposed in this pull request? This is a follow-up pr of #16308. This pr enables timezone support in CSV/JSON parsing. We should introduce `timeZone` option for CSV/JSON datasources (the default value of the option is session local timezone). The datasources should use the `timeZone` option to format/parse to write/read timestamp values. Notice that while reading, if the timestampFormat has the timezone info, the timezone will not be used because we should respect the timezone in the values. For example, if you have timestamp `"2016-01-01 00:00:00"` in `GMT`, the values written with the default timezone option, which is `"GMT"` because session local timezone is `"GMT"` here, are: ```scala scala> spark.conf.set("spark.sql.session.timeZone", "GMT") scala> val df = Seq(new java.sql.Timestamp(1451606400000L)).toDF("ts") df: org.apache.spark.sql.DataFrame = [ts: timestamp] scala> df.show() +-------------------+ |ts | +-------------------+ |2016-01-01 00:00:00| +-------------------+ scala> df.write.json("/path/to/gmtjson") ``` ```sh $ cat /path/to/gmtjson/part-* {"ts":"2016-01-01T00:00:00.000Z"} ``` whereas setting the option to `"PST"`, they are: ```scala scala> df.write.option("timeZone", "PST").json("/path/to/pstjson") ``` ```sh $ cat /path/to/pstjson/part-* {"ts":"2015-12-31T16:00:00.000-08:00"} ``` We can properly read these files even if the timezone option is wrong because the timestamp values have timezone info: ```scala scala> val schema = new StructType().add("ts", TimestampType) schema: org.apache.spark.sql.types.StructType = StructType(StructField(ts,TimestampType,true)) scala> spark.read.schema(schema).json("/path/to/gmtjson").show() +-------------------+ |ts | +-------------------+ |2016-01-01 00:00:00| +-------------------+ scala> spark.read.schema(schema).option("timeZone", "PST").json("/path/to/gmtjson").show() +-------------------+ |ts | +-------------------+ |2016-01-01 00:00:00| +-------------------+ ``` And even if `timezoneFormat` doesn't contain timezone info, we can properly read the values with setting correct timezone option: ```scala scala> df.write.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("timeZone", "JST").json("/path/to/jstjson") ``` ```sh $ cat /path/to/jstjson/part-* {"ts":"2016-01-01T09:00:00"} ``` ```scala // wrong result scala> spark.read.schema(schema).option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").json("/path/to/jstjson").show() +-------------------+ |ts | +-------------------+ |2016-01-01 09:00:00| +-------------------+ // correct result scala> spark.read.schema(schema).option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("timeZone", "JST").json("/path/to/jstjson").show() +-------------------+ |ts | +-------------------+ |2016-01-01 00:00:00| +-------------------+ ``` This pr also makes `JsonToStruct` and `StructToJson` `TimeZoneAwareExpression` to be able to evaluate values with timezone option. ## How was this patch tested? Existing tests and added some tests. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #16750 from ueshin/issues/SPARK-18937.
Showing
- python/pyspark/sql/readwriter.py 27 additions, 16 deletionspython/pyspark/sql/readwriter.py
- python/pyspark/sql/streaming.py 12 additions, 8 deletionspython/pyspark/sql/streaming.py
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala 24 additions, 6 deletions...ache/spark/sql/catalyst/expressions/jsonExpressions.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala 7 additions, 4 deletions...cala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala 1 addition, 1 deletion...org/apache/spark/sql/catalyst/json/JacksonGenerator.scala
- sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala 96 additions, 17 deletions...spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
- sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 7 additions, 1 deletion...src/main/scala/org/apache/spark/sql/DataFrameReader.scala
- sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala 4 additions, 0 deletions...src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
- sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 4 additions, 2 deletionssql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala 4 additions, 4 deletions...e/spark/sql/execution/datasources/csv/CSVFileFormat.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala 8 additions, 13 deletions...ache/spark/sql/execution/datasources/csv/CSVOptions.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityGenerator.scala 1 addition, 1 deletion...rk/sql/execution/datasources/csv/UnivocityGenerator.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala 1 addition, 1 deletion...spark/sql/execution/datasources/csv/UnivocityParser.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala 6 additions, 3 deletions...spark/sql/execution/datasources/json/JsonFileFormat.scala
- sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala 4 additions, 0 deletions...ala/org/apache/spark/sql/streaming/DataStreamReader.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchemaSuite.scala 11 additions, 11 deletions...k/sql/execution/datasources/csv/CSVInferSchemaSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala 43 additions, 1 deletion...apache/spark/sql/execution/datasources/csv/CSVSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParserSuite.scala 44 additions, 29 deletions.../sql/execution/datasources/csv/UnivocityParserSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala 42 additions, 4 deletions...ache/spark/sql/execution/datasources/json/JsonSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/sources/ResolvedDataSourceSuite.scala 5 additions, 1 deletion...rg/apache/spark/sql/sources/ResolvedDataSourceSuite.scala
Loading
Please register or sign in to comment