-
- Downloads
[SPARK-18699][SQL] Put malformed tokens into a new field when parsing CSV data
## What changes were proposed in this pull request? This pr added a logic to put malformed tokens into a new field when parsing CSV data in case of permissive modes. In the current master, if the CSV parser hits these malformed ones, it throws an exception below (and then a job fails); ``` Caused by: java.lang.IllegalArgumentException at java.sql.Date.valueOf(Date.java:143) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272) at scala.util.Try.getOrElse(Try.scala:79) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269) at ``` In case that users load large CSV-formatted data, the job failure makes users get some confused. So, this fix set NULL for original columns and put malformed tokens in a new field. ## How was this patch tested? Added tests in `CSVSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #16928 from maropu/SPARK-18699-2.
Showing
- python/pyspark/sql/readwriter.py 24 additions, 8 deletionspython/pyspark/sql/readwriter.py
- python/pyspark/sql/streaming.py 24 additions, 8 deletionspython/pyspark/sql/streaming.py
- sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 14 additions, 4 deletions...src/main/scala/org/apache/spark/sql/DataFrameReader.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala 22 additions, 9 deletions...e/spark/sql/execution/datasources/csv/CSVFileFormat.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala 15 additions, 3 deletions...ache/spark/sql/execution/datasources/csv/CSVOptions.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala 49 additions, 13 deletions...spark/sql/execution/datasources/csv/UnivocityParser.scala
- sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala 14 additions, 4 deletions...ala/org/apache/spark/sql/streaming/DataStreamReader.scala
- sql/core/src/test/resources/test-data/value-malformed.csv 2 additions, 0 deletionssql/core/src/test/resources/test-data/value-malformed.csv
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala 59 additions, 4 deletions...apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Please register or sign in to comment