Skip to content
Snippets Groups Projects
  • Takeshi Yamamuro's avatar
    09ed6e77
    [SPARK-18699][SQL] Put malformed tokens into a new field when parsing CSV data · 09ed6e77
    Takeshi Yamamuro authored
    ## What changes were proposed in this pull request?
    This pr added a logic to put malformed tokens into a new field when parsing CSV data  in case of permissive modes. In the current master, if the CSV parser hits these malformed ones, it throws an exception below (and then a job fails);
    ```
    Caused by: java.lang.IllegalArgumentException
    	at java.sql.Date.valueOf(Date.java:143)
    	at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
    	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
    	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
    	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
    	at scala.util.Try.getOrElse(Try.scala:79)
    	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
    	at
    ```
    In case that users load large CSV-formatted data, the job failure makes users get some confused. So, this fix set NULL for original columns and put malformed tokens in a new field.
    
    ## How was this patch tested?
    Added tests in `CSVSuite`.
    
    Author: Takeshi Yamamuro <yamamuro@apache.org>
    
    Closes #16928 from maropu/SPARK-18699-2.
    09ed6e77
    History
    [SPARK-18699][SQL] Put malformed tokens into a new field when parsing CSV data
    Takeshi Yamamuro authored
    ## What changes were proposed in this pull request?
    This pr added a logic to put malformed tokens into a new field when parsing CSV data  in case of permissive modes. In the current master, if the CSV parser hits these malformed ones, it throws an exception below (and then a job fails);
    ```
    Caused by: java.lang.IllegalArgumentException
    	at java.sql.Date.valueOf(Date.java:143)
    	at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
    	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
    	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
    	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
    	at scala.util.Try.getOrElse(Try.scala:79)
    	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
    	at
    ```
    In case that users load large CSV-formatted data, the job failure makes users get some confused. So, this fix set NULL for original columns and put malformed tokens in a new field.
    
    ## How was this patch tested?
    Added tests in `CSVSuite`.
    
    Author: Takeshi Yamamuro <yamamuro@apache.org>
    
    Closes #16928 from maropu/SPARK-18699-2.