Skip to content
Snippets Groups Projects
  • Jen-Ming Chung's avatar
    6273a711
    [SPARK-21610][SQL] Corrupt records are not handled properly when creating a dataframe from a file · 6273a711
    Jen-Ming Chung authored
    ## What changes were proposed in this pull request?
    ```
    echo '{"field": 1}
    {"field": 2}
    {"field": "3"}' >/tmp/sample.json
    ```
    
    ```scala
    import org.apache.spark.sql.types._
    
    val schema = new StructType()
      .add("field", ByteType)
      .add("_corrupt_record", StringType)
    
    val file = "/tmp/sample.json"
    
    val dfFromFile = spark.read.schema(schema).json(file)
    
    scala> dfFromFile.show(false)
    +-----+---------------+
    |field|_corrupt_record|
    +-----+---------------+
    |1    |null           |
    |2    |null           |
    |null |{"field": "3"} |
    +-----+---------------+
    
    scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
    res1: Long = 0
    
    scala> dfFromFile.filter($"_corrupt_record".isNull).count()
    res2: Long = 3
    ```
    When the `requiredSchema` only contains `_corrupt_record`, the derived `actualSchema` is empty and the `_corrupt_record` are all null for all rows. This PR captures above situation and raise an exception with a reasonable workaround messag so that users can know what happened and how to fix the query.
    
    ## How was this patch tested?
    
    Added test case.
    
    Author: Jen-Ming Chung <jenmingisme@gmail.com>
    
    Closes #18865 from jmchung/SPARK-21610.
    6273a711
    History
    [SPARK-21610][SQL] Corrupt records are not handled properly when creating a dataframe from a file
    Jen-Ming Chung authored
    ## What changes were proposed in this pull request?
    ```
    echo '{"field": 1}
    {"field": 2}
    {"field": "3"}' >/tmp/sample.json
    ```
    
    ```scala
    import org.apache.spark.sql.types._
    
    val schema = new StructType()
      .add("field", ByteType)
      .add("_corrupt_record", StringType)
    
    val file = "/tmp/sample.json"
    
    val dfFromFile = spark.read.schema(schema).json(file)
    
    scala> dfFromFile.show(false)
    +-----+---------------+
    |field|_corrupt_record|
    +-----+---------------+
    |1    |null           |
    |2    |null           |
    |null |{"field": "3"} |
    +-----+---------------+
    
    scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
    res1: Long = 0
    
    scala> dfFromFile.filter($"_corrupt_record".isNull).count()
    res2: Long = 3
    ```
    When the `requiredSchema` only contains `_corrupt_record`, the derived `actualSchema` is empty and the `_corrupt_record` are all null for all rows. This PR captures above situation and raise an exception with a reasonable workaround messag so that users can know what happened and how to fix the query.
    
    ## How was this patch tested?
    
    Added test case.
    
    Author: Jen-Ming Chung <jenmingisme@gmail.com>
    
    Closes #18865 from jmchung/SPARK-21610.