-
- Downloads
[SPARK-3339][SQL] Support for skipping json lines that fail to parse
This PR aims to provide a way to skip/query corrupt JSON records. To do so, we introduce an internal column to hold corrupt records (the default name is `_corrupt_record`. This name can be changed by setting the value of `spark.sql.columnNameOfCorruptRecord`). When there is a parsing error, we will put the corrupt record in its unparsed format to the internal column. Users can skip/query this column through SQL. * To query those corrupt records ``` -- For Hive parser SELECT `_corrupt_record` FROM jsonTable WHERE `_corrupt_record` IS NOT NULL -- For our SQL parser SELECT _corrupt_record FROM jsonTable WHERE _corrupt_record IS NOT NULL ``` * To skip corrupt records and query regular records ``` -- For Hive parser SELECT field1, field2 FROM jsonTable WHERE `_corrupt_record` IS NULL -- For our SQL parser SELECT field1, field2 FROM jsonTable WHERE _corrupt_record IS NULL ``` Generally, it is not recommended to change the name of the internal column. If the name has to be changed to avoid possible name conflicts, you can use `sqlContext.setConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD, <new column name>)` or `sqlContext.sql(SET spark.sql.columnNameOfCorruptRecord=<new column name>)`. Author: Yin Huai <huai@cse.ohio-state.edu> Closes #2680 from yhuai/corruptJsonRecord and squashes the following commits: 4c9828e [Yin Huai] Merge remote-tracking branch 'upstream/master' into corruptJsonRecord 309616a [Yin Huai] Change the default name of corrupt record to "_corrupt_record". b4a3632 [Yin Huai] Merge remote-tracking branch 'upstream/master' into corruptJsonRecord 9375ae9 [Yin Huai] Set the column name of corrupt json record back to the default one after the unit test. ee584c0 [Yin Huai] Provide a way to query corrupt json records as unparsed strings.
Showing
- sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala 4 additions, 0 deletionssql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala
- sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala 10 additions, 4 deletions...core/src/main/scala/org/apache/spark/sql/SQLContext.scala
- sql/core/src/main/scala/org/apache/spark/sql/api/java/JavaSQLContext.scala 12 additions, 4 deletions.../scala/org/apache/spark/sql/api/java/JavaSQLContext.scala
- sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala 20 additions, 10 deletions...re/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala
- sql/core/src/test/scala/org/apache/spark/sql/json/JsonSuite.scala 61 additions, 1 deletion.../src/test/scala/org/apache/spark/sql/json/JsonSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/json/TestJsonData.scala 9 additions, 0 deletions...c/test/scala/org/apache/spark/sql/json/TestJsonData.scala
Loading
Please register or sign in to comment