-
- Downloads
[SPARK-16101][SQL] Refactoring CSV schema inference path to be consistent with JSON
## What changes were proposed in this pull request? This PR refactors CSV schema inference path to be consistent with JSON data source and moves some filtering codes having the similar/same logics into `CSVUtils`. It makes the methods in classes have consistent arguments with JSON ones. (this PR renames `.../json/InferSchema.scala` → `.../json/JsonInferSchema.scala`) `CSVInferSchema` and `JsonInferSchema` ``` scala private[csv] object CSVInferSchema { ... def infer( csv: Dataset[String], caseSensitive: Boolean, options: CSVOptions): StructType = { ... ``` ``` scala private[sql] object JsonInferSchema { ... def infer( json: RDD[String], columnNameOfCorruptRecord: String, configOptions: JSONOptions): StructType = { ... ``` These allow schema inference from `Dataset[String]` directly, meaning the similar functionalities that use `JacksonParser`/`JsonInferSchema` for JSON can be easily implemented by `UnivocityParser`/`CSVInferSchema` for CSV. This completes refactoring CSV datasource and they are now pretty consistent. ## How was this patch tested? Existing tests should cover this and ``` ./dev/change-scala-version.sh 2.10 ./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #16680 from HyukjinKwon/SPARK-16101-schema-inference.
Showing
- sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 2 additions, 2 deletions...src/main/scala/org/apache/spark/sql/DataFrameReader.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala 13 additions, 99 deletions...e/spark/sql/execution/datasources/csv/CSVFileFormat.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala 69 additions, 46 deletions.../spark/sql/execution/datasources/csv/CSVInferSchema.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala 1 addition, 1 deletion...ache/spark/sql/execution/datasources/csv/CSVOptions.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala 134 additions, 0 deletions...apache/spark/sql/execution/datasources/csv/CSVUtils.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala 1 addition, 1 deletion...spark/sql/execution/datasources/json/JsonFileFormat.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala 1 addition, 1 deletion...park/sql/execution/datasources/json/JsonInferSchema.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtilsSuite.scala 47 additions, 0 deletions...e/spark/sql/execution/datasources/csv/CSVUtilsSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParserSuite.scala 0 additions, 24 deletions.../sql/execution/datasources/csv/UnivocityParserSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala 3 additions, 3 deletions...ache/spark/sql/execution/datasources/json/JsonSuite.scala
Loading
Please register or sign in to comment