-
- Downloads
[SPARK-19610][SQL] Support parsing multiline CSV files
## What changes were proposed in this pull request? This PR proposes the support for multiple lines for CSV by resembling the multiline supports in JSON datasource (in case of JSON, per file). So, this PR introduces `wholeFile` option which makes the format not splittable and reads each whole file. Since Univocity parser can produces each row from a stream, it should be capable of parsing very large documents when the internal rows are fix in the memory. ## How was this patch tested? Unit tests in `CSVSuite` and `tests.py` Manual tests with a single 9GB CSV file in local file system, for example, ```scala spark.read.option("wholeFile", true).option("inferSchema", true).csv("tmp.csv").count() ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #16976 from HyukjinKwon/SPARK-19610.
Showing
- python/pyspark/sql/readwriter.py 4 additions, 2 deletionspython/pyspark/sql/readwriter.py
- python/pyspark/sql/streaming.py 4 additions, 2 deletionspython/pyspark/sql/streaming.py
- python/pyspark/sql/tests.py 8 additions, 1 deletionpython/pyspark/sql/tests.py
- python/test_support/sql/ages_newlines.csv 6 additions, 0 deletionspython/test_support/sql/ages_newlines.csv
- sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 1 addition, 0 deletions...src/main/scala/org/apache/spark/sql/DataFrameReader.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala 12 additions, 0 deletions...apache/spark/sql/execution/datasources/CodecStreams.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala 239 additions, 0 deletions...e/spark/sql/execution/datasources/csv/CSVDataSource.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala 21 additions, 56 deletions...e/spark/sql/execution/datasources/csv/CSVFileFormat.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala 4 additions, 55 deletions.../spark/sql/execution/datasources/csv/CSVInferSchema.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala 2 additions, 0 deletions...ache/spark/sql/execution/datasources/csv/CSVOptions.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala 85 additions, 9 deletions...spark/sql/execution/datasources/csv/UnivocityParser.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala 6 additions, 12 deletions...spark/sql/execution/datasources/json/JsonDataSource.scala
- sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala 1 addition, 0 deletions...ala/org/apache/spark/sql/streaming/DataStreamReader.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala 132 additions, 60 deletions...apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Loading
Please register or sign in to comment