-
- Downloads
[SPARK-14480][SQL] Remove meaningless StringIteratorReader for CSV data source.
## What changes were proposed in this pull request? This PR removes meaningless `StringIteratorReader` for CSV data source. In `CSVParser.scala`, there is an `Reader` wrapping `Iterator` but there are two problems by this. Firstly, it was actually not faster than processing line by line with Iterator due to additional logics to wrap `Iterator` to `Reader`. Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103). A benchmark was performed manually and the results were below: - Original codes with Reader wrapping Iterator |End-to-end (ns) | Parse Time (ns) | |-----------------------|------------------------| |14116265034 |2008277960 | - New codes with Iterator |End-to-end (ns) | Parse Time (ns) | |-----------------------|------------------------| |13451699644 | 1549050564 | For the details for the environment, dataset and methods, please refer the JIRA ticket. ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #13808 from HyukjinKwon/SPARK-14480-small.
Showing
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala 9 additions, 3 deletions...e/spark/sql/execution/datasources/csv/CSVFileFormat.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParser.scala 10 additions, 158 deletions...pache/spark/sql/execution/datasources/csv/CSVParser.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala 14 additions, 5 deletions...che/spark/sql/execution/datasources/csv/CSVRelation.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVParserSuite.scala 0 additions, 125 deletions.../spark/sql/execution/datasources/csv/CSVParserSuite.scala
Please register or sign in to comment