-
- Downloads
[SPARK-18579][SQL] Use ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace options in CSV writing
## What changes were proposed in this pull request? This PR proposes to support _not_ trimming the white spaces when writing out. These are `false` by default in CSV reading path but these are `true` by default in CSV writing in univocity parser. Both `ignoreLeadingWhiteSpace` and `ignoreTrailingWhiteSpace` options are not being used for writing and therefore, we are always trimming the white spaces. It seems we should provide a way to keep this white spaces easily. WIth the data below: ```scala val df = spark.read.csv(Seq("a , b , c").toDS) df.show() ``` ``` +---+----+---+ |_c0| _c1|_c2| +---+----+---+ | a | b | c| +---+----+---+ ``` **Before** ```scala df.write.csv("/tmp/text.csv") spark.read.text("/tmp/text.csv").show() ``` ``` +-----+ |value| +-----+ |a,b,c| +-----+ ``` It seems this can't be worked around via `quoteAll` too. ```scala df.write.option("quoteAll", true).csv("/tmp/text.csv") spark.read.text("/tmp/text.csv").show() ``` ``` +-----------+ | value| +-----------+ |"a","b","c"| +-----------+ ``` **After** ```scala df.write.option("ignoreLeadingWhiteSpace", false).option("ignoreTrailingWhiteSpace", false).csv("/tmp/text.csv") spark.read.text("/tmp/text.csv").show() ``` ``` +----------+ | value| +----------+ |a , b , c| +----------+ ``` Note that this case is possible in R ```r > system("cat text.csv") f1,f2,f3 a , b , c > df <- read.csv(file="text.csv") > df f1 f2 f3 1 a b c > write.csv(df, file="text1.csv", quote=F, row.names=F) > system("cat text1.csv") f1,f2,f3 a , b , c ``` ## How was this patch tested? Unit tests in `CSVSuite` and manual tests for Python. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17310 from HyukjinKwon/SPARK-18579.
Showing
- python/pyspark/sql/readwriter.py 18 additions, 10 deletionspython/pyspark/sql/readwriter.py
- python/pyspark/sql/streaming.py 6 additions, 6 deletionspython/pyspark/sql/streaming.py
- python/pyspark/sql/tests.py 13 additions, 0 deletionspython/pyspark/sql/tests.py
- sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 3 additions, 3 deletions...src/main/scala/org/apache/spark/sql/DataFrameReader.scala
- sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala 5 additions, 1 deletion...src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala 11 additions, 4 deletions...ache/spark/sql/execution/datasources/csv/CSVOptions.scala
- sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala 3 additions, 3 deletions...ala/org/apache/spark/sql/streaming/DataStreamReader.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala 57 additions, 0 deletions...apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Loading
Please register or sign in to comment