Skip to content
Snippets Groups Projects
Commit a832cef1 authored by hyukjinkwon's avatar hyukjinkwon Committed by Reynold Xin
Browse files

[SPARK-13425][SQL] Documentation for CSV datasource options

## What changes were proposed in this pull request?

This PR adds the explanation and documentation for CSV options for reading and writing.

## How was this patch tested?

Style tests with `./dev/run_tests` for documentation style.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>

Closes #12817 from HyukjinKwon/SPARK-13425.
parent a6428292
No related branches found
No related tags found
No related merge requests found
......@@ -282,6 +282,45 @@ class DataFrameReader(object):
:param paths: string, or list of strings, for input path(s).
You can set the following CSV-specific options to deal with CSV files:
* ``sep`` (default ``,``): sets the single character as a separator \
for each field and value.
* ``charset`` (default ``UTF-8``): decodes the CSV files by the given \
encoding type.
* ``quote`` (default ``"``): sets the single character used for escaping \
quoted values where the separator can be part of the value.
* ``escape`` (default ``\``): sets the single character used for escaping quotes \
inside an already quoted value.
* ``comment`` (default empty string): sets the single character used for skipping \
lines beginning with this character. By default, it is disabled.
* ``header`` (default ``false``): uses the first line as names of columns.
* ``ignoreLeadingWhiteSpace`` (default ``false``): defines whether or not leading \
whitespaces from values being read should be skipped.
* ``ignoreTrailingWhiteSpace`` (default ``false``): defines whether or not trailing \
whitespaces from values being read should be skipped.
* ``nullValue`` (default empty string): sets the string representation of a null value.
* ``nanValue`` (default ``NaN``): sets the string representation of a non-number \
value.
* ``positiveInf`` (default ``Inf``): sets the string representation of a positive \
infinity value.
* ``negativeInf`` (default ``-Inf``): sets the string representation of a negative \
infinity value.
* ``dateFormat`` (default ``None``): sets the string that indicates a date format. \
Custom date formats follow the formats at ``java.text.SimpleDateFormat``. This \
applies to both date type and timestamp type. By default, it is None which means \
trying to parse times and date by ``java.sql.Timestamp.valueOf()`` and \
``java.sql.Date.valueOf()``.
* ``maxColumns`` (default ``20480``): defines a hard limit of how many columns \
a record can have.
* ``maxCharsPerColumn`` (default ``1000000``): defines the maximum number of \
characters allowed for any given value being read.
* ``mode`` (default ``PERMISSIVE``): allows a mode for dealing with corrupt records \
during parsing.
* ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted record. \
When a schema is set by user, it sets ``null`` for extra fields.
* ``DROPMALFORMED`` : ignores the whole corrupted records.
* ``FAILFAST`` : throws an exception when it meets corrupted records.
>>> df = sqlContext.read.csv('python/test_support/sql/ages.csv')
>>> df.dtypes
[('C0', 'string'), ('C1', 'string')]
......@@ -663,6 +702,19 @@ class DataFrameWriter(object):
known case-insensitive shorten names (none, bzip2, gzip, lz4,
snappy and deflate).
You can set the following CSV-specific options to deal with CSV files:
* ``sep`` (default ``,``): sets the single character as a separator \
for each field and value.
* ``quote`` (default ``"``): sets the single character used for escaping \
quoted values where the separator can be part of the value.
* ``escape`` (default ``\``): sets the single character used for escaping quotes \
inside an already quoted value.
* ``header`` (default ``false``): writes the names of columns as the first line.
* ``nullValue`` (default empty string): sets the string representation of a null value.
* ``compression``: compression codec to use when saving to file. This can be one of \
the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and \
deflate).
>>> df.write.csv(os.path.join(tempfile.mkdtemp(), 'data'))
"""
self.mode(mode)
......
......@@ -290,7 +290,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
* <li>`allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers
* (e.g. 00012)</li>
* <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records
* during parsing.<li>
* during parsing.</li>
* <ul>
* <li>`PERMISSIVE` : sets other fields to `null` when it meets a corrupted record, and puts the
* malformed string into a new field configured by `columnNameOfCorruptRecord`. When
......@@ -300,7 +300,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
* </ul>
* <li>`columnNameOfCorruptRecord` (default `_corrupt_record`): allows renaming the new field
* having malformed string created by `PERMISSIVE` mode. This overrides
* `spark.sql.columnNameOfCorruptRecord`.<li>
* `spark.sql.columnNameOfCorruptRecord`.</li>
*
* @since 1.4.0
*/
......@@ -326,7 +326,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
* <li>`allowBackslashEscapingAnyCharacter` (default `false`): allows accepting quoting of all
* character using backslash quoting mechanism</li>
* <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records
* during parsing.<li>
* during parsing.</li>
* <ul>
* <li>`PERMISSIVE` : sets other fields to `null` when it meets a corrupted record, and puts the
* malformed string into a new field configured by `columnNameOfCorruptRecord`. When
......@@ -336,7 +336,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
* </ul>
* <li>`columnNameOfCorruptRecord` (default `_corrupt_record`): allows renaming the new field
* having malformed string created by `PERMISSIVE` mode. This overrides
* `spark.sql.columnNameOfCorruptRecord`.<li>
* `spark.sql.columnNameOfCorruptRecord`.</li>
*
* @since 1.6.0
*/
......@@ -393,6 +393,45 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
* This function goes through the input once to determine the input schema. To avoid going
* through the entire data once, specify the schema explicitly using [[schema]].
*
* You can set the following CSV-specific options to deal with CSV files:
* <li>`sep` (default `,`): sets the single character as a separator for each
* field and value.</li>
* <li>`encoding` (default `UTF-8`): decodes the CSV files by the given encoding
* type.</li>
* <li>`quote` (default `"`): sets the single character used for escaping quoted values where
* the separator can be part of the value.</li>
* <li>`escape` (default `\`): sets the single character used for escaping quotes inside
* an already quoted value.</li>
* <li>`comment` (default empty string): sets the single character used for skipping lines
* beginning with this character. By default, it is disabled.</li>
* <li>`header` (default `false`): uses the first line as names of columns.</li>
* <li>`ignoreLeadingWhiteSpace` (default `false`): defines whether or not leading whitespaces
* from values being read should be skipped.</li>
* <li>`ignoreTrailingWhiteSpace` (default `fDataFraalse`): defines whether or not trailing
* whitespaces from values being read should be skipped.</li>
* <li>`nullValue` (default empty string): sets the string representation of a null value.</li>
* <li>`nanValue` (default `NaN`): sets the string representation of a non-number" value.</li>
* <li>`positiveInf` (default `Inf`): sets the string representation of a positive infinity
* value.</li>
* <li>`negativeInf` (default `-Inf`): sets the string representation of a negative infinity
* value.</li>
* <li>`dateFormat` (default `null`): sets the string that indicates a date format. Custom date
* formats follow the formats at `java.text.SimpleDateFormat`. This applies to both date type
* and timestamp type. By default, it is `null` which means trying to parse times and date by
* `java.sql.Timestamp.valueOf()` and `java.sql.Date.valueOf()`.</li>
* <li>`maxColumns` (default `20480`): defines a hard limit of how many columns
* a record can have.</li>
* <li>`maxCharsPerColumn` (default `1000000`): defines the maximum number of characters allowed
* for any given value being read.</li>
* <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records
* during parsing.</li>
* <ul>
* <li>`PERMISSIVE` : sets other fields to `null` when it meets a corrupted record. When
* a schema is set by user, it sets `null` for extra fields.</li>
* <li>`DROPMALFORMED` : ignores the whole corrupted records.</li>
* <li>`FAILFAST` : throws an exception when it meets corrupted records.</li>
* </ul>
*
* @since 2.0.0
*/
@scala.annotation.varargs
......
......@@ -606,6 +606,14 @@ final class DataFrameWriter private[sql](df: DataFrame) {
* }}}
*
* You can set the following CSV-specific option(s) for writing CSV files:
* <li>`sep` (default `,`): sets the single character as a separator for each
* field and value.</li>
* <li>`quote` (default `"`): sets the single character used for escaping quoted values where
* the separator can be part of the value.</li>
* <li>`escape` (default `\`): sets the single character used for escaping quotes inside
* an already quoted value.</li>
* <li>`header` (default `false`): writes the names of columns as the first line.</li>
* <li>`nullValue` (default empty string): sets the string representation of a null value.</li>
* <li>`compression` (default `null`): compression codec to use when saving to file. This can be
* one of the known case-insensitive shorten names (`none`, `bzip2`, `gzip`, `lz4`,
* `snappy` and `deflate`). </li>
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment