Skip to content
Snippets Groups Projects
Unverified Commit 836c95b1 authored by Michal Senkyr's avatar Michal Senkyr Committed by Sean Owen
Browse files

[SPARK-18723][DOC] Expanded programming guide information on wholeTex…

## What changes were proposed in this pull request?

Add additional information to wholeTextFiles in the Programming Guide. Also explain partitioning policy difference in relation to textFile and its impact on performance.

Also added reference to the underlying CombineFileInputFormat

## How was this patch tested?

Manual build of documentation and inspection in browser

```
cd docs
jekyll serve --watch
```

Author: Michal Senkyr <mike.senkyr@gmail.com>

Closes #16157 from michalsenkyr/wholeTextFilesExpandedDocs.
parent dc2a4d4a
No related branches found
No related tags found
No related merge requests found
...@@ -851,6 +851,8 @@ class SparkContext(config: SparkConf) extends Logging { ...@@ -851,6 +851,8 @@ class SparkContext(config: SparkConf) extends Logging {
* @note Small files are preferred, large file is also allowable, but may cause bad performance. * @note Small files are preferred, large file is also allowable, but may cause bad performance.
* @note On some filesystems, `.../path/&#42;` can be a more efficient way to read all files * @note On some filesystems, `.../path/&#42;` can be a more efficient way to read all files
* in a directory rather than `.../path/` or `.../path` * in a directory rather than `.../path/` or `.../path`
* @note Partitioning is determined by data locality. This may result in too few partitions
* by default.
* *
* @param path Directory to the input data files, the path can be comma separated paths as the * @param path Directory to the input data files, the path can be comma separated paths as the
* list of inputs. * list of inputs.
...@@ -900,6 +902,8 @@ class SparkContext(config: SparkConf) extends Logging { ...@@ -900,6 +902,8 @@ class SparkContext(config: SparkConf) extends Logging {
* @note Small files are preferred; very large files may cause bad performance. * @note Small files are preferred; very large files may cause bad performance.
* @note On some filesystems, `.../path/&#42;` can be a more efficient way to read all files * @note On some filesystems, `.../path/&#42;` can be a more efficient way to read all files
* in a directory rather than `.../path/` or `.../path` * in a directory rather than `.../path/` or `.../path`
* @note Partitioning is determined by data locality. This may result in too few partitions
* by default.
* *
* @param path Directory to the input data files, the path can be comma separated paths as the * @param path Directory to the input data files, the path can be comma separated paths as the
* list of inputs. * list of inputs.
......
...@@ -347,7 +347,7 @@ Some notes on reading files with Spark: ...@@ -347,7 +347,7 @@ Some notes on reading files with Spark:
Apart from text files, Spark's Scala API also supports several other data formats: Apart from text files, Spark's Scala API also supports several other data formats:
* `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. * `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. Partitioning is determined by data locality which, in some cases, may result in too few partitions. For those cases, `wholeTextFiles` provides an optional second argument for controlling the minimal number of partitions.
* For [SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html), use SparkContext's `sequenceFile[K, V]` method where `K` and `V` are the types of key and values in the file. These should be subclasses of Hadoop's [Writable](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Writable.html) interface, like [IntWritable](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/IntWritable.html) and [Text](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html). In addition, Spark allows you to specify native types for a few common Writables; for example, `sequenceFile[Int, String]` will automatically read IntWritables and Texts. * For [SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html), use SparkContext's `sequenceFile[K, V]` method where `K` and `V` are the types of key and values in the file. These should be subclasses of Hadoop's [Writable](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Writable.html) interface, like [IntWritable](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/IntWritable.html) and [Text](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html). In addition, Spark allows you to specify native types for a few common Writables; for example, `sequenceFile[Int, String]` will automatically read IntWritables and Texts.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment