Skip to content
Snippets Groups Projects
Commit af2a2a26 authored by zsxwing's avatar zsxwing Committed by Andrew Or
Browse files

[SPARK-4361][Doc] Add more docs for Hadoop Configuration

I'm trying to point out reusing a Configuration in these APIs is dangerous. Any better idea?

Author: zsxwing <zsxwing@gmail.com>

Closes #3225 from zsxwing/SPARK-4361 and squashes the following commits:

fe4e3d5 [zsxwing] Add more docs for Hadoop Configuration
parent fb6c0cba
No related branches found
No related tags found
No related merge requests found
...@@ -288,7 +288,12 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli ...@@ -288,7 +288,12 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
// the bound port to the cluster manager properly // the bound port to the cluster manager properly
ui.foreach(_.bind()) ui.foreach(_.bind())
/** A default Hadoop Configuration for the Hadoop code (e.g. file systems) that we reuse. */ /**
* A default Hadoop Configuration for the Hadoop code (e.g. file systems) that we reuse.
*
* '''Note:''' As it will be reused in all Hadoop RDDs, it's better not to modify it unless you
* plan to set some global configurations for all Hadoop RDDs.
*/
val hadoopConfiguration = SparkHadoopUtil.get.newConfiguration(conf) val hadoopConfiguration = SparkHadoopUtil.get.newConfiguration(conf)
// Add each JAR given through the constructor // Add each JAR given through the constructor
...@@ -694,7 +699,10 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli ...@@ -694,7 +699,10 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
* necessary info (e.g. file name for a filesystem-based dataset, table name for HyperTable), * necessary info (e.g. file name for a filesystem-based dataset, table name for HyperTable),
* using the older MapReduce API (`org.apache.hadoop.mapred`). * using the older MapReduce API (`org.apache.hadoop.mapred`).
* *
* @param conf JobConf for setting up the dataset * @param conf JobConf for setting up the dataset. Note: This will be put into a Broadcast.
* Therefore if you plan to reuse this conf to create multiple RDDs, you need to make
* sure you won't modify the conf. A safe approach is always creating a new conf for
* a new RDD.
* @param inputFormatClass Class of the InputFormat * @param inputFormatClass Class of the InputFormat
* @param keyClass Class of the keys * @param keyClass Class of the keys
* @param valueClass Class of the values * @param valueClass Class of the values
...@@ -830,6 +838,14 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli ...@@ -830,6 +838,14 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
* Get an RDD for a given Hadoop file with an arbitrary new API InputFormat * Get an RDD for a given Hadoop file with an arbitrary new API InputFormat
* and extra configuration options to pass to the input format. * and extra configuration options to pass to the input format.
* *
* @param conf Configuration for setting up the dataset. Note: This will be put into a Broadcast.
* Therefore if you plan to reuse this conf to create multiple RDDs, you need to make
* sure you won't modify the conf. A safe approach is always creating a new conf for
* a new RDD.
* @param fClass Class of the InputFormat
* @param kClass Class of the keys
* @param vClass Class of the values
*
* '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each * '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each
* record, directly caching the returned RDD or directly passing it to an aggregation or shuffle * record, directly caching the returned RDD or directly passing it to an aggregation or shuffle
* operation will create many references to the same object. * operation will create many references to the same object.
......
...@@ -373,6 +373,15 @@ class JavaSparkContext(val sc: SparkContext) ...@@ -373,6 +373,15 @@ class JavaSparkContext(val sc: SparkContext)
* other necessary info (e.g. file name for a filesystem-based dataset, table name for HyperTable, * other necessary info (e.g. file name for a filesystem-based dataset, table name for HyperTable,
* etc). * etc).
* *
* @param conf JobConf for setting up the dataset. Note: This will be put into a Broadcast.
* Therefore if you plan to reuse this conf to create multiple RDDs, you need to make
* sure you won't modify the conf. A safe approach is always creating a new conf for
* a new RDD.
* @param inputFormatClass Class of the InputFormat
* @param keyClass Class of the keys
* @param valueClass Class of the values
* @param minPartitions Minimum number of Hadoop Splits to generate.
*
* '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each * '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each
* record, directly caching the returned RDD will create many references to the same object. * record, directly caching the returned RDD will create many references to the same object.
* If you plan to directly cache Hadoop writable objects, you should first copy them using * If you plan to directly cache Hadoop writable objects, you should first copy them using
...@@ -395,6 +404,14 @@ class JavaSparkContext(val sc: SparkContext) ...@@ -395,6 +404,14 @@ class JavaSparkContext(val sc: SparkContext)
* Get an RDD for a Hadoop-readable dataset from a Hadooop JobConf giving its InputFormat and any * Get an RDD for a Hadoop-readable dataset from a Hadooop JobConf giving its InputFormat and any
* other necessary info (e.g. file name for a filesystem-based dataset, table name for HyperTable, * other necessary info (e.g. file name for a filesystem-based dataset, table name for HyperTable,
* *
* @param conf JobConf for setting up the dataset. Note: This will be put into a Broadcast.
* Therefore if you plan to reuse this conf to create multiple RDDs, you need to make
* sure you won't modify the conf. A safe approach is always creating a new conf for
* a new RDD.
* @param inputFormatClass Class of the InputFormat
* @param keyClass Class of the keys
* @param valueClass Class of the values
*
* '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each * '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each
* record, directly caching the returned RDD will create many references to the same object. * record, directly caching the returned RDD will create many references to the same object.
* If you plan to directly cache Hadoop writable objects, you should first copy them using * If you plan to directly cache Hadoop writable objects, you should first copy them using
...@@ -476,6 +493,14 @@ class JavaSparkContext(val sc: SparkContext) ...@@ -476,6 +493,14 @@ class JavaSparkContext(val sc: SparkContext)
* Get an RDD for a given Hadoop file with an arbitrary new API InputFormat * Get an RDD for a given Hadoop file with an arbitrary new API InputFormat
* and extra configuration options to pass to the input format. * and extra configuration options to pass to the input format.
* *
* @param conf Configuration for setting up the dataset. Note: This will be put into a Broadcast.
* Therefore if you plan to reuse this conf to create multiple RDDs, you need to make
* sure you won't modify the conf. A safe approach is always creating a new conf for
* a new RDD.
* @param fClass Class of the InputFormat
* @param kClass Class of the keys
* @param vClass Class of the values
*
* '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each * '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each
* record, directly caching the returned RDD will create many references to the same object. * record, directly caching the returned RDD will create many references to the same object.
* If you plan to directly cache Hadoop writable objects, you should first copy them using * If you plan to directly cache Hadoop writable objects, you should first copy them using
...@@ -675,6 +700,9 @@ class JavaSparkContext(val sc: SparkContext) ...@@ -675,6 +700,9 @@ class JavaSparkContext(val sc: SparkContext)
/** /**
* Returns the Hadoop configuration used for the Hadoop code (e.g. file systems) we reuse. * Returns the Hadoop configuration used for the Hadoop code (e.g. file systems) we reuse.
*
* '''Note:''' As it will be reused in all Hadoop RDDs, it's better not to modify it unless you
* plan to set some global configurations for all Hadoop RDDs.
*/ */
def hadoopConfiguration(): Configuration = { def hadoopConfiguration(): Configuration = {
sc.hadoopConfiguration sc.hadoopConfiguration
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment