Skip to content
Snippets Groups Projects
Commit 2f639957 authored by Chen Chao's avatar Chen Chao Committed by Reynold Xin
Browse files

default task number misleading in several places

  private[streaming] def defaultPartitioner(numPartitions: Int = self.ssc.sc.defaultParallelism){
    new HashPartitioner(numPartitions)
  }

it represents that the default task number in Spark Streaming relies on the variable defaultParallelism in SparkContext, which is decided by the config property spark.default.parallelism

the property "spark.default.parallelism" refers to https://github.com/apache/spark/pull/389

Author: Chen Chao <crazyjvm@gmail.com>

Closes #766 from CrazyJvm/patch-7 and squashes the following commits:

0b7efba [Chen Chao] Update streaming-programming-guide.md
cc5b66c [Chen Chao] default task number misleading in several places
parent 44165fc9
No related branches found
No related tags found
No related merge requests found
......@@ -522,9 +522,9 @@ common ones are as follows.
<td> <b>reduceByKey</b>(<i>func</i>, [<i>numTasks</i>]) </td>
<td> When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the
values for each key are aggregated using the given reduce function. <b>Note:</b> By default,
this uses Spark's default number of parallel tasks (2 for local machine, 8 for a cluster) to
do the grouping. You can pass an optional <code>numTasks</code> argument to set a different
number of tasks.</td>
this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number
is determined by the config property <code>spark.default.parallelism</code>) to do the grouping.
You can pass an optional <code>numTasks</code> argument to set a different number of tasks.</td>
</tr>
<tr>
<td> <b>join</b>(<i>otherStream</i>, [<i>numTasks</i>]) </td>
......@@ -743,8 +743,9 @@ said two parameters - <i>windowLength</i> and <i>slideInterval</i>.
<td> When called on a DStream of (K, V) pairs, returns a new DStream of (K, V)
pairs where the values for each key are aggregated using the given reduce function <i>func</i>
over batches in a sliding window. <b>Note:</b> By default, this uses Spark's default number of
parallel tasks (2 for local machine, 8 for a cluster) to do the grouping. You can pass an optional
<code>numTasks</code> argument to set a different number of tasks.
parallel tasks (2 for local mode, and in cluster mode the number is determined by the config
property <code>spark.default.parallelism</code>) to do the grouping. You can pass an optional
<code>numTasks</code> argument to set a different number of tasks.
</td>
</tr>
<tr>
......@@ -956,9 +957,10 @@ before further processing.
### Level of Parallelism in Data Processing
Cluster resources maybe under-utilized if the number of parallel tasks used in any stage of the
computation is not high enough. For example, for distributed reduce operations like `reduceByKey`
and `reduceByKeyAndWindow`, the default number of parallel tasks is 8. You can pass the level of
parallelism as an argument (see the
[`PairDStreamFunctions`](api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions)
and `reduceByKeyAndWindow`, the default number of parallel tasks is decided by the [config property]
(configuration.html#spark-properties) `spark.default.parallelism`. You can pass the level of
parallelism as an argument (see [`PairDStreamFunctions`]
(api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions)
documentation), or set the [config property](configuration.html#spark-properties)
`spark.default.parallelism` to change the default.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment