diff --git a/docs/configuration.md b/docs/configuration.md index a6b1f15fdabfc94b469f94b2c9e0497e66486b15..b7f10e69f38e419a6f0f4d1c854f90afb3706ed4 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -435,7 +435,7 @@ Apart from these, the following properties are also available, and may be useful <td><code>spark.jars.packages</code></td> <td></td> <td> - Comma-separated list of maven coordinates of jars to include on the driver and executor + Comma-separated list of Maven coordinates of jars to include on the driver and executor classpaths. The coordinates should be groupId:artifactId:version. If <code>spark.jars.ivySettings</code> is given artifacts will be resolved according to the configuration in the file, otherwise artifacts will be searched for in the local maven repo, then maven central and finally any additional remote diff --git a/docs/index.md b/docs/index.md index 57b9fa848f4a3da7a191b18f0d8abce0b899c95e..023e06ada3690bf9e9827d07e623b64f84ef2bfc 100644 --- a/docs/index.md +++ b/docs/index.md @@ -15,7 +15,7 @@ It also supports a rich set of higher-level tools including [Spark SQL](sql-prog Get Spark from the [downloads page](http://spark.apache.org/downloads.html) of the project website. This documentation is for Spark version {{site.SPARK_VERSION}}. Spark uses Hadoop's client libraries for HDFS and YARN. Downloads are pre-packaged for a handful of popular Hadoop versions. Users can also download a "Hadoop free" binary and run Spark with any Hadoop version [by augmenting Spark's classpath](hadoop-provided.html). -Scala and Java users can include Spark in their projects using its maven cooridnates and in the future Python users can also install Spark from PyPI. +Scala and Java users can include Spark in their projects using its Maven coordinates and in the future Python users can also install Spark from PyPI. If you'd like to build Spark from diff --git a/docs/programming-guide.md b/docs/programming-guide.md index a4017b5b9755061ecebd101e9f4f2d0646f23395..db8b048fcef946b9a6df40130b022793ac1789ef 100644 --- a/docs/programming-guide.md +++ b/docs/programming-guide.md @@ -185,7 +185,7 @@ In the Spark shell, a special interpreter-aware SparkContext is already created variable called `sc`. Making your own SparkContext will not work. You can set which master the context connects to using the `--master` argument, and you can add JARs to the classpath by passing a comma-separated list to the `--jars` argument. You can also add dependencies -(e.g. Spark Packages) to your shell session by supplying a comma-separated list of maven coordinates +(e.g. Spark Packages) to your shell session by supplying a comma-separated list of Maven coordinates to the `--packages` argument. Any additional repositories where dependencies might exist (e.g. Sonatype) can be passed to the `--repositories` argument. For example, to run `bin/spark-shell` on exactly four cores, use: @@ -200,7 +200,7 @@ Or, to also add `code.jar` to its classpath, use: $ ./bin/spark-shell --master local[4] --jars code.jar {% endhighlight %} -To include a dependency using maven coordinates: +To include a dependency using Maven coordinates: {% highlight bash %} $ ./bin/spark-shell --master local[4] --packages "org.example:example:0.1" @@ -217,7 +217,7 @@ In the PySpark shell, a special interpreter-aware SparkContext is already create variable called `sc`. Making your own SparkContext will not work. You can set which master the context connects to using the `--master` argument, and you can add Python .zip, .egg or .py files to the runtime path by passing a comma-separated list to `--py-files`. You can also add dependencies -(e.g. Spark Packages) to your shell session by supplying a comma-separated list of maven coordinates +(e.g. Spark Packages) to your shell session by supplying a comma-separated list of Maven coordinates to the `--packages` argument. Any additional repositories where dependencies might exist (e.g. Sonatype) can be passed to the `--repositories` argument. Any Python dependencies a Spark package has (listed in the requirements.txt of that package) must be manually installed using `pip` when necessary. diff --git a/docs/streaming-kafka-0-10-integration.md b/docs/streaming-kafka-0-10-integration.md index b645d3c3a4b53ec1cd52780e4bebb1301ee02c84..6ef54ac210704a7803c785be359c458dbcbaf8ca 100644 --- a/docs/streaming-kafka-0-10-integration.md +++ b/docs/streaming-kafka-0-10-integration.md @@ -183,7 +183,7 @@ stream.foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<String, String>>>() { Note that the typecast to `HasOffsetRanges` will only succeed if it is done in the first method called on the result of `createDirectStream`, not later down a chain of methods. Be aware that the one-to-one mapping between RDD partition and Kafka partition does not remain after any methods that shuffle or repartition, e.g. reduceByKey() or window(). ### Storing Offsets -Kafka delivery semantics in the case of failure depend on how and when offsets are stored. Spark output operations are [at-least-once](streaming-programming-guide.html#semantics-of-output-operations). So if you want the equivalent of exactly-once semantics, you must either store offsets after an idempotent output, or store offsets in an atomic transaction alongside output. With this integration, you have 3 options, in order of increasing reliablity (and code complexity), for how to store offsets. +Kafka delivery semantics in the case of failure depend on how and when offsets are stored. Spark output operations are [at-least-once](streaming-programming-guide.html#semantics-of-output-operations). So if you want the equivalent of exactly-once semantics, you must either store offsets after an idempotent output, or store offsets in an atomic transaction alongside output. With this integration, you have 3 options, in order of increasing reliability (and code complexity), for how to store offsets. #### Checkpoints If you enable Spark [checkpointing](streaming-programming-guide.html#checkpointing), offsets will be stored in the checkpoint. This is easy to enable, but there are drawbacks. Your output operation must be idempotent, since you will get repeated outputs; transactions are not an option. Furthermore, you cannot recover from a checkpoint if your application code has changed. For planned upgrades, you can mitigate this by running the new code at the same time as the old code (since outputs need to be idempotent anyway, they should not clash). But for unplanned failures that require code changes, you will lose data unless you have another way to identify known good starting offsets. diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md index b8b4cc3a53046df2df01c6febd326943c5a40ce9..d23dbcf10d952ec875cf82ff8064941978f4e5fd 100644 --- a/docs/submitting-applications.md +++ b/docs/submitting-applications.md @@ -189,7 +189,7 @@ This can use up a significant amount of space over time and will need to be clea is handled automatically, and with Spark standalone, automatic cleanup can be configured with the `spark.worker.cleanup.appDataTtl` property. -Users may also include any other dependencies by supplying a comma-delimited list of maven coordinates +Users may also include any other dependencies by supplying a comma-delimited list of Maven coordinates with `--packages`. All transitive dependencies will be handled when using this command. Additional repositories (or resolvers in SBT) can be added in a comma-delimited fashion with the flag `--repositories`. (Note that credentials for password-protected repositories can be supplied in some cases in the repository URI,