Skip to content
Snippets Groups Projects
Commit 9ddad0dc authored by Matei Zaharia's avatar Matei Zaharia
Browse files

Fixes suggested by Patrick

parent 4819baa6
No related branches found
No related tags found
No related merge requests found
...@@ -4,7 +4,7 @@ ...@@ -4,7 +4,7 @@
# spark-env.sh and edit that to configure Spark for your site. # spark-env.sh and edit that to configure Spark for your site.
# #
# The following variables can be set in this file: # The following variables can be set in this file:
# - SPARK_LOCAL_IP, to override the IP address binds to # - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - MESOS_NATIVE_LIBRARY, to point to your libmesos.so if you use Mesos # - MESOS_NATIVE_LIBRARY, to point to your libmesos.so if you use Mesos
# - SPARK_JAVA_OPTS, to set node-specific JVM options for Spark. Note that # - SPARK_JAVA_OPTS, to set node-specific JVM options for Spark. Note that
# we recommend setting app-wide options in the application's driver program. # we recommend setting app-wide options in the application's driver program.
......
...@@ -21,7 +21,6 @@ Hadoop and Spark on a common cluster manager like [Mesos](running-on-mesos.html) ...@@ -21,7 +21,6 @@ Hadoop and Spark on a common cluster manager like [Mesos](running-on-mesos.html)
[Hadoop YARN](running-on-yarn.html). [Hadoop YARN](running-on-yarn.html).
* If this is not possible, run Spark on different nodes in the same local-area network as HDFS. * If this is not possible, run Spark on different nodes in the same local-area network as HDFS.
If your cluster spans multiple racks, include some Spark nodes on each rack.
* For low-latency data stores like HBase, it may be preferrable to run computing jobs on different * For low-latency data stores like HBase, it may be preferrable to run computing jobs on different
nodes than the storage system to avoid interference. nodes than the storage system to avoid interference.
......
...@@ -40,12 +40,13 @@ Python interpreter (`./pyspark`). These are a great way to learn Spark. ...@@ -40,12 +40,13 @@ Python interpreter (`./pyspark`). These are a great way to learn Spark.
Spark uses the Hadoop-client library to talk to HDFS and other Hadoop-supported Spark uses the Hadoop-client library to talk to HDFS and other Hadoop-supported
storage systems. Because the HDFS protocol has changed in different versions of storage systems. Because the HDFS protocol has changed in different versions of
Hadoop, you must build Spark against the same version that your cluster uses. Hadoop, you must build Spark against the same version that your cluster uses.
You can do this by setting the `SPARK_HADOOP_VERSION` variable when compiling: By default, Spark links to Hadoop 1.0.4. You can change this by setting the
`SPARK_HADOOP_VERSION` variable when compiling:
SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly
In addition, if you wish to run Spark on [YARN](running-on-yarn.md), you should also In addition, if you wish to run Spark on [YARN](running-on-yarn.md), set
set `SPARK_YARN`: `SPARK_YARN` to `true`:
SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
...@@ -94,7 +95,7 @@ set `SPARK_YARN`: ...@@ -94,7 +95,7 @@ set `SPARK_YARN`:
exercises about Spark, Shark, Mesos, and more. [Videos](http://ampcamp.berkeley.edu/agenda-2012), exercises about Spark, Shark, Mesos, and more. [Videos](http://ampcamp.berkeley.edu/agenda-2012),
[slides](http://ampcamp.berkeley.edu/agenda-2012) and [exercises](http://ampcamp.berkeley.edu/exercises-2012) are [slides](http://ampcamp.berkeley.edu/agenda-2012) and [exercises](http://ampcamp.berkeley.edu/exercises-2012) are
available online for free. available online for free.
* [Code Examples](http://spark.incubator.apache.org/examples.html): more are also available in the [examples subfolder](https://github.com/mesos/spark/tree/master/examples/src/main/scala/spark/examples) of Spark * [Code Examples](http://spark.incubator.apache.org/examples.html): more are also available in the [examples subfolder](https://github.com/mesos/spark/tree/master/examples/src/main/scala/) of Spark
* [Paper Describing Spark](http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf) * [Paper Describing Spark](http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf)
* [Paper Describing Spark Streaming](http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf) * [Paper Describing Spark Streaming](http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf)
......
...@@ -126,7 +126,7 @@ object SimpleJob { ...@@ -126,7 +126,7 @@ object SimpleJob {
This job simply counts the number of lines containing 'a' and the number containing 'b' in the Spark README. Note that you'll need to replace $YOUR_SPARK_HOME with the location where Spark is installed. Unlike the earlier examples with the Spark shell, which initializes its own SparkContext, we initialize a SparkContext as part of the job. We pass the SparkContext constructor four arguments, the type of scheduler we want to use (in this case, a local scheduler), a name for the job, the directory where Spark is installed, and a name for the jar file containing the job's sources. The final two arguments are needed in a distributed setting, where Spark is running across several nodes, so we include them for completeness. Spark will automatically ship the jar files you list to slave nodes. This job simply counts the number of lines containing 'a' and the number containing 'b' in the Spark README. Note that you'll need to replace $YOUR_SPARK_HOME with the location where Spark is installed. Unlike the earlier examples with the Spark shell, which initializes its own SparkContext, we initialize a SparkContext as part of the job. We pass the SparkContext constructor four arguments, the type of scheduler we want to use (in this case, a local scheduler), a name for the job, the directory where Spark is installed, and a name for the jar file containing the job's sources. The final two arguments are needed in a distributed setting, where Spark is running across several nodes, so we include them for completeness. Spark will automatically ship the jar files you list to slave nodes.
This file depends on the Spark API, so we'll also include an sbt configuration file, `simple.sbt` which explains that Spark is a dependency. This file also adds two repositories which host Spark dependencies: This file depends on the Spark API, so we'll also include an sbt configuration file, `simple.sbt` which explains that Spark is a dependency. This file also adds a repository that Spark depends on:
{% highlight scala %} {% highlight scala %}
name := "Simple Project" name := "Simple Project"
...@@ -137,9 +137,7 @@ scalaVersion := "{{site.SCALA_VERSION}}" ...@@ -137,9 +137,7 @@ scalaVersion := "{{site.SCALA_VERSION}}"
libraryDependencies += "org.spark-project" %% "spark-core" % "{{site.SPARK_VERSION}}" libraryDependencies += "org.spark-project" %% "spark-core" % "{{site.SPARK_VERSION}}"
resolvers ++= Seq( resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
"Akka Repository" at "http://repo.akka.io/releases/",
"Spray Repository" at "http://repo.spray.cc/")
{% endhighlight %} {% endhighlight %}
If you also wish to read data from Hadoop's HDFS, you will also need to add a dependency on `hadoop-client` for your version of HDFS: If you also wish to read data from Hadoop's HDFS, you will also need to add a dependency on `hadoop-client` for your version of HDFS:
...@@ -210,10 +208,6 @@ To build the job, we also write a Maven `pom.xml` file that lists Spark as a dep ...@@ -210,10 +208,6 @@ To build the job, we also write a Maven `pom.xml` file that lists Spark as a dep
<packaging>jar</packaging> <packaging>jar</packaging>
<version>1.0</version> <version>1.0</version>
<repositories> <repositories>
<repository>
<id>Spray.cc repository</id>
<url>http://repo.spray.cc</url>
</repository>
<repository> <repository>
<id>Akka repository</id> <id>Akka repository</id>
<url>http://repo.akka.io/releases</url> <url>http://repo.akka.io/releases</url>
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment