Skip to content
Snippets Groups Projects
Commit b7b627d5 authored by Patrick Wendell's avatar Patrick Wendell
Browse files

Updating relevant documentation

parent 893aaff7
No related branches found
No related tags found
No related merge requests found
...@@ -4,10 +4,11 @@ title: Running Spark on EC2 ...@@ -4,10 +4,11 @@ title: Running Spark on EC2
--- ---
The `spark-ec2` script, located in Spark's `ec2` directory, allows you The `spark-ec2` script, located in Spark's `ec2` directory, allows you
to launch, manage and shut down Spark clusters on Amazon EC2. It automatically sets up Mesos, Spark and HDFS to launch, manage and shut down Spark clusters on Amazon EC2. It automatically
on the cluster for you. sets up Spark, Shark and HDFS on the cluster for you. This guide describes
This guide describes how to use `spark-ec2` to launch clusters, how to run jobs on them, and how to shut them down. how to use `spark-ec2` to launch clusters, how to run jobs on them, and how
It assumes you've already signed up for an EC2 account on the [Amazon Web Services site](http://aws.amazon.com/). to shut them down. It assumes you've already signed up for an EC2 account
on the [Amazon Web Services site](http://aws.amazon.com/).
`spark-ec2` is designed to manage multiple named clusters. You can `spark-ec2` is designed to manage multiple named clusters. You can
launch a new cluster (telling the script its size and giving it a name), launch a new cluster (telling the script its size and giving it a name),
...@@ -59,18 +60,22 @@ RAM). Refer to the Amazon pages about [EC2 instance ...@@ -59,18 +60,22 @@ RAM). Refer to the Amazon pages about [EC2 instance
types](http://aws.amazon.com/ec2/instance-types) and [EC2 types](http://aws.amazon.com/ec2/instance-types) and [EC2
pricing](http://aws.amazon.com/ec2/#pricing) for information about other pricing](http://aws.amazon.com/ec2/#pricing) for information about other
instance types. instance types.
- `--region=<EC2_REGION>` specifies an EC2 region in which to launch
instances. The default region is `us-east-1`.
- `--zone=<EC2_ZONE>` can be used to specify an EC2 availability zone - `--zone=<EC2_ZONE>` can be used to specify an EC2 availability zone
to launch instances in. Sometimes, you will get an error because there to launch instances in. Sometimes, you will get an error because there
is not enough capacity in one zone, and you should try to launch in is not enough capacity in one zone, and you should try to launch in
another. This happens mostly with the `m1.large` instance types; another.
extra-large (both `m1.xlarge` and `c1.xlarge`) instances tend to be more
available.
- `--ebs-vol-size=GB` will attach an EBS volume with a given amount - `--ebs-vol-size=GB` will attach an EBS volume with a given amount
of space to each node so that you can have a persistent HDFS cluster of space to each node so that you can have a persistent HDFS cluster
on your nodes across cluster restarts (see below). on your nodes across cluster restarts (see below).
- `--spot-price=PRICE` will launch the worker nodes as - `--spot-price=PRICE` will launch the worker nodes as
[Spot Instances](http://aws.amazon.com/ec2/spot-instances/), [Spot Instances](http://aws.amazon.com/ec2/spot-instances/),
bidding for the given maximum price (in dollars). bidding for the given maximum price (in dollars).
- `--spark-version=VERSION` will pre-load the cluster with the
specified version of Spark. VERSION can be a version number
(e.g. "0.7.2") or a specific git hash. By default, a recent
version will be used.
- If one of your launches fails due to e.g. not having the right - If one of your launches fails due to e.g. not having the right
permissions on your private key file, you can run `launch` with the permissions on your private key file, you can run `launch` with the
`--resume` option to restart the setup process on an existing cluster. `--resume` option to restart the setup process on an existing cluster.
...@@ -99,9 +104,8 @@ permissions on your private key file, you can run `launch` with the ...@@ -99,9 +104,8 @@ permissions on your private key file, you can run `launch` with the
`spark-ec2` to attach a persistent EBS volume to each node for `spark-ec2` to attach a persistent EBS volume to each node for
storing the persistent HDFS. storing the persistent HDFS.
- Finally, if you get errors while running your jobs, look at the slave's logs - Finally, if you get errors while running your jobs, look at the slave's logs
for that job inside of the Mesos work directory (/mnt/mesos-work). You can for that job inside of the scheduler work directory (/root/spark/work). You can
also view the status of the cluster using the Mesos web UI also view the status of the cluster using the web UI: `http://<master-hostname>:8080`.
(`http://<master-hostname>:8080`).
# Configuration # Configuration
...@@ -141,22 +145,14 @@ section. ...@@ -141,22 +145,14 @@ section.
# Limitations # Limitations
- `spark-ec2` currently only launches machines in the US-East region of EC2.
It should not be hard to make it launch VMs in other zones, but you will need
to create your own AMIs in them.
- Support for "cluster compute" nodes is limited -- there's no way to specify a - Support for "cluster compute" nodes is limited -- there's no way to specify a
locality group. However, you can launch slave nodes in your locality group. However, you can launch slave nodes in your
`<clusterName>-slaves` group manually and then use `spark-ec2 launch `<clusterName>-slaves` group manually and then use `spark-ec2 launch
--resume` to start a cluster with them. --resume` to start a cluster with them.
- Support for spot instances is limited.
If you have a patch or suggestion for one of these limitations, feel free to If you have a patch or suggestion for one of these limitations, feel free to
[contribute](contributing-to-spark.html) it! [contribute](contributing-to-spark.html) it!
# Using a Newer Spark Version
The Spark EC2 machine images may not come with the latest version of Spark. To use a newer version, you can run `git pull` to pull in `/root/spark` to pull in the latest version of Spark from `git`, and build it using `sbt/sbt compile`. You will also need to copy it to all the other nodes in the cluster using `~/spark-ec2/copy-dir /root/spark`.
# Accessing Data in S3 # Accessing Data in S3
Spark's file interface allows it to process data in Amazon S3 using the same URI formats that are supported for Hadoop. You can specify a path in S3 as input through a URI of the form `s3n://<bucket>/path`. You will also need to set your Amazon security credentials, either by setting the environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` before your program or through `SparkContext.hadoopConfiguration`. Full instructions on S3 access using the Hadoop input libraries can be found on the [Hadoop S3 page](http://wiki.apache.org/hadoop/AmazonS3). Spark's file interface allows it to process data in Amazon S3 using the same URI formats that are supported for Hadoop. You can specify a path in S3 as input through a URI of the form `s3n://<bucket>/path`. You will also need to set your Amazon security credentials, either by setting the environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` before your program or through `SparkContext.hadoopConfiguration`. Full instructions on S3 access using the Hadoop input libraries can be found on the [Hadoop S3 page](http://wiki.apache.org/hadoop/AmazonS3).
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment