Skip to content
Snippets Groups Projects
Commit b2017126 authored by Dan Osipov's avatar Dan Osipov Committed by Patrick Wendell
Browse files

[SPARK-787] Add S3 configuration parameters to the EC2 deploy scripts

When deploying to AWS, there is additional configuration that is required to read S3 files. EMR creates it automatically, there is no reason that the Spark EC2 script shouldn't.

This PR requires a corresponding PR to the mesos/spark-ec2 to be merged, as it gets cloned in the process of setting up machines: https://github.com/mesos/spark-ec2/pull/58

Author: Dan Osipov <daniil.osipov@shazam.com>

Closes #1120 from danosipov/s3_credentials and squashes the following commits:

758da8b [Dan Osipov] Modify documentation to include the new parameter
71fab14 [Dan Osipov] Use a parameter --copy-aws-credentials to enable S3 credential deployment
7e0da26 [Dan Osipov] Get AWS credentials out of boto connection instance
39bdf30 [Dan Osipov] Add S3 configuration parameters to the EC2 deploy scripts
parent ec1adecb
No related branches found
No related tags found
No related merge requests found
......@@ -156,6 +156,6 @@ If you have a patch or suggestion for one of these limitations, feel free to
# Accessing Data in S3
Spark's file interface allows it to process data in Amazon S3 using the same URI formats that are supported for Hadoop. You can specify a path in S3 as input through a URI of the form `s3n://<bucket>/path`. You will also need to set your Amazon security credentials, either by setting the environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` before your program or through `SparkContext.hadoopConfiguration`. Full instructions on S3 access using the Hadoop input libraries can be found on the [Hadoop S3 page](http://wiki.apache.org/hadoop/AmazonS3).
Spark's file interface allows it to process data in Amazon S3 using the same URI formats that are supported for Hadoop. You can specify a path in S3 as input through a URI of the form `s3n://<bucket>/path`. To provide AWS credentials for S3 access, launch the Spark cluster with the option `--copy-aws-credentials`. Full instructions on S3 access using the Hadoop input libraries can be found on the [Hadoop S3 page](http://wiki.apache.org/hadoop/AmazonS3).
In addition to using a single input file, you can also use a directory of files as input by simply giving the path to the directory.
......@@ -30,3 +30,5 @@ export HADOOP_MAJOR_VERSION="{{hadoop_major_version}}"
export SWAP_MB="{{swap}}"
export SPARK_WORKER_INSTANCES="{{spark_worker_instances}}"
export SPARK_MASTER_OPTS="{{spark_master_opts}}"
export AWS_ACCESS_KEY_ID="{{aws_access_key_id}}"
export AWS_SECRET_ACCESS_KEY="{{aws_secret_access_key}}"
\ No newline at end of file
......@@ -158,6 +158,9 @@ def parse_args():
parser.add_option(
"--additional-security-group", type="string", default="",
help="Additional security group to place the machines in")
parser.add_option(
"--copy-aws-credentials", action="store_true", default=False,
help="Add AWS credentials to hadoop configuration to allow Spark to access S3")
(opts, args) = parser.parse_args()
if len(args) != 2:
......@@ -714,6 +717,13 @@ def deploy_files(conn, root_dir, opts, master_nodes, slave_nodes, modules):
"spark_master_opts": opts.master_opts
}
if opts.copy_aws_credentials:
template_vars["aws_access_key_id"] = conn.aws_access_key_id
template_vars["aws_secret_access_key"] = conn.aws_secret_access_key
else:
template_vars["aws_access_key_id"] = ""
template_vars["aws_secret_access_key"] = ""
# Create a temp directory in which we will place all the files to be
# deployed after we substitue template parameters in them
tmp_dir = tempfile.mkdtemp()
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment