Snippets Groups Projects

10 years ago

[SPARK-4501][Core] - Create build/mvn to automatically download maven/zinc/scalac · a3e51cc9

Brennon York authored 10 years ago

Creates a top level directory script (as `build/mvn`) to automatically download zinc and the specific version of scala used to easily build spark. This will also download and install maven if the user doesn't already have it and all packages are hosted under the `build/` directory. Tested on both Linux and OSX OS's and both work. All commands pass through to the maven binary so it acts exactly as a traditional maven call would.

Author: Brennon York <brennon.york@capitalone.com>

Closes #3707 from brennonyork/SPARK-4501 and squashes the following commits:

0e5a0e4 [Brennon York] minor incorrect doc verbage (with -> this)
9b79e38 [Brennon York] fixed merge conflicts with dev/run-tests, properly quoted args in sbt/sbt, fixed bug where relative paths would fail if passed in from build/mvn
d2d41b6 [Brennon York] added blurb about leverging zinc with build/mvn
b979c58 [Brennon York] updated the merge conflict
c5634de [Brennon York] updated documentation to overview build/mvn, updated all points where sbt/sbt was referenced with build/sbt
b8437ba [Brennon York] set progress bars for curl and wget when not run on jenkins, no progress bar when run on jenkins, moved sbt script to build/sbt, wrote stub and warning under sbt/sbt which calls build/sbt, modified build/sbt to use the correct directory, fixed bug in build/sbt-launch-lib.bash to correctly pull the sbt version
be11317 [Brennon York] added switch to silence download progress only if AMPLAB_JENKINS is set
28d0a99 [Brennon York] updated to remove the python dependency, uses grep instead
7e785a6 [Brennon York] added silent and quiet flags to curl and wget respectively, added single echo output to denote start of a download if download is needed
14a5da0 [Brennon York] removed unnecessary zinc output on startup
1af4a94 [Brennon York] fixed bug with uppercase vs lowercase variable
3e8b9b3 [Brennon York] updated to properly only restart zinc if it was freshly installed
a680d12 [Brennon York] Added comments to functions and tested various mvn calls
bb8cc9d [Brennon York] removed package files
ef017e6 [Brennon York] removed OS complexities, setup generic install_app call, removed extra file complexities, removed help, removed forced install (defaults now), removed double-dash from cli
07bf018 [Brennon York] Updated to specifically handle pulling down the correct scala version
f914dea [Brennon York] Beginning final portions of localized scala home
69c4e44 [Brennon York] working linux and osx installers for purely local mvn build
4a1609c [Brennon York] finalizing working linux install for maven to local ./build/apache-maven folder
cbfcc68 [Brennon York] Changed the default sbt/sbt to build/sbt and added a build/mvn which will automatically download, install, and execute maven with zinc for easier build capability

a3e51cc9

[SPARK-4501][Core] - Create build/mvn to automatically download maven/zinc/scalac

Brennon York authored 10 years ago

Creates a top level directory script (as `build/mvn`) to automatically download zinc and the specific version of scala used to easily build spark. This will also download and install maven if the user doesn't already have it and all packages are hosted under the `build/` directory. Tested on both Linux and OSX OS's and both work. All commands pass through to the maven binary so it acts exactly as a traditional maven call would.

Author: Brennon York <brennon.york@capitalone.com>

Closes #3707 from brennonyork/SPARK-4501 and squashes the following commits:

0e5a0e4 [Brennon York] minor incorrect doc verbage (with -> this)
9b79e38 [Brennon York] fixed merge conflicts with dev/run-tests, properly quoted args in sbt/sbt, fixed bug where relative paths would fail if passed in from build/mvn
d2d41b6 [Brennon York] added blurb about leverging zinc with build/mvn
b979c58 [Brennon York] updated the merge conflict
c5634de [Brennon York] updated documentation to overview build/mvn, updated all points where sbt/sbt was referenced with build/sbt
b8437ba [Brennon York] set progress bars for curl and wget when not run on jenkins, no progress bar when run on jenkins, moved sbt script to build/sbt, wrote stub and warning under sbt/sbt which calls build/sbt, modified build/sbt to use the correct directory, fixed bug in build/sbt-launch-lib.bash to correctly pull the sbt version
be11317 [Brennon York] added switch to silence download progress only if AMPLAB_JENKINS is set
28d0a99 [Brennon York] updated to remove the python dependency, uses grep instead
7e785a6 [Brennon York] added silent and quiet flags to curl and wget respectively, added single echo output to denote start of a download if download is needed
14a5da0 [Brennon York] removed unnecessary zinc output on startup
1af4a94 [Brennon York] fixed bug with uppercase vs lowercase variable
3e8b9b3 [Brennon York] updated to properly only restart zinc if it was freshly installed
a680d12 [Brennon York] Added comments to functions and tested various mvn calls
bb8cc9d [Brennon York] removed package files
ef017e6 [Brennon York] removed OS complexities, setup generic install_app call, removed extra file complexities, removed help, removed forced install (defaults now), removed double-dash from cli
07bf018 [Brennon York] Updated to specifically handle pulling down the correct scala version
f914dea [Brennon York] Beginning final portions of localized scala home
69c4e44 [Brennon York] working linux and osx installers for purely local mvn build
4a1609c [Brennon York] finalizing working linux install for maven to local ./build/apache-maven folder
cbfcc68 [Brennon York] Changed the default sbt/sbt to build/sbt and added a build/mvn which will automatically download, install, and execute maven with zinc for easier build capability

hadoop-third-party-distributions.md 4.49 KiB

layout: global
title: Third-Party Hadoop Distributions

Spark can run against all versions of Cloudera's Distribution Including Apache Hadoop (CDH) and the Hortonworks Data Platform (HDP). There are a few things to keep in mind when using Spark with these distributions:

Compile-time Hadoop Version

When compiling Spark, you'll need to specify the Hadoop version by defining the hadoop.version property. For certain versions, you will need to specify additional profiles. For more detail, see the guide on building with maven:

mvn -Dhadoop.version=1.0.4 -DskipTests clean package
mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package

The table below lists the corresponding hadoop.version code for each CDH/HDP release. Note that some Hadoop releases are binary compatible across client versions. This means the pre-built Spark distribution may "just work" without you needing to compile. That said, we recommend compiling with the exact Hadoop version you are running to avoid any compatibility errors.

CDH Releases

Release	Version code
CDH 4.X.X (YARN mode)	2.0.0-cdh4.X.X
CDH 4.X.X	2.0.0-mr1-cdh4.X.X
CDH 3u6	0.20.2-cdh3u6
CDH 3u5	0.20.2-cdh3u5
CDH 3u4	0.20.2-cdh3u4

HDP Releases

Release	Version code
HDP 1.3	1.2.0
HDP 1.2	1.1.2
HDP 1.1	1.0.3
HDP 1.0	1.0.3
HDP 2.0	2.2.0

In SBT, the equivalent can be achieved by setting the the hadoop.version property:

build/sbt -Dhadoop.version=1.0.4 assembly

Linking Applications to the Hadoop Version

In addition to compiling Spark itself against the right version, you need to add a Maven dependency on that version of hadoop-client to any Spark applications you run, so they can also talk to the HDFS version on the cluster. If you are using CDH, you also need to add the Cloudera Maven repository. This looks as follows in SBT:

{% highlight scala %} libraryDependencies += "org.apache.hadoop" % "hadoop-client" % ""

// If using CDH, also add Cloudera repo resolvers += "Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/" {% endhighlight %}

Or in Maven:

{% highlight xml %} ... org.apache.hadoop hadoop-client [version]

... Cloudera repository https://repository.cloudera.com/artifactory/cloudera-repos/

{% endhighlight %}

Where to Run Spark

As described in the Hardware Provisioning guide, Spark can run in a variety of deployment modes:

Using dedicated set of Spark nodes in your cluster. These nodes should be co-located with your Hadoop installation.
Running on the same nodes as an existing Hadoop installation, with a fixed amount memory and cores dedicated to Spark on each node.
Run Spark alongside Hadoop using a cluster resource manager, such as YARN or Mesos.

These options are identical for those using CDH and HDP.

Inheriting Cluster Configuration

If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that should be included on Spark's classpath:

hdfs-site.xml, which provides default behaviors for the HDFS client.
core-site.xml, which sets the default filesystem name.

The location of these configuration files varies across CDH and HDP versions, but a common location is inside of /etc/hadoop/conf. Some tools, such as Cloudera Manager, create configurations on-the-fly, but offer a mechanisms to download copies of them.

To make these files visible to Spark, set HADOOP_CONF_DIR in $SPARK_HOME/spark-env.sh to a location containing the configuration files.