diff --git a/README.md b/README.md index 4116ef3563879e15d9768ff05f51e3593c637eca..c0d6a946035a97e31db26ef7a8deb7c92d46dbcb 100644 --- a/README.md +++ b/README.md @@ -87,10 +87,7 @@ Hadoop, you must build Spark against the same version that your cluster runs. Please refer to the build documentation at ["Specifying the Hadoop Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version) for detailed guidance on building for a particular distribution of Hadoop, including -building for particular Hive and Hive Thriftserver distributions. See also -["Third Party Hadoop Distributions"](http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html) -for guidance on building a Spark application that works with a particular -distribution. +building for particular Hive and Hive Thriftserver distributions. ## Configuration diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html index b4952fe97ca0ee4f69d43d582036d2dfae92e3e1..467ff7a03fb7030b540d3af77affae7b11fcf241 100755 --- a/docs/_layouts/global.html +++ b/docs/_layouts/global.html @@ -112,7 +112,6 @@ <li><a href="job-scheduling.html">Job Scheduling</a></li> <li><a href="security.html">Security</a></li> <li><a href="hardware-provisioning.html">Hardware Provisioning</a></li> - <li><a href="hadoop-third-party-distributions.html">3<sup>rd</sup>-Party Hadoop Distros</a></li> <li class="divider"></li> <li><a href="building-spark.html">Building Spark</a></li> <li><a href="https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark">Contributing to Spark</a></li> diff --git a/docs/configuration.md b/docs/configuration.md index 682384d4249e029fbb005bf5caed16932cbbc22f..c276e8e90decfb163873fbfe97bf53109b7eea97 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1674,3 +1674,18 @@ Spark uses [log4j](http://logging.apache.org/log4j/) for logging. You can config To specify a different configuration directory other than the default "SPARK_HOME/conf", you can set SPARK_CONF_DIR. Spark will use the the configuration files (spark-defaults.conf, spark-env.sh, log4j.properties, etc) from this directory. + +# Inheriting Hadoop Cluster Configuration + +If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that +should be included on Spark's classpath: + +* `hdfs-site.xml`, which provides default behaviors for the HDFS client. +* `core-site.xml`, which sets the default filesystem name. + +The location of these configuration files varies across CDH and HDP versions, but +a common location is inside of `/etc/hadoop/conf`. Some tools, such as Cloudera Manager, create +configurations on-the-fly, but offer a mechanisms to download copies of them. + +To make these files visible to Spark, set `HADOOP_CONF_DIR` in `$SPARK_HOME/spark-env.sh` +to a location containing the configuration files. diff --git a/docs/hadoop-third-party-distributions.md b/docs/hadoop-third-party-distributions.md deleted file mode 100644 index 795dd82a6be06b8ec58e5286595af4c41f188375..0000000000000000000000000000000000000000 --- a/docs/hadoop-third-party-distributions.md +++ /dev/null @@ -1,117 +0,0 @@ ---- -layout: global -title: Third-Party Hadoop Distributions ---- - -Spark can run against all versions of Cloudera's Distribution Including Apache Hadoop (CDH) and -the Hortonworks Data Platform (HDP). There are a few things to keep in mind when using Spark -with these distributions: - -# Compile-time Hadoop Version - -When compiling Spark, you'll need to specify the Hadoop version by defining the `hadoop.version` -property. For certain versions, you will need to specify additional profiles. For more detail, -see the guide on [building with maven](building-spark.html#specifying-the-hadoop-version): - - mvn -Dhadoop.version=1.0.4 -DskipTests clean package - mvn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package - -The table below lists the corresponding `hadoop.version` code for each CDH/HDP release. Note that -some Hadoop releases are binary compatible across client versions. This means the pre-built Spark -distribution may "just work" without you needing to compile. That said, we recommend compiling with -the _exact_ Hadoop version you are running to avoid any compatibility errors. - -<table> - <tr valign="top"> - <td> - <h3>CDH Releases</h3> - <table class="table" style="width:350px; margin-right: 20px;"> - <tr><th>Release</th><th>Version code</th></tr> - <tr><td>CDH 4.X.X (YARN mode)</td><td>2.0.0-cdh4.X.X</td></tr> - <tr><td>CDH 4.X.X</td><td>2.0.0-mr1-cdh4.X.X</td></tr> - </table> - </td> - <td> - <h3>HDP Releases</h3> - <table class="table" style="width:350px;"> - <tr><th>Release</th><th>Version code</th></tr> - <tr><td>HDP 1.3</td><td>1.2.0</td></tr> - <tr><td>HDP 1.2</td><td>1.1.2</td></tr> - <tr><td>HDP 1.1</td><td>1.0.3</td></tr> - <tr><td>HDP 1.0</td><td>1.0.3</td></tr> - <tr><td>HDP 2.0</td><td>2.2.0</td></tr> - </table> - </td> - </tr> -</table> - -In SBT, the equivalent can be achieved by setting the the `hadoop.version` property: - - build/sbt -Dhadoop.version=1.0.4 assembly - -# Linking Applications to the Hadoop Version - -In addition to compiling Spark itself against the right version, you need to add a Maven dependency on that -version of `hadoop-client` to any Spark applications you run, so they can also talk to the HDFS version -on the cluster. If you are using CDH, you also need to add the Cloudera Maven repository. -This looks as follows in SBT: - -{% highlight scala %} -libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "<version>" - -// If using CDH, also add Cloudera repo -resolvers += "Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/" -{% endhighlight %} - -Or in Maven: - -{% highlight xml %} -<project> - <dependencies> - ... - <dependency> - <groupId>org.apache.hadoop</groupId> - <artifactId>hadoop-client</artifactId> - <version>[version]</version> - </dependency> - </dependencies> - - <!-- If using CDH, also add Cloudera repo --> - <repositories> - ... - <repository> - <id>Cloudera repository</id> - <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> - </repository> - </repositories> -</project> - -{% endhighlight %} - -# Where to Run Spark - -As described in the [Hardware Provisioning](hardware-provisioning.html#storage-systems) guide, -Spark can run in a variety of deployment modes: - -* Using dedicated set of Spark nodes in your cluster. These nodes should be co-located with your - Hadoop installation. -* Running on the same nodes as an existing Hadoop installation, with a fixed amount memory and - cores dedicated to Spark on each node. -* Run Spark alongside Hadoop using a cluster resource manager, such as YARN or Mesos. - -These options are identical for those using CDH and HDP. - -# Inheriting Cluster Configuration - -If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that -should be included on Spark's classpath: - -* `hdfs-site.xml`, which provides default behaviors for the HDFS client. -* `core-site.xml`, which sets the default filesystem name. - -The location of these configuration files varies across CDH and HDP versions, but -a common location is inside of `/etc/hadoop/conf`. Some tools, such as Cloudera Manager, create -configurations on-the-fly, but offer a mechanisms to download copies of them. - -To make these files visible to Spark, set `HADOOP_CONF_DIR` in `$SPARK_HOME/spark-env.sh` -to a location containing the configuration files. diff --git a/docs/index.md b/docs/index.md index c0dc2b8d7412a528a00732dd28f357beae5362c8..f1d9e012c6cf04e173cddb9677f053f9b24d4c0f 100644 --- a/docs/index.md +++ b/docs/index.md @@ -117,7 +117,6 @@ options for deployment: * [Job Scheduling](job-scheduling.html): scheduling resources across and within Spark applications * [Security](security.html): Spark security support * [Hardware Provisioning](hardware-provisioning.html): recommendations for cluster hardware -* [3<sup>rd</sup> Party Hadoop Distributions](hadoop-third-party-distributions.html): using common Hadoop distributions * Integration with other storage systems: * [OpenStack Swift](storage-openstack-swift.html) * [Building Spark](building-spark.html): build Spark using the Maven system diff --git a/docs/programming-guide.md b/docs/programming-guide.md index 22656fd7910c0cb8a9102ed2518c38c371a67be2..f823b89a4b5e92b4d3b3402592f5db5013877d84 100644 --- a/docs/programming-guide.md +++ b/docs/programming-guide.md @@ -34,8 +34,7 @@ To write a Spark application, you need to add a Maven dependency on Spark. Spark version = {{site.SPARK_VERSION}} In addition, if you wish to access an HDFS cluster, you need to add a dependency on -`hadoop-client` for your version of HDFS. Some common HDFS version tags are listed on the -[third party distributions](hadoop-third-party-distributions.html) page. +`hadoop-client` for your version of HDFS. groupId = org.apache.hadoop artifactId = hadoop-client @@ -66,8 +65,7 @@ To write a Spark application in Java, you need to add a dependency on Spark. Spa version = {{site.SPARK_VERSION}} In addition, if you wish to access an HDFS cluster, you need to add a dependency on -`hadoop-client` for your version of HDFS. Some common HDFS version tags are listed on the -[third party distributions](hadoop-third-party-distributions.html) page. +`hadoop-client` for your version of HDFS. groupId = org.apache.hadoop artifactId = hadoop-client @@ -93,8 +91,7 @@ This script will load Spark's Java/Scala libraries and allow you to submit appli You can also use `bin/pyspark` to launch an interactive Python shell. If you wish to access HDFS data, you need to use a build of PySpark linking -to your version of HDFS. Some common HDFS version tags are listed on the -[third party distributions](hadoop-third-party-distributions.html) page. +to your version of HDFS. [Prebuilt packages](http://spark.apache.org/downloads.html) are also available on the Spark homepage for common HDFS versions.