-
Kousuke Saruta authored
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #1885 from sarutak/SPARK-2963 and squashes the following commits: ed53329 [Kousuke Saruta] Modified description and notaton of proper noun 07c59fc [Kousuke Saruta] Added a description about how to build to use HiveServer and CLI for SparkSQL to building-with-maven.md 6e6645a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2963 c88fa93 [Kousuke Saruta] Added a description about building to use HiveServer and CLI for SparkSQL
Kousuke Saruta authoredAuthor: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #1885 from sarutak/SPARK-2963 and squashes the following commits: ed53329 [Kousuke Saruta] Modified description and notaton of proper noun 07c59fc [Kousuke Saruta] Added a description about how to build to use HiveServer and CLI for SparkSQL to building-with-maven.md 6e6645a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2963 c88fa93 [Kousuke Saruta] Added a description about building to use HiveServer and CLI for SparkSQL
layout: global
title: Building Spark with Maven
- This will become a table of contents (this text will be scraped). {:toc}
Building Spark using Maven requires Maven 3.0.4 or newer and Java 6+.
Setting up Maven's Memory Usage
You'll need to configure Maven to use more memory than usual by setting MAVEN_OPTS
. We recommend the following settings:
{% highlight bash %} export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" {% endhighlight %}
If you don't run this, you may see errors like the following:
[INFO] Compiling 203 Scala sources and 9 Java sources to /Users/me/Development/spark/core/target/scala-{{site.SCALA_BINARY_VERSION}}/classes...
[ERROR] PermGen space -> [Help 1]
[INFO] Compiling 203 Scala sources and 9 Java sources to /Users/me/Development/spark/core/target/scala-{{site.SCALA_BINARY_VERSION}}/classes...
[ERROR] Java heap space -> [Help 1]
You can fix this by setting the MAVEN_OPTS
variable as discussed before.
Note: For Java 8 and above this step is not required.
Specifying the Hadoop Version
Because HDFS is not protocol-compatible across versions, if you want to read from HDFS, you'll need to build Spark against the specific HDFS version in your environment. You can do this through the "hadoop.version" property. If unset, Spark will build against Hadoop 1.0.4 by default. Note that certain build profiles are required for particular Hadoop versions:
Hadoop version | Profile required |
---|---|
0.23.x | hadoop-0.23 |
1.x to 2.1.x | (none) |
2.2.x | hadoop-2.2 |
2.3.x | hadoop-2.3 |
2.4.x | hadoop-2.4 |
For Apache Hadoop versions 1.x, Cloudera CDH "mr1" distributions, and other Hadoop versions without YARN, use:
{% highlight bash %}
Apache Hadoop 1.2.1
mvn -Dhadoop.version=1.2.1 -DskipTests clean package
Cloudera CDH 4.2.0 with MapReduce v1
mvn -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DskipTests clean package
Apache Hadoop 0.23.x
mvn -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package {% endhighlight %}
For Apache Hadoop 2.x, 0.23.x, Cloudera CDH, and other Hadoop versions with YARN, you can enable the "yarn-alpha" or "yarn" profile and optionally set the "yarn.version" property if it is different from "hadoop.version". The additional build profile required depends on the YARN version:
YARN version | Profile required |
---|---|
0.23.x to 2.1.x | yarn-alpha |
2.2.x and later | yarn |
Examples:
{% highlight bash %}
Apache Hadoop 2.0.5-alpha
mvn -Pyarn-alpha -Dhadoop.version=2.0.5-alpha -DskipTests clean package
Cloudera CDH 4.2.0
mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean package
Apache Hadoop 0.23.x
mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
Apache Hadoop 2.2.X
mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
Apache Hadoop 2.3.X
mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package
Apache Hadoop 2.4.X
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
Different versions of HDFS and YARN.
mvn -Pyarn-alpha -Phadoop-2.3 -Dhadoop.version=2.3.0 -Dyarn.version=0.23.7 -DskipTests clean package {% endhighlight %}
Building Thrift JDBC server and CLI for Spark SQL
Spark SQL supports Thrift JDBC server and CLI.
See sql-programming-guide.md for more information about those features.
You can use those features by setting -Phive-thriftserver
when building Spark as follows.
{% highlight bash %}
mvn -Phive-thriftserver assembly
{% endhighlight %}
Spark Tests in Maven
Tests are run by default via the ScalaTest Maven plugin.
Some of the tests require Spark to be packaged first, so always run mvn package
with -DskipTests
the first time. The following is an example of a correct (build, test) sequence:
mvn -Pyarn -Phadoop-2.3 -DskipTests -Phive clean package
mvn -Pyarn -Phadoop-2.3 -Phive test
The ScalaTest plugin also supports running only a specific test suite as follows:
mvn -Dhadoop.version=... -DwildcardSuites=org.apache.spark.repl.ReplSuite test
Continuous Compilation
We use the scala-maven-plugin which supports incremental and continuous compilation. E.g.
mvn scala:cc