Table of Contents generated with DocToc

Building Apache Spark from source code is never a time-saving job. Either mvn or sbt consumes quite a lot time in compilation. To this end, incremental compilation or continuous compilation will be the saver.

1. sbt

It is recommended in Spark official documents that sbt is more suitable for day-to-day build:

But SBT is supported for day-to-day development since it can provide much faster iterative compilation.

The first step is to create a fat jar, which includes all of Spark’s dependencies:

./build/sbt -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.3 -Dscala-2.11 -Phive -Phive-thriftserver -DskipTests assembly 

We will have a large jar file like this:

$ ls -hl assembly/target/scala-2.11
total 184M
-rw-rw-r-- 1 ubuntu ubuntu 184M Oct 31 15:28 spark-assembly-1.6.1-hadoop2.7.3.jar

Then create a seperated jar package for Spark itself, and incremental compilation will be conducted on this jar.

./build/sbt -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.3 -Dscala-2.11 -Phive -Phive-thriftserver -DskipTests package

Then we will have:

~/spark-1.6.1
$ ls -hl assembly/target/scala-2.11
total 184M
-rw-rw-r-- 1 ubuntu ubuntu 184M Oct 31 15:28 spark-assembly-1.6.1-hadoop2.7.3.jar
-rw-rw-r-- 1 ubuntu ubuntu  281 Oct 31 15:42 spark-assembly_2.11-1.6.1.jar

Or in an interactive way:

$ ./build/sbt -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.3 -Dscala-2.11 -Phive -Phive-thriftserver -DskipTests
> assembly 
....
> package

Enter the incremental compilation mode:

$ ./build/sbt -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.3 -Dscala-2.11 -Phive -Phive-thriftserver -DskipTests
> ~compile

To launch Spark from the seperated jar, we have to set the env variable:

$ export SPARK_PREPEND_CLASSES=true
$ ./bin/start-all.sh

Please refer to http://www.voidcn.com/blog/lovehuangjiaju/article/p-4669432.html

With this env variable, the launch path of Spark is different:

# Add the launcher build dir to the classpath if requested.
if [ -n "$SPARK_PREPEND_CLASSES" ]; then
  LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH"
fi

2. mvn

For mvn, the incremental compilation is similar:

$ ./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.3 -Phive -Phive-thriftserver -Dscala-2.11 -DskipTests clean install

Then, as mentioned in Spark official documents Building Spark says:

$ cd core
$ ../build/mvn scala:cc

Also setup the SPARK_SCALA_VERSION.