Apache Spark incremental compilation tips

Table of Contents generated with DocToc

1. sbt
2. mvn

Building Apache Spark from source code is never a time-saving job. Either mvn or sbt consumes quite a lot time in compilation. To this end, incremental compilation or continuous compilation will be the saver.

1. `sbt`

It is recommended in Spark official documents that sbt is more suitable for day-to-day build:

But SBT is supported for day-to-day development since it can provide much faster iterative compilation.

The first step is to create a fat jar, which includes all of Spark’s dependencies:

./build/sbt -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.3 -Dscala-2.11 -Phive -Phive-thriftserver -DskipTests assembly 

We will have a large jar file like this:

$ ls -hl assembly/target/scala-2.11
total 184M
-rw-rw-r-- 1 ubuntu ubuntu 184M Oct 31 15:28 spark-assembly-1.6.1-hadoop2.7.3.jar

Then create a seperated jar package for Spark itself, and incremental compilation will be conducted on this jar.

./build/sbt -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.3 -Dscala-2.11 -Phive -Phive-thriftserver -DskipTests package

Then we will have:

~/spark-1.6.1
$ ls -hl assembly/target/scala-2.11
total 184M
-rw-rw-r-- 1 ubuntu ubuntu 184M Oct 31 15:28 spark-assembly-1.6.1-hadoop2.7.3.jar
-rw-rw-r-- 1 ubuntu ubuntu  281 Oct 31 15:42 spark-assembly_2.11-1.6.1.jar

Or in an interactive way:

$ ./build/sbt -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.3 -Dscala-2.11 -Phive -Phive-thriftserver -DskipTests
> assembly 
....
> package

Enter the incremental compilation mode:

$ ./build/sbt -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.3 -Dscala-2.11 -Phive -Phive-thriftserver -DskipTests
> ~compile

To launch Spark from the seperated jar, we have to set the env variable:

$ export SPARK_PREPEND_CLASSES=true
$ ./bin/start-all.sh

Please refer to http://www.voidcn.com/blog/lovehuangjiaju/article/p-4669432.html

With this env variable, the launch path of Spark is different:

# Add the launcher build dir to the classpath if requested.
if [ -n "$SPARK_PREPEND_CLASSES" ]; then
  LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH"
fi

2. `mvn`

For mvn, the incremental compilation is similar:

$ ./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.3 -Phive -Phive-thriftserver -Dscala-2.11 -DskipTests clean install

Then, as mentioned in Spark official documents Building Spark says:

$ cd core
$ ../build/mvn scala:cc

Also setup the SPARK_SCALA_VERSION.

1. sbt

2. mvn

1. `sbt`

2. `mvn`