Under most configurations the a Spark cluster’s input comes from HDFS (a standalone Spark could use local file system). So the first step to build a Spark cluster is setting up a Hadoop cluster.

As I haven’t read through the requirements for Spark carefully, I just choose to install the Hadoop 2.2.0, which is mentioned in the README.md file.

Unfortunately, Hadoop 2.2.0 has some bugs. First of all, if you download a pre-build binary package, and run it on 64bit platform, you will get a ugly INFO when you run every Hadoop command:

INFO util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Why? Because the pre-build package is prepared for the 32bit system. So we need to build Hadoop 2.2.0 from its source code.

To build the source code, the Hadoop-10110.patch is needed.

In addition, some libraries are required to be installed. Such as:

  • libssl-devel
  • protoc-2.5.0 (build from source code, and sudo ldconfig should be executed in the end.)

More details could refer to http://blog.csdn.net/linshao_andylin/article/details/12307747 and Google…