Monday, August 10, 2015

Apache Spark on UDOO (Arch Linux)

Recently started to play with Apache Spark and so wanted to run it on my UDOO Quad (well everyone is doing Apache Spark on Raspberry Pi... and I think UDOO is more powerful, right? :)

Here are steps to get Apache Spark running.  Basically it is to install Python,  JDK, and then the Spark packages.  Note that I am using Arch Linux on my UDOO.

Get Python 2.7, not 3.  The easiest way is to install it via pacman:

$ sudo pacman -Sy python2

Apache Sparks runs on Scala, which in turn requires a JVM.  By default, most Linux distributions will install OpenJDK.  However, the performance of OpenJDK Zero VM really lags behind when compared with Oracle's HotSpot engine.  So we are going to get Oracle VM running on UDOO first.

Go ahead and follow the ArchWiki page to install OpenJDK 8.  Although we don't want to use it, we want the Java environment to be setup properly.  When done, you will find the OpenJDK VM installed under a subfolder in /usr/lib/jvm.

Then, go to Oracle's Java site, accept the license and download the JDK for ARM (I downloaded the 1.8.0_51 version).  Unpack the tar.gz file and place the whole package under /usr/lib/jvm in its own folder (e.g. jdk1.8.0_51).  The Arch Linux Java configuration should pick it up:

$ sudo archlinux-java status
Available Java environments:
  java-8-openjdk (default)

Run the follow commands to switch to Oracle's VM:

$ sudo archlinux-java set jdk1.8.0_51
$ sudo archlinux-java status
Available Java environments:
  jdk1.8.0_51 (default)
$ java -version
java version "1.8.0_51"
Java(TM) SE Runtime Environment (build 1.8.0_51-b07)
Java HotSpot(TM) Client VM (build 25.51-b07, mixed mode)

The preparation is done!  Now go to Apache Spark website and download the pre-built package (I downloaded the 1.4.0 version, pre-built for Hadoop 2.6 and later.  Some said it is better to use the Hadoop 2.4 pre-built...).

Unpack the file.  Try to run the SparkPi example:

$ cd spark-1.4.0-bin-hadoop2.6
$ bin/run-example SparkPi 10

Among all the log messages, you should see an *estimation* of the Pi value:
Pi is roughly 3.14258

The Apache Spark should be fully functional by now.  You can go ahead to try the Spark shell and pySpark shell etc.

No comments: