Sunday, August 23, 2015

Apache Spark Standalone cluster on UDOO and Raspberry Pi

After setting up Apache Spark on UDOO, I tried to setup a cluster by adding a Raspberry Pi.  Here are the steps.

Environment

In this setup, UDOO will be both a master and a worker.  Another worker will be on a Raspberry Pi.  A user account "spark" is created on both machines.  Each machine has Apache Spark placed under its home directory (e.g. /home/spark/spark-1.4.1-bin-hadoop2.6).

For reference, my UDOO is running Arch Linux (kernel 4.1.6) with Oracle JDK 1.8.0_60, Scala, 2.10.5, and Python 2.7.10.  Hostname is "maggie".

The Raspberry Pi (512MB Model B) is running Raspbian (kernel 4.1.6) with Oracle JDK 1.8.0, Scala 2.10.5, and Python 2.7.3.  Hostname is "spark01".  (Note that you may want to modify the line in /etc/hosts that points 127.0.0.1 to the hostname.  Change it to point to the IP address of eth0 instead.  Otherwise Spark may bind to the wrong address)


Setup

First we need to enable auto-login of ssh with key-pair as Spark master will need to login to slaves.  On UDOO (the master), login as spark and execute the following command to generate keys

$ ssh-keygen -t rsa -b 4096

Use empty password (just press Enter) when prompted to encrypt the private key.  Then we need to copy the public key as "authorized_keys" on both UDOO (as UDOO will run as one of the slaves) and Raspberry Pi

$ cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
$ chmod go-r-w-x ~/.ssh/authorized_keys
$ ssh-copy-id spark@spark01

Enter password when copying the key to spark01 (the Raspberry Pi).  Once it is done, try to ssh to spark01 again and no password should be needed to login.

Once ssh is configured, we can setup Spark itself.

Under the conf folder of Spark, create a file called slave to list our Spark worker machines

spark01
localhost

Note that the second line is "localhost", which will start a worker locally on UDOO.

Also on the master machine, copy conf/spark-env.sh.template as conf/spark-env.sh and uncomment / edit the followings.

SPARK_MASTER_IP=192.168.1.10
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=256m
SPARK_EXECUTOR_MEMORY=256m
SPARK_DRIVER_MEMORY=256m

SPARK_MASTER_IP points to the address of my UDOO. The number of worker cores is set as 2 since we will be running the master on UDOO too so better not let it uses all 4 cores by default. Worker memory set as 256MB. For other options, refer the Spark site.

Create the same conf/spark-env.sh on Raspberry Pi.  For my Raspberry Pi B with only 1 core, I changed SPARK_WORKER_CORES to 1.  If you have multiple slaves to deploy to, you may use scp to copy them (and test out the no-password-login of ssh at the same time):

$ scp ~/spark-1.4.1-bin-hadoop2.6/conf/spark-env.sh spark@spark01:spark-1.4.1-bin-hadoop2.6/conf/
$ scp ~/spark-1.4.1-bin-hadoop2.6/conf/spark-env.sh spark@spark02:spark-1.4.1-bin-hadoop2.6/conf/
$ scp ~/spark-1.4.1-bin-hadoop2.6/conf/spark-env.sh spark@spark03:spark-1.4.1-bin-hadoop2.6/conf/
...

The cluster should be ready by now.  We can start the master and slaves by running commands on the master

$ sbin/start-all.sh

You can also start master and individual slaves separately.  Refer to the Spark document for other commands.

Check the console and logs folder for any error messages.  Strangely, on my setup, the console has error messages but in fact both master and slaves are ok.

......
failed to launch org.apache.spark.deploy.master.Master
full log in /home/spark/spark-1.4.1-bin-hadoop2.6/sbin/../logs/spark-spark-org.apache.spark.deploy.master.Master-1-maggie.out
......
spark01: failed to launch org.apache.spark.deploy.worker.Worker:
spark01: full log in /home/spark/spark-1.4.1-bin-hadoop2.6/sbin/../logs/spark-spark-org.apache.spark.deploy.worker.Worker-1-spark01.out
......

To make sure everything is working, point your browser to the cluster web UI on the master node, e.g. http://192.168.1.10:8080/

You should see two workers connected to the cluster.  (Wait for a minute or two, it takes some time for the worker on Raspberry Pi to startup).


We can then submit a job to the cluster, e.g.

$ bin/spark-submit  --master spark://192.168.1.10:7077 --executor-memory=256m examples/src/main/python/pi.py 10

(Note the use of --executor-memory.  Since I configured my slaves with worker memory of 256MB only, I need to limit the application memory when submitting job too.  Otherwise, there will be no appropriate worker to pick up the jobs)

Refresh the web UI to monitor the progress.


On my setup, the UDOO workers usually pick up application much quicker than Raspberry Pi, so with simple test like this, no job will be run on Raspberry Pi at all.  Remove UDOO from the conf/slaves list to test only the Raspbery Pi if necessary.




Thursday, August 20, 2015

Comparing JVM options when tuning for performance

Sometimes, when tuning the JVM performance by trying different command line options, it is useful to see what flags are actually effective.

We could specify the option "-XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal" to do just that.

The followings are the diff (diff -U 0) between -server and -client options, running on Java 1.8.0_60 on ARM hf.

$java -server -XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal > server

$java -client -XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal > client

$ diff -U 0 client server
--- client      2015-08-20 18:25:36.208946610 -0700
+++ server      2015-08-20 18:25:24.664324031 -0700
@@ -12,0 +13,2 @@
+     intx AliasLevel                                = 3                                   {C2 product}
+     bool AlignVector                               = true                                {C2 product}
@@ -30,0 +33 @@
+     intx AutoBoxCacheMax                           = 128                                 {C2 product}
@@ -33 +36 @@
-     intx BackEdgeThreshold                         = 100000                              {pd product}
+     intx BackEdgeThreshold                         = 140000                              {pd product}
@@ -41,0 +45,3 @@
+     bool BlockLayoutByFrequency                    = true                                {C2 product}
+     intx BlockLayoutMinDiamondPercentage           = 20                                  {C2 product}
+     bool BlockLayoutRotateLoops                    = true                                {C2 product}
@@ -42,0 +49 @@
+     bool BranchOnRegister                          = false                               {C2 product}
@@ -45,9 +52 @@
-     bool C1OptimizeVirtualCallProfiling            = true                                {C1 product}
-     bool C1PatchInvokeDynamic                      = true                                {C1 diagnostic}
-     bool C1ProfileBranches                         = true                                {C1 product}
-     bool C1ProfileCalls                            = true                                {C1 product}
-     bool C1ProfileCheckcasts                       = true                                {C1 product}
-     bool C1ProfileInlinedCalls                     = true                                {C1 product}
-     bool C1ProfileVirtualCalls                     = true                                {C1 product}
-     bool C1UpdateMethodData                        = false                               {C1 product}
-     intx CICompilerCount                           = 1                                   {product}
+     intx CICompilerCount                           = 2                                   {product}
@@ -148 +147 @@
-     intx CompileThreshold                          = 1500                                {pd product}
+     intx CompileThreshold                          = 10000                               {pd product}
@@ -153,0 +153 @@
+     intx ConditionalMoveLimit                      = 4                                   {C2 pd product}
@@ -161,0 +162 @@
+     bool DebugInlinedCalls                         = true                                {C2 diagnostic}
@@ -171,0 +173 @@
+ccstrlist DisableIntrinsic                          =                                     {C2 diagnostic}
@@ -174,0 +177,2 @@
+     bool DoEscapeAnalysis                          = true                                {C2 product}
+     intx DominatorSearchLimit                      = 1000                                {C2 diagnostic}
@@ -180,0 +185,5 @@
+     intx EliminateAllocationArraySizeLimit         = 64                                  {C2 product}
+     bool EliminateAllocations                      = true                                {C2 product}
+     bool EliminateAutoBox                          = true                                {C2 product}
+     bool EliminateLocks                            = true                                {C2 product}
+     bool EliminateNestedLocks                      = true                                {C2 product}
@@ -188,0 +198 @@
+   double EscapeAnalysisTimeout                     = 20.000000                           {C2 product}
@@ -210 +220 @@
-     intx FreqInlineSize                            = 325                                 {pd product}
+     intx FreqInlineSize                            = 175                                 {pd product}
@@ -265,0 +276 @@
+     bool IncrementalInline                         = true                                {C2 product}
@@ -267 +278 @@
-    uintx InitialCodeCacheSize                      = 163840                              {pd product}
+    uintx InitialCodeCacheSize                      = 1572864                             {pd product}
@@ -276 +287,2 @@
-     bool InlineSynchronizedMethods                 = true                                {C1 product}
+     bool InsertMemBarAfterArraycopy                = true                                {C2 product}
+     intx InteriorEntryAlignment                    = 16                                  {C2 pd product}
@@ -290 +301,0 @@
-     bool LIRFillDelaySlots                         = false                               {C1 pd product}
@@ -293,0 +305 @@
+     intx LiveNodeCountInliningCutoff               = 40000                               {C2 product}
@@ -300,0 +313,6 @@
+     bool LoopLimitCheck                            = true                                {C2 diagnostic}
+     intx LoopMaxUnroll                             = 16                                  {C2 product}
+     intx LoopOptsCount                             = 43                                  {C2 product}
+     intx LoopUnrollLimit                           = 60                                  {C2 pd product}
+     intx LoopUnrollMin                             = 4                                   {C2 product}
+     bool LoopUnswitching                           = true                                {C2 product}
@@ -320,0 +339,4 @@
+     intx MaxJumpTableSize                          = 65000                               {C2 product}
+     intx MaxJumpTableSparseness                    = 5                                   {C2 product}
+     intx MaxLabelRootDepth                         = 1100                                {C2 product}
+     intx MaxLoopPad                                = 15                                  {C2 product}
@@ -325 +347,2 @@
- uint64_t MaxRAM                                    = 1073741824                          {pd product}
+     intx MaxNodeLimit                              = 75000                               {C2 product}
+ uint64_t MaxRAM                                    = 0                                   {pd product}
@@ -330 +353,2 @@
-    uintx MetaspaceSize                             = 12582912                            {pd product}
+     intx MaxVectorSize                             = 8                                   {C2 product}
+    uintx MetaspaceSize                             = 16777216                            {pd product}
@@ -334,0 +359 @@
+     intx MinJumpTableSize                          = 16                                  {C2 pd product}
@@ -341,0 +367 @@
+     intx MultiArrayExpandLimit                     = 6                                   {C2 product}
@@ -350 +376 @@
-     bool NeverActAsServerClassMachine              = true                                {pd product}
+     bool NeverActAsServerClassMachine              = false                               {pd product}
@@ -357,0 +384 @@
+     intx NodeLimitFudgeFactor                      = 2000                                {C2 product}
@@ -358,0 +386 @@
+     intx NumberOfLoopInstrToAlign                  = 4                                   {C2 product}
@@ -365 +393,6 @@
-     intx OnStackReplacePercentage                  = 933                                 {pd product}
+     intx OnStackReplacePercentage                  = 140                                 {pd product}
+     bool OptimizeExpensiveOps                      = true                                {C2 diagnostic}
+     bool OptimizeFill                              = true                                {C2 product}
+     bool OptimizePtrCompare                        = true                                {C2 product}
+     bool OptimizeStringConcat                      = true                                {C2 product}
+     bool OptoBundling                              = false                               {C2 pd product}
@@ -366,0 +400 @@
+     bool OptoScheduling                            = true                                {C2 pd product}
@@ -382,0 +417,3 @@
+     bool PartialPeelAtUnsignedTests                = true                                {C2 product}
+     bool PartialPeelLoop                           = true                                {C2 product}
+     intx PartialPeelNewPhiDelta                    = 0                                   {C2 product}
@@ -442,0 +480 @@
+     bool PrintIntrinsics                           = false                               {C2 diagnostic}
@@ -453,0 +492,2 @@
+     bool PrintPreciseBiasedLockingStatistics       = false                               {C2 diagnostic}
+     bool PrintPreciseRTMLockingStatistics          = false                               {C2 diagnostic}
@@ -473 +513,2 @@
-     bool ProfileInterpreter                        = false                               {pd product}
+     bool ProfileDynamicTypes                       = true                                {C2 diagnostic}
+     bool ProfileInterpreter                        = true                                {pd product}
@@ -482,0 +524,5 @@
+     bool RangeLimitCheck                           = true                                {C2 diagnostic}
+     bool ReassociateInvariants                     = true                                {C2 product}
+     bool ReduceBulkZeroing                         = true                                {C2 product}
+     bool ReduceFieldZeroing                        = true                                {C2 product}
+     bool ReduceInitialCardMarks                    = true                                {C2 product}
@@ -496,3 +542,2 @@
-     bool RewriteBytecodes                          = false                               {pd product}
-     bool RewriteFrequentPairs                      = false                               {pd product}
-     intx SafepointPollOffset                       = 0                                   {C1 pd product}
+     bool RewriteBytecodes                          = true                                {pd product}
+     bool RewriteFrequentPairs                      = true                                {pd product}
@@ -515,0 +561,2 @@
+     bool SpecialEncodeISOArray                     = true                                {C2 product}
+     bool SplitIfBlocks                             = true                                {C2 product}
@@ -578 +624,0 @@
-     bool TimeLinearScan                            = false                               {C1 product}
@@ -599,0 +646,2 @@
+     bool TraceTypeProfile                          = false                               {C2 diagnostic}
+     intx TrackedInitializationLimit                = 50                                  {C2 product}
@@ -601,0 +650 @@
+     bool TrapBasedRangeChecks                      = false                               {C2 pd product}
@@ -603,0 +653 @@
+     intx TypeProfileMajorReceiverPercent           = 90                                  {C2 product}
@@ -608,0 +659 @@
+     bool UnrollLimitCheck                          = true                                {C2 diagnostic}
@@ -622,0 +674 @@
+     bool UseBimorphicInlining                      = true                                {C2 product}
@@ -632,0 +685 @@
+     bool UseCondCardMark                           = false                               {C2 product}
@@ -633,0 +687 @@
+     bool UseDivMod                                 = true                                {C2 product}
@@ -634,0 +689 @@
+     bool UseFPUForSpilling                         = true                                {C2 product}
@@ -643,0 +699 @@
+     bool UseImplicitStableValues                   = true                                {C2 diagnostic}
@@ -644,0 +701 @@
+     bool UseInlineDepthForSpeculativeTypes         = true                                {C2 diagnostic}
@@ -645,0 +703 @@
+     bool UseJumpTables                             = true                                {C2 product}
@@ -653 +711,2 @@
-     bool UseLoopInvariantCodeMotion                = true                                {C1 product}
+     bool UseLoopPredicate                          = true                                {C2 product}
+     bool UseMathExactIntrinsics                    = true                                {C2 product}
@@ -655,0 +715 @@
+     bool UseMultiplyToLenIntrinsic                 = false                               {C2 product}
@@ -661,0 +722 @@
+     bool UseOldInlining                            = true                                {C2 product}
@@ -662,0 +724 @@
+     bool UseOnlyInlinedBimorphic                   = true                                {C2 product}
@@ -663,0 +726 @@
+     bool UseOptoBiasInlining                       = false                               {C2 product}
@@ -669 +732,2 @@
-     bool UsePopCountInstruction                    = false                               {product}
+     bool UsePopCountInstruction                    = true                                {product}
+     bool UseRDPCForConstantTableBase               = false                               {C2 product}
@@ -680,0 +745 @@
+     bool UseSuperWord                              = true                                {C2 product}
@@ -684,0 +750 @@
+     bool UseTypeSpeculation                        = true                                {C2 product}
@@ -690,2 +756 @@
-     intx ValueMapInitialSize                       = 11                                  {C1 product}
-     intx ValueMapMaxLoopSize                       = 8                                   {C1 product}
+     intx ValueSearchLimit                          = 1000                                {C2 product}

Wednesday, August 19, 2015

Testing I2C connections between Raspberry Pi and Arduino

Can a 5V Arduino connect directly to a 3.3V Raspberry Pi via I2C?  Yes. Sort of.  You will need to disable the Arduino internal pullup resistors.  The internal pullup resistors of Raspberry Pi, in theory, should make the connection works without using a level shifter.

Rpi          Arduino
---------    ---------
Gnd          Gnd
SDA (GPIO0)  SDA (A4)
SCL (GPIO1)  SCL (A5)


Code also available on github

Saturday, August 15, 2015

Fun with IPC

Checking out POSIX IPC with Python

https://github.com/kitsook/python_ipc

Friday, August 14, 2015

Using the -Xcomp flag to disable interpreted method invocations

A little experiment with the -Xcomp flag on Java VM.  By default, method invocation is executed in interpreted mode unless the number of invocation has reached certain threshold. The threshold can be set by runtime options (e.g. -client, -server, and -XX:CompileThreshold).  The -Xcomp flag forces compilation of the code on first invocation.

The following is a comparison of running the sample (basically an empty method) of jmh benchmark.  Hardware platform is armv7l with 1GB of RAM.  Software is Arch Linux with kernel 4.1.5-1-ARCH, running Oracle Java 1.8.0_51 with HotSpot Server VM.

Using default jmh settings, i.e. 10 forks of 20 iterations, each with 20 warmup iterations.

When run without -Xcomp:

Result "testMethod":
  46954362.329 ?(99.9%) 121531.693 ops/s [Average]
  (min, avg, max) = (44999194.538, 46954362.329, 47305769.412), stdev = 514572.799
  CI (99.9%): [46832830.635, 47075894.022] (assumes normal distribution)


# Run complete. Total time: 00:06:55

Benchmark                Mode  Cnt         Score        Error  Units
MyBenchmark.testMethod  thrpt  200  46954362.329 ? 121531.693  ops/s


When run with -Xcomp:

Result "testMethod":
  47114913.482 ?(99.9%) 89902.070 ops/s [Average]
  (min, avg, max) = (46113318.172, 47114913.482, 47304787.674), stdev = 380650.996
  CI (99.9%): [47025011.412, 47204815.553] (assumes normal distribution)


# Run complete. Total time: 00:07:41

Benchmark                Mode  Cnt         Score       Error  Units
MyBenchmark.testMethod  thrpt  200  47114913.482 ? 89902.070  ops/s


Note that:

1. When -Xcomp is specified, the overall runtime is 11% longer.  That is because the VM needs to wait the code to be compiled.  i.e. the efficiency is lower

2. The error (i.e. variation) is 26% lower when -Xcomp is specified.  That is probably because all code has been compiled and so no need to spend time on compilation during execution.

3. The average score (i.e. throughput) is almost the same.  That is because the jmh does warmup iteration before each round of fork.

Monday, August 10, 2015

Apache Spark on UDOO (Arch Linux)

Recently started to play with Apache Spark and so wanted to run it on my UDOO Quad (well everyone is doing Apache Spark on Raspberry Pi... and I think UDOO is more powerful, right? :)

Here are steps to get Apache Spark running.  Basically it is to install Python,  JDK, and then the Spark packages.  Note that I am using Arch Linux on my UDOO.

Get Python 2.7, not 3.  The easiest way is to install it via pacman:

$ sudo pacman -Sy python2

Apache Sparks runs on Scala, which in turn requires a JVM.  By default, most Linux distributions will install OpenJDK.  However, the performance of OpenJDK Zero VM really lags behind when compared with Oracle's HotSpot engine.  So we are going to get Oracle VM running on UDOO first.

Go ahead and follow the ArchWiki page to install OpenJDK 8.  Although we don't want to use it, we want the Java environment to be setup properly.  When done, you will find the OpenJDK VM installed under a subfolder in /usr/lib/jvm.

Then, go to Oracle's Java site, accept the license and download the JDK for ARM (I downloaded the 1.8.0_51 version).  Unpack the tar.gz file and place the whole package under /usr/lib/jvm in its own folder (e.g. jdk1.8.0_51).  The Arch Linux Java configuration should pick it up:

$ sudo archlinux-java status
Available Java environments:
  java-8-openjdk (default)
  jdk1.8.0_51

Run the follow commands to switch to Oracle's VM:

$ sudo archlinux-java set jdk1.8.0_51
$ sudo archlinux-java status
Available Java environments:
  java-8-openjdk
  jdk1.8.0_51 (default)
$ java -version
java version "1.8.0_51"
Java(TM) SE Runtime Environment (build 1.8.0_51-b07)
Java HotSpot(TM) Client VM (build 25.51-b07, mixed mode)

The preparation is done!  Now go to Apache Spark website and download the pre-built package (I downloaded the 1.4.0 version, pre-built for Hadoop 2.6 and later.  Some said it is better to use the Hadoop 2.4 pre-built...).

Unpack the file.  Try to run the SparkPi example:

$ cd spark-1.4.0-bin-hadoop2.6
$ bin/run-example SparkPi 10

Among all the log messages, you should see an *estimation* of the Pi value:
Pi is roughly 3.14258

The Apache Spark should be fully functional by now.  You can go ahead to try the Spark shell and pySpark shell etc.

Sunday, August 9, 2015

TM4C123 with DHT11 sensor

Connected the DHT11 sensor to TM4C123.  The sensor is not that accurate, but simple to use.  Result displayed on Nokia 5110 LCD.

Programs written with Energia.  There are libraries for the sensor and LCD display, but both require slight modification.

Code available on github.



Updates 2016-05-14:

Here is the wiring.  Note that my Nokia 5110 board support input of 3v to 5v.  Your mileage may vary.

TM4C123 -  LCD 5110        Comment
==================================
VBUS    -  Vcc             My version of 5110 supports 3v to 5v
VBUS    -  BL              Backlight
GND     -  GND
PB_5    -  RST             Reset 
PB_4    -  Clk             SCK(2) to Clock
PB_7    -  Din             MOSI(2) to Serial data in
PA_7    -  CE              Chip Select
PA_2    -  DC              Select between data or command



TM4C123 -  DHT11        Comment
==================================
PD_7    -  Data
VBUS    -  Vcc
GND     -  GND


Saturday, August 8, 2015

Nokia 5110 LCD connected to Raspberry Pi

Following sample wiring method as shown on Adafruit:



Source code available on github:
https://github.com/kitsook/lcd5110

Wednesday, August 5, 2015

JMH result on UDOO

Just for fun. As a follow-up on Java 8 on UDOO. Here are the result of running JMH on UDOO.

OpenJDK 1.8.0_51 (Zero VM in interpreted mode only.  No Cacao for the default build):

Benchmark                Mode  Cnt        Score      Error  Units
MyBenchmark.testMethod  thrpt  200  2593688.336 ? 4998.196  ops/s


Oracle JVM 1.8.0_51:

Benchmark                Mode  Cnt         Score       Error  Units
MyBenchmark.testMethod  thrpt  200  47042989.345 ? 97733.584  ops/s


That is 18x improvement on the througput.

Java 8 on UDOO

The performance of OpenJDK 8 on UDOO (Archlinux) is not that great when compared with Beaglebone Black:

$ uname -a
Linux maggie 4.1.3-1-ARCH #1 SMP Wed Jul 22 18:44:39 MDT 2015 armv7l GNU/Linux

$ java -version
openjdk version "1.8.0_51"
OpenJDK Runtime Environment (build 1.8.0_51-b16)
OpenJDK Zero VM (build 25.51-b03, interpreted mode)

$ time java -XX:+TieredCompilation -XX:+AggressiveOpts fastaredux 25000000 > /dev/null 2>&1

real    6m52.692s
user    6m52.095s
sys     0m0.480s


Time to download  the Oracle JDK 8 for ARM.  Same as the case with BBB and Raspberry Pi, the result is much better:

$ java -version
java version "1.8.0_51"
Java(TM) SE Runtime Environment (build 1.8.0_51-b07)
Java HotSpot(TM) Client VM (build 25.51-b07, mixed mode)

$ time java -XX:+TieredCompilation -XX:+AggressiveOpts fastaredux 25000000 > /dev/null 2>&1

real    0m20.618s
user    0m20.355s
sys     0m0.290s