Why build this cluster?

Good part:

Raspberry Pi runs the same Linux and Hadoop as Amazon EC2 does. Good for learning and testing.
Cheap, less than $200 for the whole system. 40w power consumption, always on 24x7.
Under same network, no server down, command response has zero lag.
Versatile, could be multi-machine web server or own cloud.
Sits on the desk, actually own the server, feels good.

But:

ARM architecture at low clock speed, slow. Not suited for any big-size dataset.
However, you can test a little portion of large datasets before shipping to the AWS.

Cluster Physical Setup

Cluster Specs

4 x Raspberry Pi 3 with 64 GB SD card
Gigabit Ethernet Switch for Inter-Nodes talk
Cables and Power

Assembly

In order to “stack” the parts together, I ordered a stacking kit which can be stacked together with Raspberry Pi.
For power supply, I used a 40W 5-Port USB charger with a max output of 2.4A for each port.
Official recommend using a 2.5Amp power supply for each of the Raspberry Pi 3 but I highly doubt they can use that much power.
Since Raspberry Pi 3 has a build-in WIFI adapter, it’s not necessary to use any wired connections in order
to let cluster work. However, for the sake of stability and adaptability, I used a Gigabit Ethernet Switch for inter-nodes communication.
Whenever you need to bring the cluster to somewhere else, you only need to connect the master node to the Internet by wire or WIFI.
The picture of fully assembled cluster: (in the next page due to PDF output)

assembled cluster

Cluster Software Setup

Install Linux Image

For Linux distribution, I used RASPBIAN which is official supported operating system. I used RASPBIAN JESSIE LITE which contains no GUI and has only 292 MB.

In order to avoid some stupid mistaks which need several hours to find out, I install the Linux System and Hadoop only into one machine,
configure and test the Hadoop to check if everything is OK and then
dump the whole SD card image to other three of them.

Installation:

Insert the SD card to my MacBook
diskutil list to list the mounted driver
unmount the SD card for writing the image:

diskutil unmountDisk /dev/disk

in my case is disk3

write the image to SD card:

sudo dd bs=1m if=image.img of=/dev/rdisk

and in my case:

sudo dd bs=1m if=/Users/DLI/Downloads/2016-05-27-raspbian-jessie-lite.img of=/dev/rdisk3`

OS setup and Hadoop Compilation

Connect to RPi:

Insert SD card into one the RPi and connect that RPi to my home router. For clarity, I will name this RPi masterPi.
Then, log into my router to find the IP address of masterPi,
which is 192.168.1.125. Then, manually assign an IP address to masterPi
so I don’t need to check the address of that masterPi after rebooting.
log into the masterPi:

SSH pi@192.168.1.125

with a pre-set password: raspberry
expand the file system and configure the locale setting with sudo raspi-config
do a

sudo apt-get update && sudo apt-get upgrade

to keep the system fresh
add a new user HadoopUser into a sudo group hadoop:
```
sudo addgroup hadoop
sudo adduser --ingroup hadoop HadoopUser
```
shell give me a warning: adduser: Please enter a username matching the regular expression configured
after some research it seems that Linux doesn’t recommend mixing upper and lower case name,
so I used
```
sudo adduser --ingroup hadoop hadoopuser
```
instead (Lession learnt…), then add hadoopuser to sudo group
```
sudo adduser hadoopuser sudo
```

Install JDK

it is reported that the Oracle JDK has better performance over Open JDK, so: sudo apt-get install oracle-java8-jdk
note: JDK is installed on /usr/lib/jvm/
check if

java -version

points to Oracle JDK

Download source code

From hadoop-2.7.2

Compile

A quick look at the Building Instruction of Hadoop from Apache, Hadoop has the following requirements:

Unix System
JDK 1.6
Maven 3.0
Forrest 0.8 (if generating docs)
Findbugs 1.3.9 (if running findbugs)
ProtocolBuffer 2.4.1+ (for MapReduce and HDFS)
CMake 2.6 or newer (if compiling native code)
Internet connection for first build (to fetch all Maven and Hadoop dependencies)

Addition to JDK, I need at least Maven,ProtocolBuffer and CMake so install them accordingly:

apt-get install maven, a lot of dependencies but apt-get covered me nicely :)
installing ProtocolBuffer is a little tricky
we must download the source, then ./autogen.sh, as specified in the offical site
by default, ProtocolBuffer install the file to /usr/local which may not be found by the system, so reconfig it to /usr:
```
./configure --prefix=/usr
```
and then
```
make
```

make check to check if builds OK and returns:

=====================================================
Testsuite summary for Protocol Buffers 3.0.0-beta-3
=====================================================
# TOTAL: 6
# PASS:  6
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
=====================================================

download the hadoop-2.7.2 source code unzip with

tar xzvf hadoop-2.7.2-src.tar.gz
compile the Hadoop natively for RPi

cd hadoop-2.7.2-src/
sudo mvn package -Pdist,native -DskipTests -Dtar
However, I get the error message in the mid of compiling process:

[ERROR] Failed to execute goal org.apache.hadoop:hadoop-maven-plugins:2.7.2:protoc (compile-protoc) on project hadoop-common: org.apache.maven.plugin.MojoExecutionException: ‘protoc –version’ did not return a version -> [Help 1].

After some search online, building Hadoop need version of protocol buffers exactly be 2.5. So I need to delete the one I installed and rebuild Hadoop.
Another 45 minutes of life, I finally get protocol buffers version 2.5 compiled.
Again:

sudo mvn package -Pdist,native -DskipTests -Dtar
Finally:

[INFO] ————————————————————————
[INFO] BUILD SUCCESS
[INFO] ————————————————————————
[INFO] Total time: 53:22.700s
[INFO] Finished at: Thu Jun 09 23:08:57 CDT 2016
[INFO] Final Memory: 81M/224M
[INFO] ————————————————————————

Test the build:

bin/hadoop checknative -a

And the output:

Native library checking:
hadoop:  true /home/hadoopuser/hadoop-2.7.2-src/hadoop-dist/target/hadoop-2.7.2/lib/native/libhadoop.so.1.0.0
zlib:    true /lib/arm-linux-gnueabihf/libz.so.1
snappy:  true /usr/lib/libsnappy.so.1
lz4:     true revision:99
bzip2:   true /lib/arm-linux-gnueabihf/libbz2.so.1
openssl: true /usr/lib/arm-linux-gnueabihf/libcrypto.so

Test with

$ bin/hdfs dfs -put etc/hadoop input
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'

View the output with

$ bin/hdfs dfs -cat output/*

Output:

6       dfs.audit.logger
4       dfs.class
3       dfs.server.namenode.
2       dfs.period
2       dfs.audit.log.maxfilesize
2       dfs.audit.log.maxbackupindex
1       dfsmetrics.log
1       dfsadmin
1       dfs.servers
1       dfs.replication
1       dfs.file

Everything seems OK, moving to next step:

Clone image to other RPi

Insert the SD card from Pi to my laptop, and then

$ sudo dd if=/dev/sdb of=node1.img
$ sudo dd if=node1.img of=/dev/sdb

Do it 3 times to clone the image to other three slave nodes.

Pseudo-distributed mode test

Setup

In core-site.xml

configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

And for etc/hadoop/hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Then, formate the namenode:

$ bin/hdfs namenode -format

Start the DataNode:

$ sbin/start-dfs.sh

And jps shows:

2608 DataNode
2774 SecondaryNameNode
2520 NameNode
2968 Jps

wordcount test

Download the bioproject.xml from HW1:

wget http://rasinsrv07.cstcis.cti.depaul.edu/CSC555/bioproject.xml

put into HDFS:

$ bin/hdfs dfs -put bioproject.xml bioproject

run wordcount:

root@masterPi:~/hadoop-2.7.2# time bin/hadoop jar
.. share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar
.. wordcount bioproject bioprojectOutput2

time report:

real    8m39.367s
user    8m51.510s
sys     0m15.360s

not bad compared to Amazon EC2’s single-node cluster with 1m30s.

Two, Three, Four nodes comparison

two-node cluster

setup

hdfs-site.xml: I lower the block size because of the test sample file size

<configuration>
    <property>
    <name>dfs.replication</name>
    <value>2</value>
</property>
    <property>
        <name>dfs.blocksize</name>
        <value>32m</value>
        <description>Block size</description>
    </property>
</configuration>

and yarn-site.xml:

yarn.resourcemanager.resource-tracker.address
master:8025

yarn.resourcemanager.scheduler.address
master:8030

yarn.resourcemanager.address
master:8050

then:

bin/hdfs namenode -format

and:

sbin/start-all.sh

Check on the 50070 port:

Node    Last contact    Admin State Capacity          Used    Non DFS Used    Remaining        Blocks 
masterPi:50010 (192.168.3.1:50010)  1   In Service  58.61 GB    222.19 MB   9.91 GB     48.48 GB    7 
slavePi1:50010 (192.168.3.2:50010)  1   In Service  58.58 GB    222.19 MB   10.44 GB    47.92 GB    7

each of the nodes was allocated 7 blocks.
to start everything and try wordcount on bioproject.xml again.

time bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar 
wordcount bioproject bioprojectOutput1

the result:

real    8m55.691s
user    8m43.950s
sys     0m15.560s

is not better than the single-node cluster, why? Maybe I changed the blcok size, change it back and re-run the task: now each node has only 2 blocks
result:

real    8m35.001s
user    7m73.950s
sys     0m15.560s

Still the same, why? After some careful observation of the mapreduce job, I see a lot of map -> map jobs which is unrelated to the
mapreduce process. After searching the web, I think the problem is I’m still using the normal hadoop jar ... command which is not
replace by the YARN command that new to the 2.x version of Hadoop.

After spending more time doing research on YARN, which is replacing the old ResourceManager by giving authorities to slave node for allocating containers(job runner), monitoring their resource usage.
I find YARN is a great idea since central administration is become harder and harder as the Hadoop ecosystem getting larger. With YARN and MapReduce 2, there are no longer pre-configured static slots for Map and Reduce tasks. The entire cluster is available for dynamic resource allocation of Maps and Reduces as needed by the job. So, I decided to use YARN as task manager instead of Hadoop generic mapreduce. However, after get YARN up and running, I keep getting:

Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137

which is the famous OOM error that the tasks is kill by the system because of not enough memory…

Of course I will get this error message, RPi only has 1g of RAM but YARN will allocate 1024mb of memory minimun for each container and for each node will has multiple containers for the tasks.
So, we must change some parameters:

I added the following in the yarn-set.xml:

<property>
<name>yarn.nodemanager.resource.memory-mb</name>name>
<value>500</value>
</property>

Then, run with command:

bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar 
wordcount /bioproject /bioprojectop1

However, I still get

Diagnostics: Container is running beyond virtual memory limits. 
Current usage: 15.1 MB of 256 MB physical memory used; 1.1 GB of 537.6 MB virtual memory used. Killing container.

and containers were shutdown by the system,
Run top shows:

4301 root      20   0  305860 150512   5808 S  70.2 15.1   0:24.66 java
4556 root      20   0  284684  36320  15324 S  62.6  3.6   0:04.16 java
4228 root      20   0  306728 151200   6364 S  61.0 15.2   0:28.99 java
4584 root      20   0  283660  35136  15232 S  60.6  3.5   0:03.36 java
4538 root      20   0  284684  36084  15156 S  59.0  3.6   0:04.15 java
4321 root      20   0  305704 151508   7268 S  56.7 15.2   0:21.70 java
3140 root      20   0  512012  77436   6784 S  17.5  7.8   1:09.14 java

YARN seems give too many works to a single node…

After another servaral hours of configuration, I can’t set parameters properly to let yarn task run successfully. I decided to move on to the next topic and leave this for later to solve.
remove the parameters:

<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
</property>

in order to bypass the YARN framework and run the Mapreduce tasks directly on HDFS.

Various tasks on fully distributed mode

Add other three nodes to the cluser is just add more names into the slaves file and everything runs smoothly.

wordcount on bioproject

In full-cluster mode, the wordcount job on bioproject give us

real    6m65.001s
user    7m33.050s
sys     0m18.560s

A about 20% decrease in time, not a big increase in performance at all. Run top give us some hint of reason why:

The I/O performance of RPi is so low that CPU is idle waiting for input from the SD card. Moreover, the wordcount is not an CPU but IO intensive task, so the increase in cluster size doesn’t provide too much benefit for RPi.

The bottleneck and improvements

From the result of wordcount on bioproject.xml, we know the bottleneck is from disk IO. In order to optimize it,
we can lower the block size since SD card is not like traditional hardrive which has to spin the disk to read the data.

Also, as indicated by my unsuccessful experiment of
running YARN on top of HDFS, the ResourceManager of YARN could max out the CPU of RPi by spawning many (>10) MapReduce containers to run the tasks in parallel, which I think
could potentially increase the performance. However, containers using too much RAM will get killed by the system. I need to find the sweet spot of balance of parameter setting but I didn’t succeed.

Re-run tasks after optimization

Tweaked the blocksize to 35mb, I run the same project.xml again with result:

real    6m35.833s
user    7m33.343s
sys     0m13.560s

Still, not a big improvement. Search online, I found that we could over-clock the speed of I/O in RPi by modifying the /boot/config.txt

sudo bash -c 'printf "dtoverlay=sdhost,overclock_50=100\n" 
>> /boot/config.txt'

Default is 50MHz and I increased to 100. Another run of hadoop:

real    4m52.303s
user    5m23.183s
sys     0m14.560s

Another 20%! I can still push the RPi a little more by overclocking the CPU and RAM speed but I think the increase in performance is limited.

Final thoughts

By moving from disk writing to scheduling of container task which could potentially max out the system resource, the new 2.x Hadoop with YARN architecture is one step to more SPARK like and I think is a elegant solution to the proliferation of Hadoop eco-system.
I tried to configure the YARN parameter correctly but after 2 days of works I still get the out of memory interruption in the middle of task. The reason for that is I still haven’t figured out the multiple relations of different settings. (eg: the minimum size allocated for each container should be larger than adequate but can’t exceed the memory size limit of system)

In conclusion, I learnt a lot from this project. I have a much better understanding of the distributed nature of Hadoop, Linux system and command line environment. Also, presenting my cluster and running demo to the classmates is a fun experience.

Building a Hadoop cluser using Raspberry Pi