Apache Spark 1.5.2 on Raspberry PI 2 cluster

This is the second post of a series about BigData cluster for OS X and Raspberry PI 2.

  1. Build Hadoop Cluster with 5 clicks.
  2. Apache Spark 1.5.2 on Raspberry PI 2 cluster
  3. The Next Version of PocketCluster

** update: Oct 2, 2017. The next version of PocketCluster is coming soon!

Once we have a foundation to build Big Data analytics stack, the next one that should come is Apache Spark. Spark provides speed boost to MapReduce based algorithms by executing such computations in memory and conducting DAG optimization. You can read more details in this paper.

Spark tremendously helps analytics computation since such computations are iterative in nature. Suppose you’re to read and write 100 GB of data again and agin on disk and how painfully slow that could be. (Some folks handle a couple petabytes everyday. Let’s not go there yet.) At the same time, Spark handles such operations inside memory. You just ought to experience the differences. If you’re to do anything with Big Data analytics, Spark is therefore just one thing you are to encounter any direction you go.

In fact, the very first two posts of this blog are about running Spark on Raspberry PI 2 (henceforth RPI2). Nevertheless, it wasn’t much of joy to build and run such thing. If you’ve been with me, you know we’ve crossed some serious creeks. Now, here comes an OS X application that deploys Apache Spark & Hadoop with few mouse clicks.


PocketCluster 0.1.3

Just like the previous post, I’m going to play a video and talk about few more details.

First of all, PocketCluster supports Vagrant and Raspberry PI 2 at the same time. If you want to carry a multi-node Big Data environment with you all the time, it is definitely recommended to go with Vagrant version. The installation and operation process is exactly the same as the one depicted in the video.

Meanwhile, I would recommend to go with RPI2 if you’re working in a stationary environment. Six RPI2 could provide roughly the same amount of computation power as one Intel i7 processor does. You’d be able to 1) quickly test your hypothesis or 2) debug your prototype in a real, multi-node environment.

(*Old generation Raspberry PI is not supported. Only Raspberry PI 2 is supported at the moment.)

Secondly, In order for PocketCluster to smoothly install and operate, you’d need a solid internet connection. All the software are downloaded and configured in run-time that jumpy connection could really ruin your experience. I am working on improving this one.

Third, Spark supports five different modes of operation. 1) Standalone 2) Pseudo cluster, 3) Standalone cluster, 4) YARN Client, and 5) Mesos client. PocketCluster supports Standalone cluster mode only at the moment.

Lastly, as soon as Spark installation completes, SparkR is configured run across slave nodes. It just that Homebrew installation of R takes forever since it needs to compile gcc to provide Fortran for R. Hence, should you need to use SparkR, just type following in a Terminal or iTerm shell on your Mac after installation. (*it could take about 40 mins.)

brew tap homebrew/science && brew install r && brew untap homebrew/science

For more detailed instructions about PocketCluster installation, please go to my previous post.

Here comes  PocketCluster 0.1.3  again. I’m looking for the next package to install. If you have a suggestion, please leave a comment below or tweet me @stkim1.

E-mail Subscribtion
Subscribe for upcoming posts!
Join Slack
Join my channel!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s