This is the third post of a series about Raspberry PI 2 BigData Cluster for OSX.
- Apache Spark on Raspberry Pi 2
- Apache Spark 1.4.0 on Raspberry PI 2 Cluster
- One Step Spark/Hadoop Installer for OSX v0.1.0
- Build Hadoop Cluster with 5 clicks
TLDR: This installer lets you have a single-node Spark/Hadoop cluster similar to a data center cluster on your laptop. You can download it here.
Today, I am going to release One-step Apache Spark/Hadoop installer for OSX v0.1.0. This has been in my mind for some time, but I wasn’t able to put my hands on before handling all other errands such as building a cluster case.
As soon as you looked at the title, I believe you’d have few questions.
Q.1 Apache Spark and Hadoop are Big Data softwares running on a gigantic data center with gazillion computers. Why on a puny laptop?
Yes. You are absolutely right. There is no doubt in my mind that Apache Spark & Hadoop need many computers to perform what they are supposed to do. However, it all starts from your laptop computer. Whatever data analysis or machine learning you’re going to perform at a gigantic data center or AWS, they all need to start from a scratch you’re going to type in. You want to write it with a language of your preference and an editor of your liking, and you want to quickly build & test your code before rolling out. I think your laptop is the best place to do so.
Q.2 What’s up with OSX?
Firstly it is because of my preference. I’ve been building iOS/Android apps for about 5 years. It forgives me of not remembering all the gory detail of *NIX. Secondly, it is still a UNIX that all the libraries and commands I need are there for me to play. Lastly, wherever I turn up, I see more of MacBook than Windows or Linux laptop.
Q.3 On OSX, there are at least an option to install Apache Spark/Hadoop with breeze, like Homebrew. What’s up with this reinvention?
Brew installation of Hadoop and Spark only provides standalone mode with minimum executables to your system. It is ok for debugging, but you wouldn’t know what’s going to happen until you get your code to a real, big-ass cluster.
This installer takes all the pain of setting thing up, and provides you Hadoop Distributed File System (HDFS) in pseudo-distributed mode and Spark in Yarn cluster mode. The environment is very similar to that of a multi-node cluster on a data center or a cloud service like AWS.
Q.4 How about Docker?
Docker is a great solution to setup a bigdata cluster, especially on a single linux PC for you. It is very fast, reliable, and, most importantly, very easy to operate. Nonetheless, it requires you to install VirtualBox for OSX to interact with an underlying Linux. I am not against it. I just feel that there’s more you can do with Docker. For a single-node pseudo-dist mode, it would be an overkill.
The installer is neither polished nor completed. It just handles bare minimal setup for you that there are lots of work to do. Wish me a luck, and help me with using it. Here comes the download link again.
Subscribe for upcoming posts!
Join my channel!