Amazon EMR Tutorial: Apache Zeppelin & HBase Interpreters

[rt_reading_time label=”Read Time:” postfix=”minutes” postfix_singular=”minute”]

Gears of Big Data

We use Amazon EMR heavily for both customer projects and internal use-cases when we need to crunch huge datasets in the cloud.

What is Amazon EMR?

Amazon Elastic MapReduce (EMR) is an an easier alternative to running in-house cluster computing. By providing an expandable, low-configuration service, it simplifies the process of spinning up and maintaining Hadoop and Spark clusters running in the cloud. This drastically lowers the barriers of entry for data teams to get started and allows users to easily and cost-effectively process huge amounts of data.

How does AWS EMR work?

Amazon EMR uses a hosted Hadoop framework as its data processing engine, running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). Using the MapReduce programming model, Hadoop divides data into small fragments of work, which are executed on any cluster node. Instead of using one large computer to store and process the data, Hadoop can cluster many computers and crunch huge datasets way faster.

We have found increased interest from internal and external data teams wanting to collaborate using notebooks. Apache Zeppelin is an open-sourced project for web-based interactive notebooks for data analytics. Typical use-cases include data ingestion, discovery, analytics, visualization, and collaboration. Luckily for us, Amazon EMR comes with Apache Zeppelin support baked right in. At the time of writing, Zeppelin version 0.6.1 is supported on Amazon EMR 5.0.0 (latest).

In addition, Amazon EMR supports Apache HBase (1.2.2) and Apache Phoenix (4.7.0) natively. HBase is a highly reliable NoSQL store built upon the Hadoop Distributed File System, and Phoenix is a JDBC “front-end” to the HBase engine, which converts standard SQL into native HBase scans and queries. This enables a powerful point-lookup use-case, capable of returning small results from billions of rows in milliseconds, or larger queries in seconds, using standard SQL. Finally, the HBase and Phoenix combo supports full ACID transactions, enabling OLTP workloads to run in a highly available, scale-out architected database in AWS.

Of course, we wanted to see the power of using HBase and Phoenix for ourselves on EMR using Zeppelin notebooks, but we noted a few extra steps in order to get this running. We hope we can “spark” (pun intended) your interest in exploring big data sets in the cloud, using EMR and Zeppelin.

How Do I Set Up an EMR? (An Amazon EMR Tutorial)

A summary of the steps that we’ll follow in order to experiment with Zeppelin, Phoenix, and HBase on Amazon EMR is provided below:

1. Start an EMR cluster with Zeppelin, Phoenix, and HBase pre-configured
2. SSH and web proxy into the EMR Master Node
3. Install and configure Zeppelin interpreters for HBase and Phoenix
4. Load data into HBase
5. Query the data using Phoenix in Zeppelin to create charts and graphs
6. Terminate the cluster

You will need:

1. Familiarity with foundational AWS Concepts
2. An AWS account
3. A VPC with a public subnet, in US East N. Virginia
4. Basic knowledge of bash, *nix command line, and SQL helps but is not required

1. Start an EMR cluster with Zeppelin, Phoenix, and HBase pre-configured

Click the following link to start an EMR cluster in your own AWS account using a JSON CloudFormation template provided by us. Make sure you are currently logged in to an AWS account in your web browser, otherwise the link may not work.

https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?stackName=Zeppelin-Phoenix-Blog&templateURL=https:%2F%2Fs3-us-west-2.amazonaws.com%2Fcis-samples%2Fzeppelin-demo%2Fcfn%2Fzeppelin-demo.json

This link will get you quickly started with provisioning the EMR cluster using a JSON CloudFormation template. Follow the prompts in the CloudFormation page to build the cluster. Once it has completed, make a note of the table in the Outputs tab as you will need the URLs in later steps.

A. Create the Stack:

create the stack

B. Specify the Parameters using your own VPC, subnet, and EC2 key:

specify parameters

C. Optionally, specify some tags:

optionally, specify some tags

D. Click Create:

click create

E. Take a note of the Output Tab:

take a note of the output tab

2. SSH and Web Proxy into the Master Node

You’ll need to set yourself up to be able to connect to the master EMR node via SSH.

A. Go to the EMR page and select your cluster:

emr page

B. Scroll down until you find Security groups for Master and click the hyperlink “sg-********” in the Console:

security groups for master

C. Select the Security Group with the Group Name “ElasticMapReduce-master”, click Inbound, then click Edit:

ElasticMapReduce-master

D. Add your own IP by clicking Add Rule, then select SSH from the drop-down on the far right, then select My IP in the second to last drop down, then click Save:

Edit Inbound Rules

E. Now you can SSH into the master node from your local workstation! Steps can be found on how to SSH into the master node by going back to your EMR cluster page (Step 2.a), and clicking the SSH link hyperlink:

Zeppelin-Phoenix-Blog

Connect to Master Node using SSH

F. Finally, you will also need to set up a web connection via proxy to the EMR master node. Steps can be found by going back to your EMR cluster page (Step 2.a), and clicking the Enable Web Connections hyperlink:

Setup Web Connection

3. Install and Configure Zeppelin Interpreters for HBase and Phoenix

A. Start a SSH Terminal session (Step 2.e)

SSH Terminal Session
B. Run the following commands in sequence in the Terminal command line interface:

cd /usr/lib/zeppelin
sudo bash bin/install-interpreter.sh -a
sudo bash bin/zeppelin-daemon.sh restart

C. You changed directories into the zeppelin binaries folder, installed all possible interpreters, and restarted the service for the changes to take effect.

4. Connect to the Zeppelin UI and set up the interpreters

A. Open a new Terminal window and start the SSH tunnel (Step 2.f)

SSH tunnel
B. Open your browser and enable your web proxy, e.g. FoxyProxy (Step 2.f)

web proxy
C. Open your browser and point it to the url in the “zeppelin” output from CloudFormation. You should see the following page:

Zeppelin notebook

D. Click the drop down that says anonymous and select Interpreter

anonymous - interpreter

E. Click the Create button in the top left corner

create
F. Give the interpreter a name of jdbc, and select jdbc from the Interpreter group drop-down

Amazon EMR Tutorial: Apache Zeppelin & HBase Interpreters 1

G. Scroll down in the Properties section until you see “phoenix.” settings. Adjust the phoenix.url setting to: jdbc:phoenix:localhost:8765/hbase
phoenix

H. Scroll down to the Dependencies section add the artifact org.apache.phoenix:phoenix-core:4.7.0-HBase-1.1, then click the Save button

save dependencies

5. Query the data using Zeppelin

At this point in time, you are now ready to create a new Zeppelin notebook and to start loading and querying data!

A. Return to the Zeppelin homepage, click Import Note, click Add from URL, copy and paste the URL below in the URL field. Click Import Note in the pop-up screen, using the URL below:

https://gist.githubusercontent.com/laithalsaadoon/
566407d2c0700f785eed87d5b73bdbf8/raw/9755ea
d50ab05e32e6c8af7bd170a09d49e90eb7/zeppelin-blog.json

import note

import new note

import new note marked

B. Click the notebook that you just added

added notebook
C. You should see a few lines of code and scripts in a notebook similar to the following:

Hbase and Phoenix example
D. You can now hit the small “play” button in the top right corner of each white box (these are call “paragraphs” in Zeppelin vernacular)

ready

E. Click the “play” button on the first paragraph (the one that includes “### Create an empty table in HBase”). Wait for it to say FINISHED in the top right corner before proceeding.

finished

F. In the following paragraph, update the EMR endpoint url on line 9 with your cluster’s corresponding URL.

G. Repeat the Step d above, running each paragraph in sequence, waiting for each paragraph to say “FINISHED” before proceeding.

H. At the end, you should end up with a Pie Chart visualization similar to the following:

pie chart

6. Terminate the cluster

A. Click here to return to the CloudFormation web page

B. Select the template that you created as part of this demo

template

C. From the Actions drop-down, select Delete Stack

delete stack

D. Click Yes, Delete

yes, delete

Amazon EMR Tutorial Conclusion

We hope you enjoyed our Amazon EMR tutorial on Apache Zeppelin and it has truly sparked your interest in exploring big data sets in the cloud, using EMR and Zeppelin.

Related posts:
Learn more about our big data and analytics services by downloading our AWS Data Pipeline Whitepaper or watching our latest Big Data video.

Hidden layer

Share on linkedin
Share on twitter
Share on facebook
Share on email

Onica Insights

Stay up to date with the latest perspectives, tips, and news directly to your inbox.

Explore More Cloud Insights from Onica

Blogs

The latest perspectives on navigating an ever-changing cloud landscape

Case Studies

Explore how our customers are driving cloud innovation in their industries

Videos

Watch an on-demand library of cloud tutorials, tips and tricks

Publications

Learn how to succeed in the cloud with deep-dives into pressing cloud topics