We use Amazon EMR heavily for both customer projects and internal use-cases when we need to crunch huge datasets in the cloud.
What is Amazon EMR?
Amazon Elastic MapReduce (EMR) is an an easier alternative to running in-house cluster computing. By providing an expandable, low-configuration service, it simplifies the process of spinning up and maintaining Hadoop and Spark clusters running in the cloud. This drastically lowers the barriers of entry for data teams to get started and allows users to easily and cost-effectively process huge amounts of data.
How does AWS EMR work?
Amazon EMR uses a hosted Hadoop framework as its data processing engine, running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). Using the MapReduce programming model, Hadoop divides data into small fragments of work, which are executed on any cluster node. Instead of using one large computer to store and process the data, Hadoop can cluster many computers and crunch huge datasets way faster.
We have found increased interest from internal and external data teams wanting to collaborate using notebooks. Apache Zeppelin is an open-sourced project for web-based interactive notebooks for data analytics. Typical use-cases include data ingestion, discovery, analytics, visualization, and collaboration. Luckily for us, Amazon EMR comes with Apache Zeppelin support baked right in. At the time of writing, Zeppelin version 0.6.1 is supported on Amazon EMR 5.0.0 (latest).
In addition, Amazon EMR supports Apache HBase (1.2.2) and Apache Phoenix (4.7.0) natively. HBase is a highly reliable NoSQL store built upon the Hadoop Distributed File System, and Phoenix is a JDBC “front-end” to the HBase engine, which converts standard SQL into native HBase scans and queries. This enables a powerful point-lookup use-case, capable of returning small results from billions of rows in milliseconds, or larger queries in seconds, using standard SQL. Finally, the HBase and Phoenix combo supports full ACID transactions, enabling OLTP workloads to run in a highly available, scale-out architected database in AWS.
Of course, we wanted to see the power of using HBase and Phoenix for ourselves on EMR using Zeppelin notebooks, but we noted a few extra steps in order to get this running. We hope we can “spark” (pun intended) your interest in exploring big data sets in the cloud, using EMR and Zeppelin.
How Do I Set Up an EMR? (An Amazon EMR Tutorial)
A summary of the steps that we’ll follow in order to experiment with Zeppelin, Phoenix, and HBase on Amazon EMR is provided below:
1. Start an EMR cluster with Zeppelin, Phoenix, and HBase pre-configured
2. SSH and web proxy into the EMR Master Node
3. Install and configure Zeppelin interpreters for HBase and Phoenix
4. Load data into HBase
5. Query the data using Phoenix in Zeppelin to create charts and graphs
6. Terminate the cluster
You will need:
1. Familiarity with foundational AWS Concepts
2. An AWS account
3. A VPC with a public subnet, in US East N. Virginia
4. Basic knowledge of bash, *nix command line, and SQL helps but is not required
1. Start an EMR cluster with Zeppelin, Phoenix, and HBase pre-configured
Click the following link to start an EMR cluster in your own AWS account using a JSON CloudFormation template provided by us. Make sure you are currently logged in to an AWS account in your web browser, otherwise the link may not work.
This link will get you quickly started with provisioning the EMR cluster using a JSON CloudFormation template. Follow the prompts in the CloudFormation page to build the cluster. Once it has completed, make a note of the table in the Outputs tab as you will need the URLs in later steps.
A. Create the Stack:
B. Specify the Parameters using your own VPC, subnet, and EC2 key:
C. Optionally, specify some tags:
D. Click Create:
E. Take a note of the Output Tab:
2. SSH and Web Proxy into the Master Node
You’ll need to set yourself up to be able to connect to the master EMR node via SSH.
A. Go to the EMR page and select your cluster:
B. Scroll down until you find Security groups for Master and click the hyperlink “sg-********” in the Console:
C. Select the Security Group with the Group Name “ElasticMapReduce-master”, click Inbound, then click Edit:
D. Add your own IP by clicking Add Rule, then select SSH from the drop-down on the far right, then select My IP in the second to last drop down, then click Save:
E. Now you can SSH into the master node from your local workstation! Steps can be found on how to SSH into the master node by going back to your EMR cluster page (Step 2.a), and clicking the SSH link hyperlink:
F. Finally, you will also need to set up a web connection via proxy to the EMR master node. Steps can be found by going back to your EMR cluster page (Step 2.a), and clicking the Enable Web Connections hyperlink:
3. Install and Configure Zeppelin Interpreters for HBase and Phoenix
A. Start a SSH Terminal session (Step 2.e)
B. Run the following commands in sequence in the Terminal command line interface:
cd /usr/lib/zeppelin
sudo bash bin/install-interpreter.sh -a
sudo bash bin/zeppelin-daemon.sh restart
C. You changed directories into the zeppelin binaries folder, installed all possible interpreters, and restarted the service for the changes to take effect.
4. Connect to the Zeppelin UI and set up the interpreters
A. Open a new Terminal window and start the SSH tunnel (Step 2.f)
B. Open your browser and enable your web proxy, e.g. FoxyProxy (Step 2.f)
C. Open your browser and point it to the url in the “zeppelin” output from CloudFormation. You should see the following page:
D. Click the drop down that says anonymous and select Interpreter
E. Click the Create button in the top left corner
F. Give the interpreter a name of jdbc, and select jdbc from the Interpreter group drop-down
G. Scroll down in the Properties section until you see “phoenix.” settings. Adjust the phoenix.url setting to: jdbc:phoenix:localhost:8765/hbase
H. Scroll down to the Dependencies section add the artifact org.apache.phoenix:phoenix-core:4.7.0-HBase-1.1, then click the Save button
5. Query the data using Zeppelin
At this point in time, you are now ready to create a new Zeppelin notebook and to start loading and querying data!
A. Return to the Zeppelin homepage, click Import Note, click Add from URL, copy and paste the URL below in the URL field. Click Import Note in the pop-up screen, using the URL below:
B. Click the notebook that you just added
C. You should see a few lines of code and scripts in a notebook similar to the following:
D. You can now hit the small “play” button in the top right corner of each white box (these are call “paragraphs” in Zeppelin vernacular)
E. Click the “play” button on the first paragraph (the one that includes “### Create an empty table in HBase”). Wait for it to say FINISHED in the top right corner before proceeding.
F. In the following paragraph, update the EMR endpoint url on line 9 with your cluster’s corresponding URL.
G. Repeat the Step d above, running each paragraph in sequence, waiting for each paragraph to say “FINISHED” before proceeding.
H. At the end, you should end up with a Pie Chart visualization similar to the following:
6. Terminate the cluster
A. Click here to return to the CloudFormation web page
B. Select the template that you created as part of this demo
C. From the Actions drop-down, select Delete Stack
D. Click Yes, Delete
Amazon EMR Tutorial Conclusion
We hope you enjoyed our Amazon EMR tutorial on Apache Zeppelin and it has truly sparked your interest in exploring big data sets in the cloud, using EMR and Zeppelin.
Related posts:
Learn more about our big data and analytics services by downloading our AWS Data Pipeline Whitepaper or watching our latest Big Data video.