AWS Data Pipeline Whitepaper

Onica

Data Analytics
August 4, 2016

[rt_reading_time label=”Read Time:” postfix=”minutes” postfix_singular=”minute”]

Simplify Data Workflow with AWS Data Pipeline – Get the Whitepaper

Businesses around the globe are looking to tap into a growing number of data sources and volumes in order to make better, data-driven decisions; advanced analysis; and future predictions. AWS Data Pipeline is a service provided to simplify those data workflow challenges to bring large volumes of data into and out of the AWS ecosystem with tools such as Amazon S3, RDS, EMR, and Redshift.

What does AWS Data Pipeline Do?

Data Pipeline runs on top of a highly scalable and elastic architecture, with data stored and moved inside customer-managed AWS account and Virtual Private Cloud networks. Data Pipeline comes with zero upfront costs and on-demand based pricing that’s up to 1/10 the cost of competitors. Data Pipeline manages many complex parts of workflows, letting big data architects and engineers focus on what matters most: the business logic and source and target systems behind the data flows.

What Can You Do With Your Data?

Experts estimate that global Internet traffic will exceed a Zettabyte (1 trillion Gigabytes) in 2016, with 40 Zettabytes existing by 2020. In the current technology climate, most companies store at least Terabytes of data retrieved from transactional, operational, campaign, and third-party market research data. Some companies embrace even more data sources such as clickstream, event processing, and Internet of Things (IoT) data, thereby exponentially increasing the amount of data ingested. One survey even found that 71% of companies have a near-term plan to use the simplest form of analytics in every day decision-making. The motivation for data usage and storage is clear: top-performing organizations use analytics five times more than lower performers.

Simplify Data Workflows (Affordably and Effectively)

Given the surge of data into business technology infrastructure, IT departments may find it difficult to use data effectively. Activities such as standardizing and scaling data ingest and extraction, data transformation and cleansing, and data loading into storage are best suited for advanced analytical processing engines. Some organizations may turn to complex, expensive data integration tools to meet the demands of data operations and data governance. The complexities and cost of these tools can be a show-stopper. Amazon Web Services (AWS) provides AWS Data Pipeline, a data integration web service that is robust and highly available at nearly 1/10th the cost of other data integration tools. AWS Data Pipeline enables data-driven integration workflows to move and process data both in the cloud and on-premises.

AWS Data Pipeline enables users to create advanced data processing workflows that are fault tolerant, repeatable, and highly available. Data engineers, integrators, and system operations staff don’t have to worry about ensuring resource availability, provisioning, managing inter-task dependencies, retrying transient failures or timeouts in individual tasks, or creating a failure notification system. With AWS Data Pipeline, IT and data teams can move and process data once locked up in data silos, with the benefit of zero upfront costs and only paying for what they use.

Learn More About What Data Pipeline Can Do For You

Our whitepaper is intended for big data architects, data engineers, data integrators, and system operations administrators faced with the challenge of orchestrating and Extracting, Transforming, and Loading (ETL) vast amounts of data from across the enterprise and/or external data sources. This whitepaper will also help familiarize readers with AWS Data Pipeline by sharing:

AWS Data Pipeline Overview: Use-Cases, Architecture & Components, Automation & Agility, Security, Elasticity, and Cost
Onica Best Practices and Knowledge Share: Guidance from Onica’s real-world big data scenarios and experience
Hands-On Example and Demonstration: Relational database export Data Pipeline

Typical Use-Cases

Curious about when Data Pipeline will make a big difference in your data analytics and use? Here’s a bit of a teaser to give you an idea of what’s in this thorough whitepaper. Common use-cases for AWS Data Pipeline include, but are not limited to, the following:

Extraction – for example, to extract relational data from a transactional relational database management system (RDBMS) into S3 object storage for later use
Transformation – Run Spark or Hadoop MapReduce jobs to transform and process data sets on EMR
Production analytics jobs – Run MapReduce or Spark jobs on a schedule on EMR for advanced analytics that are SLA-bound
Loading – for example, to load data into AWS’s reliable Redshift, an OLAP Data Warehousing solution
Scheduled maintenance/administration scripts

Learn About Automation and Agility

Still not convinced you need this download? Check out Data Pipeline’s features that will help you automate and quickly iterate on data integration workflows:

Deployment
Schedule & On-Demand
Preconditions
Notifications & Retries
Automated Provisioning
Pipelines as Code
Logging

Find out more about each of these features and exactly how they will help you do more with your data in our free whitepaper.

Get Inside Tips on How To Enhance Your Security

Are you interested in security? IT and technology management should evaluate security before adopting any new systems are services. We’ve outlined AWS’s and Data Pipeline’s security controls below within the whitepaper. Here’s an outline of what we cover:

Shared Responsibility Model
Security in the Virtual Private Cloud
Identity & Access Management Roles
Storage Security

Because we know you’re curious, here are a few of the included storage security tips to keep your credentials and data safe:

Avoid storing and committing credentials in your Data Pipeline definitions – pass them as parameters when you submit a pipeline, instead.
In your JSON definitions, prefix your authentication parameters with the ‘*’ (asterisk) special character to encrypt the credentials in transit and in the Data Pipeline console.
It’s recommended that you create a user on your RDBMS systems specifically for Data Pipeline to control access to certain schemas, databases, and tables and audit the activities regularly. Use a cryptographically strong password and rotate regularly.
Limit access to the AWS Data Pipeline API using AWS Identity and Access Management (IAM). We recommend permitting the “GetPipelineDefinition” action only to privileged IAM users such that underprivileged users cannot retrieve submitted passwords from the AWS Console or API.

That’s just the beginning of the overview, detailed descriptions, expert insights, and hands-on demonstrations of everything Data Pipeline, which we’ve compiled in this one document just for you. Get all the information before you make the purchase by downloading the free whitepaper today.

Download the Whitepaper

We are an AWS Premier Consulting and Big Data Partner. We specialize in guiding our customers with big data challenges on their journey into the cloud. Our data practice is focused on enabling businesses to focus on extracting immediate competitive business insights from their data instead of provisioning servers, storage, and other non-differentiating tasks. Contact us to learn more about how AWS Data Pipeline can help your organization make better business decisions.