The Rise of Cloud Data Lakes

Introduction

Data lakes emerged to solve a growing problem: the need for a scalable, low-cost data repository that allowed organizations to easily store all data types from a diverse set of sources, and then analyze that data to make evidence-based business decisions.

“But what about the data warehouse, which was the de facto solution for storing and analyzing structured data and preceded the data lake by 30 years?”

It couldn’t accommodate these new, big data projects and their fast-paced acquisition models, many of which envisioned easily storing petabytes of data in structured and semi-structured forms. With the future of big data looming large, the data lake seemed like the answer: an ideal way to gather, store, and analyze enormous amounts of data in one location.

Interest in data lakes skyrocketed for one simple reason: Most organizations consider data a very important asset, and the systems of the time couldn’t handle the variety. For decades, organizations have collected structured data from enterprise applications, and now they’re supplementing it with these newer forms of semi-structured data from web pages, social media sites, mobile phones, Internet of Things (IoT) devices, and many other sources, including shared data sets. Then with all this data, companies started hiring Data Scientists. However, Data scientists, business analysts, and line-of-business professionals still need a way to easily capture, store, access, and analyze that data. Data Scientists and their entourage soon realized that they can’t deal with this, they started shouting for help: We need a Data Engineer PLEASE.

Flowing data into the lake

Prior to the data lake, most analytic systems stored specific types of data, using a predefined database structure. For example, data warehouses were built primarily for analytics, using relational databases that included a schema to define tables of structured data in orderly columns and rows.

By contrast, the hope for data lakes was to store many data types in their native formats and make that data available to the business community for reporting and analytics.

datalake

The original goal of the data lake, which failed to deliver the desired rapid insights.

The goal was to enable organizations to explore, refine, and analyze petabytes of information without a predetermined notion of structure.

The most important thing to understand about a data lake is not how it is constructed, but what it enables.

It’s a comprehensive way to explore, refine, and analyze petabytes of information constantly arriving from multiple data sources.

One petabyte of data is equivalent to 1 million gigabytes: about 500 billion pages of standard, printed text or 58,333 high-definition, two-hour movies.

Data lakes were conceived for business users to explore and analyze petabytes of data.

Understanding the Problems with Data Lakes

The initial data lake concept was compelling, and many organizations rushed to build on-premises data lakes. The core technology was based on the Apache Hadoop ecosystem, software framework that distributes data storage and processing among commodity hardware located in on-premises data centers (Our Org's Big Data Team). Hadoop includes a file system called HDFS that enables customers to store data in its native form. The Hadoop ecosystem also includes open source equivalents to Structured Query Language (SQL) — the standard language used to communicate with a database, along with batch and interactive data processing technologies, cluster management utilities, and other necessary data platform components.

Unfortunately, many of these on-premises data lake projects failed to fulfill the promise of data lake computing, thanks to:

  • burdensome complexity
  • slow time to value
  • heavy system management efforts

The inherent complexities of a distributed architecture and the need for custom coding for data transformation and integration, mainly handled by highly skilled data engineers, made it difficult to derive useful analytics and contributed to Hadoop’s demise. Although many Hadoop-based data lake projects aren’t delivering their promised value, organizations of all types still want all the insights from all their data by all their users.

The original promise of the data lake remains: a way for organizations to collect, store, and analyze all of their data in one place.

Today, as cloud computing takes center stage and legacy technology models fade into the background, this new paradigm is revealing its potential. Modern cloud technologies allow you to create innovative, cost-effective, and versatile data lakes, or extend existing data lakes created using Hadoop, cloud object stores (computer data storage architecture that manages data as objects), and other technologies.

What are the requirements for these modern Data Lakes?

To be truly useful:

  • A data lake must be able to easily store data in native formats.
  • Facilitate user-friendly exploration of that data.
  • Automate routine data management activities.
  • Support a broad range of analytics use cases.

Most of today’s data lakes, however, can’t effectively organize all of an organization’s data. What’s more, the data lakes of today must be filled from a number of data streams, each of which delivers data at a different frequency.

Without adequate data quality and data governance even well-constructed data lakes can quickly become data swamps - unorganized pools of data that are difficult to use, understand, and share with business users. The greater the quantity and variety of data, the more significant this problem becomes. That makes it harder and harder to derive meaningful insights. Other common problems include slow performance, difficulty managing and scaling the environment, and high license costs for hardware and software.

datalakes-2

Slow performance, complexity, and poor governance are among the reasons traditional data lakes often fail.

Enter Cloud Data Lakes

History: When the data lake emerged back in 2010, few people anticipated the management complexity, lackluster performance, limited scaling, and weak governance. As a result, Hadoop-based data lakes became data swamps, a place for dumping data. These early data lakes left many organizations struggling to produce the needed insights.

Rise of Object Stores: As the cloud computing industry matured, object stores from Amazon, Microsoft, and other vendors introduced interim data lake solutions, such as Amazon Simple Storage Service (S3), Microsoft Azure Blob, and Google Cloud Storage. Some organizations leveraged these data storage environments to create their own data lakes from scratch. These highly elastic cloud storage solutions allow customers to store unlimited amounts of data in their native formats, against which they can conduct analytics.

However, although customers no longer have to manage the hardware stack(good bye Big Data team), as they did with Hadoop, they still have to create, integrate, and manage the software environment. This involves:

  • Setting up procedures to transform data
  • Along with establishing policies and procedures for:
    • identity management
    • security
    • data governance

Finally, customers have to figure out how to obtain high-performance analytics.

Rise of Modern Data Lakes: Three things contributed to the rise of modern data lakes: cloud-based analytics, the data warehouse (people still can’t live without a star-schema or snow-flake schema) and cloud-based object-stores. These solutions have become the foundation for the modern data lake: a place where structured and semi-structured data can be staged in its raw form — either in the data warehouse itself or in an associated object storage service.

The lure of Cloud Data Lakes: The ability to store unlimited amounts of diverse data makes the cloud particularly well-suited for data lakes. And the entire environment can be operated with familiar SQL tools. Because all storage objects and necessary compute resources are internal to the modern data lake platform, data can be accessed and analytics can be executed quickly and efficiently.

This is much different than the original data lake architectures, where data was always stored in an external data bucket and then copied to another loosely integrated storage-compute layer to achieve adequate analytics performance.

Where are these modern data lakes used?: Modern data lakes have the potential to play an important role in every industry. For example:

  • E-Commerce Retailers: E-commerce retailers use modern data lakes to collect clickstream data for monitoring web shopping activities. They analyze browser data in conjunction with customer buying histories to predict outcomes. Armed with these insights, retailers can provide timely, relevant, and consistent messaging and offers for acquiring, serving, and retaining customers.

Traditional data lakes vs Cloud Data Lakes: Traditional data lakes fail because of their inherent complexity, poor performance, and lack of governance, among other issues. Modern cloud data lakes overcome these challenges thanks to foundational tenets, such as:

  1. No Silos: Easily ingest petabytes of structured, semi-structured, and unstructured data into a single repository.
  2. Instant elasticity: Supply any amount of computing resources to any user or any workload. Dynamically change the size of a compute cluster without affecting running queries (OR) scale the service to easily include additional compute clusters to complete intense workloads faster.
  3. Concurrent operation: Deploy to a near-unlimited number of users and workloads to access a single copy of your data, all without affecting performance.
  4. Embedded goverance: Present fresh and accurate data to users, with a focus on collaboration, data quality, access control, and metadata (data about data) management.
  5. Fully managed: With a software-as-a-service (SaaS) solution, the data platform itself largely manages and handles provisioning, data protection, security, backups, and performance tuning, allowing you to focus on analytic endeavors rather than on managing hardware and software. You just set the modern data lake platform and go.

Case Study 1: A Gaming Company analyzes semi-structured JSON data in the cloud

ABCGames is a mobile and online gaming company. Analytics plays a central role in helping the company meet its revenue goals, allowing ABCGames to continually experiment with new features, functionalities, and platforms. ABCGames’s analytics team captures event data from gaming activities in JavaScript Object Notation (JSON), which is a language-independent data format, and makes it available for exploration, reporting, and predictive analytics.

Previously, the event data had to be converted to Structured Query Language (SQL, the language typically used by databases). That’s what ABCGames’s analytics platform required. It was a convoluted, multi-step process:

  • Funnel JSON event data into an Apache Kafka pipeline, which stores streams of records in categories called topics.
  • Process the data on a Hadoop cluster.
  • Import the results into Hive tables, providing a SQL-like interface for querying data.
  • Transform and load data into relational tables for analysis.

In order to simplify this data-transformation and processing cycle, ABCGames migrated its analytics environment to a cloud-built data lake that natively supports semi-structured data. In this new architecture, JSON event data is output directly from Kafka to an Amazon S3 storage service, where it is loaded into the cloud data lake to make it accessible for processing and analysis - with each business group at ABCGames able to operate independent virtual warehouses (compute clusters) for transforming, slicing, aggregating, and analyzing. The process is an order of magnitude faster than before, and the company pays by the second for the storage and compute power it needs.

Additional Stuff

What does a data engineer do?

At a high-level a data engineer does these 5 tasks: collect, store, process/analyze/transform, orchestrate, consume.

dataengineer_tasks

What are some of data pipeline architecture best practices?

Data pipeline architecture was designed based on the division of concerns: collect, store, process/analyze/transform, orchestrate, consume.

The data pipeline architecture addresses concerns stated above in this way:

  • Collect: Data is extracted from on-premise databases by using Apache Spark. Then, it’s loaded to AWS S3.
  • Store: Data is stored in its original form in S3. It serves as an immutable staging area for the data warehouse.
  • Process/Analyze/Transform: Data is transformed by using DBT and inserted to AWS Redshift. Keep in mind DBT only serves as a tool for performing transformations inside of the data warehouse itself. A workaround is accessing AWS S3 data through Redshift Spectrum external tables, so they can be transformed and materialized in Redshift by using DBT.
  • Consume: Data is consumed by users through different BI tools — like Metabase and Tableau.
  • Orchestrate: Data processes are orchestrated by Airflow. It allows for solving most of the problems one can get when orchestrating decoupled systems.