Level Up Your Data Lake

Take your data lake game to new heights with these two architecture improvements.

Paul Singman

Published in

Whispering Data

4 min readFeb 22, 2022

What is the Basic Data Lake?

A data lake is primarily two things: an object store and the objects being stored. It might look something like this:

Even with this basic setup, your data is in a good position to support all three of the main use cases for data: 1. BI Analytics 2. Data-Intensive APIs and 3. Machine Learning Algorithms.

The fact that this architecture is flexible enough to support all three speaks to the strength of object stores, particularly their flexibility in integrating with a diverse set of data processing engines.

In-memory distributed processing with Spark? No problem. A columnar data warehouse like Snowflake? Piece of cake. A distributed query engine like Trino? Go for it.

Level 1: Modern Table Formats

As data lakes exploded in adoption, a number of improvements were made upon this basic architecture. The first and most obvious improvement to make is to replace those pesky CSV files.

A popular improvement to CSVs was and still is the columnar parquet file format. Parquet is great for analytic use cases due to it being:

Columnar.
Highly-compressible.
Able to support complex, nested data types.

While these are key improvements, objects in an object store — however optimized they may be — can never be anything more than a loose collection of objects (sans adopting a separate metastore service).

What’s missing from these collections of objects, people realized, is the abstraction of a table. In databases, tables are everywhere, and all the benefits they provide are equally valid in object stores.

This is where the table formats come in: Apache Iceberg, Apache Hudi, and Delta Lake. When saving data in these formats it becomes infinitely easier to create tables within the object store itself — with a defined schema, with versioning history, and with the ability to be updated atomically.

This greatly enhances the performance and usability of a data lake. And soon enough our basic data lake will look something more like this:

How do these table formats work? Well, the idea behind them is to maintain a transaction log of objects added (and removed) from certain prefixes in the lake. This provides the important guarantee of atomicity to write operations, and lets us avoid query errors when reading and writing data simultaneously.

For more details, here are two articles that dive into more of the specifics:

An Introduction to Modern Data Lake Storage Layers

In recent years we've seen a rise in new storage layers for data lakes. In 2017, Uber announced Hudi - an incremental…

dacort.dev

Hudi, Iceberg and Delta Lake: Data Lake Table Formats Compared

When building a data lake, there is perhaps no more consequential decision than the format data will be stored in. The…

lakefs.io

Level 2: Source Control for Data

While table formats made our data lake much more impressive, we’re not done improving it. All of the benefits modern file formats provided on the table level can be extended even further to encompass our entire data lake!

How you ask? With a data source control tool like lakeFS that turns an object store’s bucket into data repository in which we can track multiple datasets.

While the previous architecture is still fresh in your mind, here’s what our data lake looks like at this level:

A new layer is added to the folder hierarchy, which corresponds to the name of a branch. lakeFS lets us create, alter, and merge as many branches as we want, making it possible to do things like:

Create numerous multiple copies of all tables (without duplicating objects!)
Save cross-collection snapshots of tables as commits and time-travel between them

For example, it is possible to synchronize updates to two Iceberg tables (or even a Hudi and Iceberg table) in the same lakeFS repository via a merge operation from one branch to another.

When it comes to reproducing the state of training data in an ML experiment or updating data assets power critical APIs, having your data nimble in this way makes it possible to work efficiently over even the largest data lakes.

For more information on these git-inspired workflows, see the following article:

Guarantee Consistency in Your Delta Lake Tables With lakeFS

A walkthrough on wow to guarantee consistency in your delta lake tables and ensure data quality with lakeFS