Guarantee Consistency in Your Delta Lake Table(s)

Learn how to integrate lakeFS hooks to validate data on commits.

Paul Singman
Whispering Data

--

One of the most common questions we receive from existing and potential users of lakeFS is “Can it work with Delta Tables?

It’s understandable why we hear this question given Delta’s rapid adoption and advanced capabilities including:

  • Table-level ACID operations
  • Data mutations including deletes and “in-place” updates
  • Advanced partitioning and indexing abilities (with z-order)

While the above features are powerful, the combination of Delta Tables within a lakeFS repository is even more powerful. With Delta Lake and lakeFS together you can enable additional data safety guarantees, while simplifying operations.

For example:

  • ACID operations can span across multiple Delta tables
  • CI/CD hooks can be used to validate data quality and even ensure referential integrity
  • Tables can be cloned in zero-copy fashion, without duplicating data

lakeFS & Delta in Action

To prove this point, we’ll demonstrate how to guarantee data quality in a Delta table by utilizing lakeFS branches and hooks into the workflow for adding new data.

We’ll start by creating two Delta tables, representing loans and loan payments on top of data stored in a lakeFS repository. To do this we’ll use a Databricks notebook, configured to use lakeFS as the underlying storage:

Once we run the above commands, we will see the following Delta metadata files added to the repository. In lakeFS, we can commit them on the active branch, as shown below.

Great! Now let’s create a second notebook containing a set of validation rules between the two loan tables that will serve as the data quality pre-merge hook.

As you can see above, we have created two data validation checks in the form of SQL queries:

  1. Check for integrity between the loan_payments foreign key and loans primary keys
  2. Check no payment is ever higher than the total amount of the loan

Since branching and merging in lakeFS are zero-copy metadata operations, we can utilize a separate branch from the main one for ingesting new files. In this way new data gets added in isolation and can be tested by a lakeFS hook to run the validation before being merged back to main.

The first step is to create the lakeFS branch, which we will call dev-reports. We can create it using the API, CLI or the lakeFS UI:

In the proposed branching scheme, we’ll have a main and a dev-reports branch. Most consumers should read from the main branch to read data that is guaranteed to be tested and validated.

A consumer that is ok reading “dirty” data in order to see the absolute latest can do so from the dev-reports branch:

Automating Data Deployment with lakeFS Hooks

To provide this guarantee, we’ll configure the tests we created run automatically before we expose new data to consumers.

To do this, let’s first create a Databricks Job that executes the validation notebook created earlier:

We can define a lakeFS webhook to by uploading a config (shown below) with prefix _lakefs_actions/ to the main branch. This will automatically execute this job as a pre-merge hook on main:

Let’s save this file and deploy the following Flask webhook to execute the Databricks job:

For reusable webhooks that you could simply deploy and use, check out the examples in the lakeFS-hooks repository!

Now, let’s add a “bad” record and insert it into our loan_payments table — this record refers to a loan that doesn’t exist.

Let’s attempt to commit and merge this change into our main branch:

Hurray! The pre-merge hook to main failed and consumers of that branch will never see this record in the dataset.

Wrapping Up

When using lakeFS together with Delta, we can introduce changes to data and schema safely, providing powerful guarantees about the data contained within.

In this architecture, each technology is responsible for what it was designed — Delta Lake for scalable, transaction-friendly tables, and lakeFS for managing the data lifecycle.

Want to learn more?

To learn more about lakeFS and its benefits in data lake architectures, check out the lakeFS Github repo and say “Hi” in the Slack group!

Originally published by Oz Katz on the lakeFS blog.

--

--

Paul Singman
Whispering Data

Data @ Meta. Whisperer of data and productivity wisdom. Standing on the shoulders of giants.