Use docker compose to create local replicas of a modern data stack with one command.

Introduction

An important part of developing an open source project is assisting and advising users. When they run into an issue and feel pain, we want to feel that pain, too. Quite literally.

This means recreating the environment, running the same code, and raising the same error.

In complex, modern data…


A closer look at Hive Metastore at present time. Specifically how it solves a problem (that it sort of created itself, many years ago), and where it falls short with the shifting tides in computing

Photo by おにぎり on Unsplash

Designing a SQL Table in a Filesystem Directory

Eleven years ago, Apache Hive was born with an ambitious mission: Allow querying massive amounts of data using plain SQL instead of writing complex MapReduce jobs.

For this to work efficiently, Hive needed a way to structure SQL tables on top of a Filesystem(-like) interface. …


A majority of data architectures still feature Hive Metastore. Why has it survived and what can finally replace it in the future?

Hive & Hadoop — A Brief History

Apache Hive burst onto the scene in 2010 as a component of the Hadoop ecosystem, when Hadoop was the novel and innovative way of doing big data analytics.

What Hive did was implement a SQL interface to Hadoop. Its architecture consisted of two main services:

  1. A Query Engine — responsible…


Useful datasets are created through a process that involves several predictable steps. Learn how to organize your workflows in this framework.

What is Data Lifecycle Management

Datasets are the foundational output of a data team. They do not appear out of thin air. No one has ever snapped their fingers and created an orders_history table.

Instead, useful sets of data are created and maintained through a process that involves several predictable steps. …


Learn how to integrate lakeFS hooks to validate data on commits.

One of the most common questions we receive from existing and potential users of lakeFS is “Can it work with Delta Tables?

It’s understandable why we hear this question given Delta’s rapid adoption and advanced capabilities including:

  • Table-level ACID operations
  • Data mutations including deletes and “in-place” updates
  • Advanced partitioning and…


Data teams love calculating and tracking everything with metrics. We already have the infrastructure in place to do so… yet often fail to apply the same strategy for our own work. Let’s fix that!

In my old office, there was a shelf that held tchotchkes and stuffed animals collected over the years.

When people would visit me, then serving as CTO of SimilarWeb, they would sometimes fiddle with these items. Before leaving work on most days, I made sure everything was neatly returned to its proper place.

I remember hurrying through this therapeutic ritual one afternoon to make it to an evening…


Fixing a data issue is never fun, but why make it harder?

Photo by Marcus Dall Col on Unsplash

Introducing Data Reproducibility

There are two types of issues in the world — reproducible and unreproducible.

A reproducible issue is one where the original conditions for an error can be recreated, allowing for the controlled manufacture of its occurrence. …


A look at the entire ecosystem of data engineering tools and technologies.

Note: This article was written by Einat Orr, PhD and published on the lakeFS blog on May 5, 2021.

Let’s start with the obvious: the lakeFS project doesn’t exist in isolation. It belongs to a larger ecosystem of data engineering tools and technologies adjacent and complementary to the problems we…


What the AWS CLI don’t know won’t hurt it.

Photo by ThisisEngineering RAEng on Unsplash

It makes perfect sense that if you type aws s3 ls s3://my-bucket to list the contents of an S3 bucket, you would expect to connect to the genuine bucket and have its contents listed.

But there’s no hard rule that you have to connect to the real bucket. And in…


Note: This article was written by Oz Katz and was published on the lakeFS blog on 4/12/21.

Introduction

When building a data lake, there is perhaps no more consequential decision than the format data will be stored in. The outcome will have a direct effect on its performance, usability, and compatibility.

Paul Singman

DevRel @lakeFS. Ex-ML Engineering Lead @Equinox. Whisperer of data and productivity wisdom. Standing on the shoulders of giants.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store