A closer look at Hive Metastore at present time. Specifically how it solves a problem (that it sort of created itself, many years ago), and where it falls short with the shifting tides in computing

Photo by おにぎり on Unsplash

Designing a SQL Table in a Filesystem Directory

Eleven years ago, Apache Hive was born with an ambitious mission: Allow querying massive amounts of data using plain SQL instead of writing complex MapReduce jobs.

For this to work efficiently, Hive needed a way to structure SQL tables on top of a Filesystem(-like) interface. It was built on top of Hadoop, with HDFS being the most widely adopted distributed file system at the time.

Traditional RDBMS and Data Warehouses are able to achieve good performance using elaborate indexing schemes: clustered or non-clustered, primary and secondary, partial or complete. …


A majority of data architectures still feature Hive Metastore. Why has it survived and what can finally replace it in the future?

Hive & Hadoop — A Brief History

Apache Hive burst onto the scene in 2010 as a component of the Hadoop ecosystem, when Hadoop was the novel and innovative way of doing big data analytics.

What Hive did was implement a SQL interface to Hadoop. Its architecture consisted of two main services:

  1. A Query Engine — responsible for SQL statement execution.
  2. A Metastore — responsible for virtualization of data collections in HDFS as tables.


Useful datasets are created through a process that involves several predictable steps. Learn how to organize your workflows in this framework.

What is Data Lifecycle Management

Datasets are the foundational output of a data team. They do not appear out of thin air. No one has ever snapped their fingers and created an orders_history table.

Instead, useful sets of data are created and maintained through a process that involves several predictable steps. Managing this process is often referred to as data lifecycle management.

Put another way, data lifecycle management encapsulates two things:

  1. What is required to publish new data?
  2. What is required to ensure published data is useful?

The most effective data teams have a clear understanding of the steps and processes their data is subject…


Learn how to integrate lakeFS hooks to validate data on commits.

One of the most common questions we receive from existing and potential users of lakeFS is “Can it work with Delta Tables?

It’s understandable why we hear this question given Delta’s rapid adoption and advanced capabilities including:

  • Table-level ACID operations
  • Data mutations including deletes and “in-place” updates
  • Advanced partitioning and indexing abilities (with z-order)

While the above features are powerful, the combination of Delta Tables within a lakeFS repository is even more powerful. With Delta Lake and lakeFS together you can enable additional data safety guarantees, while simplifying operations.

For example:

  • ACID operations can span across multiple Delta tables
  • CI/CD…


Data teams love calculating and tracking everything with metrics. We already have the infrastructure in place to do so… yet often fail to apply the same strategy for our own work. Let’s fix that!

In my old office, there was a shelf that held tchotchkes and stuffed animals collected over the years.

When people would visit me, then serving as CTO of SimilarWeb, they would sometimes fiddle with these items. Before leaving work on most days, I made sure everything was neatly returned to its proper place.

I remember hurrying through this therapeutic ritual one afternoon to make it to an evening performance of The Book of Mormon… when I was interrupted by a regional sales director, Adia, asking if we could speak.

She was struggling to convert new trial users to enterprise plans. …


Fixing a data issue is never fun, but why make it harder?

Photo by Marcus Dall Col on Unsplash

Introducing Data Reproducibility

There are two types of issues in the world — reproducible and unreproducible.

A reproducible issue is one where the original conditions for an error can be recreated, allowing for the controlled manufacture of its occurrence. This is the state a seasoned engineer will strive to reach when debugging an error.

It is extremely hard to solve a problem you don’t yet understand. And reproducing it is the best way to establish the necessary foothold to start down the path of solving it.

A challenge when working in data-intensive environments is recreating the prior state of a dataset — what…


A look at the entire ecosystem of data engineering tools and technologies.

Note: This article was written by Einat Orr, PhD and published on the lakeFS blog on May 5, 2021.

Let’s start with the obvious: the lakeFS project doesn’t exist in isolation. It belongs to a larger ecosystem of data engineering tools and technologies adjacent and complementary to the problems we are solving. What better way to visualize our place in this ecosystem, I thought, than by creating a cross-sectional LUMAscape to depict it.

What’s more, I believe it is critical to understand where lakeFS resides to identify opportunities where we can bring additional value to users by addressing pain points…


What the AWS CLI don’t know won’t hurt it.

Photo by ThisisEngineering RAEng on Unsplash

It makes perfect sense that if you type aws s3 ls s3://my-bucket to list the contents of an S3 bucket, you would expect to connect to the genuine bucket and have its contents listed.

But there’s no hard rule that you have to connect to the real bucket. And in fact, there’s a simple parameter you can pass to the above CLI command to easily connect instead to any URL of your choice.

Consider the following:

$ aws s3 ls s3://my-bucket --endpoint-url http://localhost:5000

Notice the inclusion of the argument --endpoint-url pointing to the localhost on port 5000? Rather than list…


Note: This article was written by Oz Katz and was published on the lakeFS blog on 4/12/21.

Introduction

When building a data lake, there is perhaps no more consequential decision than the format data will be stored in. The outcome will have a direct effect on its performance, usability, and compatibility.

It is inspiring that by simply changing the format data is stored in, we can unlock new functionality and improve the performance of the overall system.

Apache Hudi, Apache Iceberg, and Delta Lake are the current best-in-breed formats designed for data lakes. …


Thoughts on a personal journey into the world of developer advocacy at an open-source data project.

In March of 2021, I chose to leave the data team at Equinox Media and join a nascent open-source project lakeFS as the first developer advocate. In this post, I share a few reasons why I’m excited about starting this new chapter and the goals I hope to accomplish.

lakeFS the Project

Your first instinct when altering a piece of code is to simply open it in an editor and make the desired changes. But somewhere along the way, somebody convinced you it is worthwhile to wait for a second and do one thing first…

> git checkout -b my-amazing-branch

It’s a small…

Paul Singman

DevRel @lakeFS. Ex-ML Engineering Lead @Equinox. Whisperer of data and productivity wisdom. Standing on the shoulders of giants.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store