Fixing a data issue is never fun, but why make it harder?

Photo by Marcus Dall Col on Unsplash

Introducing Data Reproducibility

There are two types of issues in the world — reproducible and unreproducible.

A reproducible issue is one where the original conditions for an error can be recreated, allowing for the controlled manufacture of its occurrence. This is the state a seasoned engineer will strive to reach when debugging an error.

It is extremely hard to solve a problem you don’t yet understand. And reproducing it is the best way to establish the necessary foothold to start down the path of solving it.

A challenge when working in data-intensive environments is recreating the prior state of a dataset — what…

A look at the entire ecosystem of data engineering tools and technologies.

Note: This article was written by Einat Orr, PhD and published on the lakeFS blog on May 5, 2021.

Let’s start with the obvious: the lakeFS project doesn’t exist in isolation. It belongs to a larger ecosystem of data engineering tools and technologies adjacent and complementary to the problems we are solving. What better way to visualize our place in this ecosystem, I thought, than by creating a cross-sectional LUMAscape to depict it.

What’s more, I believe it is critical to understand where lakeFS resides to identify opportunities where we can bring additional value to users by addressing pain points…

What the AWS CLI don’t know won’t hurt it.

Photo by ThisisEngineering RAEng on Unsplash

It makes perfect sense that if you type aws s3 ls s3://my-bucket to list the contents of an S3 bucket, you would expect to connect to the genuine bucket and have its contents listed.

But there’s no hard rule that you have to connect to the real bucket. And in fact, there’s a simple parameter you can pass to the above CLI command to easily connect instead to any URL of your choice.

Consider the following:

$ aws s3 ls s3://my-bucket --endpoint-url http://localhost:5000

Notice the inclusion of the argument --endpoint-url pointing to the localhost on port 5000? Rather than list…

Note: This article was written by Oz Katz and was published on the lakeFS blog on 4/12/21.


When building a data lake, there is perhaps no more consequential decision than the format data will be stored in. The outcome will have a direct effect on its performance, usability, and compatibility.

It is inspiring that by simply changing the format data is stored in, we can unlock new functionality and improve the performance of the overall system.

Apache Hudi, Apache Iceberg, and Delta Lake are the current best-in-breed formats designed for data lakes. …

Thoughts on a personal journey into the world of developer advocacy at an open-source data project.

In March of 2021, I chose to leave the data team at Equinox Media and join a nascent open-source project lakeFS as the first developer advocate. In this post, I share a few reasons why I’m excited about starting this new chapter and the goals I hope to accomplish.

lakeFS the Project

Your first instinct when altering a piece of code is to simply open it in an editor and make the desired changes. But somewhere along the way, somebody convinced you it is worthwhile to wait for a second and do one thing first…

> git checkout -b my-amazing-branch

It’s a small…

Rid yourself of these troubling habits and start the journey towards data lake mastery

Photo by Ali Zbeeb on Unsplash


Data lakes offer tantalizing performance upside, which is a major reason for their high rate of adoption. Sometimes though, the promise of technological performance can overshadow an unpleasant developer experience.

This is troublesome since I believe the developer experience is as important, if not more, in proving the worth of technology or paradigm.

When creating and maintaining a complex system like a data lake, unfriendly user workflows and interfaces can sap productivity, similar to an application with too much tech debt or poor documentation.

Anti-Pattern #1

You click around the S3 (or comparable) storage console often

One symptom of unfriendly workflows with a data lake is spending too much time in the…

What can we learn from high performers in other disciplines?

Photo by Evgeniya Litovchenko on Unsplash

You have a burning passion inside you.

Maybe you feel it in this moment, or maybe it’s a feeling you lost touch with lately due to circumstance.

The beauty of this passion — with it, you are capable of great things.

“The emotions you are feeling at this very moment are a gift, a guideline, a support system, a call to action. If you suppress your emotions and try to drive them out of your life… you’re squandering one of life’s most precious resources” — Tony Robbins, Awaken the Giant Within

Keeping the flame alive

One of the ways to keep the flame of your passion alive is to envision a…

If you’re as sick of this three-letter phrase as I am, you’ll be happy to know there is another way.

Take a Look Around You…

If you work in data in 2021, the acronym ETL is everywhere.

Ask certain people what they do, and their whole response will be “ETL.” On LinkedIn, there are thousands of people with the title ETL Developer. It can be a noun, verb, adjective, and even a preposition. (Yes, a mouse can ETL a house.)

Standing for “Extract, Transform, and Load,” ETL refers to the general process of taking batches of data out of one database or application and loading it into another.

Data teams are the masters of ETL as they often have to stick their grubby fingers into…

Can you consider yourself a great developer if you aren’t producing quality code?

Photo by Markus Spiske on Unsplash

You want to work with great code

Code that makes you grow. Code that motivates you to write great code. Code that demonstrates mastery of fundamental concepts. Code that reflects thoughtfulness and care by its creator. Code that inspires.

Code that blows the hair you have left, straight back.

But I don’t have time to write beautiful code…

Look, I get it. Projects have deadlines. Jira Tickets have point estimations. Getting something working in the quickest way possible is often a smart approach in a practical, corporate setting.

But if you find yourself always in that mindset, it’s a sign that you need to invest in yourself and grow. The lack of time you perceive for…

Learning why the Webster’s Director of Analytics doesn’t immediately fix problems, but takes the time to understand them first.

They say to not wait for a promotion.

Instead, to start assuming the responsibilities of the job you want.

Someone who embodied this advice is Luigi de Guzman, a common Data Analyst when I joined the data team at Vroom in 2017.

Uncommon was the fact that if you spoke to anyone who also worked there, they’d tell you he was one of the most integral and respected employees in the whole company.

How did he do this?

If I had to boil it down to one sentence, I’d say it was a relentless attention to detail to how business processes worked.

To learn more about his mindset…

Paul Singman

DevRel @lakeFS. Ex-ML Engineering Lead @Equinox. Whisperer of data and productivity wisdom. Standing on the shoulders of giants.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store