Eleven years ago, Apache Hive was born with an ambitious mission: Allow querying massive amounts of data using plain SQL instead of writing complex MapReduce jobs.
For this to work efficiently, Hive needed a way to structure SQL tables on top of a Filesystem(-like) interface. It was built on top of Hadoop, with HDFS being the most widely adopted distributed file system at the time.
Apache Hive burst onto the scene in 2010 as a component of the Hadoop ecosystem, when Hadoop was the novel and innovative way of doing big data analytics.
What Hive did was implement a SQL interface to Hadoop. Its architecture consisted of two main services:
Datasets are the foundational output of a data team. They do not appear out of thin air. No one has ever snapped their fingers and created an orders_history table.
Instead, useful sets of data are created and maintained through a process that involves several predictable steps. Managing this process is often referred to as data lifecycle management.
Put another way, data lifecycle management encapsulates two things:
The most effective data teams have a clear understanding of the steps and processes their data is subject…
One of the most common questions we receive from existing and potential users of lakeFS is “Can it work with Delta Tables?“
While the above features are powerful, the combination of Delta Tables within a lakeFS repository is even more powerful. With Delta Lake and lakeFS together you can enable additional data safety guarantees, while simplifying operations.
When people would visit me, then serving as CTO of SimilarWeb, they would sometimes fiddle with these items. Before leaving work on most days, I made sure everything was neatly returned to its proper place.
I remember hurrying through this therapeutic ritual one afternoon to make it to an evening performance of The Book of Mormon… when I was interrupted by a regional sales director, Adia, asking if we could speak.
She was struggling to convert new trial users to enterprise plans. …
There are two types of issues in the world — reproducible and unreproducible.
A reproducible issue is one where the original conditions for an error can be recreated, allowing for the controlled manufacture of its occurrence. This is the state a seasoned engineer will strive to reach when debugging an error.
It is extremely hard to solve a problem you don’t yet understand. And reproducing it is the best way to establish the necessary foothold to start down the path of solving it.
A challenge when working in data-intensive environments is recreating the prior state of a dataset — what…
Note: This article was written by Einat Orr, PhD and published on the lakeFS blog on May 5, 2021.
Let’s start with the obvious: the lakeFS project doesn’t exist in isolation. It belongs to a larger ecosystem of data engineering tools and technologies adjacent and complementary to the problems we are solving. What better way to visualize our place in this ecosystem, I thought, than by creating a cross-sectional LUMAscape to depict it.
What’s more, I believe it is critical to understand where lakeFS resides to identify opportunities where we can bring additional value to users by addressing pain points…
It makes perfect sense that if you type
aws s3 ls s3://my-bucket to list the contents of an S3 bucket, you would expect to connect to the genuine bucket and have its contents listed.
But there’s no hard rule that you have to connect to the real bucket. And in fact, there’s a simple parameter you can pass to the above CLI command to easily connect instead to any URL of your choice.
Consider the following:
$ aws s3 ls s3://my-bucket --endpoint-url http://localhost:5000
Notice the inclusion of the argument
--endpoint-url pointing to the localhost on port 5000? Rather than list…
When building a data lake, there is perhaps no more consequential decision than the format data will be stored in. The outcome will have a direct effect on its performance, usability, and compatibility.
It is inspiring that by simply changing the format data is stored in, we can unlock new functionality and improve the performance of the overall system.
In March of 2021, I chose to leave the data team at Equinox Media and join a nascent open-source project lakeFS as the first developer advocate. In this post, I share a few reasons why I’m excited about starting this new chapter and the goals I hope to accomplish.
Your first instinct when altering a piece of code is to simply open it in an editor and make the desired changes. But somewhere along the way, somebody convinced you it is worthwhile to wait for a second and do one thing first…
> git checkout -b my-amazing-branch
It’s a small…