An important part of developing an open source project is assisting and advising users. When they run into an issue and feel pain, we want to feel that pain, too. Quite literally.
This means recreating the environment, running the same code, and raising the same error.
Eleven years ago, Apache Hive was born with an ambitious mission: Allow querying massive amounts of data using plain SQL instead of writing complex MapReduce jobs.
For this to work efficiently, Hive needed a way to structure SQL tables on top of a Filesystem(-like) interface. …
Apache Hive burst onto the scene in 2010 as a component of the Hadoop ecosystem, when Hadoop was the novel and innovative way of doing big data analytics.
What Hive did was implement a SQL interface to Hadoop. Its architecture consisted of two main services:
Datasets are the foundational output of a data team. They do not appear out of thin air. No one has ever snapped their fingers and created an orders_history table.
Instead, useful sets of data are created and maintained through a process that involves several predictable steps. …
One of the most common questions we receive from existing and potential users of lakeFS is “Can it work with Delta Tables?“
When people would visit me, then serving as CTO of SimilarWeb, they would sometimes fiddle with these items. Before leaving work on most days, I made sure everything was neatly returned to its proper place.
I remember hurrying through this therapeutic ritual one afternoon to make it to an evening…
There are two types of issues in the world — reproducible and unreproducible.
A reproducible issue is one where the original conditions for an error can be recreated, allowing for the controlled manufacture of its occurrence. …
Note: This article was written by Einat Orr, PhD and published on the lakeFS blog on May 5, 2021.
It makes perfect sense that if you type
aws s3 ls s3://my-bucket to list the contents of an S3 bucket, you would expect to connect to the genuine bucket and have its contents listed.
When building a data lake, there is perhaps no more consequential decision than the format data will be stored in. The outcome will have a direct effect on its performance, usability, and compatibility.