Note: This article was written by Einat Orr, PhD and published on the lakeFS blog on May 5, 2021.
Let’s start with the obvious: the lakeFS project doesn’t exist in isolation. It belongs to a larger ecosystem of data engineering tools and technologies adjacent and complementary to the problems we are solving. What better way to visualize our place in this ecosystem, I thought, than by creating a cross-sectional LUMAscape to depict it.
What’s more, I believe it is critical to understand where lakeFS resides to identify opportunities where we can bring additional value to users by addressing pain points…
It makes perfect sense that if you type
aws s3 ls s3://my-bucket to list the contents of an S3 bucket, you would expect to connect to the genuine bucket and have its contents listed.
But there’s no hard rule that you have to connect to the real bucket. And in fact, there’s a simple parameter you can pass to the above CLI command to easily connect instead to any URL of your choice.
Consider the following:
$ aws s3 ls s3://my-bucket --endpoint-url http://localhost:5000
Notice the inclusion of the argument
--endpoint-url pointing to the localhost on port 5000? Rather than list…
When building a data lake, there is perhaps no more consequential decision than the format data will be stored in. The outcome will have a direct effect on its performance, usability, and compatibility.
It is inspiring that by simply changing the format data is stored in, we can unlock new functionality and improve the performance of the overall system.
In March of 2021, I chose to leave the data team at Equinox Media and join a nascent open-source project lakeFS as the first developer advocate. In this post, I share a few reasons why I’m excited about starting this new chapter and the goals I hope to accomplish.
Your first instinct when altering a piece of code is to simply open it in an editor and make the desired changes. But somewhere along the way, somebody convinced you it is worthwhile to wait for a second and do one thing first…
> git checkout -b my-amazing-branch
It’s a small…
Data lakes offer tantalizing performance upside, which is a major reason for their high rate of adoption. Sometimes though, the promise of technological performance can overshadow an unpleasant developer experience.
This is troublesome since I believe the developer experience is as important, if not more, in proving the worth of technology or paradigm.
When creating and maintaining a complex system like a data lake, unfriendly user workflows and interfaces can sap productivity, similar to an application with too much tech debt or poor documentation.
One symptom of unfriendly workflows with a data lake is spending too much time in the…
Maybe you feel it in this moment, or maybe it’s a feeling you lost touch with lately due to circumstance.
The beauty of this passion — with it, you are capable of great things.
“The emotions you are feeling at this very moment are a gift, a guideline, a support system, a call to action. If you suppress your emotions and try to drive them out of your life… you’re squandering one of life’s most precious resources” — Tony Robbins, Awaken the Giant Within
One of the ways to keep the flame of your passion alive is to envision a…
If you work in data in 2021, the acronym ETL is everywhere.
Ask certain people what they do, and their whole response will be “ETL.” On LinkedIn, there are thousands of people with the title ETL Developer. It can be a noun, verb, adjective, and even a preposition. (Yes, a mouse can ETL a house.)
Standing for “Extract, Transform, and Load,” ETL refers to the general process of taking batches of data out of one database or application and loading it into another.
Data teams are the masters of ETL as they often have to stick their grubby fingers into…
Code that makes you grow. Code that motivates you to write great code. Code that demonstrates mastery of fundamental concepts. Code that reflects thoughtfulness and care by its creator. Code that inspires.
Code that blows the hair you have left, straight back.
Look, I get it. Projects have deadlines. Jira Tickets have point estimations. Getting something working in the quickest way possible is often a smart approach in a practical, corporate setting.
But if you find yourself always in that mindset, it’s a sign that you need to invest in yourself and grow. The lack of time you perceive for…
Instead, to start assuming the responsibilities of the job you want.
Someone who embodied this advice is Luigi de Guzman, a common Data Analyst when I joined the data team at Vroom in 2017.
Uncommon was the fact that if you spoke to anyone who also worked there, they’d tell you he was one of the most integral and respected employees in the whole company.
How did he do this?
If I had to boil it down to one sentence, I’d say it was a relentless attention to detail to how business processes worked.
To learn more about his mindset…
Say you are an awesome developer sitting contentedly at your desk when a Slack message suddenly interrupts your peaceful mental flow:
It would appear there is a data issue with the new Activity History service released last month… Or at least a couple people think there is.
Now, instead of making progress on new tasks, you now need to drop those and look into what’s happening here.
What this Activity History service does is calculate and then expose counts of how many times users have used the company’s application.
DevRel @lakeFS. Ex-ML Engineering Lead @Equinox. Whisperer of data and productivity wisdom. Standing on the shoulders of giants.