Thoughts on the Future of the Databricks Ecosystem

Databricks has come a long way since growing out of a Berkeley Lab in 2013 with an open-source distributed computing framework called Spark.

Paul Singman
Whispering Data

--

Fast forward eight years and in addition to the core Spark product, there are a dizzying number of new features in various stages of public preview within the Databricks platform. In case you haven’t been keeping up, these include:

Taken together, it’s interesting to note that these features provide comparable functionality to the set of tools commonly referred to as the “Modern Data Stack”.

The result is a noticeably consolidated data stack, almost entirely contained within the Databricks ecosystem.

Some people cheer for this type of consolidation, tired of spending time fitting together pieces of an analytics puzzle that don’t necessarily want to get along. Others believe an unbundled architecture is preferable, allowing users can mix-and-match tools specialized for a specific purpose.

In truth, there’s no clear answer of who is right. It depends largely on the execution of the different companies competing in the space. For its part, lakeFS is largely agnostic in this battle, as it fits at a foundational level with nearly any stack.

Given their positioning, Databricks sees value in growing the data lake ecosystem, which includes lakeFS. Consequently, we’ve started to collaborate more closely with members of the Databricks team, in both content and product.

Data + AI Online Meetup Recap

One of the first outcomes of this collaboration is a joint meetup presentation with myself and Denny Lee.

The Topic: Multi-Transactional Guarantees with Delta Lake and lakeFS.

The Key Takeaway: The version-controlled workflows enabled by lakeFS allows you to expose new data from multiple datasets in one atomic merge operation. This prevents the possibility of a consumer of the data seeing an inconsistent view, which can lead to incorrect metrics.

After showing how to configure a Spark cluster to read/write from a lakeFS repo, I hopped into a demo of running a data validation check with a Databricks Job and lakeFS pre-merge hook.

Check out the full talk below!

Interested in learning more?

Originally published on the lakeFS blog.

--

--

Paul Singman
Whispering Data

Data @ Meta. Whisperer of data and productivity wisdom. Standing on the shoulders of giants.