3 Ways to Add Data to lakeFS

Few people start using lakeFS without first having some data collected. Consequently, it is common that after getting it up and running, one of the first things people do is import their existing data to lakeFS.

Two Things to Know Before We Begin

You should know the values for two critical pieces of information before we go any further.

  1. Your lakeFS endpoint URL — This is the address of your lakeFS installation’s S3 Gateway. If testing locally, it will likely be http://localhost:8000. If you have a cloud-deployed lakeFS installation, you should have a DNS record pointing to the server, something like lakefs.example.com . Know this value it’ll be used in several places.
  2. Your lakeFS credentials — These are the Key ID and Secret Key generated when you first set up lakeFS and downloaded a lakectl.yaml file. Or your lakeFS administrator should set up a user for you and send the key and ID.
~/.aws/credentials
.lakectl.yaml

Single Local File Copy (AWS CLI)

The Situation — The marketing expert at your company sends you a CSV file of all customers he sent promotional emails to in the past month. You would like to add this file (which currently sits in your local Downloads folder) to your data lake for availability in potential analyses of these customers.

The Command

The following command copies a file called customer_promo_2021-11.csv in my local ~/Downloads folder onto the main branch of a lakeFS repository called my-repo under the path marketing/customer_promo_2021–11.csv .

aws --profile lakefs \
--endpoint-url https://penv.lakefs.dev \
s3 cp ~/Downloads/customer_promo_2021-11.csv s3://my-repo/main/marketing/customer_promo_2021-11.csv

Configuration

In order for this to work, we need to make sure we set our ~/.aws/credentials file with an entry for lakeFS. Here’s what mine looks like:

[default]
aws_access_key_id=AKIAMYACTUALAWSCREDS
aws_secret_access_key=EXAMPLEj2fnHf73J9jkke/e3ea4D
[lakefs]
aws_access_key_id=AKIAJRKP6EXAMPLE
aws_secret_access_key=EXAMPLEYC5wcWOgF36peXniwEJn5kwncw32

Copy Data Without Copying Data (lakectl ingest)

As the lakeFS docs beautifully state:

The lakectl command line tool supports ingesting objects from a source object store without actually copying the data itself. This is done by listing the source bucket (and optional prefix), and creating pointers to the returned objects in lakeFS.

Note that unlike the AWS CLI file copy command above, this works for data already in an object store. We’ll show how to copy the same customer_promo_2021-11.csv file as last time. Instead of being on our local computer though, now it’ll be located in an S3 bucket named my-beautiful-s3-bucket .

The Command

The parameters for the lakectl ingest command are quite straightforward. We simply use the --from and --to params to point to the S3 prefix where the file(s) are located, and where in the lakeFS repo we want the objects to exist.

lakectl ingest \
--from s3://my-beautiful-s3-bucket/customer_promo_2021-11.csv \
--to lakefs://my-repo/main/marketing/customer_promo_2021-11.csv
lakectl ingest \
--from s3://my-beautiful-s3-bucket/customer_promos/ \
--to lakefs://my-repo/main/marketing/customer_promos/

Configuration

In order for the lakectl ingest command to work, we need to make sure our command line is set up to run lakectl commands.

credentials:
access_key_id: AKIAJRKP6EXAMPLE
secret_access_key: EXAMPLEYC5wcWOgF36peXniwEJn5kwncw32
server:
endpoint_url: https://penv.lakefs.dev

Large-Scale Imports (lakeFS inventory imports)

For even larger data collections, the lakeFS binary comes pre-packaged with an import utility that can handle many, many millions of objects.The way it works is by taking advantage of S3’s Inventory feature to create an efficient snapshot of your bucket.

The Comannd

The following command imports the data summarized from an S3 inventory stored in the bucket my-beautiful-s3-bucket-inventory to a lakeFS repository named my-repo .

lakefs import \
lakefs://my-repo \
-m s3://my-beautiful-s3-bucket-inventory/my-beautiful-bucket/my-beautiful-inventory/2021-10-25T00-00Z/manifest.json \
--config .lakefs.yaml
Inventory (2021-10-24) Files Read                     1 / 1    done
Inventory (2021-10-24) Current File 1 / 1 done
Commit progress 0 done
Objects imported 1 done
Added or changed objects: 1
Commit ref:3c1e4222cf2ac89a5c3a9fdd99d106f8bf225e2a17ac013ffae6d19f844420d0
Import to branch import-from-inventory finished successfully.
To list imported objects, run:
$ lakectl fs ls lakefs://my-repo@3c1e4222cf2ac89a5c3a9fdd99d106f8bf225e2a17ac013ffae6d19f844420d0/
To merge the changes to your main branch, run:
$ lakectl merge lakefs://my-repo@3c1e4222cf2ac89a5c3a9fdd99d106f8bf225e2a17ac013ffae6d19f844420d0 lakefs://my-repo@main

Creating an S3 Inventory

Step 1: Go to the Management tab of the data bucket and click “Create Inventory Configuration”.

Configuration

Running lakeFS is dependent on the configuration file documented here. By default lakeFS looks for the following configuration files:

./config.yaml
$HOME/lakefs/config.yaml
/etc/lakefs/config.yaml
$HOME/.lakefs.yaml

Wrapping Up

Whether adding data big or small, I hope this article has been helpful for getting your lakeFS instance hydrated with data! Although we covered three ways to do so, it’s worth noting that two other methods exist — Rclone and distcp.

Still have questions about data and lakeFS?

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Paul Singman

DevRel @lakeFS. Ex-ML Engineering Lead @Equinox. Whisperer of data and productivity wisdom. Standing on the shoulders of giants.