3 Ways to Add Data to lakeFS
Few people start using lakeFS without first having some data collected. Consequently, it is common that after getting it up and running, one of the first things people do is import their existing data to lakeFS.
There isn’t a one-size-fits-all approach for doing importing data. Instead, there are ways that work great for a single file, and some that are designed to handle millions of them.
Let’s walk through, in detail, how it’s done for each situation.
Ready? Let’s go.
Two Things to Know Before We Begin
You should know the values for two critical pieces of information before we go any further.
- Your lakeFS endpoint URL — This is the address of your lakeFS installation’s S3 Gateway. If testing locally, it will likely be
http://localhost:8000
. If you have a cloud-deployed lakeFS installation, you should have a DNS record pointing to the server, something likelakefs.example.com
. Know this value it’ll be used in several places. - Your lakeFS credentials — These are the Key ID and Secret Key generated when you first set up lakeFS and downloaded a
lakectl.yaml
file. Or your lakeFS administrator should set up a user for you and send the key and ID.
These credentials will be used in the following configuration files:
~/.aws/credentials
.lakectl.yaml
With that out of the way, let’s get started!
Single Local File Copy (AWS CLI)
The Situation — The marketing expert at your company sends you a CSV file of all customers he sent promotional emails to in the past month. You would like to add this file (which currently sits in your local Downloads folder) to your data lake for availability in potential analyses of these customers.
To do this, we’ll use the AWS CLI to copy the file into our lakeFS repo.
The Command
The following command copies a file called customer_promo_2021-11.csv
in my local ~/Downloads
folder onto the main
branch of a lakeFS repository called my-repo
under the path marketing/customer_promo_2021–11.csv
.
aws --profile lakefs \
--endpoint-url https://penv.lakefs.dev \
s3 cp ~/Downloads/customer_promo_2021-11.csv s3://my-repo/main/marketing/customer_promo_2021-11.csv
Configuration
In order for this to work, we need to make sure we set our ~/.aws/credentials
file with an entry for lakeFS. Here’s what mine looks like:
[default]
aws_access_key_id=AKIAMYACTUALAWSCREDS
aws_secret_access_key=EXAMPLEj2fnHf73J9jkke/e3ea4D[lakefs]
aws_access_key_id=AKIAJRKP6EXAMPLE
aws_secret_access_key=EXAMPLEYC5wcWOgF36peXniwEJn5kwncw32
This works by overriding the --endpoint-url
parameter of the AWS CLI, allowing you to direct the requests to your lakeFS installation.
Copy Data Without Copying Data (lakectl ingest)
As the lakeFS docs beautifully state:
The lakectl command line tool supports ingesting objects from a source object store without actually copying the data itself. This is done by listing the source bucket (and optional prefix), and creating pointers to the returned objects in lakeFS.
Note that unlike the AWS CLI file copy command above, this works for data already in an object store. We’ll show how to copy the same customer_promo_2021-11.csv
file as last time. Instead of being on our local computer though, now it’ll be located in an S3 bucket named my-beautiful-s3-bucket
.
The Command
The parameters for the lakectl ingest
command are quite straightforward. We simply use the --from
and --to
params to point to the S3 prefix where the file(s) are located, and where in the lakeFS repo we want the objects to exist.
lakectl ingest \
--from s3://my-beautiful-s3-bucket/customer_promo_2021-11.csv \
--to lakefs://my-repo/main/marketing/customer_promo_2021-11.csv
This works for both single files and multiple files. If instead we had an S3 prefix /customer_promos/
in our beautiful S3 bucket with multiple CSV files, we could ingest all of them to lakeFS with the command:
lakectl ingest \
--from s3://my-beautiful-s3-bucket/customer_promos/ \
--to lakefs://my-repo/main/marketing/customer_promos/
Configuration
In order for the lakectl ingest
command to work, we need to make sure our command line is set up to run lakectl
commands.
As part of the lakeFS tutorial series, I made a 4 minute video explaining exactly how to do this.
Note in the video I use the lakectl config
helper command to configure lakectl. You could also edit the ~/$HOME/.lakectl.yaml
config file directly. Using the same example lakeFS host and credentials as before, my .lakectl.yaml
file looks like:
credentials:
access_key_id: AKIAJRKP6EXAMPLE
secret_access_key: EXAMPLEYC5wcWOgF36peXniwEJn5kwncw32
server:
endpoint_url: https://penv.lakefs.dev
Of course, you would not use these exact values, but instead use your lakeFS credentials and lakeFS domain host.
Large-Scale Imports (lakeFS inventory imports)
For even larger data collections, the lakeFS binary comes pre-packaged with an import utility that can handle many, many millions of objects.The way it works is by taking advantage of S3’s Inventory feature to create an efficient snapshot of your bucket.
The Comannd
The following command imports the data summarized from an S3 inventory stored in the bucket my-beautiful-s3-bucket-inventory
to a lakeFS repository named my-repo
.
lakefs import \
lakefs://my-repo \
-m s3://my-beautiful-s3-bucket-inventory/my-beautiful-bucket/my-beautiful-inventory/2021-10-25T00-00Z/manifest.json \
--config .lakefs.yaml
Note that we cannot save the inventory of an S3 bucket in the bucket itself, as this would create a recursive mess that is better discussed in the pages of Gödel, Escher, Bach.
Anyway, for the data stored in my-beautiful-s3-bucket
that we want to import to lakeFS, we create a second bucket named my-beautiful-s3-bucket-inventory
(though it could be called anything) and point the S3 Inventory to be stored there.
If the inventory import works, you’ll see a response in terminal like this:
Inventory (2021-10-24) Files Read 1 / 1 done
Inventory (2021-10-24) Current File 1 / 1 done
Commit progress 0 done
Objects imported 1 doneAdded or changed objects: 1
Commit ref:3c1e4222cf2ac89a5c3a9fdd99d106f8bf225e2a17ac013ffae6d19f844420d0
Import to branch import-from-inventory finished successfully.To list imported objects, run:
$ lakectl fs ls lakefs://my-repo@3c1e4222cf2ac89a5c3a9fdd99d106f8bf225e2a17ac013ffae6d19f844420d0/
To merge the changes to your main branch, run:
$ lakectl merge lakefs://my-repo@3c1e4222cf2ac89a5c3a9fdd99d106f8bf225e2a17ac013ffae6d19f844420d0 lakefs://my-repo@main
And if you open the lakeFS UI, you’ll see a new import-from-inventory
branch with the added objects on the latest commit.
Let’s walk through step-by-step creating this S3 Inventory, shall we?
Creating an S3 Inventory
Step 1: Go to the Management tab of the data bucket and click “Create Inventory Configuration”.
Step 2: Configure the Inventory. Fill in the inventory config name, which can be any value. For this example, I’ll go with my-beautiful-inventory
.
If you want to limit the inventory to only a certain prefix within the bucket, you can specify that under Inventory Scope. I’ll leave it blank to capture the whole bucket.
Next, I’ll find my-beautiful-bucket-inventory
from the S3 Brower and select it as the inventory destination. Note: If you recently created the destination bucket, it can take a few minutes for it to appear in the S3 Browser. Be patient.
Next choose options for the inventory frequency, format, and status. I recommend Daily, Apache Parquet, and Enable. If you have sensitive data you can turn Server-side encryption on, I’ll leave it off for this example.
For Additional Fields, as the lakeFS documentation states, make sure you check Size, Last modified, and ETag. Once checked, click “Create”
Step 3: Wait! It takes around 24 hours for the first inventory report to be generated, after which we can run the lakefs import
command. Eventually you will see that the inventory job ran with a link to the generated manifest.json
file.
Configuration
Running lakeFS is dependent on the configuration file documented here. By default lakeFS looks for the following configuration files:
./config.yaml
$HOME/lakefs/config.yaml
/etc/lakefs/config.yaml
$HOME/.lakefs.yaml
It is not recommended best practice to save this file locally on a laptop in plaintext. For temporary testing purposes, however, it is okay to take your lakeFS config file, save it to your executable path (same as the .lakectl.yaml
) and try out the lakeFS import command.
Wrapping Up
Whether adding data big or small, I hope this article has been helpful for getting your lakeFS instance hydrated with data! Although we covered three ways to do so, it’s worth noting that two other methods exist — Rclone and distcp.
Look out for future articles diving into how those work.
Still have questions about data and lakeFS?
- Check out the lakeFS repository on GitHub
- Follow us on Twitter or LinkedIn
- Say “Hi” in our friendly Slack Group