Virtual Data Lake (VDL)

The Virtual Data Lake (VDL) is the central storage system in Squirrels that holds all materialized data models, seeds, and source data. It is built using DuckLake, a lakehouse architecture that provides the convenience of building your data lake locally while still allowing for scalability if needed by using cloud storage systems (like AWS S3).

Overview

The VDL serves as the foundation for your Squirrels project’s data layer. When you run sqrl build, Squirrels:

Loads seed data from CSV files
Extracts and stores source data (when load_to_vdl: true is configured)
Executes build models and materializes them as tables or views
Stores all of this data in the VDL

The VDL is then attached as a schema named vdl in DuckDB during realtime queries, making all your materialized models accessible for querying by federate models.

Architecture

The VDL uses DuckLake which consists of two main components:

Catalog database

The catalog database (vdl_catalog.duckdb by default) stores metadata about the VDL, including table schemas, view definitions, and other structural information. The catalog database can be configured via the SQRL_VDL__CATALOG_DB_PATH environment variable (see the Configuration section below for more details).

Data storage

The data storage location (target/vdl_data/ by default) contains the actual data files for the VDL. These files store the physical data for materialized tables. The location can be configured via the SQRL_VDL__DATA_PATH environment variable (see the Configuration section below for more details).

Configuration

The catalog database and data location can be configured using environment variables in your .env or .env.local file:

SQRL_VDL__CATALOG_DB_PATH

string

Path to the DuckLake catalog database for the VDL. Supports the {project_path} placeholder. Using a Postgres database for the catalog database is possible.For more information, see the DuckLake documentation on choosing a catalog database.

SQRL_VDL__DATA_PATH

string

default:"{project_path}/target/vdl_data/"

Storage path for the DuckLake data files for the VDL. Supports the {project_path} placeholder. Using a cloud storage system like AWS S3 is also possible.For more information, see the DuckLake documentation on choosing storage.

Both the catalog database path and data storage path support the {project_path} placeholder, allowing you to use relative paths that resolve to your project directory.

Squirrels does not allow changing the data path for an existing Virtual Data Lake. If you need to change the data path, you must rebuild the VDL from scratch using sqrl build --full-refresh after updating the environment variable.

Building the VDL

The VDL is created and updated when you run sqrl build:

# Build all models
sqrl build

# Rebuild from scratch (drops all tables)
sqrl build --full-refresh

# Build a specific model and its dependencies
sqrl build --select my_model

During the build process:

Seeds are loaded from CSV files in the seeds/ directory
Sources with load_to_vdl: true are extracted and stored
Build models are executed in dependency order and materialized
All data is stored in the VDL

Accessing VDL data

Once built, data in the VDL can be accessed in several ways:

In data models

Build models and federate models in SQL can reference VDL tables using the ref() macro:

-- Reference a seed
SELECT * FROM {{ ref("seed_categories") }}

-- Reference a source (if loaded to VDL)
SELECT * FROM {{ ref("src_customers") }}

-- Reference another build model
SELECT * FROM {{ ref("build_transactions") }}

Build models and federate models in Python can access VDL data through the sqrl.ref() method:

def main(sqrl: args.BuildModelArgs) -> pl.LazyFrame:
    # Access a seed
    categories = sqrl.ref("seed_categories")
    
    # Access a source (if loaded to VDL)
    customers = sqrl.ref("src_customers")
    
    # Access another build model
    transactions = sqrl.ref("build_transactions")
    
    return transactions.join(customers, on="customer_id")

Direct SQL queries

You can also query the VDL directly using DuckDB SQL. When using sqrl duckdb or sqrl duckdb --ui, the VDL is automatically attached as the vdl schema (in READ ONLY mode). This schema is in use by default such that you don’t need to run USE vdl; to use it.

-- Querying a seed from the VDL
SELECT * FROM vdl.seed_categories;

-- Can also query from the VDL without specifying the schema
SELECT * FROM seed_categories;

Build models - Create materialized data models in the VDL
Sources - Configure source tables and load them to the VDL
Seeds - Load CSV data into the VDL
Environment variables - Configure VDL paths and settings
sqrl build - CLI reference for building the VDL
sqrl duckdb - CLI reference for running DuckDB SQL queries against the VDL

Get started

Concepts

Project files

Overview

Architecture

Catalog database

Data storage

Configuration

Building the VDL

Accessing VDL data

In data models

Direct SQL queries

Get started

Concepts

Project files

​Overview

​Architecture

​Catalog database

​Data storage

​Configuration

​Building the VDL

​Accessing VDL data

​In data models

​Direct SQL queries

​Related pages

Overview

Architecture

Catalog database

Data storage

Configuration

Building the VDL

Accessing VDL data

In data models

Direct SQL queries

Related pages