Skip to main content
Sources define the metadata for database tables that exist in your external data sources. They are configured in the models/sources.yml file and can be loaded into the Virtual Data Lake (VDL) for use by other data models.

Configuration structure

The sources.yml file contains a list of source definitions under the sources key:
sources:
  - name: src_customers
    description: Customer data from the CRM system
    connection: default
    table: customers
    load_to_vdl: true
    columns:
      - name: customer_id
        type: integer
        description: Unique customer identifier
        category: dimension

Configuration fields

name
string
required
A unique identifier for the source. This name is used to reference the source in your models using the ref() function.
description
string
default:""
A human-readable description of the source table. Used for documentation and data consumers (such as AI agents).
connection
string
default:"default"
The name of the database connection to use, as defined in squirrels.yml or pyconfigs/connections.py. If not specified, uses the default connection defined by the SQRL_CONNECTIONS__DEFAULT_NAME_USED environment variable, or “default” if that is not set.
table
string
The name of the table in the external database. If not specified, defaults to the source name.
load_to_vdl
boolean
default:"false"
Whether to load this source’s data into the Virtual Data Lake (VDL) during the build process. When true, the source data is extracted from the external database and stored in the VDL, making it available for use by build and federate models. See VDL loading behavior for details on how columns are loaded.
primary_key
list[string]
default:"[]"
A list of column names that form the primary key of the source table. Used for incremental loading and upsert operations.
update_hints
object
Configuration for incremental data loading.
columns
list[object]
default:"[]"
A list of column configurations that define which columns to load and their metadata.
If no columns are specified and load_to_vdl is set to true, all columns from the source table are loaded with types inferred by DuckDB.
Unlike seeds, sources require column type specifications. Each column in the columns list must have a type field set. If a column is missing a type, an error will be raised.

Example configurations

sources:
  # Basic example (with default connection)
  - name: src_customers
    description: Customer master data
    table: customers
    load_to_vdl: true
    columns:
      - name: customer_id
        type: integer
        description: Customer identifier
        category: dimension
      
      - name: name
        type: string
        description: Customer name
        category: dimension

  # Source with incremental loading and using a custom connection
  - name: src_transactions
    description: Financial transactions with incremental loading
    connection: finance_db
    table: transactions
    load_to_vdl: true
    primary_key:
      - transaction_id
    update_hints:
      increasing_column: created_at
      strictly_increasing: true
    columns:
      - name: transaction_id
        type: integer
        description: Unique transaction identifier
        category: dimension
      
      - name: created_at
        type: datetime
        description: Transaction timestamp
        category: dimension
      
      - name: amount
        type: decimal
        description: Transaction amount
        category: measure

  # Using Jinja templating
  - name: src_data
    description: Data from {{ env_vars.DATA_SOURCE_NAME }}
    table: {{ env_vars.TABLE_NAME }}
    load_to_vdl: true
    columns:
      - name: id
        type: integer

VDL loading behavior

When load_to_vdl: true is set, the source data is extracted from the external database and stored in the Virtual Data Lake during sqrl build. The behavior depends on whether columns are specified:
Columns specified?Behavior
YesOnly the specified columns are loaded, and each column is cast to the specified type.
NoAll columns from the source table are loaded with types inferred by DuckDB.
When specifying columns, make sure the column types match or are compatible with the actual database column types to avoid casting errors during the build process.

Using sources in models

Sources can be referenced in your data models, but the method depends on the model type and whether load_to_vdl is enabled.

In build models (SQL)

Build models written in SQL can reference sources using the ref() macro which provides the fully qualified name of the source model:
-- models/builds/customer_orders.sql
SELECT 
    c.customer_id,
    c.name,
    COUNT(o.order_id) AS order_count,
    SUM(o.total_amount) AS total_spent
FROM {{ ref("src_customers") }} c
LEFT JOIN {{ ref("src_orders") }} o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name

In build models (Python)

Build models written in Python can reference sources using the sqrl.ref() function which returns a LazyFrame:
# models/builds/customer_orders.py
from squirrels import arguments as args
import polars as pl


def main(sqrl: args.BuildModelArgs) -> pl.LazyFrame:
    customers = sqrl.ref("src_customers")
    orders = sqrl.ref("src_orders")
    
    result = customers.join(
        orders.group_by("customer_id").agg([
            pl.count("order_id").alias("order_count"),
            pl.sum("total_amount").alias("total_spent")
        ]),
        on="customer_id",
        how="left"
    )
    
    return result

In federate models

Similar to build models, sources with load_to_vdl: true can be referenced directly in federate models using the ref() macro for SQL models or the sqrl.ref() function for Python models.
-- models/federates/recent_orders.sql
SELECT *
FROM {{ ref("src_orders") }}
WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'

In dbview models

Dbview models reference sources using the source() function (though ref() is also supported as an alias). Unlike build and federate models, dbview models have specific rules for referencing sources:
  1. Same connection required: A dbview model can only reference sources that share the same database connection. This is because dbview queries run directly on the external database (or at least give the illusion of doing so).
  2. translate_to_duckdb restriction: If a dbview model has translate_to_duckdb: true, it can only reference sources with load_to_vdl: true. This is because the translated query actually runs on DuckDB using VDL data.
-- models/dbviews/customer_summary.sql
-- This dbview must use the same connection as src_customers
SELECT 
    customer_id,
    COUNT(*) AS order_count
FROM {{ source("src_customers") }}
GROUP BY customer_id

Best practices

  1. Use descriptive names: Prefix your source names with src_ to distinguish them from other model types.
  2. Document columns: Add descriptions to all columns to help other developers and data consumers (including AI agents) understand the data.
  3. Specify column types: Always provide explicit column types for sources to ensure data is loaded correctly.
  4. Consider incremental loading: For large or frequently updated tables, use update_hints to enable incremental loading and reduce build times.
  5. Keep VDL loading selective: Only set load_to_vdl: true for sources that need to be used by build or federate models. Sources used only by dbview models don’t need VDL loading.