2022-12-30

dbt model

What is a model in dbt

In dbt, a model is a SQL SELECT statement. Models are defined in .sql (.py) files and are stored by default in the models directory.

SQL Models

Each .sql file consists of one model with one SELECT statement. Also, the name of the file is used as the model name. When you run the dbt run command, dbt will wrap the model in CREATE VIEW as or CREATE TABLE as statements and execute the SQL.

For example, consider the following customers model.

models/customers.sql

with customer_orders as (
    select
        customer_id,
        min(order_date) as first_order_date,
        max(order_date) as most_recent_order_date,
        count(order_id) as number_of_orders
    from jaffle_shop.orders
    group by 1
)
select
    customers.customer_id,
    customers.first_name,
    customers.last_name,
    customer_orders.first_order_date,
    customer_orders.most_recent_order_date,
    coalesce(customer_orders.number_of_orders, 0) as number_of_orders
from jaffle_shop.customers
left join customer_orders using (customer_id)

When you run the dbt run command, dbt will execute the following SQL to create a view table named customers. The view table will be created in the specified schema (dataset). In the following example, the schema (dataset) is dbt_alice. Also, the name of the view table, customers, comes from the file name customers.sql.

create view dbt_alice.customers as (
    with customer_orders as (
        select
            customer_id,
            min(order_date) as first_order_date,
            max(order_date) as most_recent_order_date,
            count(order_id) as number_of_orders

        from jaffle_shop.orders

        group by 1
    )

    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        customer_orders.first_order_date,
        customer_orders.most_recent_order_date,
        coalesce(customer_orders.number_of_orders, 0) as number_of_orders

    from jaffle_shop.customers

    left join customer_orders using (customer_id)
)

dbt will wrap the model with a CREATE TABLE as statement.

Configure models

Models can be configured in dbt_project.yml or in the config block of each model. For example, if you want to configure materialization and schema,

dbt_project.yml

name: jaffle_shop
config-version: 2
...

models:
  jaffle_shop: # this matches the `name:`` config
    +materialized: view # this applies to all models in the current project
    marts:
      +materialized: table # this applies to all models in the `marts/` directory
      marketing:
        +schema: marketing # this applies to all models in the `marts/marketing/`` directory

models/customer.sql

{{ config(
    materialized="view",
    schema="marketing"
)}}

with customer_orders as ...

Settings are applied hierarchically. A configuration applied to a subdirectory will override any general configurations.For example, if a model is configured in both dbt_project.yml and config, the configuration in config takes precedence.

Python models

Python models allow you to use tools available in the Python ecosystem, including state-of-the-art data science and statistics packages.

Each Python model is stored in a .py file in the models/ folder. A function called model() is defined and takes the following two parameters.

dbt
A class specific to each model compiled by dbt Core that can represent model dependencies.
session
A class representing the connection from the data platform to the Python backend. session is needed to read tables as DataFrames and write DataFrames back to tables.

The model() function must return a single DataFrame.

models/my_python_model.py

import ...

def model(dbt, session):
    final_df = ...  # stuff you can't write in SQL!
    return final_df

In addition to defining model functions, you can import other functions or define your own functions.

models/my_python_model.py

def add_one(x):
    return x + 1

def model(dbt, session):
    dbt.config(materialized="table")
    temps_df = dbt.ref("temperatures")

    # warm things up just a little
    df = temps_df.withColumn("degree_plus_one", add_one(temps_df["degree"]))
    return df

models/my_python_model.py

import holidays

def is_holiday(date_col):
    # Chez Jaffle
    french_holidays = holidays.France()
    is_holiday = (date_col in french_holidays)
    return is_holiday

def model(dbt, session):
    dbt.config(
        materialized = "table",
        packages = ["holidays"]
    )

    orders_df = dbt.ref("stg_orders")

    df = orders_df.to_pandas()

    # apply our function
    # (columns need to be in uppercase on Snowpark)
    df["IS_HOLIDAY"] = df["ORDER_DATE"].apply(is_holiday)

    # return final dataset (Pandas DataFrame)
    return df

Configure models

Models can be configured in the following three patterns:

Configure them in dbt_project.yml.
Configure by creating a .yml file in the models/ directory.
Configure by calling the dbt.config() method in each model .py.

If you create a .yml file in the models/ directory and configure it, it will look like this

models/config.yml

version: 2

models:
  - name: my_python_model
    config:
      materialized: table
      target_name: "{{ target.name }}"
      specific_var: "{{ var('SPECIFIC_VAR') }}"
      specific_env_var: "{{ env_var('SPECIFIC_ENV_VAR') }}"

models/my_python_model.py

def model(dbt, session):
    target_name = dbt.config.get("target_name")
    specific_var = dbt.config.get("specific_var")
    specific_env_var = dbt.config.get("specific_env_var")

    orders_df = dbt.ref("fct_orders")

    # limit data in dev
    if target_name == "dev":
        orders_df = orders_df.limit(500)

If you call the dbt.config() method in .py to configure it, it will look like this

models/my_python_model.py

def model(dbt, session):

    # setting configuration
    dbt.config(materialized="table")

Limitations

The Python model has the following drawbacks:

Time and cost
The Python model is slower to execute than the SQL model, and the cloud resources to run it are more expensive; Python requires more general purpose computing to run, and may require different services or architecture than the SQL model.
Syntax differences
dbt has worked to abstract the differences between SQL dialects in a typical data warehouse through packages such as the dispatch pattern and dbt_utils. python has far more dialects than SQL. if there are 5 ways to do something in SQL, there are 500 ways to do it in Python. If there are five ways to do something in SQL, there are 500 ways to write it in Python, with varying degrees of performance and standards compliance.

If there is a conversion that can be written equally well in SQL and Python, SQL is likely the better choice; if there is a conversion that cannot be written in SQL, or if you can save 1000 lines of the unreadable Jinja-SQL with 10 lines of elegant, annotated Python, then Python is likely the better choice. Python seems to be the better choice if you can save 1000 lines of Jinja-SQL which is hard to read.

References

What is dbt

Materializations

Descriptive Statistics

Differential Equation

Dimensionality Reduction

Discrete Choice Model

Google Search Console

Hugging Face

Hypothesis Testing

Inferential Statistics

Probability Distribution

Ryusei Kakujo

Weave the future of cities through data

Transportation modeling/ Urban planning/ Machine learning/ Computer science/ GIS