What is a model in dbt
In dbt, a model is a SQL SELECT
statement. Models are defined in .sql
(.py
) files and are stored by default in the models
directory.
SQL Models
Each .sql
file consists of one model with one SELECT
statement. Also, the name of the file is used as the model name. When you run the dbt run
command, dbt will wrap the model in CREATE VIEW as
or CREATE TABLE as
statements and execute the SQL.
For example, consider the following customers
model.
with customer_orders as (
select
customer_id,
min(order_date) as first_order_date,
max(order_date) as most_recent_order_date,
count(order_id) as number_of_orders
from jaffle_shop.orders
group by 1
)
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order_date,
customer_orders.most_recent_order_date,
coalesce(customer_orders.number_of_orders, 0) as number_of_orders
from jaffle_shop.customers
left join customer_orders using (customer_id)
When you run the dbt run
command, dbt will execute the following SQL to create a view table named customers
. The view table will be created in the specified schema (dataset). In the following example, the schema (dataset) is dbt_alice
. Also, the name of the view table, customers
, comes from the file name customers.sql
.
create view dbt_alice.customers as (
with customer_orders as (
select
customer_id,
min(order_date) as first_order_date,
max(order_date) as most_recent_order_date,
count(order_id) as number_of_orders
from jaffle_shop.orders
group by 1
)
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order_date,
customer_orders.most_recent_order_date,
coalesce(customer_orders.number_of_orders, 0) as number_of_orders
from jaffle_shop.customers
left join customer_orders using (customer_id)
)
dbt will wrap the model with a CREATE TABLE as
statement.
Configure models
Models can be configured in dbt_project.yml
or in the config
block of each model. For example, if you want to configure materialization and schema,
name: jaffle_shop
config-version: 2
...
models:
jaffle_shop: # this matches the `name:`` config
+materialized: view # this applies to all models in the current project
marts:
+materialized: table # this applies to all models in the `marts/` directory
marketing:
+schema: marketing # this applies to all models in the `marts/marketing/`` directory
{{ config(
materialized="view",
schema="marketing"
)}}
with customer_orders as ...
Settings are applied hierarchically. A configuration applied to a subdirectory will override any general configurations.For example, if a model is configured in both dbt_project.yml
and config
, the configuration in config
takes precedence.
Python models
Python models allow you to use tools available in the Python ecosystem, including state-of-the-art data science and statistics packages.
Each Python model is stored in a .py
file in the models/
folder. A function called model()
is defined and takes the following two parameters.
- dbt
A class specific to each model compiled by dbt Core that can represent model dependencies. - session
A class representing the connection from the data platform to the Python backend.session
is needed to read tables as DataFrames and write DataFrames back to tables.
The model()
function must return a single DataFrame.
import ...
def model(dbt, session):
final_df = ... # stuff you can't write in SQL!
return final_df
In addition to defining model functions, you can import other functions or define your own functions.
def add_one(x):
return x + 1
def model(dbt, session):
dbt.config(materialized="table")
temps_df = dbt.ref("temperatures")
# warm things up just a little
df = temps_df.withColumn("degree_plus_one", add_one(temps_df["degree"]))
return df
import holidays
def is_holiday(date_col):
# Chez Jaffle
french_holidays = holidays.France()
is_holiday = (date_col in french_holidays)
return is_holiday
def model(dbt, session):
dbt.config(
materialized = "table",
packages = ["holidays"]
)
orders_df = dbt.ref("stg_orders")
df = orders_df.to_pandas()
# apply our function
# (columns need to be in uppercase on Snowpark)
df["IS_HOLIDAY"] = df["ORDER_DATE"].apply(is_holiday)
# return final dataset (Pandas DataFrame)
return df
Configure models
Models can be configured in the following three patterns:
- Configure them in
dbt_project.yml
. - Configure by creating a
.yml
file in themodels/
directory. - Configure by calling the
dbt.config()
method in each model.py
.
If you create a .yml
file in the models/
directory and configure it, it will look like this
version: 2
models:
- name: my_python_model
config:
materialized: table
target_name: "{{ target.name }}"
specific_var: "{{ var('SPECIFIC_VAR') }}"
specific_env_var: "{{ env_var('SPECIFIC_ENV_VAR') }}"
def model(dbt, session):
target_name = dbt.config.get("target_name")
specific_var = dbt.config.get("specific_var")
specific_env_var = dbt.config.get("specific_env_var")
orders_df = dbt.ref("fct_orders")
# limit data in dev
if target_name == "dev":
orders_df = orders_df.limit(500)
If you call the dbt.config()
method in .py
to configure it, it will look like this
def model(dbt, session):
# setting configuration
dbt.config(materialized="table")
Limitations
The Python model has the following drawbacks:
- Time and cost
The Python model is slower to execute than the SQL model, and the cloud resources to run it are more expensive; Python requires more general purpose computing to run, and may require different services or architecture than the SQL model. - Syntax differences
dbt has worked to abstract the differences between SQL dialects in a typical data warehouse through packages such as the dispatch pattern anddbt_utils
. python has far more dialects than SQL. if there are 5 ways to do something in SQL, there are 500 ways to do it in Python. If there are five ways to do something in SQL, there are 500 ways to write it in Python, with varying degrees of performance and standards compliance.
If there is a conversion that can be written equally well in SQL and Python, SQL is likely the better choice; if there is a conversion that cannot be written in SQL, or if you can save 1000 lines of the unreadable Jinja-SQL with 10 lines of elegant, annotated Python, then Python is likely the better choice. Python seems to be the better choice if you can save 1000 lines of Jinja-SQL which is hard to read.
References