2022-12-30

DAG in dbt

What is DAG

DAG stands for Directed Acyclic Graph, a type of graph in which nodes are related to each other in a directional manner and do not form a directional closed loop. For example, the execution sequence "A -> B -> C" is a DAG, while a cyclic execution sequence such as "A -> B -> C -> A" is not a DAG.

In the context of data pipeline, a DAG is used in the sense that it is a graphical representation of a set of tasks that need to be ordered, and it helps to understand the dependencies among data models.

For example, the following diagram is a DAG.

DAG

From this DAG we can see the following:

  • int_users is generated from stg_users and stg_user_groups.
  • dim_users is created from stg_orgs and int_users.
  • dim_users is the end of the DAG and is generated from 4 different models.

The DAG allows us to quickly understand the model dependencies and data lineage.

DAG in dbt

In dbt, the ref function can be used to construct model dependencies, or DAGs. The parameter of ref is the name of another model.

For example, suppose you want to construct the following DAG.

dbt DAG

stg_customers.sql and stg_orders.sql are as follows respectively.

models/stg_customers.sql
select
    id as customer_id,
    first_name,
    last_name
from xxx.raw_customers
models/stg_orders.sql
select
    id as order_id,
    user_id as customer_id,
    order_date,
    status
from xxx.raw_orders

In customers.sql, use the ref function to reference stg_customers and stg_orders.

models/customers.sql
with customers as (
    select * from {{ ref('stg_customers') }}
),
orders as (
    select * from {{ ref('stg_orders') }}
),
customer_orders as (
    select
        customer_id,
        min(order_date) as first_order_date,
        max(order_date) as most_recent_order_date,
        count(order_id) as number_of_orders
    from orders
    group by 1
),
final as (
    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        customer_orders.first_order_date,
        customer_orders.most_recent_order_date,
        coalesce(customer_orders.number_of_orders, 0) as number_of_orders
    from customers
    left join customer_orders using (customer_id)
)
select * from final

The above will first generate stg_customers and stg_orders. Then customers is created, and stg_customers and stg_orders are referenced at that time.

Thus, dbt can easily express model dependencies by using ref.

References

https://docs.getdbt.com/terms/dag

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!