What is DAG
DAG stands for Directed Acyclic Graph, a type of graph in which nodes are related to each other in a directional manner and do not form a directional closed loop. For example, the execution sequence "A -> B -> C" is a DAG, while a cyclic execution sequence such as "A -> B -> C -> A" is not a DAG.
In the context of data pipeline, a DAG is used in the sense that it is a graphical representation of a set of tasks that need to be ordered, and it helps to understand the dependencies among data models.
For example, the following diagram is a DAG.
From this DAG we can see the following:
int_users
is generated fromstg_users
andstg_user_groups
.dim_users
is created fromstg_orgs
andint_users
.dim_users
is the end of the DAG and is generated from 4 different models.
The DAG allows us to quickly understand the model dependencies and data lineage.
DAG in dbt
In dbt, the ref
function can be used to construct model dependencies, or DAGs. The parameter of ref
is the name of another model.
For example, suppose you want to construct the following DAG.
stg_customers.sql
and stg_orders.sql
are as follows respectively.
select
id as customer_id,
first_name,
last_name
from xxx.raw_customers
select
id as order_id,
user_id as customer_id,
order_date,
status
from xxx.raw_orders
In customers.sql
, use the ref
function to reference stg_customers
and stg_orders
.
with customers as (
select * from {{ ref('stg_customers') }}
),
orders as (
select * from {{ ref('stg_orders') }}
),
customer_orders as (
select
customer_id,
min(order_date) as first_order_date,
max(order_date) as most_recent_order_date,
count(order_id) as number_of_orders
from orders
group by 1
),
final as (
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order_date,
customer_orders.most_recent_order_date,
coalesce(customer_orders.number_of_orders, 0) as number_of_orders
from customers
left join customer_orders using (customer_id)
)
select * from final
The above will first generate stg_customers
and stg_orders
. Then customers
is created, and stg_customers
and stg_orders
are referenced at that time.
Thus, dbt can easily express model dependencies by using ref
.
References