Introduction
Data transformation is a critical process in the data pipeline, and dbt (data build tool) is an increasingly popular solution for managing these transformations. This article will provide you with a comprehensive understanding of how to structure a dbt project effectively. We will discuss sources, staging models, and marts models, each with examples to help you get started.
Defining Sources
Sources are the raw data that you will be working with in your dbt project. They represent the tables and views in your database.
Identifying Data Sources
Before configuring sources in dbt, you need to identify the tables and views in your database that you will be using for your project.
Configuring Source YAML
In your dbt project, create a sources.yml
file inside the models
folder. This file will define your sources and their configurations.
Example: Setting Up a Source
To set up a source, add the following configuration to your sources.yml
file:
version: 2
sources:
- name: my_source
database: my_database
schema: my_schema
tables: - name: my_table
Building Staging Models
Staging models are the first level of transformation in dbt. They provide a consistent interface for raw data, allowing you to clean and standardize it.
The Purpose of Staging Models
Staging models serve as a bridge between raw data and business-specific transformations. They allow you to apply basic data cleaning and standardization before moving on to more complex transformations.
Naming and Organizing Staging Models
Staging models should be named stg_
followed by the source name. Organize them in a folder called staging
within your models
folder.
Example: Creating a Staging Model
To create a staging model for the my_table
source, create a file named stg_my_table.sql
in the staging
folder with the following content:
SELECT
column1,
column2,
...
FROM {{ source('my_source', 'my_table') }}
Designing Marts Models
Marts models are the final stage of your data transformation pipeline. They create aggregated and summarized datasets tailored to the needs of specific business use cases.
Understanding Marts Models
Marts models represent the final output of your dbt project. They transform data into a format that can be easily consumed by analysts, data scientists, and other stakeholders.
Types of Marts Models
Marts models can be classified into two main types: dimension models and fact models. Dimension models, represented by dim_
in the naming convention, create a schema for your data warehouse by organizing descriptive attributes. Fact models, represented by fct_
in the naming convention, store quantitative data and represent the key performance indicators or measurements of interest.
Example: Implementing a Marts Model
To create dimension and fact marts models, first ensure that you have a folder called "marts" within your "models" folder.
For a dimension model, create a file named dim_customer.sql
inside the marts
folder with the following content:
SELECT
customer_id,
first_name,
last_name,
email,
...
FROM {{ ref('stg_customers') }}
This example demonstrates how to create a dimension model for customer data using the stg_customers
staging model.
For a fact model, create a file named fct_sales.sql
inside the marts
folder with the following content:
WITH sales_data AS (
SELECT
order_id,
customer_id,
product_id,
SUM(quantity) AS total_quantity,
SUM(price) AS total_price,
...
FROM {{ ref('stg_orders') }}
GROUP BY order_id, customer_id, product_id
)
SELECT
order_id,
customer_id,
product_id,
total_quantity,
total_price,
...
FROM sales_data
This example demonstrates how to create a fact model that aggregates sales data from the stg_orders
staging model.
Example Models Directory
The article introduces a directory structure based on the example of payment records being loaded into a data warehouse using the Stripe API.
├── dbt_project.yml
└── models
├── marts
│ ├── core
│ │ ├── core.yml
│ │ ├── dim_customers.sql
│ │ ├── fct_orders.sql
│ ├── finance
│ ├── marketing
│ └── product
└── staging
└── stripe
├── src_stripe.yml
├── stg_stripe.yml
├── stg_stripe__customers.sql
└── stg_stripe__invoices.sql
References