2022-12-30

Designing Data Transformation Pipelines with dbt

Introduction

Data transformation is a critical process in the data pipeline, and dbt (data build tool) is an increasingly popular solution for managing these transformations. This article will provide you with a comprehensive understanding of how to structure a dbt project effectively. We will discuss sources, staging models, and marts models, each with examples to help you get started.

Defining Sources

Sources are the raw data that you will be working with in your dbt project. They represent the tables and views in your database.

Identifying Data Sources

Before configuring sources in dbt, you need to identify the tables and views in your database that you will be using for your project.

Configuring Source YAML

In your dbt project, create a sources.yml file inside the models folder. This file will define your sources and their configurations.

Example: Setting Up a Source

To set up a source, add the following configuration to your sources.yml file:

models/sources.yml
version: 2
sources:

- name: my_source
  database: my_database
  schema: my_schema
  tables: - name: my_table

Building Staging Models

Staging models are the first level of transformation in dbt. They provide a consistent interface for raw data, allowing you to clean and standardize it.

The Purpose of Staging Models

Staging models serve as a bridge between raw data and business-specific transformations. They allow you to apply basic data cleaning and standardization before moving on to more complex transformations.

Naming and Organizing Staging Models

Staging models should be named stg_ followed by the source name. Organize them in a folder called staging within your models folder.

Example: Creating a Staging Model

To create a staging model for the my_table source, create a file named stg_my_table.sql in the staging folder with the following content:

models/staging/stg_my_table.sql
SELECT
  column1,
  column2,
...
FROM {{ source('my_source', 'my_table') }}

Designing Marts Models

Marts models are the final stage of your data transformation pipeline. They create aggregated and summarized datasets tailored to the needs of specific business use cases.

Understanding Marts Models

Marts models represent the final output of your dbt project. They transform data into a format that can be easily consumed by analysts, data scientists, and other stakeholders.

Types of Marts Models

Marts models can be classified into two main types: dimension models and fact models. Dimension models, represented by dim_ in the naming convention, create a schema for your data warehouse by organizing descriptive attributes. Fact models, represented by fct_ in the naming convention, store quantitative data and represent the key performance indicators or measurements of interest.

Example: Implementing a Marts Model

To create dimension and fact marts models, first ensure that you have a folder called "marts" within your "models" folder.

For a dimension model, create a file named dim_customer.sql inside the marts folder with the following content:

models/marts/dim_customer.sql
SELECT
  customer_id,
  first_name,
  last_name,
  email,
  ...
FROM {{ ref('stg_customers') }}

This example demonstrates how to create a dimension model for customer data using the stg_customers staging model.

For a fact model, create a file named fct_sales.sql inside the marts folder with the following content:

models/marts/fct_sales.sql
WITH sales_data AS (
  SELECT
    order_id,
    customer_id,
    product_id,
    SUM(quantity) AS total_quantity,
    SUM(price) AS total_price,
    ...
  FROM {{ ref('stg_orders') }}
  GROUP BY order_id, customer_id, product_id
)

SELECT
  order_id,
  customer_id,
  product_id,
  total_quantity,
  total_price,
  ...
FROM sales_data

This example demonstrates how to create a fact model that aggregates sales data from the stg_orders staging model.

Example Models Directory

The article introduces a directory structure based on the example of payment records being loaded into a data warehouse using the Stripe API.

├── dbt_project.yml
└── models
    ├── marts
    │   ├── core
    │   │   ├── core.yml
    │   │   ├── dim_customers.sql
    │   │   ├── fct_orders.sql
    │   ├── finance
    │   ├── marketing
    │   └── product
    └── staging
        └── stripe
            ├── src_stripe.yml
            ├── stg_stripe.yml
            ├── stg_stripe__customers.sql
            └── stg_stripe__invoices.sql

References

https://github.com/dbt-labs/corp/blob/main/dbt_style_guide.md
https://discourse.getdbt.com/t/how-we-used-to-structure-our-dbt-projects/355
https://docs.getdbt.com/faqs/project/example-projects
https://github.com/flexanalytics/dbt-business-intelligence

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!