2022-12-30

dbtのモデル

dbt のモデルとは

dbtにおいてモデルとはSQLのSELECT文のことです。モデルは.sql（.py）ファイルで定義され、デフォルトではmodelsディレクトリに格納されます。

SQL モデル

各.sqlファイルは1つのモデルとして1つのSELECT文で構成されます。また、ファイルの名前がモデル名として使用されます。dbt runコマンドを実行するとdbtはモデルをCREATE VIEW asまたCREATE TABLE asのステートメントでラップしてSQLを実行します。

例えば次のcustomersモデルを考えます。

models/customers.sql

with customer_orders as (
    select
        customer_id,
        min(order_date) as first_order_date,
        max(order_date) as most_recent_order_date,
        count(order_id) as number_of_orders
    from jaffle_shop.orders
    group by 1
)
select
    customers.customer_id,
    customers.first_name,
    customers.last_name,
    customer_orders.first_order_date,
    customer_orders.most_recent_order_date,
    coalesce(customer_orders.number_of_orders, 0) as number_of_orders
from jaffle_shop.customers
left join customer_orders using (customer_id)

dbt runコマンドを実行すると、dbtは次のSQLを実行してcustomersというviewテーブルを作成します。このとき、viewテーブルは指定されたスキーマ（データセット）に作成されます。次の例でいうとスキーマ（データセット）はdbt_aliceです。また、viewテーブルの名前であるcustomersはファイル名のcustomers.sqlから来ています。

create view dbt_alice.customers as (
    with customer_orders as (
        select
            customer_id,
            min(order_date) as first_order_date,
            max(order_date) as most_recent_order_date,
            count(order_id) as number_of_orders

        from jaffle_shop.orders

        group by 1
    )

    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        customer_orders.first_order_date,
        customer_orders.most_recent_order_date,
        coalesce(customer_orders.number_of_orders, 0) as number_of_orders

    from jaffle_shop.customers

    left join customer_orders using (customer_id)
)

dbtはモデルをCREATE TABLE asステートメントでラップしていることになります。

モデルの設定

モデルはdbt_project.ymlや各モデルのconfigブロックで設定することができます。例えばマテリアライゼーションとスキーマを設定したい場合は次のようになります。

dbt_project.yml

name: jaffle_shop
config-version: 2
...

models:
  jaffle_shop: # this matches the `name:`` config
    +materialized: view # this applies to all models in the current project
    marts:
      +materialized: table # this applies to all models in the `marts/` directory
      marketing:
        +schema: marketing # this applies to all models in the `marts/marketing/`` directory

models/customer.sql

{{ config(
    materialized="view",
    schema="marketing"
)}}

with customer_orders as ...

設定は階層的に適用されます。サブディレクトリに適用された設定は、上位に設定された条件よりも優先されます。例えばあるモデルについてdbt_project.ymlとconfigの両方で設定されている場合はconfigの設定値が優先されます。

Python モデル

Pythonモデルでは、データサイエンスや統計学の最先端パッケージを含むPythonエコシステムで利用できるツールを使うことができます。

各Pythonモデルはmodels/フォルダにある.pyファイルに格納されます。model()というの関数が定義され、次の2つのパラメータを受け取ります。

dbt
dbt Coreによってコンパイルされた各モデルに固有のクラスで、モデルの依存関係を表現することができます。
session
データプラットフォームからPythonバックエンドへの接続を表すクラスです。sessionはテーブルをDataFrameとして読み込んだり、DataFrameをテーブルに書き戻したりするために必要とされます。

model()関数は、1つのDataFrameを返す必要があります。

models/my_python_model.py

import ...

def model(dbt, session):
    final_df = ...  # stuff you can't write in SQL!
    return final_df

モデル関数の定義に加えて、他の関数をインポートしたり独自の関数を定義したりすることもできます。

models/my_python_model.py

def add_one(x):
    return x + 1

def model(dbt, session):
    dbt.config(materialized="table")
    temps_df = dbt.ref("temperatures")

    # warm things up just a little
    df = temps_df.withColumn("degree_plus_one", add_one(temps_df["degree"]))
    return df

models/my_python_model.py

import holidays

def is_holiday(date_col):
    # Chez Jaffle
    french_holidays = holidays.France()
    is_holiday = (date_col in french_holidays)
    return is_holiday

def model(dbt, session):
    dbt.config(
        materialized = "table",
        packages = ["holidays"]
    )

    orders_df = dbt.ref("stg_orders")

    df = orders_df.to_pandas()

    # apply our function
    # (columns need to be in uppercase on Snowpark)
    df["IS_HOLIDAY"] = df["ORDER_DATE"].apply(is_holiday)

    # return final dataset (Pandas DataFrame)
    return df

モデルの設定

モデルは次の3パターンで設定することができます。

dbt_project.ymlで設定する
models/ディレクトリに.ymlファイルを作成して設定する
各モデル.py内のdbt.config()メソッドを呼び出して設定する

models/ディレクトリの.ymlファイルを作成して設定する場合は次のようになります。

models/config.yml

version: 2

models:
  - name: my_python_model
    config:
      materialized: table
      target_name: "{{ target.name }}"
      specific_var: "{{ var('SPECIFIC_VAR') }}"
      specific_env_var: "{{ env_var('SPECIFIC_ENV_VAR') }}"

models/my_python_model.py

def model(dbt, session):
    target_name = dbt.config.get("target_name")
    specific_var = dbt.config.get("specific_var")
    specific_env_var = dbt.config.get("specific_env_var")

    orders_df = dbt.ref("fct_orders")

    # limit data in dev
    if target_name == "dev":
        orders_df = orders_df.limit(500)

.py内のdbt.config()メソッドを呼び出して設定する場合は次のようになります。

models/my_python_model.py

def model(dbt, session):

    # setting configuration
    dbt.config(materialized="table")

Limitations

Pythonモデルは次の欠点があります。

時間とコスト
PythonモデルはSQLモデルよりも実行速度が遅く、それを実行するクラウドリソースも高価になります。Pythonの実行には、より汎用的なコンピューティングが必要であり、また、SQLモデルとは別のサービスやアーキテクチャが必要な場合もあります。
Syntaxの違い
dbtはディスパッチパターンやdbt_utilsなどのパッケージを通じて、一般的なデータウェアハウスにおけるSQL方言の違いの抽象化に努めてきました。PythonはSQLよりも遥かに多くの方言が存在します。SQLで何かをする方法が5つあるとしたら、Pythonでそれを書く方法は500通りあり、性能も標準への準拠も様々です。

SQLでもPythonでも同じように書ける変換がある場合はSQLの方が良さそうです。SQLで書けない変換がある、あるいは10行のエレガントで注釈付きのPythonで、読みにくいJinja-SQLの1000行を節約できる場合であればPythonを選択するのが良さそうです。

参考

dbtとは

マテリアライゼーション

Descriptive Statistics

Differential Equation

Dimensionality Reduction

Discrete Choice Model

Google Search Console

Hugging Face

Hypothesis Testing

Inferential Statistics

Probability Distribution

Ryusei Kakujo

Weave the future of cities through data

Transportation modeling/ Urban planning/ Machine learning/ Computer science/ GIS

dbtのモデル

dbt のモデル とは

SQL モデル

モデルの設定

Python モデル

モデルの設定

Limitations

参考

dbtとは

マテリアライゼーション

Ryusei Kakujo

dbt のモデルとは