What is Source in dbt
In dbt, a Source is the raw data in DWH. dbt provides the following functions for handling Sources:
- Name a Source and refer to it from a model.
- Test the data in a Source.
- Checking the freshness of data in a Source.
Declaring Sources
Place arbitrarily named .yml
files in the directories declared in the model-paths
of dbt_project.yml
.
Suppose you have the following raw data in BigQuery:
- Project ID:
my_project
- Dataset:
jaffle_shop
- Table:
raw_customers
raw_orders
raw_payments
- Table:
If you want to use the raw data in the dataset jaffle_shop
as Source, write the following.
version: 2
sources:
- name: jaffle ## Any name
database: my_project
schema: jaffle_shop
tables:
- name: raw_customers
- name: raw_orders
- name: raw_payments
Use Sources
Declared Sources can be referenced from the model. The following example references the raw_orders
and raw_customers
tables from a Source declared as jaffle
.
SELECT
*
FROM {{ source('jaffle', 'raw_orders') }}
LEFT JOIN {{ source('jaffle', 'raw_customers') }} USING (customer_id)
dbt compiles internally as follows.
SELECT
*
FROM jaffle.jaffle_shop.raw_orders
LEFT JOIN jaffle.jaffle_shop.raw_customers USING (customer_id)
Test Sources
You can test your Sources by writing the following and running dbt test
.
version: 2
sources:
- name: jaffle
database: my_project
schema: jaffle_shop
tables:
- name: raw_customers
+ columns:
+ - name: id
+ tests:
+ - unique
+ - not_null
- name: raw_orders
+ columns:
+ - name: id
+ tests:
+ - unique
- name: raw_payments
Check freshness of Sources
You can check the freshness of the Sources by writing about loaded_at_field
and freshness
.
version: 2
sources:
- name: jaffle
database: my_project
schema: jaffle_shop
+ freshness: # default freshness
+ warn_after: {count: 12, period: hour}
+ error_after: {count: 24, period: hour}
+ loaded_at_field: _etl_loaded_at
tables:
- name: raw_customers # this will use the freshness defined above
- name: raw_orders
+ freshness: # make this a little more strict
+ warn_after: {count: 6, period: hour}
+ error_after: {count: 12, period: hour}
- name: raw_payments
+ freshness: null # do not check freshness for this table
The loaded_at_field
is a column of time data to be used in the freshness calculation. dbt checks the freshness of the data by taking the MAX value of the specified column and measuring the difference from the current time. However, the time must be in timestamp and UTC. Otherwise, you need to write a query to cast the time as follows.
loaded_at_field: "convert_timezone('UTC', 'Asia/Tokyo', _etl_loaded_at)"
The freshness
parameter defines warnings or errors about the difference between the current time and the time of the data. For example, warn_after: {count: 12, period: hour}
means to warn if the latest data is more than 12 hours behind the current time.
You can run dbt source freshness
to measure the freshness of your Sources.
$ dbt source freshness
Running with dbt=0.18.1
Found 5 models, 11 tests, 0 snapshots, 0 analyses, 155 macros, 0 operations, 0 seed files, 2 sources
20:35:05 | Concurrency: 4 threads (target='learn')
20:35:05 |
20:35:06 | 1 of 1 START freshness of jaffle.raw_orders........................... [RUN]
20:35:09 | 1 of 1 WARN freshness of jaffle.raw_orders............................ [WARN in 3.98s]
20:35:09 | Done.
Since WARN
is displayed, it means that the freshness level specified by warn_after
has been exceeded, i.e., more than 6 hours have passed since the current time (command execution time). Note that PASSS
will be displayed if there is no freshness problem, and ERROR STALE
will be displayed if error_after
is applied.
References