DBT (Data Build Tool) does not have a built-in feature for incremental data loading, but it can be accomplished by using DBT’s filtering and macro capabilities in combination with a database’s incremental loading functionality. This can be done by using the following steps:
- Use a database’s incremental loading feature (e.g. INSERT INTO … ON DUPLICATE KEY UPDATE) to only load new or updated rows into a staging table.
- In DBT, create a model that filters the data from the staging table to only include new or updated rows. This can be done by using a macro to generate a SQL statement that selects rows from the staging table based on a timestamp or other unique identifier.
- Create a DBT model that transforms the incremental data and loads it into the final table. This can be done by using a DBT model with the
incremental=True
configuration to specify that the model should only load new rows and update existing rows. - Run DBT with the target models in each run. This can be done by specifying the
-m
option and the name of the model to run. - Schedule the DBT run with incremental data loading in a cron job or cloud function to update the final table periodically.
Here is an example of how you might use DBT to handle incremental data loading:
- Use a database’s incremental loading feature (e.g. INSERT INTO … ON DUPLICATE KEY UPDATE) to only load new or updated rows into a staging table.
- In DBT, create a model that filters the data from the staging table to only include new or updated rows.
{% set incremental_data =
(select * from {{ref('staging_table')}}
where updated_at > (select max(updated_at) from {{this.schema}}.incremental_table)) %}
{{incremental_data}}
- Create a DBT model that transforms the incremental data and loads it into the final table.
{{ config(materialized='table', incremental=True) }}
select
id,
name,
address
from {{ref('incremental_data')}}
- Run dbt with the target models in each run.
dbt run -m incremental_data
dbt run -m final_table
- Schedule the dbt run with incremental data loading in a cron job or cloud function to update the final table periodically.
Schedule the dbt Run: This part suggests automating the execution of your dbt (Data Build Tool) workflow. Instead of manually triggering it, you’re advised to set up a scheduled process that runs dbt at specified intervals.
Incremental Data Loading: The term “incremental data loading” means that rather than processing all the data from scratch each time, you’re updating your data incrementally. In other words, you’re only loading and processing the new or changed data since the last run. This approach is often more efficient and faster, especially for large datasets.
Cron Job or Cloud Function: These are two different methods for scheduling and automating tasks:
Cron Job: It’s a time-based job scheduler in Unix-like operating systems. You can use a cron job to execute commands or scripts at specified intervals, such as hourly, daily, or weekly.
Cloud Function: In a cloud computing environment (like AWS Lambda or Google Cloud Functions), a cloud function is a piece of code that can be triggered by various events, including time-based triggers. You can set up a cloud function to run your dbt process periodically in response to a specified schedule or trigger.
Update the Final Table Periodically: The ultimate goal of this automation is to periodically update or refresh the “final table” with the most recent and relevant data. The “final table” likely refers to the result or output of your data transformation process. By scheduling dbt runs with incremental data loading, you ensure that this final table is kept up-to-date without reprocessing all the data every time.
Get more useful articles on dbt