DBT : Keeping your data fresh and reliable: Implementing Incremental Model updates in DBT

As your data warehouse continues to grow and accumulate more data, it becomes increasingly inefficient to refresh your entire dataset with every DBT run. This is where incremental models step in, offering a more scalable and efficient way of updating your data by only processing new or updated records. In this article, we’ll delve into how to implement incremental model updates using DBT while ensuring data integrity.

What are incremental models?

Incremental models are a materialization type in DBT that allows you to add new records, and optionally update existing ones, without having to reprocess your entire data set. This model is particularly useful when working with large tables, where a full refresh would be time-consuming and resource-intensive.

Setting up an incremental model

To set up an incremental model, you need to specify the materialized configuration parameter in your model SQL file.

-- freshers_in_orders.sql
{{
  config(
    materialized='incremental'
  )
}}
SELECT *
FROM {{ source('raw_data', 'freshers_in_orders') }}
WHERE timestamp_column_value > (SELECT MAX(timestamp_column_value) FROM {{ this }})

In this example, {{ this }} refers to the model itself, and the WHERE clause ensures that only records newer than the latest one in the target table get processed. This means that each time you run this model, DBT will only process and add new data.

Ensuring data integrity

While incremental models are a boon for efficiency, they can also introduce challenges in terms of data integrity, especially when dealing with updates or deletions in your source data. Here’s how you can tackle these challenges:

Handling updated records

DBT includes a handy merge statement that combines the insert and update operations, allowing you to both add new records and update existing ones. However, keep in mind that using merge requires your warehouse to support this operation.

Firstly, you need to specify a unique key using the unique_key configuration parameter in your model. Then, you can use a conditional statement in your SQL to handle both insertions and updates.

-- freshers_in_orders.sql
{{
  config(
    materialized='incremental',
    unique_key='order_id'
  )
}}
SELECT *
FROM {{ source('raw_data', 'freshers_in_orders') }}

DBT will add new records to the freshers_in_orders table and update existing ones that have a newer timestamp_column_value value.

Handling deleted records

One limitation of incremental models is that they do not inherently account for deleted records. A common way to handle this is by using a soft delete method, where you keep a column (often called is_deleted or deleted_at) in your source data that indicates if a record has been deleted.

Then, in your DBT model, you can exclude these deleted records:

-- freshers_in_orders.sql
{{
  config(
    materialized='incremental',
    unique_key='order_id'
  )
}}
SELECT *
FROM {{ source('raw_data', 'freshers_in_orders') }}
WHERE deleted_at IS NULL

With this model, DBT will not include deleted records when adding or updating the data.

Testing your incremental models

Regularly testing your models is crucial to maintaining data integrity. DBT offers a variety of tests, such as unique, not_null, and relationships, which you can use to validate your incremental models.

models:
  my_project:
    freshers_in_orders:
      constraints:
        unique:
          - order_id
        not_null:
          - order_id
          - timestamp_column_value

Get more useful articles on dbt

Post Views: 16

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Most Viewed Posts