DBT Sources: Streamlining Raw Data Management in Your DBT Project

DBT sources are a powerful feature that helps in managing raw data in your DBT project. They provide a way to define and document raw data tables from your source databases, making it easier to reference and work with them in your models. In this article, we’ll explore what DBT sources are, their benefits, and how to set them up in your DBT project.

  1. Understanding DBT sources:

DBT sources are a way to define and document your raw data tables in the source database. By using sources, you create an abstraction layer between your raw data and DBT models. This allows you to reference the raw tables more easily and ensures that the raw data is consistent throughout the project. DBT sources are defined in a YAML file called sources.yml that resides in the project’s root directory or within individual subdirectories.

  1. Benefits of using DBT sources:

  • Consistency: DBT sources create a single point of reference for raw data tables, ensuring that table names and schema information are consistent throughout your project.
  • Documentation: Defining sources helps document your raw data tables, making it easier for team members to understand the underlying data structure and relationships.
  • Version control: DBT sources can be version-controlled along with your project, allowing you to track changes to your raw data over time.
  • Validation: You can use built-in DBT tests to validate the raw data against predefined data quality checks, ensuring that your transformations are built on reliable data.
  1. Setting up DBT sources:

To set up DBT sources in your project, follow these steps:

Step 1: Create a sources.yml file in your project’s root directory or within a subdirectory.

Step 2: Define your sources in the sources.yml file.

For each source, provide the database, schema, and table information, as well as any additional metadata or description.

Example sources.yml:

version: 2

sources:
  - name: e-commerce
    database: raw_data
    schema: public
    tables:
      - name: orders
        description: "Orders data including order_id, user_id, product_id, order_date, and revenue."
        columns:
          - name: order_id
            tests:
              - unique
              - not_null
      - name: users
        description: "User data including user_id, name, email, signup_date, and country."
        columns:
          - name: user_id
            tests:
              - unique
              - not_null

Step 3: Reference your sources in your DBT models using the source() function.

This function takes two arguments: the name of the source and the name of the table.

Example model orders_daily.sql:

{{
    config(
        materialized='table'
    )
}}

SELECT
    order_date,
    SUM(revenue) as daily_revenue
FROM {{ source('e-commerce', 'orders') }}
GROUP BY order_date
  1. Running DBT with sources:

When you run your DBT project, it will use the sources defined in your sources.yml file to reference raw data tables. DBT will also generate documentation for your sources, making it easier for team members to understand the raw data structure.

By defining and documenting your raw data tables, you can ensure consistency, improve collaboration, and streamline the development of your data transformation pipeline

Get more useful articles on dbt

  1. ,
Author: user

Leave a Reply