TitanRDM

Let's get something out of the way: dbt is excellent. If you're doing analytics engineering, modelling data, managing transformations, building tested and documented data pipelines, dbt is probably the best tool available. This article isn't a critique of dbt. It's a critique of a specific pattern that's become widespread in dbt projects, one that dbt itself was never designed to support.

That pattern is using seed files as your reference data management strategy.

If you've worked on a dbt project of any meaningful size, you've seen it. There's a /seeds directory in the repo. Inside it, CSV files: store_mapping.csv, product_categories.csv, country_codes.csv, cost_centre_hierarchy.csv. They were committed months ago. Maybe years ago. They get loaded by dbt seed and joined into your models as if they were any other source table.

It works. Until it doesn't.

And the moment it stops working is usually the moment a business user needs to make a change.

How Seed Files Became the Default

To understand why this pattern is so common, you have to understand what seed files replaced.

Before dbt seed files, most teams managed reference data in one of three ways: Excel spreadsheets on a shared drive, hard-coded SQL CASE WHEN statements scattered across models, or manual INSERT scripts run against warehouse tables. All three approaches were fragile, ungoverned, and disconnected from the transformation pipeline.

dbt seed files solved a real problem. They brought reference data into the same version-controlled, deployable workflow as everything else in the dbt project. Suddenly, your product category mapping was in Git. It went through pull requests. It was deployed alongside your models. It was reproducible. Compared to what came before, this was a genuine improvement.

And dbt's own documentation is clear about the intended scope. Seeds are designed for small, static datasets that change infrequently, things like country code lookups or test fixture data. The documentation explicitly warns against using seeds for data that changes frequently or is large in volume.

The problem is that reference data doesn't stay small or static. What starts as a fifty-row country code list grows into a multi-tab mapping of store hierarchies, product classifications, customer tier definitions, and cost-centre structures, business-critical data that changes whenever the business changes. And the moment that data needs to change, the seed file workflow reveals its fundamental limitation.

The Business User Problem

This is the core issue, and it's the one that no amount of process improvement can fix within the seed file paradigm.

Reference data is business-owned data. The operations team defines which stores belong to which regions. The merchandising team defines product categories and subcategories. The finance team defines cost-centre hierarchies and allocation rules. These aren't technical decisions, they're business decisions that happen to be encoded as data.

Now imagine you're the operations manager. Three new stores opened last month. You need to add them to the regional hierarchy so they appear correctly in sales reports. The data lives in a seed file.

To make this change, you need to:

Clone the dbt repository (or ask someone who has it cloned).
Open store_hierarchy.csv in a text editor.
Add three rows in the correct format, with the correct column order, with no trailing commas or encoding issues.
Commit the change to a branch.
Open a pull request.
Wait for a data engineer to review and approve the PR.
Wait for the PR to be merged.
Wait for the next dbt seed + dbt run execution to load and propagate the change.

If you're a data engineer reading this, each step seems reasonable. If you're the operations manager, this process is absurd. You need to change three rows in a lookup table, and the workflow requires you to interact with Git, understand CSV formatting, navigate a code review process designed for software changes, and wait for a deployment cycle.

In practice, what actually happens is one of two things. Either the business user emails the change to a data engineer and waits, days or weeks, for someone to commit it. Or the business user maintains their own spreadsheet alongside the seed file, creating exactly the kind of shadow reference data that the seed file was supposed to eliminate.

This isn't a training problem. You're not going to teach every operations manager, merchandiser, and finance analyst to use Git. And you shouldn't have to. The question is whether reference data management should conform to the developer workflow, or whether it needs its own workflow that accommodates both business users and engineers.

The Scale Problem

Even setting aside the business user issue, seed files have practical limitations that surface as your reference data grows.

File size and performance. dbt's documentation recommends keeping seed files under 1MB. For a simple country code list, that's fine. For a product catalogue mapping with 50,000 SKUs, a supplier crosswalk with 10,000 entries, or a historical store hierarchy with effective dating, you've exceeded the practical limit. dbt seed loads data by issuing INSERT statements, which becomes painfully slow at scale. Some teams have resorted to splitting large seeds across multiple files, which creates its own maintenance nightmare.

No validation or data types. A CSV file is just text. There's no schema enforcement on a seed file before it's loaded. If someone adds a row with a mistyped column, an extra comma, or a value in the wrong format, you won't know until dbt seed fails, or worse, until the data loads successfully but the downstream model produces wrong results. There's no equivalent of a CHECK constraint, a foreign key reference, or even a basic data type validation on the CSV itself.

No effective dating. Reference data changes over time, and your warehouse needs to handle that. When a store is reclassified from "Medium" to "Large," does that apply retroactively? From a specific date? Seed files have no concept of temporal validity. You either overwrite the current value (losing history) or build a versioning scheme on top of CSV files (which gets ugly fast). Either way, answering "what was the store classification on March 15th?" requires custom logic that the seed file format can't natively support.

No audit trail. Yes, seed files are in Git, so you have commit history. But Git commit history is not an audit trail. It tells you what changed and who committed it. It doesn't tell you who requested the change, who approved it, why it was made, or whether it went through any kind of governance process. For regulated industries or any organisation that needs to demonstrate data lineage, Git history is a developer tool, not a compliance artefact.

No access controls. Anyone with write access to the dbt repo can modify any seed file. There's no mechanism for saying "the finance team can edit cost-centre mappings but not product hierarchies" or "changes to the store hierarchy require approval from the VP of Operations." In a small team, this doesn't matter. In a mid-size or large organisation, it's a governance gap.

The Deployment Coupling Problem

There's a subtler issue that's worth naming explicitly: seed files couple reference data changes to code deployments.

When you change a seed file, the change only takes effect after dbt seed runs as part of a deployment. That means a simple reference data update, adding a new store, correcting a product classification, updating a cost-centre mapping, goes through the same deployment pipeline as model changes, macro updates, and schema migrations.

This creates two problems. First, it means reference data changes are batched with code changes, which makes deployments riskier and harder to roll back. If a seed file change and a model refactor are in the same deployment, and something breaks, you have to untangle which change caused the issue.

Second, it means reference data updates are gated by your deployment cadence. If you deploy daily, a reference data change submitted today might go live tomorrow. If you deploy weekly, it might take a week. For a business user who needs a new store to appear in reports now, this lag is unacceptable, and it's why they end up maintaining a separate spreadsheet.

Reference data changes should be decoupled from code deployments. A new store in the regional hierarchy is not a code change. It's a data change. It should flow through a data governance process, not a software release process.

What dbt Seed Files Are Good For

To be fair, and this article would be dishonest without this section, seed files are genuinely good for certain things.

Static, universal reference data that rarely changes. ISO country codes. Currency codes. US state abbreviations. Data that is defined by an external standard and changes once a decade, if ever. For these, seed files are fine. The business user problem doesn't apply because nobody in your organisation is editing the ISO country code list. The scale problem doesn't apply because the dataset is tiny. The deployment coupling is acceptable because the data almost never changes.

Test fixtures and development data. Seed files are useful for loading test data into your development or CI environment. Small datasets used to validate model logic, edge cases for testing, or sample data for local development, this is a legitimate and common use.

Bootstrapping a project. When you're standing up a new dbt project and need some reference data to get started, seed files are the fastest way to get lookup tables into your warehouse. They're a good starting point. They're not a good permanent solution.

The pattern that breaks is when seed files graduate from "bootstrap data" to "the system of record for business-critical reference data." That transition usually happens silently, nobody decides to make seed files the permanent approach. It just never gets replaced.

What to Use Instead

The alternative isn't going back to spreadsheets. It's giving reference data a proper home, infrastructure designed for the job, not repurposed from a development tool.

Here's what that home needs to provide:

A business-user interface. The people who understand product categories, store hierarchies, and cost-centre mappings should be able to maintain them directly, without interacting with Git, CSV files, or deployment pipelines. This is the single most important requirement, and it's the one that seed files fundamentally cannot meet.

Governance and audit trails. Every change should be versioned, attributable, and auditable. Not just "who committed to Git" but "who requested, who approved, what was the business reason, and what did the data look like before." For regulated industries, this is non-negotiable. For everyone else, it's the difference between "we think the data is right" and "we can prove it."

API and SDK access for engineers. Your dbt models should be able to consume reference data from an external source, via a source definition that points to an API-loaded table, not a CSV file in the repo. This decouples reference data changes from code deployments. The operations manager updates the store hierarchy through the business interface; the change flows into the warehouse automatically, without a PR, a merge, or a dbt seed run.

Validation and data types. Reference data should have defined schemas with enforced data types, allowed values, and referential integrity. A new row in the store hierarchy should be validated against the expected format before it's accepted, not after it breaks a downstream model.

Temporal awareness. The system should support effective dating natively. When a classification changes, the historical record is preserved. Your models can query reference data as of a point in time, enabling accurate historical analysis without custom logic.

You can build this yourself. A database, a CRUD interface, an API layer, and some governance logic. Some teams do. But maintaining a custom-built reference data management system is its own commitment, one that tends to accumulate the same kind of neglect that created the seed file problem in the first place.

TitanRDM was built for exactly this pattern. It gives your reference data a proper home: business-user interface for stewards and analysts, API and Python SDK for data engineers, governance and audit trails by default. You can import your existing seed file data in minutes, point your dbt models at the new source, and retire the CSV files from your repo.

But the tool matters less than the pattern. Whether you build it, buy it, or cobble it together from existing infrastructure, move your reference data out of seed files and into something purpose-built. Your dbt project will be cleaner, your business users will be unblocked, and your reference data will finally have the governance it deserves.

What To Do Next

If you're ready to move reference data out of your dbt repo:

Import your seed file data in minutes. Serve it to your warehouse via API. Give business users an interface. Free tier, no credit card, no sales call.

Try TitanRDM Free →

Why dbt Seed Files Are the Wrong Place for Reference Data