Blog
/
Engineering
James Swirhun
Credible

Rethinking Data Transformation with Malloy

TL;DR: dbt revolutionized data transformation by bringing software engineering practices to SQL. But as pipelines scale, the combination of SQL, Jinja, and YAML creates complexity that's hard to navigate and maintain. Malloy offers a different approach: a purpose-built declarative language where transformation logic, semantic modeling, and materialization are unified — not bolted together. The result: type-safe, composable transformations that compile in seconds and read clearly.

The dbt Revolution—and Its Growing Pains

dbt changed data engineering for the better. Before dbt, transformation logic lived in scattered scripts, scheduled jobs, and stored procedures with no version control, no testing, and no clear lineage. dbt introduced software engineering fundamentals: ref() functions for dependency management, YAML for configuration, and a model-driven approach to SQL.

Your dbt project probably started clean. A few models. Clear dependencies. Readable SQL with light Jinja templating.

But two years later, it looks different:

  • 400+ models spread across dozens of directories
  • Macros calling macros calling macros
  • {% if %} blocks nested five levels deep
  • YAML files that duplicate logic from SQL files
  • 10-minute dbt compile times
  • That one critical model nobody dares refactor

This isn't a dbt problem—it's a language problem.

SQL was designed for querying relational data, not orchestrating multi-step transformation pipelines. Jinja was designed for templating, not controlling flow. YAML was designed for configuration, not business logic. When you combine three languages that weren't designed to work together, complexity compounds. The dependency graph exists, but it's implicit — inferred from strings. Your IDE can't help you navigate it. Refactoring requires courage.

There's a better way.

Malloy: A Declarative Language Built for Data

Malloy is a high-level language for data. You express the logical structure and semantic meaning of your data — the result you want, not how to compute it. Documentation, visualization hints, and metadata all live directly in the model. Performance annotations are optional and additive — never required to understand the model itself.

SQL (and dbt) are low-level. They force you to describe how to compute what you want: join strategies, intermediate tables, CTEs, clustering keys, materializations, dependency graphs. Key metadata is scattered across SQL, YAML, macros, and downstream BI layers.

This matters beyond maintainability. It matters in two ways that are becoming urgent:

For the engineer who owns the pipeline: when someone needs a new column or table for analysis, the work today means going deep into the weeds of dbt — understanding macro behavior, tracing implicit dependencies, hoping a change doesn't break something downstream. Even with AI agents helping write the code, the underlying complexity is dangerous. Declarative structure makes the pipeline safe to change, because the compiler catches errors and the dependency graph is explicit.

For teams that don't have a dbt expert: as AI-powered analytics tools put more queries against your warehouse, materialization becomes essential to keep costs and latency under control. But if standing up a new materialized table requires deep dbt knowledge, only one team can do it. Malloy makes materialization accessible — any team can add it, because the language is readable and the tooling catches mistakes. That's the difference between a data stack that scales with headcount and one that scales with demand.

Both of these point in the same direction: declarative structure is the foundation for an AI-ready data stack — whether that's warehouse-native ML pipelines or traditional analytics.

Here's what that looks like in practice:

Type Safety

In dbt, column references are just strings — dbt compile won't catch a bad one. The error shows up when the warehouse runs the SQL.

-- dbt: This compiles fine but fails at runtime
select customer_id, nonexistent_column from {{ ref('customers') }}

In Malloy, every field reference is validated against the warehouse schema at compile time — before any query runs:

// Malloy: Compiler error - field 'nonexistent_column' not found
run: customers -> { select: customer_id, nonexistent_column }

Explicit Dependencies

In dbt, dependencies are strings parsed at compile time. In Malloy, they're named variables — the compiler builds the dependency graph from the language itself:

source: order_metrics is orders extend {
  join_one: customers with customer_id
  join_one: products with product_id
 
  dimension:
    product_name is products.product_name  // Compiler verifies field exists and type
}

Breaking changes are caught before runtime. Your IDE can navigate definitions with Cmd+click.

Clean Imports and Flexible Metadata

Malloy has a proper module system and an annotation system for attaching metadata anywhere in your model:

import "sources/customers.malloy"
import "sources/orders.malloy"
 
source: sales is warehouse.table('sales') extend {
  dimension:
    #(doc) "Customer segment based on lifetime value"
    customer_segment is
      pick 'high_value' when lifetime_revenue > 10000
      pick 'medium_value' when lifetime_revenue > 1000
      else 'standard'
 
  measure:
    #(doc) "Total revenue across all sales"
    total_revenue is sum(amount)
 
    # percent
    margin_rate is sum(profit) / sum(amount)
}

Tags encode documentation (#(doc)), rendering hints (# percent, # line_chart), or application-specific metadata. The system is intentionally flexible — applications interpret tags as needed.

Transformation Power Built Into the Language

dbt's transformation capabilities come from macros and packages. Malloy builds common patterns directly into the language.

The CTE Problem

In SQL, you can't reference a column alias in the same SELECT where it's defined. When transformations build on each other, you need CTEs. Here's a real-world example from our event logs at Credible:

with parsed as (
  select *,
    regexp_extract(raw_user, r'user:[\\"]*([^\\"]+)[\\"]*') as user
  from raw_logs
),
cleaned as (
  select *,
    coalesce(user, regexp_extract(raw_user, r'([^:]+@[^:]+)')) as cleaned_user
  from parsed
),
classified as (
  select *,
    cleaned_user like '%@mycompany.com' as is_internal_user
  from cleaned
)
select *, not is_internal_user as is_external_user from classified

In Malloy, dimensions reference other dimensions directly:

source: logs is warehouse.table('raw_logs') extend {
  dimension:
    user is regexp_extract(raw_user, r'user:[\\"]*([^\\"]+)[\\"]*')
    cleaned_user is coalesce(user, regexp_extract(raw_user, r'([^:]+@[^:]+)'))
    is_internal_user is cleaned_user ~ r'@mycompany\.com$'
    is_external_user is not is_internal_user and cleaned_user is not null
}

Four lines instead of four CTEs. Each dimension can reference any earlier dimension — no restructuring needed when logic changes.

Developer Experience

The Malloy extension for VS Code and Cursor provides LSP support — errors are caught as you type with autocomplete, hover documentation, and go-to-definition. The Malloy CLI compiles and validates transformations. And when you need database-specific functionality, sql() blocks give you an escape hatch with type safety intact.

Materialization as a First-Class Language Feature

If you've used dbt, you know materialization: saving the result of a transformation as a table so downstream queries read from that table instead of recomputing it. Malloy brings the same concept into the language itself — no separate YAML config, no external orchestration. (Malloy calls this feature "persistence" using the #@ persist annotation, but it's the same idea as dbt's materializations.)

The #@ persist annotation tells Malloy: "This source should be materialized." The backing query's result is saved as a warehouse table, and subsequent queries read from that table instead of recomputing it. Available today as an experimental feature:

#@ persist name=customer_metrics
source: customer_metrics is orders -> {
  join_one: customers with customer_id
 
  group_by:
    customers.customer_id
    customers.signup_date
    customers.tier
  aggregate:
    order_count is count()
    total_revenue is sum(amount)
    avg_order_value is avg(amount)
}

  • Materialization is visible in the model — no separate YAML config files
  • Dependencies are automatic — the compiler knows what to refresh when sources change
  • Builds are incremental — each source gets a content-addressed BuildID; if the logic hasn't changed, the table isn't rebuilt
  • Logic stays unified — the same model definition powers both materialization and queries

Under the hood, malloy-cli build walks the dependency graph, creates tables in order, and writes a manifest. The Malloy extension reads that manifest automatically. This becomes especially powerful as AI agents query your data more frequently — materialized tables serve results at warehouse query speed instead of recomputing transformations on every request. It's also critical for expensive operations like LLM inference and embedding generation — see Warehouse-Native ML Pipelines and Entity Matching with Embeddings for real-world examples. See the Malloy persistence documentation to try it yourself.

Example: Full Transformation Comparison

A common pattern: building customer aggregates with joins, computed fields, and persistence.

The dbt Way

SQL: 4 SQL files + ~40 lines of YAML declaring tests, constraints, and docs separately from the SQL.

-- stg_orders.sql
select
  order_id,
  customer_id,
  product_id,
  order_date,
  order_total
from {{ source('raw', 'orders') }}
 
-- stg_customers.sql
select
  customer_id,
  customer_tier,
  signup_date
from {{ source('raw', 'customers') }}
 
-- stg_products.sql
select
  product_id,
  product_name,
  category,
  price
from {{ source('raw', 'products') }}
 
-- order_metrics.sql
{{
  config(materialized='table')
}}
 
select
  o.order_id,
  o.order_date,
  c.customer_id,
  c.customer_tier,
  case
    when c.signup_date >= current_date - interval '30 days' then 'new'
    else 'established'
  end as customer_segment,
  p.product_name,
  p.category,
  o.order_total
from {{ ref('stg_orders') }} o
left join {{ ref('stg_customers') }} c using (customer_id)
left join {{ ref('stg_products') }} p using (product_id)

The Malloy Way

#@ persist name=order_metrics
source: order_metrics is warehouse.table('orders') extend {
  join_one: customers is warehouse.table('customers') with customer_id
  join_one: products is warehouse.table('products') with product_id
 
  dimension:
    customer_segment is customers.signup_date ?
      pick 'new' when >= now - 30 days
      else 'established'
 
  measure:
    order_count is count()
    total_revenue is sum(order_total)
    avg_product_price is products.price.avg()
 
  // Reusable query shape — run against the persisted source
  view: summary is {
    group_by:
      customers.customer_id
      customers.customer_tier
      customer_segment
      products.category
    aggregate:
      order_count
      total_revenue
      avg_product_price
  }
}

Malloy total: 1 file, 1 language. join_one encodes the relationship cardinality explicitly — one customer per order, one product per order. No duplicated structure definitions across files and languages — the join declaration is the documentation. #@ persist is the materialization — declared inline, right where the logic lives. Everything is in one place.

Get Started

The dbt revolution taught us that data transformation deserves software engineering practices. Malloy takes that further: a unified, purpose-built language for transformation, semantic modeling, and materialization. Not YAML. Not Jinja. Not SQL with config blocks. A language designed for the problem.