Blog
/
AI & ML
James Swirhun
Credible

The Future of ML Pipelines

TL;DR: While everyone was focused on LLMs, something equally important happened: data warehouses gained native ML capabilities. BigQuery, Snowflake, and Databricks can now run embeddings, classification, and text generation as SQL functions—no training clusters, no model registries, no separate serving infrastructure. Anyone can build an ML pipeline now. The harder problem is building one you can trust: governed, version-controlled, with evaluation that proves it works. Malloy is purpose-built for this: a single, composable language where data transformation, LLM operations, evaluation, and materialization all live together.

Something Quietly Changed

If you work with data, you've probably noticed the shift—even if you haven't named it yet.

A few years ago, building an ML pipeline meant exporting CSVs to a training cluster, managing model artifacts, and standing up serving infrastructure. It was a whole discipline, and only well-resourced teams could pull it off.

Now? Your warehouse can classify text, generate embeddings, and run LLM inference directly in a query. BigQuery has ML.GENERATE_TEXT. Snowflake has Cortex. Databricks has AI Functions. The ML operation happens where the data already lives.

This is a genuine inflection point. But it creates a new problem: the tools we use to work with warehouse data weren't designed for ML workflows. SQL gets you part of the way. dbt gets you further. But neither was built for pipelines where transformation, ML execution, evaluation, and cost management need to work together as one coherent system.

That's the gap Malloy fills.

Why This Matters Now: Two Shifts Colliding

Two things happened at roughly the same time, and their intersection is what makes warehouse-native ML practical.

Shift 1: Warehouses became ML platforms. BigQuery ML, Snowflake Cortex, and Databricks AI Functions brought model inference into SQL. No data egress. No separate infrastructure. Just a function call in your query.

Shift 2: LLMs replaced custom training for many use cases. Problems that once required months of labeled data, custom model architectures, and GPU clusters—sentiment classification, entity extraction, text summarization—can now be solved with a well-crafted prompt. The "training" step became prompt engineering.

Together, these shifts mean that an enormous class of ML work can now be expressed entirely inside your data warehouse. Sentiment analysis on customer reviews. Categorization of support tickets. Entity extraction from contracts. Embedding-based matching across datasets. All of it can run where the data lives.

The infrastructure requirements for these pipelines have collapsed. But the need for rigor hasn't.

The Stages Don't Go Away

Google's TFX platform defined what production ML pipelines need. It identified stages that every serious ML system requires, and they're still right:

  1. Data Analysis — understand your input distributions
  2. Data Transformation — engineer features consistently
  3. Data Validation — catch quality issues before they reach your model
  4. Model Execution — run the ML operation
  5. Evaluation — measure quality against baselines before trusting results

TFX enforced discipline on what had been chaos: models trained in notebooks, thrown over the wall, silently degrading in production. That discipline was—and remains—essential.

But the handoff problem ran deeper than infrastructure. Data scientists explored data and built models in Python notebooks. When something worked, they handed it to engineers to rewrite in SQL, dbt, and Airflow for production. The translation introduced bugs, the feedback loop stretched to weeks, and the person who understood the problem best — the data scientist — lost control of the solution the moment it left their notebook.

But TFX-era infrastructure was built for a world where "model execution" meant distributed gradient descent across GPU clusters. It required Apache Beam pipelines, TensorFlow Transform, ML Metadata stores, Kubeflow orchestration, protocol buffer configs, intermediate artifacts scattered across GCS buckets. A "simple" pipeline might span 15+ files across multiple languages and formats.

That overhead was justified when training meant custom neural architectures. For warehouse-native LLM pipelines, it's wildly disproportionate.

What This Looks Like in Practice

Let's make this concrete. Say you have a table of customer reviews and you want to classify sentiment, validate quality, and measure accuracy. This is the kind of pipeline that used to require dedicated ML infrastructure. Now it's a warehouse operation.

Here's the entire pipeline in Malloy:

Analysis & Transformation

In Malloy, you define a source—a semantic layer over your warehouse table—and add dimensions (computed columns) and measures (aggregations) that compose on top of each other:

source: customer_reviews is bigquery.table('project.dataset.reviews') extend {
  dimension:
    // Clean the text
    cleaned_text is trim(lower(review_text))
    word_count is array_length(split(cleaned_text, ' '))
 
    // Build the prompt — this IS the "feature vector" for the LLM
    classification_prompt is concat(
      'Classify this review as exactly one word - POSITIVE, NEGATIVE, or NEUTRAL. Review: ',
      cleaned_text
    )
 
  measure:
    total_reviews is count()
    avg_rating is avg(rating)
    reviews_with_text is count() { where: word_count > 0 }
}

Each dimension builds on the ones before it. cleaned_text feeds word_count feeds classification_prompt. Change a definition, and everything downstream updates. No separate transformation framework. No preprocessing graph.

Validation

Instead of a separate validation component with its own configuration, validation is a view inside the same source:

source: reviews_validated is customer_reviews extend {
  view: quality_checks is {
    aggregate:
      total_records is count()
      null_text is count() { where: review_text is null }
      too_short is count() { where: word_count < 3 }
      duplicate_ids is count() - count(distinct review_id)
  }
}

Run quality_checks before your ML operation. It's version-controlled with your pipeline code. No separate tool, no separate config file.

ML Execution

Here's where the paradigm shift is most visible. The "model" is a warehouse function call:

source: reviews_classified is reviews_validated extend {
  dimension:
    ai_response is sql_string("""
      JSON_EXTRACT_SCALAR(
        (SELECT ml_generate_text_result
         FROM ML.GENERATE_TEXT(
           MODEL `project.gemini_model`,
           (SELECT ${classification_prompt} AS prompt),
           STRUCT(0.1 AS temperature, 10 AS max_output_tokens)
         )),
        '$.candidates[0].content.parts[0].text'
      )
    """)
 
    sentiment is
      pick 'positive' when upper(ai_response) ~ r'POSITIVE'
      pick 'negative' when upper(ai_response) ~ r'NEGATIVE'
      else 'neutral'
}

The "training" is the prompt. The "hyperparameters" are temperature and token limits. The model is hosted by your warehouse provider. No training cluster. No model registry. No serving infrastructure.

Materialization: The Cost Problem, Solved

LLM calls are expensive. You don't want to re-classify every review on every query. This is where Malloy's materialization becomes essential—and it's one of the most important pieces of the architecture.

#@ persist name=reviews_with_sentiment
source: reviews_with_sentiment is reviews_classified -> {
  select: review_id, review_text, sentiment, created_at
}

The #@ persist annotation tells Malloy to materialize results as a warehouse table. (Malloy uses the term "persistence" for what dbt calls "materialization" — same concept, different name.) The LLM runs once per record. Subsequent queries hit the materialized table at normal warehouse query cost—pennies instead of dollars.

The LLM runs once per record. Subsequent queries, dashboards, and downstream analytics all read from the materialized table — pennies instead of dollars. When new reviews arrive, only the new records hit the LLM. For more on how materialization works as a language feature, see Rethinking Data Transformation.

Evaluation

Finally, evaluation is another view—not a separate component with its own configuration:

source: sentiment_eval is reviews_with_sentiment extend {
  join_one: golden_labels on review_id = golden_labels.review_id
 
  dimension:
    is_correct is sentiment = golden_labels.true_sentiment
 
  view:
    # dashboard
    accuracy_report is {
      aggregate:
        # percent
        accuracy is count() { where: is_correct } / count()
      nest: by_class is {
        group_by: golden_labels.true_sentiment
        aggregate:
          # percent
          precision is count() { where: is_correct } / count()
      }
    }
 
    error_analysis is {
      where: not is_correct
      select: review_text, sentiment, golden_labels.true_sentiment
      limit: 50
    }
}

Run accuracy_report to see metrics. Run error_analysis to understand failures. The evaluation logic lives with the pipeline—version-controlled, reproducible, queryable at warehouse scale.

Why Not Just Use SQL (or dbt)?

You can build warehouse-native ML pipelines in raw SQL, but managing a full pipeline in SQL and dbt is complex.

Here's what a comparable dbt pipeline requires:

  • packages.yml to install the evals package
  • dbt_project.yml with global variables for judge model, criteria, sampling rates, pass thresholds—plus warehouse-specific variables (GCP project, location, connection ID, endpoint)
  • A YAML model config with post_hook, meta.llm_evals settings, input_columns, output_column, prompt, baseline_version, sampling_rate
  • A separate SQL model file per warehouse dialect (the prompt assembly alone is a wall of concat() calls)
  • Multiple dbt run commands to set up tables, execute the model, and run evaluations separately
  • Raw SQL queries against evaluation output tables to see results

That's six or more files, three config formats (YAML, SQL, Jinja), and a multi-step execution workflow—for a single classification pipeline. And the transformation logic is scattered: some in YAML config, some in SQL, some in Jinja macros.

Malloy collapses this into one language where every stage composes naturally. The transformation, execution, validation, and evaluation are all defined in the same file, using the same constructs. You can read the entire pipeline top to bottom and understand what it does.

This isn't just about fewer files. It's about composability. In Malloy, a dimension defined in one source can be referenced in downstream sources. A measure defined for validation can be reused in evaluation. Views layer on top of each other without duplicating logic. This is what it means to be a declarative modeling language rather than a templating system on top of SQL.

And because every stage — transformation, ML execution, evaluation, visualization — lives in the same language, you don't need separate tools for each. The same Malloy file you use for exploratory analysis can evolve into the governed production pipeline — no rewrite into a different tool required.

Who This Is For

Anyone can stand up an ML pipeline now. The harder question is: how do you know it works? Without evaluation, you're shipping a black box. Without governed definitions, every team builds their own version of "sentiment" or "match quality" and gets different answers.

Malloy changes who can ship ML to production. With warehouse-native ML, the person who understands the problem can own the entire workflow — explore data, engineer features, call the model, evaluate results, materialize for production, and build dashboards — all in one language, one tool, one file. No handoff to a separate engineering team. No rewriting notebooks into SQL. No waiting weeks to see your model in production.

If your team works with unstructured data in a warehouse — customer reviews, support tickets, contracts, product descriptions — and you want pipelines you can evaluate, version-control, and iterate on, this is for you.

Get Started