Blog
/
AI & ML
Ofer Mendelevitch
DevRel @ Credible

Making Healthcare Data AI-Ready

TL;DR: AI agents are quickly learning to read documents, but the harder enterprise challenge is teaching them to use the structured data that already runs the business. Raw warehouse access is not enough: schemas are ambiguous, joins are easy to misuse, and confident SQL can still produce wrong answers. Making OMOP data AI-ready requires a governed model that captures clinical meaning: vocabulary mappings, cohort logic, valid joins, trusted measures, and safe query patterns. Malloy provides that layer for complex schemas like OMOP, while Credible makes those models discoverable, governed, and available to production agents through MCP.

Over the last couple of years, a lot of the focus around AI agents has gone into unstructured data. Retrieval over PDFs, Word docs, slide decks, support tickets, and web pages is now a well-trodden path: chunk it, embed it, retrieve it, and let the LLM summarize. It works, and plenty of useful products have been built on it.

But that is not where most enterprises keep the data that actually runs the business. In healthcare, much of that data lives as structured data represented using the OMOP Common Data Model that teams rely on for cohort studies and real-world evidence research.

As enterprise AI agents move from impressive demos toward mission-critical use, the question I keep running into is: "how do we let an AI agent use structured data — reliably, and at scale?"

That turns out to be a harder problem than reading a document. A document is self-describing; the words are the meaning. Healthcare data in a data warehouse is not. A column called gender_concept_id does not explain itself (and it's even worse if the column name is called gdr_concept_id), two tables that look joinable sometimes are not, and the difference between a correct patient count and a plausible-looking wrong one might be a join the LLM didn't know it needed. Point an agent at raw tables and it will sometimes write confident SQL that double-counts patients or invents a relationship, and you may not even realize it.

This is where Malloy — an open source language for describing data relationships and transformations — comes in. Unlike SQL, Malloy allows you to define analytical logic in a governed and reusable way. Publishing Malloy models gives AI agents a governed semantic layer: reusable definitions, approved relationships, and trusted transformations that help them answer questions correctly instead of improvising over raw tables.

To make this concrete, the examples below use a synthetic COVID-19 dataset in OMOP format, the Observational Medical Outcomes Partnership Common Data Model maintained by OHDSI.

OMOP gives organizations a shared schema for representing patients, visits, conditions, drug exposures, procedures, measurements, observations, and outcomes. That is a powerful foundation, but it is still only a foundation: a schema can standardize where data lives without making it obvious how the data should be used. An AI agent still needs to know which concepts matter, how entities relate, which joins are valid, which definitions are approved, and which fields are sensitive.

That is what it means for data to be truly "AI-ready." The data is not just available to an agent; it is exposed through a layer that makes correct use more likely and incorrect use harder.

For analytical work, data is AI-ready when it is:

  • Reliable: the agent can ask questions without inventing joins, double-counting rows, or misusing fields
  • Meaningful: the model exposes clinical concepts and reusable measures, not just table and column names
  • Governed: sensitive fields, access rules, and approved definitions are enforced by the system
  • Discoverable: the agent can find the relevant fields, measures, and views from a user's intent

That is the gap Malloy helps close.

Why OMOP is both valuable and hard to query

OMOP standardizes observational health data, and that standardization is exactly what lets a phenotype or cohort study remain portable across institutions, but it also creates complexity. OMOP is not one flat table — it is a graph of clinical entities. One patient has many visits; each visit relates to conditions, procedures, drug exposures, and observations. And clinical facts are stored as numeric concept IDs that must be joined back to the OMOP vocabulary to become readable. A drug exposure is not recorded as "amlodipine"; it is drug_concept_id = 1332419. A patient's gender is not "Female"; it is 8532.

So even simple-sounding questions quickly become complex in SQL. For example, consider the following question:

How many COVID-19 patients died, and how did their baseline characteristics differ from those who survived?

In SQL you need to follow a sequence of steps:

  1. Find patients with COVID-19 diagnoses in condition_occurrence
  2. Pick each patient's first diagnosis as the "index event"
  3. Join to person for demographics and death for vital status
  4. Compute age at the index event (not age today)
  5. Look back before the index event to summarize baseline characteristics such as sex, race, ethnicity, age bands, visits, or co-morbidities
  6. Keep patient-level counts correct while joining to several one-to-many tables

None of these steps is hard on its own, but all together they are easy to get wrong. An agent pointed at raw OMOP tables sees person, condition_occurrence, drug_exposure, visit_occurrence, and concept. It still has to infer which concept IDs mean what, how to count distinct patients, how to avoid fan-out across event tables, how to define a cohort, and which analyses are unsupported because the underlying data is missing. That can be a lot to ask of an LLM on every query, and it is often why agents get the wrong results.

This is exactly why ATLAS and other OHDSI tools are so important: they give researchers proven workflows for cohort definition, standardized vocabularies, phenotype development, and reproducible observational studies. The question is not whether those tools matter; they clearly do. The question is what additional context an AI agent needs when it is asked to discover concepts, compose queries, and explain results dynamically at runtime.

That is the gap Malloy helps close: it gives the agent a governed model where relationships, measures, clinical definitions, vocabulary mappings, and safe query patterns are already encoded, making OMOP data usable by AI systems without asking the model to rediscover OMOP from raw tables on every query.

In this blog post, we will use a realistic OMOP-shaped dataset — the synthea-covid19-10k dataset — which contains over 10,000 synthetic patients (no PHI) to demonstrate this idea in more detail, based on the Malloy model for OMOP and associated notebooks in the Malloy samples repository.

Malloy captures OMOP's meaning in a reusable model

In the Malloy-samples repository, the omop.malloy model adds OMOP semantics to any OMOP-shaped dataset. For example, it extends the raw person table with joins to the vocabulary table so demographic codes become readable concepts:

source: person is omop_person extend {
  primary_key: person_id
  join_one: gender is concept on gender_concept_id = gender.concept_id
  join_one: race is concept on race_concept_id = race.concept_id
  join_one: ethnicity is concept on ethnicity_concept_id = ethnicity.concept_id
  dimension:
    age is year(now) - year_of_birth
    age_band is age ?
      pick '0-17'  when < 18
      pick '18-39' when < 40
      pick '40-64' when < 65
      else '65+'
  measure: person_count is count()
}

It also defines a person_count measure and two demographic dimensions: age and age_band. Note that this age is "current age" — a convenience helper for simple demographic cuts. Age at a clinical event — for example, a patient's age at their first COVID-19 diagnosis, which is what the Table 1 in Example 3 below reports — is a different quantity, computed at the cohort level from the event date and year_of_birth. Keeping "age now" and "age at index" distinct is something every OMOP study has to get right, and the model makes both available rather than conflating them.

Let's dive into three examples to better understand how this Malloy semantic model for OMOP helps your AI agents use the data in the right way and get correct results.

Example 1: the data becomes readable

Let's start with a simple question: "how many patients have coronary arteriosclerosis?"

OMOP stores clinical and demographic facts as numeric concept IDs. In SQL, answering that requires joining condition_occurrence to concept so the numeric condition_concept_id can be translated into a readable condition name:

select c.concept_name, count(distinct co.person_id) as patients
from condition_occurrence co
join concept c on co.condition_concept_id = c.concept_id
where c.concept_name = 'Coronary arteriosclerosis'
group by c.concept_name;

This pattern repeats for gender, race, ethnicity, condition, drug, procedure, visit type, cause of death, and many other fields. The same vocabulary lookup logic gets repeated everywhere, resulting in long and complex SQL queries.

With Malloy, the vocabulary relationship is defined once in the model. After that, the query can work with the readable concept name directly:

run: condition_occurrence -> {
  where: condition_concept.concept_name = 'Coronary arteriosclerosis'
  group_by: condition_concept.concept_name
  aggregate: patient_count
}

And the output is:

concept_namepatient_count
Coronary arteriosclerosis1,292

The query now reads much closer to the question, and the LLM does not need to reason about foreign keys or guess which concept column maps to which lookup table. With Malloy, you avoid including a mini-tutorial on OMOP vocabulary joins in every prompt.

Example 2: correct numbers across one-to-many tables

Many useful OMOP questions start at the patient and fan out to several event tables at once: how many conditions, drugs, visits, and procedures does each patient have, and how do those differ by outcome?

In SQL, joining several one-to-many tables off the patient is a classic source of double-counting. If a patient has 5 conditions, 3 drug exposures, and 4 visits, a naive join across all three produces 5 × 3 × 4 = 60 rows for that one patient. Count anything after that and the numbers are wrong unless you are very careful.

The OMOP Malloy model encodes the fan-out structure explicitly in a patient-centric source:

source: patient is person extend {
  join_many: conditions is condition_occurrence on person_id = conditions.person_id
  join_many: drugs is drug_exposure on person_id = drugs.person_id
  join_many: visits is visit_occurrence on person_id = visits.person_id
  join_many: procedures is procedure_occurrence on person_id = procedures.person_id
  join_one: death on person_id = death.person_id
  dimension:
    vital_status is pick 'Died' when death.person_id is not null else 'Survived'
  measure:
    total_conditions is conditions.count()
    total_drugs is drugs.count()
    total_visits is visits.count()
    total_procedures is procedures.count()
}

Because Malloy understands the graph of relationships rather than flattening everything into one row space, its symmetric aggregates make total_conditions and total_drugs both correct in the same query, even though both fan out under person.

With that in place, a real analysis script remains short. For example, the following compares healthcare utilization in the COVID-19 cohort, split by whether the patient died:

run: condition_occurrence -> {
  where: condition_concept.concept_name = 'COVID-19'
  group_by: person_id
} -> {
  join_one: patient with person_id
  group_by: person_id, vital_status is patient.vital_status
  aggregate:
    visits is patient.total_visits
    drugs is patient.total_drugs
    procedures is patient.total_procedures
} -> {
  group_by: vital_status
  aggregate:
    cohort is count()
    mean_visits is round(visits.avg(), 1)
    mean_drugs is round(drugs.avg(), 1)
    mean_procedures is round(procedures.avg(), 1)
}

The output we get is:

vital_statuscohortmean_visitsmean_drugsmean_procedures
Survived91611.337.64.4
Died4813.668.130.3

The important thing here is that the query does not try to answer everything in one flat join. It moves through the analysis at the right grain. First, it defines the COVID-19 cohort from condition_occurrence. Then it collapses that cohort to one row per patient, so each person is counted once. Only after that does it enrich each patient with utilization measures from the patient model: total visits, total drug exposures, and total procedures. Finally, it summarizes those per-patient measures by vital_status.

That order matters. The utilization counts are not being computed over a large (accidental) join of conditions × drugs × visits × procedures. They are computed at the patient level, where they belong, and then averaged across the cohort. The result is a query that matches the intent: compare per-patient healthcare utilization between patients who survived and patients who died.

Underneath, this logic still converts to SQL for execution, but the point is not that SQL cannot express it; of course it can. The point is that the correct SQL has to preserve the cohort grain, avoid fan-out, and aggregate at the patient level before summarizing by outcome. The equivalent generated SQL looks like this:

WITH __stage0 AS (
  SELECT
     base."person_id" as "person_id"
  FROM 'omop/data/condition_occurrence.parquet' as base
   LEFT JOIN 'omop/data/concept.parquet' AS condition_concept_0
    ON base."condition_concept_id"=condition_concept_0."concept_id"
  WHERE condition_concept_0."concept_name"='COVID-19'
  GROUP BY 1
)
, __stage1 AS (
  SELECT
     base."person_id" as "person_id",
     (CASE WHEN death_0."person_id" IS NOT NULL THEN 'Died' ELSE 'Survived' END) as "vital_status",
     (COUNT(DISTINCT visits_0."visit_occurrence_id")) as "visits",
     (COUNT(DISTINCT drugs_0."drug_exposure_id")) as "drugs",
     (COUNT(DISTINCT procedures_0."procedure_occurrence_id")) as "procedures"
  FROM __stage0 as base
   LEFT JOIN 'omop/data/person.parquet' AS patient_0
    ON patient_0."person_id"=base."person_id"
   LEFT JOIN 'omop/data/death.parquet' AS death_0
    ON patient_0."person_id"=death_0."person_id"
   LEFT JOIN 'omop/data/visit_occurrence.parquet' AS visits_0
    ON patient_0."person_id"=visits_0."person_id"
   LEFT JOIN 'omop/data/drug_exposure.parquet' AS drugs_0
    ON patient_0."person_id"=drugs_0."person_id"
   LEFT JOIN 'omop/data/procedure_occurrence.parquet' AS procedures_0
    ON patient_0."person_id"=procedures_0."person_id"
  GROUP BY 1,2
  ORDER BY 3 desc NULLS LAST
)
SELECT
   base."vital_status" as "vital_status",
   COUNT(1) as "cohort",
   ROUND(AVG(base."visits"),1) as "mean_visits",
   ROUND(AVG(base."drugs"),1) as "mean_drugs",
   ROUND(AVG(base."procedures"),1) as "mean_procedures"
FROM __stage1 as base
GROUP BY 1
ORDER BY 2 desc NULLS LAST

Notice everything that SQL has to get right by hand: define the cohort, keep the aggregation grain correct, avoid multiplying rows across several one-to-many tables, join death status, and only then compute the outcome-level summaries. In Malloy each stage has one job, so the pipeline stays readable.

The staged shape with Malloy is exactly what makes this type of query safe for an agent. The agent can reason over cohort → patient enrichment → summary far more reliably than it can emit one large multi-join statement in a single pass.

Example 3: a real study, defined once

As our final example, let's look at how to create the famous "Table 1" that is common in almost any clinical study: a summary of the cohort, stratified by an outcome.

The Malloy code for this is shown in the baseline_characteristics example notebook. The Malloy code is longer than the simple examples above because the analysis is real, but the important point is that the complexity is expressed as reusable clinical logic rather than buried inside one large SQL statement.

The resulting "Table 1" looks like a familiar clinical cohort summary:

Table 1. Baseline characteristics of the COVID-19 cohort, by vital status
CharacteristicSubgroupSurvivedDied
Patients — no. (%)916 (95.0%)48 (5.0%)
Age at index — yr, mean (SD)44.6 (28.8)81.5 (14.9)
Visits per patient — mean11.313.6
Follow-up — yr, mean10.015.7
Sex — no. (%)Female482 (52.6%)15 (31.3%)
Sex — no. (%)Male434 (47.4%)33 (68.8%)
Race — no. (%)Asian56 (6.1%)1 (2.1%)
Race — no. (%)Black or African American66 (7.2%)2 (4.2%)
Race — no. (%)No matching concept27 (2.9%)1 (2.1%)
Race — no. (%)White767 (83.7%)44 (91.7%)
Ethnicity — no. (%)Hispanic or Latino108 (11.8%)6 (12.5%)
Ethnicity — no. (%)Not Hispanic or Latino808 (88.2%)42 (87.5%)
Age band — no. (%)0-17216 (23.6%)0 (0.0%)
Age band — no. (%)18-49301 (32.9%)3 (6.3%)
Age band — no. (%)50-64135 (14.7%)2 (4.2%)
Age band — no. (%)65-7479 (8.6%)5 (10.4%)
Age band — no. (%)75+185 (20.2%)38 (79.2%)

We can see that the deceased cohort is much older at diagnosis than the survivor cohort, with a mean age of 81.5 years versus 44.6 years. Deaths are also more male-skewed: 68.8% male among those who died versus 47.4% among survivors. Age stratification shows the same pattern clearly: 79.2% of deaths occurred in patients aged 75+, while there were no deaths among patients under 18 in this cohort.

Because the logic lives in the model, re-targeting these analyses with Malloy is usually a one-line change: swap the condition in the where clause and the same demographic profile, comorbidity breakdown, or mortality-by-age query runs against a different cohort.

Conclusion

Making structured data ready for AI is not about giving an LLM a database connection. It is about giving the AI system a trusted model of the data.

OMOP gives us a standardized healthcare schema, but the schema alone is not enough. Malloy turns it into a semantic model with reusable joins, measures, dimensions, and views — making vocabulary lookups trivial, patient-level aggregation safe, and real studies reusable.

Malloy captures OMOP's meaning in a readable model. Credible turns that model into governed, enterprise-ready AI infrastructure: it publishes the model, enriches it with searchable documentation and metadata, enforces access controls, and delivers that meaning as structured context to agents over MCP. There are four parts to this:

  • Publishing and serving. A model sitting in a repo is useful to developers, but an AI-ready model has to be reachable by applications, agents, and users through governed interfaces. Credible publishes and serves the Malloy model as that interface.
  • The Context Engine. A user does not ask for condition_concept.concept_name or patient.total_visits. They ask: "For COVID patients, compare utilization between those who died and those who survived." Credible indexes the model's metadata and documentation through its Context Engine so those concepts can be retrieved semantically.
  • MCP access for agents. Credible exposes the model to agents over the Model Context Protocol (MCP), with tools such as get_context (discover relevant model entities and Malloy syntax guidance) and execute_query (run Malloy against the governed model and get JSON back). Through these tools, your AI agent is not guessing against an unknown schema — it discovers context, executes against a governed Malloy model, and returns results grounded in the same curated definitions used everywhere else.
  • Governance. In a real healthcare setting this is the non-negotiable part. The public sample is synthetic with no PHI, but real OMOP datasets carry sensitive patient-level data. Production AI workflows need role-based access, controlled environments, row- and field-level policies, and auditability. Credible enforces role-based access controls for Malloy queries with native support for trusted identities, groups and row-level security.

The unstructured-data wave taught agents to read; the next wave is teaching them to use the data warehouses enterprises already trust. OMOP is a particularly demanding test case, and Malloy shows its strengths particularly well for data represented using OMOP.

That is the real shift. Instead of pointing an LLM at raw tables and asking it to guess its way through a clinical schema, Malloy and Credible let you build a governed semantic model that humans can inspect, applications can reuse, and AI agents can query safely and with confidence.

To learn how Credible Data serves and governs Malloy models for production AI workflows, visit Credible Data.