Context is King: Why AI Validation is Driven by Validity Context

Jun 12

It is worth separating two senses of "context," because the relationship between them is itself part of why AI validation departs from CSV. The runtime context is the data a model sees when it runs; the context of validity is the set of conditions under which it was qualified: what it replaces, the input space it was tested against, the acceptance bar. In a deterministic system there is no meaningful coupling between the two; a validated program processes whatever record it is handed, and correctness does not depend on that record falling inside some trained distribution. For a large language model, the context of validity draws a boundary, and the runtime context is trustworthy only while it stays inside that boundary.

This is why draft Annex 22 pairs intended-use and sample-space definition at qualification with input-drift and performance monitoring in operation. The relationship runs both ways. The use context draws the boundary at the outset; then the operational data — the runtime context, captured under ordinary data governance — becomes the evidence that the model is still inside that boundary and still meeting its acceptance criteria. The runtime context is not a rival to the use context. It is the evidence stream that keeps the use-context claim alive after deployment. This is the same move CSV professionals already know from continued process verification: validity is asserted once, then earned continuously.

There's a tempting shortcut in our industry right now. A team has a mature, audited computer system validation (CSV) practice: requirements traceability, IQ/OQ/PQ, risk-based testing, change control.

Then, an LLM-based feature lands on the roadmap. The natural instinct is to treat it as one more system: keep the framework, add a few requirements lines about "accuracy" and "bias," write some test cases, and sign the report.

It feels rigorous. It satisfies the checklist. And it quietly misses the point, because the thing that made CSV work is the very thing an LLM doesn't give you.

CSV assumes a world LLMs don't live in

Traditional CSV, and its risk-based descendant CSA, rest on a set of assumptions so foundational we rarely state them. CSA changed how much evidence each function demands, scaling it to risk, but it left the underlying assumptions intact:

The system is deterministic: the same input produces the same output, every time.
Behavior is fully specified by the code, so a requirement can be traced to a function and a function to a test.
The validated state is stable: you qualify a frozen version at a single “point in time” and use change control to track and justify updates.

An LLM violates all three.

It's stochastic, so "expected result equals X" stops being a coherent acceptance criterion. Its behavior isn't written anywhere you can read; it's distributed across billions of weights you can't clearly trace a requirement to. Its input space is natural language; therefore, you cannot enumerate all test cases, only sample a space you can't fully map. And if you're calling a hosted model, the system underneath you changes on the vendor's schedule, not yours.

None of this is solved by adding requirements. You can write "the system shall not hallucinate" into a URS, but you've written an aspiration, not a verifiable spec. The CSV machinery still runs; it just no longer touches the actual sources of risk.

The shift is from specification to characterization

The deepest change isn't a new technique: it's a different type of evidence.

CSV verifies that a system does what it was specified to do; CSA scales the type of evidence required to the risk. Both still assume the system can be specified and will behave exactly the same way twice. LLM validation has to characterize how a system behaves across a representative slice of the conditions it'll face, and then argue that the behavior is good enough for a specific purpose. You move from binary pass/fail against discrete requirements to statistical performance against a curated, ground-truthed evaluation set: accuracy, precision and recall where they apply, and use-case-specific measures like faithfulness or groundedness for retrieval-augmented systems. A single deterministic test proves a function works. A probabilistic testing suite must prove that a function is stable over time, within its defined performance boundaries.

This is, notably, where the regulators have already landed. The FDA's January 2025 draft guidance on AI to support regulatory decision-making doesn't ask whether the model meets a spec. It asks whether the model is credible for a defined context of use, and it scales the evidence required to the model's risk: a function of how much the model influences a decision and how serious the consequence of being wrong is. That's not CSV with extra steps. It's a different organizing principle: credibility for a purpose, not conformance to a specification.

Same with Annex 11 (Computerized Systems). The 2025 draft is framed as a risk- and lifecycle-based validation strategy, where effort is “determined based on the intended use of the system and potential risks to product quality, patient safety and data integrity.”

We are not validating the AI model; we are validating that the model is fit-for-purpose for its specified context of use.

This is not new - the word “validation” traces its origins to the Latin word validare, which translates to “to make valid.” Validare itself originates from validus, the root of “valid” (“supporting the intended point or claim”).

This stands in contrast to “verification”, which derives from the Latin word verus, “true.”

Barry Boehm described this dichotomy cleanly: verification = “Am I building the product right?”, validation = “Am I building the right product?”

In essence, validation’s origins, both semantic and industrial, trace to fitness for purpose. In this sense, validating large language models and agentic systems is conceptually aligned with validation as a discipline. While the shape of testing shifts with the development and evolution of the technology, the foundation of validation as a practice remains consistent.

The validated state is a moving target

Point-in-time validation assumes that if nothing changes, the system stays valid. With LLMs, things change even when you change nothing. Hosted models are updated and deprecated. New SOPs and processes enter the facility. The population of real-world inputs drifts away from the distribution you tested against. A system that was credible in March can degrade by September without a single line of your code changing.

So validation can't end at go-live. It has to extend into ongoing monitoring, periodic re-evaluation against held-out reference sets, drift detection, and a change-control regime that treats a vendor model update as a change requiring assessment, not an invisible event. The lifecycle thinking is the same instinct CSV always had; the difference is that the system no longer holds still while you validate it.

You're not validating a model; you're validating a composite

In CSV the boundary of the system is usually obvious. With an LLM application it's a design decision, and getting it wrong is how teams end up validating the wrong thing. The model alone is rarely the unit of risk. The behavior that reaches a user is produced by the model plus the prompt, the retrieval layer, the guardrails and output filters, the orchestration logic, and the human in the loop. Guardrails and human review aren't tangential to the validated system; they're part of it, and often they're where the real risk control lives.

That reframing also brings in concerns CSV never had to name. Hallucination is a failure mode with no analog in deterministic software: the system is confidently, fluently wrong. Prompt injection turns natural-language input into an attack surface. Training-data provenance and bias become validation concerns because they shape behavior you can't otherwise see.

What this means in practice

Keep the discipline of validation: the lifecycle rigor, the risk-based proportionality, the documented evidence, the traceability of decisions. That discipline is exactly why life-science teams are positioned to do this well. But retire the assumption that the methods transfer unchanged.

Concretely, that means:

Defining a context of use before anything else
Sizing the effort to the consequence of error
Building and maintaining ground-truth evaluation datasets as first-class validated assets
Writing statistical acceptance criteria instead of deterministic expected results
Validating the full composite rather than the model in isolation
Treating monitoring and re-evaluation as part of the validated state

The shortcut - CSV/CSA plus a few requirements - produces an artifact that may pass an initial audit and proves very little. The harder path produces evidence that the system is actually credible for what you're asking it to do, and remains durable over the long term. In a regulated setting, that distinction isn't academic. It's the whole job.

The FDA has made it clear that AI-specific guidance is not required to cite organizations for human accountability outsourced to AI; the Purolea warning letter cited 21 CFR 211.22(c) for failure to comply with GMPs. The FDA's January 2025 draft guidance on AI to support regulatory decision-making is expected to finalize in Q2 2026, and the FDA-EMA Guiding Principles of Good AI Practice for Drug Development were released on January 14, 2026. Finalized guidance is not required for the initial steps of AI governance: visualizing AI use throughout your organization, risk-tiering potential AI use cases, and building out initial context-of-use, risk assessment evidence, and the outline of a model credibility plan.

The FDA and EMA chose, at the point of finalization, to converge on principles rather than a prescriptive method, which suggests the agencies intend to set expectations and risk-based proportionality and leave the how to sponsors.

If you can’t answer the following five questions, it’s time to build a plan of action:

If the model gave a different output tomorrow for the same input, would your validation evidence already account for that: or assume it can't happen?
Is "validated" a state you certified once, or a bounded state with pre-specified triggers that tell you when it's no longer true?
Is your context of use part of what you validated — or did you validate the model regardless of where it's used?
For any model you didn't build, do you know which validation evidence you can inherit and which you must generate?
If the underlying model updated next week, would you know, and would your validated state survive it?

If you can't answer these cleanly, your validation is CSV with more steps - and that's the gap an inspector will ultimately find before you do.

Kayla Britt