Why AI Validation Architecture Belongs in the FDA's Clinical Trial Pilot — And What We Told Them

Jun 29

What FDA Asked

The RFI in plain language, FDA's AI-Enabled Optimization of Early-Phase Clinical Trials Pilot Program, is asking industry how AI should be governed in the highest-stakes validation environment that exists: first-in-human dosing, small patient populations, short windows between signal and decision. The cost of validation failure here isn't a deviation report; it's a patient.

The RFI invites industry input on how AI should be deployed, governed, and evaluated within early-phase clinical trials: first-in-human dosing, oncology dose escalation, rare disease trials, Phase 1 to Phase 2 transition decisions. These are the highest-stakes validation environments in drug development: small patient populations that limit statistical power, short windows between signal acquisition and consequential decision, and high per-decision stakes where a missed safety signal or a miscalibrated efficacy inference can shape the trajectory of an entire development program.

The pilot program's design will shape industry practice. What FDA hears during this comment period — and what it doesn't hear — will help shape the structure of the governance architecture around the pilot.

Why Early-Phase Is the Hardest Validation Problem

Our submission's core argument is that early-phase clinical trials present a validation environment where traditional frameworks are structurally insufficient. Traditional computerized system validation assumes determinism: the same input produces the same output, every time. AI-enabled systems operating in early-phase contexts don't preserve that guarantee.

This matters acutely in early-phase settings for three compounding reasons.

First, small cohorts mean you can't average out miscalibration. An overconfident safety classification at cohort N propagates directly into the dosing decision for cohort N+1. In my published error taxonomy, this is Class 4 (Confidence Miscalibration) - and in a dose escalation context, it's not an abstract taxonomy entry. It's a clinical consequence.

Second, short decision windows mean you can't defer governance questions. The time between data acquisition and consequential decision is compressed in ways that later-phase trials don't face.

Third, high per-decision stakes mean silent failures have immediate consequences. This is the inverse of the conditions CSV was built for, and it's exactly where the HITL-as-sufficient-control in isolation argument breaks down most visibly.

What We Argued

Three of our core arguments woven into the submission include:

Unified capture architecture
Provenance as a property of data, not platform
Validation metrics appropriate for probabilistic systems

Unified capture architecture: Tanya Sharma Tanya Sharma’s contribution argued that the data responsibility model must be defined before the AI tool is used: not after deployment, not during the first deviation investigation. This means explicitly documenting who owns source data, which source system is authoritative, who reviews AI output, who makes the final clinical decision, and what evidence must be retained to reconstruct an AI-supported recommendation. At minimum, the infrastructure should support:

role-based access controls;
patient privacy and consent controls;
source data traceability;
data lineage across transformations;
audit trails;
data, protocol, and model version control;
evidence retention;
and the ability to reconstruct which data and model version supported an AI-generated recommendation.

Provenance as a property of data, not platform: James Canterbury James Canterbury's contribution to the submission centered on content-addressed capture: deriving each record's identifier from a cryptographic hash of its content at the point of capture, so that from the moment data enters the system its origin, exact form, and time are verifiable — and any later change is detectable. This layers verifiable provenance onto the audit trail the EDC already maintains, without requiring a new platform, without blockchain, and without raising the barrier to entry for smaller sites. Integrity becomes a property of the data itself rather than an assertion by the platform that holds it.

Validation metrics appropriate for probabilistic systems: Collectively, we proposed three specific metric axes for evaluating AI systems in early-phase trials, and deliberately mapped them to a regulatory framework FDA already owns.

Calibration: whether the model's confidence scores match observed frequencies, so a stated 90% confidence corresponds to being correct roughly 90% of the time. This corresponds to Stage 1 (design characterization) of the FDA's process validation lifecycle.
Perturbation testing: characterizing the stochastic envelope by measuring how far a model's output moves across clinically equivalent, non-identical inputs, bounded against pre-specified thresholds. This converts an unreproducible output into a reproducible distribution of outputs — the form of reproducibility that probabilistic validation actually requires. This corresponds to Stage 2 (process qualification).
Adaptive conformal prediction: producing validated coverage guarantees for generalizability assessment. This corresponds to Stage 3 (continued process verification).

The mapping is the point. CSV professionals reading this submission should recognize the skeleton. The muscle is adapted for the failure characteristics of probabilistic systems, but the lifecycle is the one they already know.

HITL in Isolation Is Not a Control and Early-Phase Trials Support It

A thread running through our submission — and through much of my published work — is that human-in-the-loop review, in isolation, does not constitute a validation control.

Early-phase clinical trials expose why this can be fallible. Small cohorts, compressed timelines, high-consequence decisions, and automation bias create conditions where human review is least reliable precisely when it matters most. A reviewer under time pressure, processing AI outputs that are confidently wrong but linguistically fluent, is not a control. They are a liability transfer mechanism.

This is not a fringe position. At the recent AI in Life Sciences conference, Nikolai Makaranka, Daikon’s CEO and founder presented a session arguing explicitly that while absolutely necessary, HITL in isolation does not constitute a validation control.

Anthropic's own published work acknowledges that the silent failure mode, wrong but plausible output, used without objection, is the one they "don't have a robust solution yet" for (“How Anthropic enables self-service data analytics with Claude” published June 3rd, 2026, on Anthropic’s blog). That silent-failure problem is precisely the gap that HITL is supposed to fill and, because of automation bias, systematically doesn't.

The docket has attracted submissions from across the industry: BIO, McKinsey, Eli Lilly, and others. The range reflects where different parts of the industry are, and what they think the FDA needs to hear.

BIO's submission is thorough and procedurally careful. It asks FDA to clarify scope, define terms, establish governance expectations, and distinguish between real-time clinical trials and AI more broadly. It is fundamentally asking FDA to be clearer before industry commits; a protective posture that serves its membership well.

McKinsey's submission leads with capabilities — production-deployed clinical AI systems, 5,800+ digital and analytics experts, partnerships with MLCommons — and proposes evaluation frameworks organized around NIST AI RMF dimensions. The technical content is competent and comprehensive, drawing on established metrics and referencing federated benchmarking infrastructure.

Eli Lilly's submission is the most substantive among the major pharma respondents, and it converges with our position in several important ways. Lilly explicitly anchors AI system qualification in GAMP 5, calling for "qualification of the surrounding GxP computer system (GAMP 5; 21 CFR Part 11)" alongside the credibility assessment of the AI model itself. For anyone working in the GAMP community, this matters: a top-ten pharma organization (by revenue) is publicly stating that AI validation sits inside the GAMP framework, not outside it. That is a signal about where the professional discipline is headed.

Lilly's HITL treatment is also notably rigorous. They observe that "a high acceptance rate without modification can indicate either well-calibrated AI or inadequate review capacity", which is essentially the argument I've been making throughout my published work, stated from the pharma side. They recommend measuring HITL as a control: reviewer override rates, time-to-review for high-risk outputs, and the proportion of AI outputs entering the regulatory record without modification. That's a meaningful departure from the industry's default posture of treating HITL as a sufficient answer rather than a measurable, validatable control.

Lilly also proposes a DMF-style mechanism for third-party AI vendors: having vendors submit model documentation directly to FDA in confidence, with sponsors authorized to reference it. This is an architectural solution to the supplier qualification problem that complements the VALID Trust framework I've published: it addresses the validation-inheritance boundary by creating a direct channel between the vendor and the regulator, rather than forcing the sponsor to validate layers they don't control.

Where Lilly anchors in GAMP 5, we build on it, mapping our three metric axes (calibration, perturbation testing, conformal prediction) to the three stages of FDA's own process validation lifecycle. Where Lilly proposes measuring HITL effectiveness, we argue that HITL as currently practiced is insufficient as a primary control and that the harness architecture around the AI system is where regulatory load must be carried. The positions are complementary rather than contradictory, and taken together they represent a rigorous vision for the future of AI-enabled regulated infrastructure.

Why This Matters Beyond Clinical Trials The validation architecture questions FDA is asking about early-phase trials are the same questions that apply across the full range of GxP AI applications - deviation triage, pharmacovigilance, manufacturing process control, regulatory submission support. The early-phase context concentrates the stakes, but the structural problem is universal: probabilistic systems require different validation evidence than deterministic ones, and the frameworks we inherited from CSV don't cover the gap.

This is the work being done by the broader GAMP community, and other communities working at the intersection of regulation, AI, and science.

The Comment Period Closes Today

The submissions are public record and are available on regulations.gov [https://www.regulations.gov/document/FDA-2026-N-4390-0001/comment].

The pilot program's design will shape how AI-enabled clinical trials are governed - not just for the pilot participants, but for the industry standards that follow. What the FDA hears now matters.

Later this week, I will release the harness piece, which explores the deeper theoretical infrastructure behind what we argued on the FDA docket. Stay tuned.

This submission was co-authored with Tanya Sharma (Assurea) and James Canterbury (Zeroth Technology). All three authors are members of the ISPE Decentralized Information Network Special Interest Group (DIN-SIG). The comment reflects individual views informed by, but not formally representing, the SIG.

This piece reflects the individual views of the author, and not the official position of any organization.

Kayla Britt

Why AI Validation Architecture Belongs in the FDA's Clinical Trial Pilot — And What We Told Them

Context is King: Why AI Validation is Driven by Validity Context