Probabilistic Error Taxonomy — Britt Biocomputing

Architecture · Failure classification

The Probabilistic
Error Taxonomy

Probabilistic failures don't have stable identities, so classifying them by what went wrong is only half the picture. This taxonomy adds a second dimension — where in the pipeline the failure was introduced — because the corrective action depends on the combination, not either axis alone.

A Control Layer (L3) component of the House of AI Trust · feeds evaluation design & monitoring

Two orthogonal dimensions

Error type is the what. Origin is the where.

The two axes are orthogonal: the same error type can arise from several origins, and the same origin can produce several error types. That's why the framework is a matrix of cells rather than a one-to-one list — and why it mirrors how quality teams already investigate deviations, identifying both what happened and where in the process it happened.

Dimension 1 · Error type

What went wrong

Fabrication — citation, data-point, and entity fabrication
Misinterpretation — recognition errors, reasoning errors, omission
Contextual misapplication — population mismatch, temporal validity, spec non-compliance
Confidence miscalibration — overconfidence, hedging, unflagged conflicts
Boundary violation — scope creep, authority creep, adversarial breach
Population bias — demographic, safety-signal, and site/transferability bias

Dimension 2 · Origin

Where it was introduced

Training data — data poisoning, privacy, distribution shift, source error
Retrieval / RAG layer — chunking, embedding drift, grounded hallucination, multi-hop
Model inference — hallucination, sycophancy, non-determinism, prompt injection
Human–AI interface — automation bias, confidence miscommunication, deskilling
Agent orchestration — multi-step planning, compounding propagation, tool-use
Supplier — silent model updates, deprecation, multi-tenancy leakage, DPA drift

The 6 × 6 matrix

Same error class, different origin, different fix

A fabrication is the same error class whether it originated at model inference, in the retrieval layer, or at the human review point — but the corrective control is different in each case. The canonical case below is Mata v. Avianca: one fabrication, two origins, two different fixes. Highlighted cells are worked beneath the grid; the fully populated 36-cell matrix is the client deliverable.

	Training data	Retrieval / RAG	Model inference	Human–AI interface	Agent orchestration	Supplier
Fabrication
Misinterpretation
Contextual misapplication
Confidence miscalibration
Boundary violation
Population bias

Featured case · Mata v. Avianca Worked example Part of the full client matrix

Featured case · Mata v. Avianca

Fabrication × Model inference

The canonical hallucination. In Mata v. Avianca (S.D.N.Y. 2023), the model fabricated six federal case citations in a single inference run — correct citation structure, invented substance — with no grounding in any retrieved source.

Corrective actionRetrieval-grounding requirement at the RAG layer; every citation must trace to a real source.

Featured case · secondary origin

Fabrication × Human–AI interface

The same case, one layer out. When the attorney asked whether the cases were real, the model confirmed they were — and no independent check stood between that answer and the filed brief.

Corrective actionForced verification workflow with structured event logging before sign-off.

Worked example

Fabrication × Retrieval / RAG

RAG reduces but doesn't eliminate fabrication. Grounded hallucination occurs when the model extrapolates beyond the retrieved context — producing a citation the source doesn't actually support.

Corrective actionCitation-grounding checks that bind every claim to retrieved text.

Worked example

Misinterpretation × Model inference

Right data, wrong reasoning. Cabral et al. (2024) documented a model that correctly identified every relevant clinical finding from a vignette, then drew an incorrect inferential conclusion from it.

Corrective actionReasoning-chain audits and inferential checks, not just fact-presence tests.

Worked example

Contextual misapplication × Training data

Right-looking, wrong-context. A clinical LLM cites a withdrawn or superseded FDA guidance document because its knowledge predates the change (temporal validity / knowledge cutoff).

Corrective actionTemporal-validity checks and currency guards against a maintained reference set.

Worked example

Contextual misapplication × Supplier

The model changed underneath a validated workflow. Between the March and June 2023 GPT-4 versions, USMLE accuracy fell from 86.6% to 82.1% with no announcement (Chen, Zaharia & Zou, 2023) — enough to break a workflow validated against the earlier endpoint.

Corrective actionContinuous monitoring, version pinning, and change control on supplier endpoints.

Worked example

Confidence miscalibration × Model inference

Maximum confidence on fabricated content. The Mata model didn't hedge — it asserted the invented cases were real, collapsing uncertainty exactly where it mattered most.

Corrective actionCalibration and uncertainty quantification; stated confidence must track accuracy.

Worked example

Boundary violation × Model inference

Guardrails that don't hold under pressure. Lee et al. (2025) found prompt-injection attacks succeeded in 94.4% of 216 controlled medical dialogues, including most high-harm trials.

Corrective actionAdversarial / red-team testing and input sanitisation at the inference boundary.

Worked example

Population bias × Training data

Accurate in aggregate, wrong for the underrepresented. Larrazabal et al. (2020) showed a consistent performance drop for underrepresented genders once training data fell below a minimum balance.

Corrective actionSubgroup performance testing against a representative reference set.

The rest of the grid

Available in engagement

Every remaining intersection carries its own representative failure mode, detection method, and corrective protocol — the fully populated 36-cell matrix is delivered as part of a client engagement rather than published here.

36 cellsPer-cell failure mode + control.

Where this sits in the architecture

The taxonomy is a Control-Layer component of the House of AI Trust. It gives the Validation Lifecycle's evaluation-design step (03) a structured map of failure modes to test against, and gives the continuous-monitoring step (06) a vocabulary for classifying what it catches.

House of AI Trust ↗ Validation Lifecycle → VALID Trust →

License the Error Taxonomy

Want to run the full 6 × 6 matrix — every failure mode and control — in your own validation work? Book a licensing conversation.

Book a licensing conversation →

Cite this work

Versioned & citable

The taxonomy is formally archived with a permanent DOI. Version 1.0 is deposited on Zenodo — citable in submissions, papers, and validation documentation.

Britt, K. (2026). Bounding the Uncertainty: An Error Taxonomy for Non-Deterministic AI in Regulated Life Sciences (Version 1.0). Zenodo. https://doi.org/10.5281/zenodo.20702867

View on Zenodo ↗

Suggest a revision

This taxonomy is a living document — it gets stronger through use and challenge. If you've found a failure mode, origin, or worked example that should be added, refined, or corrected, send it along. Substantive suggestions will be considered for the next version.