Skip to main content

False hope of a single generalisable AI sepsis prediction model: bias and proposed mitigation strategies for improving performance based on a retrospective multisite cohort study.

BMJ quality & safety2025-03-27PubMed
Total: 77.5Innovation: 8Impact: 8Rigor: 7Citation: 9

Summary

In a nine-hospital cohort of 969,292 admissions, a single sepsis ML model showed substantial performance variability by care location. Training separate models for ED and ward patients reduced alert burden (lower NNE) across most sites without changing the time window to Sepsis-3 events (HTS3).

Key Findings

  • Baseline model required fewer evaluations in EDs than wards: NNE 6.1 vs 7.5.
  • Prediction window differed by care location: median HTS3 5 h (ED) vs 20 h (wards).
  • Bias mitigation significantly reduced NNE but did not change HTS3.
  • Best-performing approach trained models separately for ED and ward patients, lowering NNE across 7/9 hospitals.

Clinical Implications

Hospitals should avoid one-size-fits-all sepsis AI models. Deploy care location–specific models, monitor NNE and calibration, and evaluate bias mitigation strategies to reduce alert burden and improve usability.

Why It Matters

This study provides rigorous, multisite evidence that site- and location-specific models are needed to minimize alert burden while preserving early detection windows in sepsis AI.

Limitations

  • Retrospective design with Sepsis-3–derived labels; not a prospective implementation trial
  • Generalizability to nonparticipating systems or international settings remains uncertain

Future Directions

Prospective deployment trials assessing patient outcomes, alarm fatigue, fairness across subgroups, and adaptive model maintenance by location.

Study Information

Study Type
Cohort
Research Domain
Diagnosis
Evidence Level
III - Retrospective multisite cohort analysis evaluating ML model performance and bias mitigation.
Study Design
OTHER