Skip to main content

Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness.

NPJ digital medicine2025-04-05PubMed
Total: 80.5Innovation: 9Impact: 8Rigor: 7Citation: 9

Summary

In 14 standardized preoperative scenarios using 58 guidelines, GPT-4 with RAG produced 96.4% accuracy—significantly outperforming human responses—with no hallucinations and faster turnaround. Results suggest guideline-grounded LLMs can support safe, efficient preoperative fitness assessments.

Key Findings

  • Evaluated 10 LLMs with RAG across 14 clinical preoperative scenarios using 58 guidelines.
  • Generated 3234 model responses versus 448 human answers; GPT-4 RAG achieved 96.4% accuracy vs. 86.6% for humans (p=0.016).
  • No hallucinations were observed; AI outputs were more consistent and delivered within ~20 seconds.

Clinical Implications

Hospitals could pilot RAG-based preoperative decision support to standardize risk triage and instructions, with governance for data security, audit trails, clinician oversight, and local guideline integration.

Why It Matters

Demonstrates clinically relevant, guideline-grounded AI outperforming humans in perioperative decision tasks, addressing consistency and hallucination risks central to clinical adoption.

Limitations

  • Scenario-based evaluation rather than real-world patient care; external validity requires clinical implementation studies.
  • Performance may depend on guideline quality/coverage and prompt engineering; generalizability beyond included guidelines is uncertain.

Future Directions

Prospective clinical trials testing AI-RAG-assisted preoperative clinics, impact on cancellations, safety events, workflow efficiency, and cost-effectiveness, with fairness and governance evaluation.

Study Information

Study Type
Cohort
Research Domain
Diagnosis
Evidence Level
III - Prospective comparative methods evaluation without randomization.
Study Design
OTHER