Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness.
Summary
In 14 standardized preoperative scenarios using 58 guidelines, GPT-4 with RAG produced 96.4% accuracy—significantly outperforming human responses—with no hallucinations and faster turnaround. Results suggest guideline-grounded LLMs can support safe, efficient preoperative fitness assessments.
Key Findings
- Evaluated 10 LLMs with RAG across 14 clinical preoperative scenarios using 58 guidelines.
- Generated 3234 model responses versus 448 human answers; GPT-4 RAG achieved 96.4% accuracy vs. 86.6% for humans (p=0.016).
- No hallucinations were observed; AI outputs were more consistent and delivered within ~20 seconds.
Clinical Implications
Hospitals could pilot RAG-based preoperative decision support to standardize risk triage and instructions, with governance for data security, audit trails, clinician oversight, and local guideline integration.
Why It Matters
Demonstrates clinically relevant, guideline-grounded AI outperforming humans in perioperative decision tasks, addressing consistency and hallucination risks central to clinical adoption.
Limitations
- Scenario-based evaluation rather than real-world patient care; external validity requires clinical implementation studies.
- Performance may depend on guideline quality/coverage and prompt engineering; generalizability beyond included guidelines is uncertain.
Future Directions
Prospective clinical trials testing AI-RAG-assisted preoperative clinics, impact on cancellations, safety events, workflow efficiency, and cost-effectiveness, with fairness and governance evaluation.
Study Information
- Study Type
- Cohort
- Research Domain
- Diagnosis
- Evidence Level
- III - Prospective comparative methods evaluation without randomization.
- Study Design
- OTHER