Daily Anesthesiology Research Analysis
Three anesthesia-focused studies advance perioperative decision support and risk stratification. A multicentre ultrasound–ML model accurately predicts gastric volume and outperforms the Perlas model, a real-world LLM chatbot (PEACH) safely operationalizes perioperative protocols with ~98% accuracy, and an explainable ML model predicts postoperative delirium and identifies outcome-relevant subtypes.
Summary
Three anesthesia-focused studies advance perioperative decision support and risk stratification. A multicentre ultrasound–ML model accurately predicts gastric volume and outperforms the Perlas model, a real-world LLM chatbot (PEACH) safely operationalizes perioperative protocols with ~98% accuracy, and an explainable ML model predicts postoperative delirium and identifies outcome-relevant subtypes.
Research Themes
- Perioperative AI/ML implementation and validation
- Ultrasound-driven aspiration risk stratification
- Delirium prediction and phenotyping for personalized care
Selected Articles
1. Development and validation of machine learning predictive models for gastric volume based on ultrasonography: A multicentre study.
In a multicentre cohort (n=793), ML models using age, RLD-CSA, and Perlas grade provided accurate gastric volume estimates and markedly improved detection of medium-to-high and high gastric volumes versus the Perlas model. External validation achieved AUC up to 0.96 for high gastric volume, supporting better aspiration risk stratification.
Impact: This work modernizes gastric ultrasound by providing validated, externally generalizable ML models that correct the systematic bias of the Perlas method and achieve excellent discrimination for high-risk volumes.
Clinical Implications: Integrating the model into point-of-care ultrasound could improve fasting status assessment, refine rapid sequence induction decisions, and reduce aspiration risk through accurate, threshold-based gastric volume prediction.
Key Findings
- Eight ML models built on age, RLD-CSA, and Perlas grade outperformed the Perlas model in internal validation (Perlas mean bias +23.5 mL vs ML −0.1 to +2.0 mL).
- Improved discrimination for medium-high gastric volume (AUC 0.74–0.77 vs 0.63) and high gastric volume (AUC 0.85–0.94 vs 0.74).
- External validation showed AUC 0.81–0.80 (medium-high) and 0.96–0.96 (high gastric volume) for top-performing models.
Methodological Strengths
- Multicentre design with prospective enrollment and external validation.
- Objective reference standard for gastric volume via immediate endoscopic aspiration.
- Feature selection with LASSO and rigorous agreement assessment (Bland-Altman).
Limitations
- Cross-sectional design under intravenous anesthesia for endoscopy may limit generalizability to broader surgical populations.
- Low prevalence of high gastric volume in the external cohort (1.5%) may widen confidence intervals.
- Operator-dependent ultrasound measures were not standardized across all sonographers.
Future Directions: Prospective perioperative studies linking model-guided decisions to aspiration events and outcomes; integration into ultrasound devices with real-time thresholds; calibration across BMI and pregnancy.
2. Real-world deployment and evaluation of PEri-operative AI CHatbot (PEACH): a large language model chatbot for peri-operative medicine.
PEACH, a perioperative LLM chatbot embedding 35 protocols, achieved 97.9% accuracy after iterative updates with minimal hallucinations (1/240) and high usability, expediting decisions in 95% of cases. Inter-rater reliability for PEACH exceeded that of experienced physicians across iterations.
Impact: This is an early real-world demonstration that a securely scoped LLM can reliably operationalize perioperative protocols with measurable accuracy and safety, potentially standardizing and accelerating complex decisions.
Clinical Implications: Hospitals can trial scoped LLM tools to harmonize practice with local protocols, reduce cognitive load, and improve turnaround time for perioperative decisions, provided governance, auditing, and safety guardrails are in place.
Key Findings
- Overall accuracy improved to 97.9% (235/240) and was statistically higher than a 95% benchmark (p=0.018).
- Very low rates of hallucinations (1/240) and deviations (2/240) with explicit harm categorization.
- Decision-making was expedited in 95% of uses; inter-rater reliability (κ up to 0.893) surpassed that of experienced physicians.
Methodological Strengths
- Real-world silent deployment with head-to-head comparison against institutional guidelines and expert consensus.
- Iterative evaluation with predefined safety taxonomy for hallucinations/deviations and usability assessment.
- Inter-rater reliability benchmarking against experienced clinicians.
Limitations
- Single-institution protocol scoping limits generalizability; content confined to 35 local protocols.
- No assessment of downstream patient outcomes or error interception in live (non-silent) deployment.
- Potential anchoring to institutional norms may propagate local practice biases.
Future Directions: Multisite deployments with outcome metrics (process adherence, complications, time-to-decision), continuous monitoring for drift, and integration with EHR context for patient-specific recommendations.
3. Leveraging data-driven machine learning: From explainable risk prediction to hierarchical clustering-based subtypes of postoperative delirium in a prospective non-cardiac surgery cohort.
In 1106 non-cardiac surgeries, a random forest model predicted postoperative delirium with AUC 0.85 and was explainable via SHAP (surgery duration, MMSE, frailty top contributors). Clustering of POD cases identified three subtypes with distinct length-of-stay and survival patterns.
Impact: Combining explainable prediction with clinically meaningful POD phenotypes bridges risk stratification and targeted perioperative interventions.
Clinical Implications: Preoperative screening incorporating model predictors (cognition, frailty) and surgery duration planning could allocate resources (delirium prevention bundles) to high-risk subtypes and inform discharge planning.
Key Findings
- Random forest achieved AUC 0.85 (95% CI 0.78–0.91) for POD prediction among six compared algorithms.
- SHAP identified surgery duration, MMSE, and Edmonton Frail Scale as top contributors to risk.
- Hierarchical clustering revealed three POD subtypes with significantly different length-of-stay (e.g., 21.5 vs 5 days) and 12-month survival (Subtype 2 > 3 > 1; p < 0.001).
Methodological Strengths
- Secondary analysis of prospective cohorts with external validation.
- Model interpretability via SHAP enhances clinical transparency.
- Outcome-based clustering linking phenotypes to LOS and survival.
Limitations
- Secondary analysis with potential for residual confounding and missing variables.
- Generalizability beyond non-cardiac surgeries and across institutions needs further testing.
- Delirium assessment methods and timing heterogeneity may affect labels.
Future Directions: Prospective interventional trials targeting identified subtypes, EHR-integrated risk alerts, and calibration across diverse surgical populations.