Daily Anesthesiology Research Analysis

Summary

Three anesthesia-focused studies advance perioperative decision support and risk stratification. A multicentre ultrasound–ML model accurately predicts gastric volume and outperforms the Perlas model, a real-world LLM chatbot (PEACH) safely operationalizes perioperative protocols with ~98% accuracy, and an explainable ML model predicts postoperative delirium and identifies outcome-relevant subtypes.

Research Themes

Perioperative AI/ML implementation and validation
Ultrasound-driven aspiration risk stratification
Delirium prediction and phenotyping for personalized care

Selected Articles

1. Development and validation of machine learning predictive models for gastric volume based on ultrasonography: A multicentre study.

75.5Level IIICohort

Journal of clinical anesthesia · 2025PMID: 40972267

In a multicentre cohort (n=793), ML models using age, RLD-CSA, and Perlas grade provided accurate gastric volume estimates and markedly improved detection of medium-to-high and high gastric volumes versus the Perlas model. External validation achieved AUC up to 0.96 for high gastric volume, supporting better aspiration risk stratification.

Impact: This work modernizes gastric ultrasound by providing validated, externally generalizable ML models that correct the systematic bias of the Perlas method and achieve excellent discrimination for high-risk volumes.

Clinical Implications: Integrating the model into point-of-care ultrasound could improve fasting status assessment, refine rapid sequence induction decisions, and reduce aspiration risk through accurate, threshold-based gastric volume prediction.

Key Findings

Eight ML models built on age, RLD-CSA, and Perlas grade outperformed the Perlas model in internal validation (Perlas mean bias +23.5 mL vs ML −0.1 to +2.0 mL).
Improved discrimination for medium-high gastric volume (AUC 0.74–0.77 vs 0.63) and high gastric volume (AUC 0.85–0.94 vs 0.74).
External validation showed AUC 0.81–0.80 (medium-high) and 0.96–0.96 (high gastric volume) for top-performing models.

Methodological Strengths

Multicentre design with prospective enrollment and external validation.
Objective reference standard for gastric volume via immediate endoscopic aspiration.
Feature selection with LASSO and rigorous agreement assessment (Bland-Altman).

Limitations

Cross-sectional design under intravenous anesthesia for endoscopy may limit generalizability to broader surgical populations.
Low prevalence of high gastric volume in the external cohort (1.5%) may widen confidence intervals.
Operator-dependent ultrasound measures were not standardized across all sonographers.

Future Directions: Prospective perioperative studies linking model-guided decisions to aspiration events and outcomes; integration into ultrasound devices with real-time thresholds; calibration across BMI and pregnancy.

STUDY OBJECTIVE: Aspiration of gastric contents is a serious complication associated with anaesthesia. Accurate prediction of gastric volume may assist in risk stratification and help prevent aspiration. This study aimed to develop and validate machine learning models to predict gastric volume based on ultrasound and clinical features. METHODS: This cross-sectional multicentre study was conducted at two hospitals and included adult patients undergoing gastroscopy under intravenous anaesthesia. Patients from Centre 1 were prospectively enrolled and randomly divided into a training set (Cohort A, n = 415) and an internal validation set (Cohort B, n = 179), while patients from Centre 2 were used as an external validation set (Cohort C, n = 199). The primary outcome was gastric volume, which was measured by endoscopic aspiration immediately following ultrasonographic examination. Least absolute shrinkage and selection operator (LASSO) regression was used for feature selection, and eight machine learning models were developed and evaluated using Bland-Altman analysis. The models' ability to predict medium-to-high and high gastric volumes was assessed. The top-performing models were externally validated, and their predictive performance was compared with the traditional Perlas model. MAIN RESULTS: Among the 793 enrolled patients, the number and proportion of patients with high gastric volume were as follows: 23 (5.5 %) in the development cohort, 10 (5.6 %) in the internal validation cohort, and 3 (1.5 %) in the external validation cohort. Eight models were developed using age, cross-sectional area of gastric antrum in right lateral decubitus (RLD-CSA) position, and Perlas grade, with these variables selected through LASSO regression. In internal validation, Bland-Altman analysis showed that the Perlas model overestimated gastric volume (mean bias 23.5 mL), while the new models provided accurate estimates (mean bias -0.1 to 2.0 mL). The models significantly improved prediction of medium-high gastric volume (area under the curve [AUC]: 0.74-0.77 vs. 0.63) and high gastric volume (AUC: 0.85-0.94 vs. 0.74). The best-performing adaptive boosting and linear regression models underwent externally validation, with AUCs of 0.81 (95 % confidence interval [CI], 0.74-0.89) and 0.80 (95 %CI, 0.72-0.89) for medium-high and 0.96 (95 %CI, 0.91-1) and 0.96 (95 %CI, 0.89-1) for high gastric volume. CONCLUSIONS: We propose a novel machine learning-based predictive model that outperforms Perlas model by incorporating the key features of age, RLD-CSA, and Perlas grade, enabling accurate prediction of gastric volume.

2. Real-world deployment and evaluation of PEri-operative AI CHatbot (PEACH): a large language model chatbot for peri-operative medicine.

74.5Level IIICohort

Anaesthesia · 2025PMID: 40973491

PEACH, a perioperative LLM chatbot embedding 35 protocols, achieved 97.9% accuracy after iterative updates with minimal hallucinations (1/240) and high usability, expediting decisions in 95% of cases. Inter-rater reliability for PEACH exceeded that of experienced physicians across iterations.

Impact: This is an early real-world demonstration that a securely scoped LLM can reliably operationalize perioperative protocols with measurable accuracy and safety, potentially standardizing and accelerating complex decisions.

Clinical Implications: Hospitals can trial scoped LLM tools to harmonize practice with local protocols, reduce cognitive load, and improve turnaround time for perioperative decisions, provided governance, auditing, and safety guardrails are in place.

Key Findings

Overall accuracy improved to 97.9% (235/240) and was statistically higher than a 95% benchmark (p=0.018).
Very low rates of hallucinations (1/240) and deviations (2/240) with explicit harm categorization.
Decision-making was expedited in 95% of uses; inter-rater reliability (κ up to 0.893) surpassed that of experienced physicians.

Methodological Strengths

Real-world silent deployment with head-to-head comparison against institutional guidelines and expert consensus.
Iterative evaluation with predefined safety taxonomy for hallucinations/deviations and usability assessment.
Inter-rater reliability benchmarking against experienced clinicians.

Limitations

Single-institution protocol scoping limits generalizability; content confined to 35 local protocols.
No assessment of downstream patient outcomes or error interception in live (non-silent) deployment.
Potential anchoring to institutional norms may propagate local practice biases.

Future Directions: Multisite deployments with outcome metrics (process adherence, complications, time-to-decision), continuous monitoring for drift, and integration with EHR context for patient-specific recommendations.

INTRODUCTION: Large Language Models are emerging as powerful tools in healthcare, particularly for complex, domain-specific tasks. This study describes the development and evaluation of PEri-operative AI CHatbot (PEACH). It was developed by embedding 35 institutional peri-operative protocols into a secure large language model environment, with iterative prompt engineering and internal testing to ensure clinical relevance and accuracy. METHODS: The system was tested with a silent deployment using real-world data. Accuracy, safety and usability were assessed. Accuracy was evaluated by comparing the responses from PEACH against institutional guidelines and expert consensus. Deviations and hallucinations were categorised based on potential harm, and user feedback was evaluated using the Davis' Technology Acceptance Model. Updates to PEACH were made after the initial silent deployment to make minor amendments to one of the protocols. RESULTS: In total, 240 real-world clinical iterations were evaluated. First-generation accuracy was 97.5% (78/80), with an overall accuracy of 96.7% (232/240) across three iterations. In the updated PEACH, accuracy improved to 97.9% (235/240), with a statistically significant difference from the null hypothesis of 95% accuracy (p = 0.018). Hallucinations and deviations were minimal (1/240 and 2/240, respectively). There was high usability, with clinicians noting that PEACH expedited decisions in 95% of cases. The κ statistic for inter-rater reliability for PEACH was 0.772 and 0.893 between three iterations, compared with 0.610 and 0.784 for experienced peri-operative physicians. DISCUSSION: PEACH is an accurate, adaptable tool that enhances consistency and efficiency in peri-operative decision-making. Future research should explore scalability across specialties and its impact on clinical outcomes. Computer programs called large language models (LLMs) are becoming helpful in healthcare. This paper talks about how a special healthcare chatbot named PEACH was created. PEACH helps doctors make decisions before, during and after surgery. It learned from 35 sets of hospital rules to give useful and safe advice. The team tested it many times to make sure it gave correct answers. PEACH was tested using real hospital cases. The team checked how correct, safe and easy it was to use. They compared what PEACH said with what the hospital guidelines and expert doctors would say. Any wrong or made‐up answers were looked at carefully. The team also asked users what they thought about using PEACH. After the first test, they made a few small improvements to one of the rules PEACH used. The team tested PEACH on 240 real hospital cases. In the first test, it was right 97.5% of the time. After some changes, it got even better, being right 97.9% of the time. It almost never made things up or gave wrong advice. Doctors said it helped them make decisions faster 95% of the time. The tool worked well when different people used it and gave similar results. PEACH is a smart and reliable tool that helps doctors make better choices during surgeries. The team hopes to test it more and use it in other areas of medicine too.

3. Leveraging data-driven machine learning: From explainable risk prediction to hierarchical clustering-based subtypes of postoperative delirium in a prospective non-cardiac surgery cohort.

70Level IIICohort

Journal of clinical anesthesia · 2025PMID: 40972266

In 1106 non-cardiac surgeries, a random forest model predicted postoperative delirium with AUC 0.85 and was explainable via SHAP (surgery duration, MMSE, frailty top contributors). Clustering of POD cases identified three subtypes with distinct length-of-stay and survival patterns.

Impact: Combining explainable prediction with clinically meaningful POD phenotypes bridges risk stratification and targeted perioperative interventions.

Clinical Implications: Preoperative screening incorporating model predictors (cognition, frailty) and surgery duration planning could allocate resources (delirium prevention bundles) to high-risk subtypes and inform discharge planning.

Key Findings

Random forest achieved AUC 0.85 (95% CI 0.78–0.91) for POD prediction among six compared algorithms.
SHAP identified surgery duration, MMSE, and Edmonton Frail Scale as top contributors to risk.
Hierarchical clustering revealed three POD subtypes with significantly different length-of-stay (e.g., 21.5 vs 5 days) and 12-month survival (Subtype 2 > 3 > 1; p < 0.001).

Methodological Strengths

Secondary analysis of prospective cohorts with external validation.
Model interpretability via SHAP enhances clinical transparency.
Outcome-based clustering linking phenotypes to LOS and survival.

Limitations

Secondary analysis with potential for residual confounding and missing variables.
Generalizability beyond non-cardiac surgeries and across institutions needs further testing.
Delirium assessment methods and timing heterogeneity may affect labels.

Future Directions: Prospective interventional trials targeting identified subtypes, EHR-integrated risk alerts, and calibration across diverse surgical populations.

STUDY OBJECTIVE: To leverage perioperative indicators in developing an explainable machine learning (ML) model for postoperative delirium (POD) prediction, discover distinct data-driven POD subtypes through hierarchical clustering analysis, and enhance personalized risk stratification to inform targeted clinical interventions. METHODS: This is a secondary analysis of several prospective observational studies, including 1106 patients who had non-cardiac surgery. Univariate analysis and the least absolute shrinkage and selection operator (LASSO) regression was used to screen essential features associated with POD. We compared six algorithms: adaptive boosting with classification trees, random forest (RF), neural networks, support vector machines, extreme gradient boosting with classification trees and logistic regression. SHapley Additive exPlanations (SHAP). was used to interpret the best one and to externally validate it in another large tertiary hospital. Among patients who developed POD, we conducted hierarchical clustering analysis on the risk factors (identified through univariate screening in the prediction model) to delineate distinct subtypes. We then compared the length of postoperative hospital stay and mortality rates (at 1, 3, 6, and 12 months postoperatively) between the identified clusters. MAIN RESULTS: We identified 14 POD risk factors to develop ML models. The RF model performed best among the six ML models (area under the curve [AUC] of 0.85, 95 % confidence interval [CI], 0.78-0.91). SHAP analysis highlighted surgery duration, preoperative mini-mental state examination score, and Edmonton Frail Scale as the top predictors of POD. Hierarchical clustering identified three distinct POD subtypes: Subtype 1 (high-risk profile with significant comorbidity and inflammatory dysregulation, longest hospitalization: 21.5 days ([interquartile range (IQR) 19-28]; p < 0.001), Subtype 2 (resilient majority with optimal survival; Log-rank p < 0.001), and Subtype 3 (advanced age, frailty and low cognitive reserve, shortest hospitalization: 5 days [IQR 4-8]). Kaplan-Meier analysis showed significant 12-month survival differences among the subtypes (Subtype 2 > Subtype 3 > Subtype 1; p < 0.001). CONCLUSION: Our study validated the utility of ML models, particularly RF, in predicting POD and identified three novel data-driven subtypes with distinct clinical characteristics.