315

GASTROENTEROLOGY SPECIFIC AI MODEL OUTPERFORMS ATTENDING PHYSICIAN CLINICAL NOTES IN A REAL-WORLD DATA EVALUATION

Date
May 19, 2024

Introduction: Artificial Intelligence (AI) large language models (LLMs) show promise in medicine, however general-purpose AI models underperform on clinical tasks. Recognizing this potential, our team developed a specialty-specific multi-task clinical LLM: GastroGPT (Fig 1). We demonstrated superiority of the platform over general purpose LLMs in a simulated environment and now seek to compare GastroGPT’s note taking abilities to attending physicians using real-world clinical data.

Materials and Methods: GastroGPT was evaluated on 3,530 selected gastroenterology-focused intensive care admissions in Medical Information Mart III (MIMIC-III), which includes de-identified, comprehensive clinical patient data. Selected attending physician notes were assessed across seven domains mirroring clinical flow and were used to select cases representing the gastroenterology subspecialties. A novel guideline-based, expert-derived weighted objective rubric called the Clinical Language Model Evaluation Rubric (CLEAR) was used to assess GastroGPT and physician performance on key clinical tasks, including assessment, diagnostic workup, treatment planning, follow-up, multidisciplinary care, history gathering, and patient education. CLEAR incorporates subtasks and essential skills under each task to enable standardized evaluation. In total, CLEAR encompasses 57 benchmarked and weighted subtasks across the clinical tasks. Overall weighted performance was the primary outcome, with secondary outcomes of individual task performance and consistency across case complexity. Multivariable regression identified score predictors.

Results: GastroGPT achieved higher note scores versus attending physicians for gastroenterology focused cases (8.1 ± 0.6 vs 6.5 ± 1.4 p<0.001). Across all clinical tasks in the notes, GastroGPT showed superior performance to attending physicians (Fig 2): 1) assessment and summary (8.5 ± 0.3 vs 7.18 ± 0.59) 2) diagnostic workup (8.5 ± 0.4 vs 7.6 ± 0.3, p<0.001), 3) treatment planning and management (7.6 ± 0.4 vs 6.5 ± 0.4, p<0.001), 4) follow-up and 5) multidisciplinary care (8.5 ± 0.3 vs 6.7 ± 0.6, p<0.001). In 6) Additional history and 7) Patient Education, GastroGPT was compared only with ChatGPT. GastroGPT was superior to ChatGPT4 in all cases (p<0.001), which scored inferior to physicians (5.2±2.1 vs 6.5±1.4; p<0.05). In multivariable analysis, GastroGPT was a predictor of higher scores after adjusting for other clinical factors. Subgroup analysis demonstrated consistent GastroGPT performance by complexity.

Conclusion: The gastroenterology-specific AI model GastroGPT achieved superior performance to attending notetaking across all tasks, while the general AI model, ChatGPT4, was inferior to GastroGPT and physician notes.
<b>Figure 1. </b>Diagram illustrating the design and working principles of gastroenterology specific artificial intelligence large language model, GastroGPT.

Figure 1. Diagram illustrating the design and working principles of gastroenterology specific artificial intelligence large language model, GastroGPT.

<b>Figure 2. </b>Bar chart comparing the evaluation scores of GastroGPT, attending physician and ChatGPT.<br /> * p < 0.05 vs ChatGPT<br /> ** p < 0.05 vs Attending Physician and ChatGPT

Figure 2. Bar chart comparing the evaluation scores of GastroGPT, attending physician and ChatGPT.
* p < 0.05 vs ChatGPT
** p < 0.05 vs Attending Physician and ChatGPT


Tracks

Related Products

Thumbnail for A NOVEL SELF-ASSEMBLING PEPTIDE HYDROGEL FOR WOUND HEALING: INITIAL RESULTS FOR ENDOSCOPIC TREATMENT OF POST-SURGICAL LEAKS
A NOVEL SELF-ASSEMBLING PEPTIDE HYDROGEL FOR WOUND HEALING: INITIAL RESULTS FOR ENDOSCOPIC TREATMENT OF POST-SURGICAL LEAKS
A novel, self-assembling peptide hydrogel (PuraStat, 3D Matrix, Tokyo, Japan) has recently been FDA-approved with dual indications for hemostasis and wound healing…
Thumbnail for NEW GENERATION ENDOSCOPIC CLIPS FOR FULL-THICKNESS DEFECT CLOSURES: A COMPARATIVE EX-VIVO, PORCINE STUDY
NEW GENERATION ENDOSCOPIC CLIPS FOR FULL-THICKNESS DEFECT CLOSURES: A COMPARATIVE EX-VIVO, PORCINE STUDY
Endoscopic closure techniques have a rapidly emerging role in the management of transmural gastrointestinal defects and have been shown to be effective and safe…
Thumbnail for THE MULTIVIEW PERSPECTIVE (MVP) STUDY: A BLINDED, TANDEM PROSPECTIVE TRIAL OF FORWARD-VIEW VERSUS SIDE-VIEW EXAMINATION DURING ERCP
THE MULTIVIEW PERSPECTIVE (MVP) STUDY: A BLINDED, TANDEM PROSPECTIVE TRIAL OF FORWARD-VIEW VERSUS SIDE-VIEW EXAMINATION DURING ERCP
INTRODUCTION: Most endoscopists perform ERCP with only a side-viewing duodenoscope. We hypothesize significant gastrointestinal findings are missed due to the side-viewing design of the duodenoscope and that at least a subset of patients would benefit from concomitant forward-viewing exam (i.e…
Thumbnail for FORWARD-VIEWING VERSUS SIDE-VIEWING ENDOSCOPIC EXAMINATION DURING ERCP: PRELIMINARY RESULTS FROM A BLINDED, TANDEM PROSPECTIVE TRIAL
FORWARD-VIEWING VERSUS SIDE-VIEWING ENDOSCOPIC EXAMINATION DURING ERCP: PRELIMINARY RESULTS FROM A BLINDED, TANDEM PROSPECTIVE TRIAL
BACKGROUND: Endoscopic Snare Papillectomy (ESP) is a minimally invasive option for the management of noninvasive ampullary adenomas…