GASTROENTEROLOGY SPECIFIC AI MODEL OUTPERFORMS ATTENDING PHYSICIAN CLINICAL NOTES IN A REAL-WORLD DATA EVALUATION

Date

May 19, 2024

Introduction: Artificial Intelligence (AI) large language models (LLMs) show promise in medicine, however general-purpose AI models underperform on clinical tasks. Recognizing this potential, our team developed a specialty-specific multi-task clinical LLM: GastroGPT (Fig 1). We demonstrated superiority of the platform over general purpose LLMs in a simulated environment and now seek to compare GastroGPT’s note taking abilities to attending physicians using real-world clinical data.

Materials and Methods: GastroGPT was evaluated on 3,530 selected gastroenterology-focused intensive care admissions in Medical Information Mart III (MIMIC-III), which includes de-identified, comprehensive clinical patient data. Selected attending physician notes were assessed across seven domains mirroring clinical flow and were used to select cases representing the gastroenterology subspecialties. A novel guideline-based, expert-derived weighted objective rubric called the Clinical Language Model Evaluation Rubric (CLEAR) was used to assess GastroGPT and physician performance on key clinical tasks, including assessment, diagnostic workup, treatment planning, follow-up, multidisciplinary care, history gathering, and patient education. CLEAR incorporates subtasks and essential skills under each task to enable standardized evaluation. In total, CLEAR encompasses 57 benchmarked and weighted subtasks across the clinical tasks. Overall weighted performance was the primary outcome, with secondary outcomes of individual task performance and consistency across case complexity. Multivariable regression identified score predictors.

Results: GastroGPT achieved higher note scores versus attending physicians for gastroenterology focused cases (8.1 ± 0.6 vs 6.5 ± 1.4 p<0.001). Across all clinical tasks in the notes, GastroGPT showed superior performance to attending physicians (Fig 2): 1) assessment and summary (8.5 ± 0.3 vs 7.18 ± 0.59) 2) diagnostic workup (8.5 ± 0.4 vs 7.6 ± 0.3, p<0.001), 3) treatment planning and management (7.6 ± 0.4 vs 6.5 ± 0.4, p<0.001), 4) follow-up and 5) multidisciplinary care (8.5 ± 0.3 vs 6.7 ± 0.6, p<0.001). In 6) Additional history and 7) Patient Education, GastroGPT was compared only with ChatGPT. GastroGPT was superior to ChatGPT4 in all cases (p<0.001), which scored inferior to physicians (5.2±2.1 vs 6.5±1.4; p<0.05). In multivariable analysis, GastroGPT was a predictor of higher scores after adjusting for other clinical factors. Subgroup analysis demonstrated consistent GastroGPT performance by complexity.

Conclusion: The gastroenterology-specific AI model GastroGPT achieved superior performance to attending notetaking across all tasks, while the general AI model, ChatGPT4, was inferior to GastroGPT and physician notes.

<b>Figure 1. </b>Diagram illustrating the design and working principles of gastroenterology specific artificial intelligence large language model, GastroGPT.

Figure 1. Diagram illustrating the design and working principles of gastroenterology specific artificial intelligence large language model, GastroGPT.

<b>Figure 2. </b>Bar chart comparing the evaluation scores of GastroGPT, attending physician and ChatGPT.<br /> * p < 0.05 vs ChatGPT<br /> ** p < 0.05 vs Attending Physician and ChatGPT

Figure 2. Bar chart comparing the evaluation scores of GastroGPT, attending physician and ChatGPT.
* p < 0.05 vs ChatGPT
** p < 0.05 vs Attending Physician and ChatGPT

Presenter

Cem Simsek

Speakers

Christopher C. Thompson

Brigham and Women's Hospital

Tracks

AGA

Related Products

THE MULTIVIEW PERSPECTIVE (MVP) STUDY: A BLINDED, TANDEM PROSPECTIVE TRIAL OF FORWARD-VIEW VERSUS SIDE-VIEW EXAMINATION DURING ERCP

INTRODUCTION: Most endoscopists perform ERCP with only a side-viewing duodenoscope. We hypothesize significant gastrointestinal findings are missed due to the side-viewing design of the duodenoscope and that at least a subset of patients would benefit from concomitant forward-viewing exam (i.e…

ENDOSCOPIC ULTRASOUND LIVER PALPATION AS A SCREENING TOOL FOR ADVANCED FIBROSIS AND CIRRHOSIS

BACKGROUND: Hepatic encephalopathy (HE) is associated with increased mortality, falls, and frequent hospitalizations. Patient reported outcomes (PROs) are useful tools to assess health-related quality of life (HRQOL) measures such as impairment of sleep, cognition, or activity…

METABOLIC OUTCOMES AND MECHANISMS OF ACTION FOR DUODENAL BI-PARTITION IN THE TREATMENT OF OBESITY AND TYPE 2 DIABETES MELLITUS: A 4-YEAR PROSPECTIVE OBSERVATIONAL STUDY

INTRODUCTION: Roux-en-Y gastric bypass (RYGB) is an effective treatment for patients suffering from obesity and concomitant type 2 diabetes mellitus (T2DM). Nevertheless, less than 2% of eligible patients choose to undergo the surgery…

A NOVEL SELF-ASSEMBLING PEPTIDE HYDROGEL FOR WOUND HEALING: INITIAL RESULTS FOR ENDOSCOPIC TREATMENT OF POST-SURGICAL LEAKS

A novel, self-assembling peptide hydrogel (PuraStat, 3D Matrix, Tokyo, Japan) has recently been FDA-approved with dual indications for hemostasis and wound healing…