Introduction: Artificial Intelligence (AI) large language models (LLMs) show promise in medicine, however general-purpose AI models underperform on clinical tasks. Recognizing this potential, our team developed a specialty-specific multi-task clinical LLM: GastroGPT (Fig 1). We demonstrated superiority of the platform over general purpose LLMs in a simulated environment and now seek to compare GastroGPT’s note taking abilities to attending physicians using real-world clinical data.
Materials and Methods: GastroGPT was evaluated on 3,530 selected gastroenterology-focused intensive care admissions in Medical Information Mart III (MIMIC-III), which includes de-identified, comprehensive clinical patient data. Selected attending physician notes were assessed across seven domains mirroring clinical flow and were used to select cases representing the gastroenterology subspecialties. A novel guideline-based, expert-derived weighted objective rubric called the Clinical Language Model Evaluation Rubric (CLEAR) was used to assess GastroGPT and physician performance on key clinical tasks, including assessment, diagnostic workup, treatment planning, follow-up, multidisciplinary care, history gathering, and patient education. CLEAR incorporates subtasks and essential skills under each task to enable standardized evaluation. In total, CLEAR encompasses 57 benchmarked and weighted subtasks across the clinical tasks. Overall weighted performance was the primary outcome, with secondary outcomes of individual task performance and consistency across case complexity. Multivariable regression identified score predictors.
Results: GastroGPT achieved higher note scores versus attending physicians for gastroenterology focused cases (8.1 ± 0.6 vs 6.5 ± 1.4 p<0.001). Across all clinical tasks in the notes, GastroGPT showed superior performance to attending physicians (Fig 2): 1) assessment and summary (8.5 ± 0.3 vs 7.18 ± 0.59) 2) diagnostic workup (8.5 ± 0.4 vs 7.6 ± 0.3, p<0.001), 3) treatment planning and management (7.6 ± 0.4 vs 6.5 ± 0.4, p<0.001), 4) follow-up and 5) multidisciplinary care (8.5 ± 0.3 vs 6.7 ± 0.6, p<0.001). In 6) Additional history and 7) Patient Education, GastroGPT was compared only with ChatGPT. GastroGPT was superior to ChatGPT4 in all cases (p<0.001), which scored inferior to physicians (5.2±2.1 vs 6.5±1.4; p<0.05). In multivariable analysis, GastroGPT was a predictor of higher scores after adjusting for other clinical factors. Subgroup analysis demonstrated consistent GastroGPT performance by complexity.
Conclusion: The gastroenterology-specific AI model GastroGPT achieved superior performance to attending notetaking across all tasks, while the general AI model, ChatGPT4, was inferior to GastroGPT and physician notes.

Figure 1. Diagram illustrating the design and working principles of gastroenterology specific artificial intelligence large language model, GastroGPT.
Figure 2. Bar chart comparing the evaluation scores of GastroGPT, attending physician and ChatGPT.
* p < 0.05 vs ChatGPT
** p < 0.05 vs Attending Physician and ChatGPT