Background: Patient-reported outcomes (PROs) are a core measure of disease activity in clinical studies of inflammatory bowel disease (IBD). While PROs are commonly captured in a structured form in prospective trials, they are typically captured as free text in electronic health records (EHR), significantly limiting their utility for research and quality improvement. This contributes to missing data in IBD registries, or the exclusion of unstructured information in favor of analysis-ready data. Computational methods for extracting information from free text have recently undergone dramatic changes, particularly following the release of GPT-4, OpenAI’s latest large language model (LLM).
Aims: To develop, evaluate, and compare natural language processing (NLP) methods for extracting 3 IBD PROs from clinical notes: abdominal pain, diarrhea, and fecal blood.
Methods: We queried a deidentified EHR database at the University of California San Francisco (UCSF) to establish a corpus of IBD clinic notes for model training and evaluation (Table 1). Two physicians independently annotated 1,050 notes for the presence or absence of PROs using a predefined protocol. 900 notes (85%) were used to train custom extraction algorithms that incorporate rule-based approaches (Regex), NLP programs (SciSpacy), and supervised learning models to predict PROs in each note. The remaining notes (15%) were used for internal testing and model selection. The top-performing model was externally tested on notes from Stanford University. As a third comparison, we evaluated the model against the zero-shot (without task-specific training data) performance of GPT-4 using the UCSF test set.
Results: Inter-rater reliability between annotators was >90%. The top-performing UCSF models were XGBoost models, with accuracies of 92% (abdominal pain), 82% (diarrhea), and 80% (fecal blood) on an internal test set (n=50). On external validation at Stanford (n=250), all model accuracies were 61-62%, with better ability to identify absence over presence of PROs (Table 2). The zero-shot GPT-4 model was 91% (abdominal pain), 90% (diarrhea), and 88% (fecal blood) accurate on the same UCSF test set.
Conclusions: Traditional NLP models achieve high institution-specific accuracy, and our open-source code provides a framework for building such custom tools. However, variation in documentation styles across institutions likely decreases their generalizability. In contrast, GPT-4 outperforms our custom NLP model despite no prior training on institutional data. GPT-4 shows resiliency across domain shifts, with outstanding performance across a test set that spanned 10-years, included multiple authors, writing styles, note templates, and changes in IBD management guidelines. LLMs like GPT represent an exciting opportunity to leverage unstructured EHR data for future IBD outcomes research.

