Background
Early identification of overt signs of GIB (melena, hematochezia, hematemesis) in hospitalized patients may enable expedited evaluation for inpatient endoscopy. Overt signs of GIB are often documented in nursing notes as descriptions of stool or emesis, which are more challenging to identify than medical terms (e.g., black stool vs. melena). Moreover, significant variation exists in the language used to describe stool or emesis (e.g., stool, bowel movement). Large language models (LLMs) have been shown to outperform traditional methods at natural language processing (NLP) tasks with data from the electronic health record (EHR). We present a hybrid LLM-based pipeline for automated GIB symptom detection in hospitalized patients with EHRs.
Methods
Algorithms were developed and evaluated with 17,712 nursing notes from 1,114 patients who presented for acute GIB and underwent endoscopy from 2014 to 2023 at an academic medical center. Data were extracted from nursing notes between the date of initial endoscopy and discharge. Gold standard labels for the presence of melena, hematochezia, and/or hematemesis in each note were derived by manual review. We compared a hybrid pipeline combining regular expressions and LLM (Llama 2) prompt engineering for named entity recognition to a traditional NLP algorithm based on regular expressions, a LLM-based (Clinical-Longformer) algorithm trained for document classification, and a LLM (Llama 2) prompt-engineering algorithm for named entity recognition alone. Training and hyperparameter tuning were performed with nested five-fold cross validation. We performed all analyses with local LLMs on a HIPAA compliant secure server. Model performance was reported as positive predictive value (PPV), negative predictive value (NPV), sensitivity, specificity, and F1 score (weighted average of PPV and sensitivity used as a metric of machine learning model performance; scale of 0-1 with 0 = worst and 1 = best).
Results
The hybrid LLM-based pipeline had higher PPV than the traditional NLP algorithm at the cost of slightly lower sensitivity for melena (PPV = 0.972 vs. 0.625; sensitivity = 0.900 vs. 0.981) and hematemesis (PPV = 0.859 vs. 0.484; sensitivity = 0.932 vs. 0.967) but had higher PPV (0.900 vs. 0.764) and sensitivity (0.908 vs. 0.893) for hematochezia (Figure 1). The hybrid model achieved the highest F1 scores overall for melena (0.934), hematochezia (0.904), and hematemesis (0.894).
Conclusion
A hybrid LLM-based pipeline can accurately and efficiently identify overt symptoms of GIB in clinical note text from EHRs. This model allows real-time monitoring for symptoms of new or recurrent GIB in patients during their hospital course, providing opportunities for early endoscopic evaluation.

Figure 1. Performance of traditional natural language processing (NLP) algorithm, large language model (LLM)-based algorithms, and a hybrid algorithm combining both traditional NLP and LLMs. The best performing model by metric (row) is bolded.