Background:
Lack of an accurate prognostic tool for acute pancreatitis (AP) remains a critical knowledge gap. Machine learning (ML) techniques have been used to develop high-performing prognostic models in AP. However, methodologic quality has received little attention. High-quality reporting and study methodology are critical to model validity, reproducibility, generalizability, and clinical implementation. In collaboration with content experts in ML methodology, we performed a systematic review critically appraising ML AP prognostic models.
Methods:
Using a validated search strategy, we identified non-regression ML AP studies from the databases MEDLINE, PubMed, and EMBASE published between January 2021 and June 2023. Eligible studies included all that developed or validated new/existing ML models in patients with AP. We used the Prediction Model Risk of Bias Assessment Tool, a well-established tool, to assess risk of bias (ROB) in 4 domains (participants, predictors, outcomes & statistical analysis). Quality of reporting was assessed by the standards of the Transparent Reporting of a Multivariable Prediction Model of Individual Prognosis or Diagnosis – Artificial Intelligence (TRIPOD-AI) statement that recommends 27 standards to be reported in a ML prognostic model paper.
Results:
We identified 4240 studies of which 27 met the eligibility criteria. AP severity (40.7%) or mortality (22.2%) were the most common outcomes predicted. Studies originated from China (19), U.S (4), Hungary (2), Turkey (1), and New Zealand (1). All studies developed a new ML model (i.e., none externally validated an existing ML model). The mean area-under-the-curve for all models was 0.9 (SD 0.08), but ROB was high in at least one domain in all studies (Figure 1). In the statistical analysis domain, 89% of studies were at high ROB. Importantly, steps were rarely taken to minimize over-optimistic model performance (63% of studies). Studies reported on only 55.6% of the 27 items contained in the TRIPOD-AI statement with notable deficiencies in sample size justification (74.1%), data quality assessment (40.7%), and model updating techniques (66.7%). There were frequent omissions in model implementation considerations, such as human-AI interaction (88.9%), handling of low-quality or incomplete data (88.9%), and integration of models into the care pathway (55.6%). Additionally, there was a lack of reporting on source data (63%), analytical codes (92.6%), and study protocols (81.5%).
Conclusion:
Despite an expansion of newly developed ML prognostic models in AP, there are important limitations in the study design, reporting, and open science practice, undermining the models' validity, reproducibility, and generalizability. Multifaceted interdisciplinary efforts (content experts in AP and ML methodology) are needed to improve the rigor of studies that develop ML prognostic models in AP.
