Society: AGA
Background: Tools predicting incident esophageal adenocarcinoma (EAC) and gastric cardia adenocarcinoma (GCA) that can be automated in electronic health records (EHRs) to guide screening decisions are needed. However, EHRs often have missing data for smoking, body mass index (BMI), and gastroesophageal reflux disease (GERD) which are important factors used by currently validated tools (Trøndelag Health [HUNT] and Kunzmann). We aimed to accurately predict EAC/GCA even with missing data using machine learning.
Methods: We performed retrospective analyses in the Veterans Health Administration (VHA) Corporate Data Warehouse among Veterans with ≥1 encounter between 2005 and 2018. Cases diagnosed with EAC/GCA were identified in the VHA Central Cancer Registry. The index date was the date of diagnosis for cases and randomly selected for controls. We collected prescriptions, laboratory results, and International Classification of Diseases diagnoses 1 to 5 years prior to index. We randomly divided the cohort into training (50%), preliminary validation (25%), and testing (25%). In the preliminary validation set, simple random sampling imputation and extreme gradient boosting machine learning were most accurate. In the test set, we compared the final model, the Kettles Esophageal and Cardia Adenocarcinoma predictioN (K-ECAN) Tool, to HUNT, Kunzmann, and published guidelines. To simulate a non-VHA population, we randomly under-sampled males. We ranked the proportion of the total gain in the loss function and the mean Shapley Additive Explanations for each variable.
Results: We identified 8,430 cases of EAC, 2,965 of GCA, and 10,256,887 controls. The mean age was 59.6 years, 92% were male, 80% white, and mean BMI 29.3 kg/m2. In the test set, K-ECAN was well calibrated (Figure 1) and had better discrimination (area under the receiver operating characteristics curve [AUC] = 0.77) than HUNT (AUC = 0.68), Kunzmann (AUC = 0.64), or guidelines (Figure 2). Using only data from 4-5 years prior to index slightly diminished its accuracy (AUC = 0.75). Under-sampling men to simulate a non-VHA population, the AUCs of HUNT and Kunzmann improved, but K-ECAN was still most accurate (AUC = 0.85, Figure 2). The most important variables influencing K-ECAN included 4 known risk factors (age, race, sex, BMI) and 9 novel (COPD, greater Hct, lower HDL, greater LDL, lower serum CO2, lower Na, lower BUN, lower ALT, and greater WBC). While GERD was strongly associated with EAC, it only contributed a small proportion of gain in information.
Conclusions: We developed and internally validated a novel prediction tool for incident EAC/GCA using EHR data. K-ECAN identifies individuals who are at increased risk for EAC/GCA ≥ 3 years in advance and is more accurate than published guidelines. Further work is needed to validate K-ECAN outside VHA and to assess how best to implement it within EHRs.

Figure 1. Calibration Plot
Predicted and observed risks are the cumulative incidences per 100,000 individuals in the testing set over the 14 years of ascertainment. Each dot represents 2% of the testing set (51,398 individuals).
![<b>Figure 2. Receiver Operating Characteristic Curves.</b><br /> <b>Panel A: Entire Testing Set. </b>AUCs [95% CIs] are displayed in parentheses.<br /> <b>Panel B: Simulated non-VHA Population</b><br /> Because Kunzmann and HUNT were both developed in populations that were roughly 50% male and both rely heavily on sex for classifying risk, this analysis simulated a non-VHA population by using all available female controls and cancer cases and under-sampled men by including a random selection of an equal number of male controls as female controls and a random sample of male cancer cases to match the expected odds ratio of male sex of 8.33 in the US population.<br /> ACG: American College of Gastroenterology, ACP: American College of Physicians, AGA: American Gastroenterological Association, ASGE: American Society for Gastrointestinal Endoscopy, BSG: British Society of Gastroenterology, ESGE: European Society for Gastrointestinal Endoscopy, HUNT: Trøndelag Health, VHA: Veterans Health Administration](https://assets.prod.dp.digitellcdn.com/api/services/imgopt/fmt_webp/akamai-opus-nc-public.digitellcdn.com/uploads/ddw/abstracts/3852054_File000002.jpg.webp)
Figure 2. Receiver Operating Characteristic Curves.
Panel A: Entire Testing Set. AUCs [95% CIs] are displayed in parentheses.
Panel B: Simulated non-VHA Population
Because Kunzmann and HUNT were both developed in populations that were roughly 50% male and both rely heavily on sex for classifying risk, this analysis simulated a non-VHA population by using all available female controls and cancer cases and under-sampled men by including a random selection of an equal number of male controls as female controls and a random sample of male cancer cases to match the expected odds ratio of male sex of 8.33 in the US population.
ACG: American College of Gastroenterology, ACP: American College of Physicians, AGA: American Gastroenterological Association, ASGE: American Society for Gastrointestinal Endoscopy, BSG: British Society of Gastroenterology, ESGE: European Society for Gastrointestinal Endoscopy, HUNT: Trøndelag Health, VHA: Veterans Health Administration