Introduction: Patients with early-onset colorectal cancer (EOCRC), defined as diagnosed before age 50, have a significant risk of metachronous CRC after curative resection. The practice of offering intensive endoscopic surveillance, although often used, has not been formally validated. To overcome the lack of prognostic biomarkers that predict disease-free survival (DFS) in EOCRC patients, we developed and validated a miRNA signature to identify patients at high risk of recurrence.
Methods: We performed miRNA expression profiling in a discovery dataset of stage II-III EOCRC patients using high-throughput small-RNA sequencing to prioritize a panel of differentially expressed miRNAs (recurrent vs. non-recurrent) using Cox-LASSO regression and AUC analysis. We then quantitatively assessed the expression of these miRNAs via RT-qPCR in stage I-III EOCRC cases from two independent and ethnically diverse cohorts of patients (European, n=88; and Asian, n=69). We then trained a machine learning algorithm (eXtreme Gradient Boosting, XGB) in the cohort of European descent, where we optimized for AUC values and intentionally avoided overfitting by slowing the learning rate (ε=0.01), restricting the maximal branching to 4 and adopting a high pruning strategy (γ=4). This assay was then locked and independently validated first in the Asian cohort.
Results: Univariate Cox regression analysis in the biomarker discovery phase yielded 10 miRNA candidates that were differentially expressed in stage II-III EOCRC patients with a DFS<5 years (Fig.1A). We then assessed the expression levels of the individual biomarkers in surgically resected specimens from both cohorts (Fig.1B). We trained an XGB machine learning model on qPCR results to identify patients with DFS <5 years, reaching an AUC of 0.96 (93.18% accuracy, 84% sensitivity, 96.83% specificity), which was subsequently validated in an independent validation cohort (79.17% accuracy, 71.43% sensitivity, and 80.65% specificity, Fig.1C).
Patients labeled as low-risk had a statistically higher median DFS in both training (mDFS = 14.6 vs. 87.2 months, p<0.0001), validation (mDFS = 60.1 vs. 60.8 months, p=0.047), and full analysis set cohorts (mDFS = 21.7 vs. 69.5 months, p<0.0001, Fig.1D). There was a significantly higher cumulative hazard of disease recurrence in the five years after surgery for patients labeled as high-risk (Fig.1E).
Conclusion: We successfully trained and validated a novel miRNA-based signature powered by advanced machine learning to predict 5yDFS after EOCRC curative-intent surgery. This signature offers an alternative surveillance strategy to overly intensive endoscopic surveillance.

Figure 1. (A) Heat-map showing the differential gene expression of the 10 candidate miRNAs from small RNA sequencing-based discovery in patients with recurrent and non-recurrent EOCRC; (B) Ridge-line plot of patients with recurrent and non-recurrent EOCRC, showing the ΔCT values from reverse transcriptase quantitative polymerase chain reaction (RT-qPCR); (C) ROC performance characteristics of the XGB model; (D) Kaplan-Meier curves of disease-free survival probabilities after curative intent surgery for EOCRC in patients labeled as high- and low-risk; (E) Cumulative hazard of disease recurrence in patients labeled as high- and low-risk