Society: AGA
Background & Aim:
For training deep learning algorithms in medical imaging, application-specific data is often scarce. Deep learning systems are therefore generally pretrained on large publicly available labeled data sets of general imagery, unrelated to the envisioned application, to have the algorithm learn basic features from these widely available data followed by refinement training on the generally scarce application-specific images. Pretraining might be more effective if the images for pretraining resemble the envisioned application, i.e., domain-specific pretraining. We investigated if pretraining on general endoscopic imagery results in a better performance of five existing AI systems with an application in gastro-intestinal endoscopy, compared to current state-of-the art pretraining approaches (i.e., supervised pretraining with ImageNet and semi-weakly supervised pretraining with the Billion-scale data set).
Methods:
Our group has created an endoscopy-specific dataset called GastroNet for pretraining deep learning systems in endoscopy. GastroNet consists of 5,084,494 endoscopic images retrospectively collected between 2012 and 2020 in seven Dutch hospitals. We created four pretrained models: one using GastroNet and three using ImageNet and/or the Billion-scale data set. The pretraining method was either supervised, self-supervised, or semi-weakly supervised. The pretrained models were subsequently trained towards five independent, commonly used applications in GI endoscopy, using their original application-specific datasets. Outcome parameters were: 1) classification and/or localization performance of the five trained applications; 2) change in performance when the number of available application-specific training data was reduced, to investigate a possible difference in performance drop for the different pretrained models. The different combinations of pretraining data & method, test sets and downstream task are visualized in Figure 1.
Results:
Overall, the domain-specific pretrained model resulted in a statistically superior performance for the five different GI applications. More detailed results are presented in Table 1. The superiority was also reflected in a smaller drop in performance when the number of application-specific training data were reduced artificially.
Conclusion:
Domain-specific pretraining, using unlabeled general endoscopic images, is superior to current state-of-the-art pretraining approaches for developing deep learning algorithms in GI endoscopy. It also allows more effective use of the generally scarce application-specific endoscopy images. These findings might cause a paradigm shift in the development of AI systems in endoscopy.

Figure 1. Flow diagram of different pretraining methods, data sets and downstream tasks.
Table 1. Overview of performance of the five different application-specific data sets using four different pretrained models. Cells highlighted in green represent the highest scoring pretrained model per application-specific data set.
Introduction
With recent successful applications of computer vision in gastroenterology and endoscopy, there has been strong interest among physicians to develop practical skills in artificial intelligence. Automated Machine Learning (AutoML) platforms may increase access to complex deep learning algorithms that may otherwise be inaccessible and allow physicians to build complex models for a variety of use-cases simply by providing labeled data.
We focused on three commonly used AutoML platforms created by Microsoft, Amazon, and Google that market their ability to create image classification and object detection models. Using labeled data from the publicly available SUN[1] colonoscopy data set, we developed computer aided diagnosis (CADx) and computer aided detection (CADe) models on all three AutoML platforms.
Methods
The dataset used to evaluate model performance is the SUN (Showa University and Nagoya University) Colonoscopy Video Database. To create the models, the data were uploaded to the respective platforms and the annotation files were parsed into a format readable by the platform. The dataset was split 70/10/20 for training, validation, and testing. We used metrics including sensitivity, specificity, PPV, NPV, F1, AuROC, accuracy, precision, and recall to evaluate the CADx models. CADe models were evaluated using precision, recall, and F1 score. We used analysis of variance (ANOVA) testing with an alpha of 0.05 to determine if the performance of each CADx model was different across platforms.
Results
The sensitivity of the three CADx models was 0.9996, 0.9801, and 0.9770 for Microsoft, Google, and Amazon respectively. The specificity was 0.9993, 0.9665, and 0.9633. There was a statistically significant difference in the performance of the three CADx models. The F1 scores of the models built using Microsoft, Google, and Amazon platforms were 0.9996, 0.9800, and 0.9768 respectively (P=0.0044). The F1 scores for the CADe models made by the Microsoft, Google, and Amazon platforms (using an IoU threshold of 0.5), were 0.9929, 0.9650, and 0.8980 respectively.
Conclusions
Using minimal coding, we were able to create three algorithms, which were all
able to achieve high F1 accuracy scores (> 0.9) on CADe and CADx use-cases. There was a statistically significant difference in the F1 accuracy of the models created by the AutoML platforms. Further analysis on larger datasets and on different landmarks is needed to demonstrate if the Microsoft AutoML consistently performs best on all endoscopic computer vision tasks. AutoML platforms represent a practical entry point for endoscopists interested in exploring computer vision for GI endoscopy and may be an important catalyst for physician-driven innovation.
[1] SUN Colonoscopy Video Database. Hayato Itoh, Masashi Misawa, Yuichi Mori, Masahiro Oda, Shin-Ei Kudo, Kensaku Mori, 2020, http://amed8k.sundatabase.org/

Table 1: CADx Model Performance on Testing Dataset
Background: Hepatic fibrosis is a pathological consequence of chronic liver injury and culminates with cirrhosis. Various research on human and animal models have been conducted to establish fibrosis regression after the removal of underlying liver injuries and with novel therapeutic agents. However, there is a paucity of methods to evaluate and predict outcomes post-treatment. Transient elastography has been used for treatment monitoring, but it was unreliable to indicate regression of liver fibrosis. Liver biopsy remains the “gold standard” for scoring cirrhosis of the liver by assessing the stages of fibrosis. However, the intra and inter-observer variability needs to be taken in account for the interpretation of liver biopsies. Computational techniques using digital pathology images can overcome these limitations and augment the interpretation of liver biopsy, which can also be used in treatment monitoring.
Aim: The purpose of this work was to develop a deep-learning (DL) model to classify healthy and fibrotic liver histopathology images and subsequently predict treatment outcomes of liver fibrosis in mouse models.
Methods: Liver dataset of mouse models consisting of 201 pathology images (110 healthy and 91 fibrosis) were used to develop a DL model using VGG16 model to distinguish among fibrotic and healthy liver sections. From the total of 201 images, training set included 160 images, test set 21 and validation set 20 images, respectively. Ten-fold cross validation was used to validate our model’s performance. The model was then utilized to predict healthy, fibrotic, and treated mice using the AI-score generated from the training. A total of 75 images (Healthy = 25; Fibrotic = 30; Treated = 20). Accuracy, precision, recall, F1 score and AUC for the DL model and AI-score based on % accuracy of the model prediction on the distinctive test set are reported.
Results: The DL model to classify healthy and fibrotic liver sections had an accuracy of 95 %, precision of 96 %, F1 score of 95% and recall of 95 % with an AUC of 0.95 that demonstrates the model performance. An AI-score of 1.0 for healthy images, 0.67 for fibrotic images and 0.85 was achieved for predicting the treated histopathology images, in the mixed dataset. The prediction scores suggest the possibility of using AI-assisted digital pathology system for treatment monitoring that compliments biopsy analysis. Fig. 1 shows a representative sample image for healthy (A), fibrotic (B) and a treated (C) liver, respectively.
Conclusion: We developed a DL model capable of accurately classifying healthy and fibrotic histopathology image that can generate an AI-score to predict treatment outcomes for liver fibrosis.


BACKGROUND & AIMS: Acute upper gastrointestinal bleeding (UGIB) remains a common problem in clinical practice with a wide range of severity. For high-risk rebleeding lesions, endoscopic intervention is the mainstay treatment to control and prevent rebleeding. Prediction of the need for endoscopic intervention would be beneficial in the resource-limited area for selective referral of the patients from primary healthcare center to endoscopic center. The proposed risk stratification scores had limited accuracy and were difficult to remember. We developed a machine learning model to predict the need for endoscopic intervention in patients with acute UGIB.
METHODS: From 2011 to 2020, acute UGIB patients’ data was prospectively and retrospectively collected (n = 1,389). Eighty percent of the patient dataset was used as a training set by Python to derive 16 supervised machine-learning models to identify patients who need endoscopic hemostasis. These interventions included both treatments for variceal and non-variceal lesions. The provided data was composed of demographic characteristics, clinical presentation, and laboratory parameters. The performance of each model was compared and internally validated with the 20% of the patient dataset using area under the receiver operating characteristic curve (AUC) analysis.
RESULTS: Of 1,389 patients, 615 (44.3%) of the cohorts received the endoscopic intervention (293 variceal bleeding interventions and 336 non-variceal bleeding interventions). Patients’ characteristics, clinical presentation, and laboratory findings were presented in Table 1. After analysis for significant factors, 21 parameters which included age, sex, presence of cirrhosis, active malignancy, anti-thrombotic drugs use, previous history of UGIB, vomitus, and stool characters (red emesis, coffee-ground emesis, melena, maroon stool, red stool), duration of UGIB before hospitalization, resuscitation requirement, systolic blood pressure, heart rate, hemoglobin level, platelet number, serum albumin, blood urea nitrogen, creatinine, and coagulogram were used to derive the machine learning models. Among 16 developed models, the most accurate model to predict the need for endoscopic intervention is the Linear Discrimination Analysis model (Figure 1). The performance of this model was moderate with a sensitivity of 57.5% and specificity of 80.1% in the training set and a sensitivity of 57.1% and specificity of 86.90% in the validation set. The AUC of the training set and validation set were 0.744 and 0.790 respectively.
CONCLUSIONS: Our machine-learning model that identifies acute UGIB patients who need hemostatic endoscopic intervention showed to have fair accuracy. Further development and identifying more specific parameters might improve the performance of the model.

Table 1. Baseline characteristics and clinical presentation of the cohorts
Figure 1. Parameters for linear discrimination analysis model and example of the result
Background and aims: EUS-guided needle-based confocal laser endomicroscopy (EUS-nCLE) can differentiate high-grade dysplasia/adenocarcinoma (HGD-Ca) in branch duct intraductal papillary mucinous neoplasms (BD-IPMNs) but requires manual interpretation and is prone to subjectivity. We have derived predictive artificial intelligence (AI) algorithms to facilitate nCLE guided risk stratification of BD-IPMNs using manually pre-edited high-yield nCLE videos. We sought to develop a nCLE-AI algorithm to automatically edit a full-length, unedited nCLE video and risk-stratify BD-IPMNs.
Methods: Patients with a reference histopathological diagnosis of BD-IPMNs were enrolled from two prospective studies evaluating EUS-nCLE (1) INDEX, a single-center study (n=145; 2015-2018; NCT02516488) and (2) CLIMB, an ongoing multi-center study (2018-present, n=183; NCT03492151). We designed two CAD-convolutional neural network (CNN) algorithms: (1) to effectively edit nCLE video to that of a high-yield enriched video with pathognomonic papillary structures, and (2) CNN-AI system to automatically extract nCLE features for risk stratification of the BD-IPMNs (Figure 1). The ability of the AI algorithm to detect HGD-Ca was compared to the reference standard of histopathology, and diagnostic parameters were calculated. Diagnostic parameters for the revised 2017 International Consensus Guidelines (ICG) were also computed.
Results: A total of 60 subjects were analyzed. The mean (±SD) duration of full-length nCLE was 6.45 ± 2.8 minutes. Twenty-five subject videos were used for developing and training the AI algorithm, and the model was tested using the remaining 35 cases. In the final 35-subject test group (female 34.3%, mean age 67±9 years, mean cyst diameter 36.3 ± 10.3 mm), 20% of the BD-IPMNs were histopathologically diagnosed as HGD-Ca. The nCLE-AI-algorithm successfully edited full-length nCLE videos from the test set (n=35) and detected HGD-Ca with a sensitivity, specificity, and diagnostic accuracy of 71% (95% CI 36- 92%), 64% (95% CI 46-79%), and 66% (95% CI 49-79%), respectively. In comparison, the revised 2017 ICG performance in the same test set was 43% sensitive (95% CI 16-75%), 89% specific (95% CI 73-96%), and 80% accurate (95% CI 64-90%) (Table 1).
Conclusion: We have demonstrated that a preliminary CNN-AI model can successfully edit full length nCLE videos and risk stratify IPMNs within a limited dataset and was not inferior to current standard of care guidelines. Future AI model improvements utilizing larger databases and multicenter validation will enhance the accuracy and fully automate the process for real-time interpretation.

