Introduction: Clinicians struggle with classifying biliary tract strictures as being benign or malignant due to inadequate sampling techniques. Current sampling modalities include brush cytology (BC) and forceps biopsy (FB), which have poor sensitivity for identifying malignancy. Cholangioscopy allows for direct visualization and sampling of biliary pathology; however, this technology is also associated with inaccurate classification of biliary disease. Previously, an artificial intelligence (AI) that analyzes cholangioscopy footage was shown to be more accurate in identifying biliary tract malignancies than BC or FB.The aim of this study was to validate this same cholangioscopy AI on a new series of examinations obtained from multiple centers.
Methods: Three academic centers (including two that previously provided no training data for the AI) collected all available, unedited cholangioscopy recordings. The cholangioscopy videos were then processed and analyzed by the same cholangioscopy AI. After reviewing the entire video, the AI provided predictions as to whether malignancy was present during the examination. For this study, the AI underwent no additional retraining prior to review of these new cholangioscopy recordings. The performance of the AI in classifying strictures as being benign or malignant was compared to the performance of BC and FB.
Results: A total of 118 cholangioscopy examinations (average length 25.7 minutes) were generated from 103 patients. The two institutions that had contributed no prior training data for the AI provided 75 (63.6%) examinations. From all cholangioscopy cases, 49 (41.5%) had recent biliary stenting, 6 (5.1%) had primary sclerosing cholangitis, and 67 (56.8%) were for the evaluation of biliary strictures (32 [47.8%] benign, 35 [52.2%] malignant). Most strictures (61.2%) that were examined were perihilar and the most common malignancy was cholangiocarcinoma (77.1%). The remaining examinations were for other indications, including treatment of choledocholithiasis. For the correct classification of strictures as being benign or malignant, the AI was 91.4% sensitive and 81.3% specific. The AI was significantly more accurate (86.6%) for stricture classification than BC (50%; p < 0.001), FB (68.1%; p = 0.024), or BC/FB combined (66.7%; p = 0.009) (Figure 1). When reviewing all cases, the AI was 91.4% sensitive, 90.4% specific, and 90.7% accurate in correctly identifying if malignancy was present (Figure 2).
Discussion: In this study, a previously established cholangioscopy AI was demonstrated to continually outperform sampling modalities even when exposed to new video recordings from other centers without any additional retraining. Prospective trials of deploying the AI are necessary to further confirm the benefits of this new technology.

Figure 1. Table reporting performance characteristics of the cholangioscopy AI compared to brush cytology, forceps biopsy, and brush cytology/forceps biopsy combined for the evaluations of biliary strictures. Table also reports the performance of the cholangioscopy AI on all cases.
Figure 2. Receiver operating characteristics demonstrating performance of cholangioscopy AI thresholds for stricture cases only (left) as well as for all cases (right).