Objective: Despite high stand-alone performance, studies demonstrate that artificial intelligence (AI)-supported endoscopic diagnostics often fall short in clinical applications due to human-AI interaction factors. This video-based trial on Barrett's esophagus aimed to investigate how examiner behavior, their levels of confidence, and system usability influence the diagnostic outcomes of AI-assisted endoscopy.
Methods: The present analysis employed data from a multicenter randomized controlled tandem video trial involving 22 endoscopists with varying degrees of expertise. Participants were tasked with evaluating a set of 96 endoscopic videos of Barrett's esophagus in two distinct rounds, with and without AI assistance. Diagnostic confidence levels were recorded, and decision changes were categorized according to the AI prediction. Additional surveys assessed user experience and system usability ratings.
Results: AI assistance significantly increased examiner confidence levels (p < 0.001) and accuracy. Withdrawing AI assistance decreased confidence (p < 0.001), but not accuracy. Experts consistently reported higher confidence than non-experts (p < 0.001), regardless of performance. Despite improved confidence, correct AI guidance was disregarded in 16% of all cases, and 9% of initially correct diagnoses were changed to incorrect ones. Overreliance on AI, algorithm aversion, and uncertainty in AI predictions were identified as key factors influencing outcomes. The System Usability Scale questionnaire scores indicated good to excellent usability, with non-experts scoring 73.5 and experts 85.6.
Conclusions: Our findings highlight the pivotal function of examiner behavior in AI-assisted endoscopy. To fully realize the benefits of AI, implementing explainable AI, improving user interfaces, and providing targeted training are essential. Addressing these factors could enhance diagnostic accuracy and confidence in clinical practice.
Background and aims
Celiac disease with its endoscopic manifestation of villous atrophy is underdiagnosed worldwide. The application of artificial intelligence (AI) for the macroscopic detection of villous atrophy at routine esophagogastroduodenoscopy may improve diagnostic performance.
Methods
A dataset of 858 endoscopic images of 182 patients with villous atrophy and 846 images from 323 patients with normal duodenal mucosa was collected and used to train a ResNet 18 deep learning model to detect villous atrophy. An external data set was used to test the algorithm, in addition to six fellows and four board certified gastroenterologists. Fellows could consult the AI algorithm’s result during the test. From their consultation distribution, a stratification of test images into “easy” and “difficult” was performed and used for classified performance measurement.
Results
External validation of the AI algorithm yielded values of 90 %, 76 %, and 84 % for sensitivity, specificity, and accuracy, respectively. Fellows scored values of 63 %, 72 % and 67 %, while the corresponding values in experts were 72 %, 69 % and 71 %, respectively. AI consultation significantly improved all trainee performance statistics. While fellows and experts showed significantly lower performance for “difficult” images, the performance of the AI algorithm was stable.
Conclusion
In this study, an AI algorithm outperformed endoscopy fellows and experts in the detection of villous atrophy on endoscopic still images. AI decision support significantly improved the performance of non-expert endoscopists. The stable performance on “difficult” images suggests a further positive add-on effect in challenging cases.