Gottlieb K, Requa J, Karnes W, et al.
Gastroenterology 2021;160:710–9.e2
Omer F. Ahmad |
Endoscopic grading of the severity of Ulcerative Colitis (UC) is a critical component of disease assessment and particularly important for guiding therapy. Despite the availability of numerous scoring systems, such as the Mayo Endoscopic Score (eMS) and the Ulcerative Colitis Endoscopic Index of Severity (UCEIS), widespread use in routine clinical practice is often limited, primarily due to inter-observer variability and lack of training for standardised use [1,2].
The rapid translation of artificial intelligence (AI)-based algorithms in colonoscopy, most notably to assist polyp detection and characterisation, has naturally led to great interest in potential applications for Inflammatory Bowel Disease [3]. Central reading of endoscopic videos by experts is standard practice for clinical trials and has generally led to improved accuracy of reporting [4]. Unfortunately, the process is time consuming and expensive, making it impractical for use in routine clinical practice and therefore an ideal potential use case for AI models.
In this study, the authors describe the post hoc development of an AI algorithm to predict eMS and UCEIS using 795 prospectively collected colonoscopy and sigmoidoscopy videos (19.5 million frames) from the phase 2, multicentre, randomised, double-blind, parallel, placebo-controlled study of mirikizumab in patients with moderate to severely active UC. Procedural videos were collected at baseline and at 12 weeks and 52 weeks and were centrally read [5]. These de-identified videos accompanied only by the final eMS and UCEIS scores were available to the investigators in this AI study. In order to develop the AI model, the authors utilised an autonomous data pre-processing strategy. This consisted of proprietary models that cleaned the video data, e.g. removal of poor clarity frames and bad bowel preparation filtering (based on Boston Bowel Prep scores), followed by abnormal feature extraction from the remaining frames. The dataset was split randomly into a training set (80%) and a test set (20%) at a patient level. A recurrent neural network (RNN) was trained to output a prediction eMS and UCEIS score for the entire video.
The primary evaluation metric was a quadratic weighted kappa (QWK), which is a measure of inter-observer agreement that progressively penalises disagreements that exceed one level. This metric was used to assess agreement between the AI and human central reader scores. A perfect score is 1. The authors also evaluated the model’s ability to assign qualification accuracy for trials (defined as eMS score 2–3 or UCEIS score 5–8) and endoscopic healing (EH) as the accuracy of predicting eMS of 0 versus all other levels, and UCEIS of 0 versus all other levels.
Following the automated self-cleaning data pre-processing, there were 786 videos containing 7.4 million video frames. 61.5% of the original 19.5 million frames were automatically excluded due to poor clarity, frames being outside of the colon or poor bowel prep. In total, 11 complete videos (1.4% of the total dataset) were removed due to an unacceptable low ‘clean’ frame count. The model overall achieved a QWK of 0.844 (95% CI, 0.787–0.901) for eMS and 0.855 (95% CI, 0.80–0.91) for UCEIS. When considering QA, the model achieved an overall score of 92.54% (95% CI, 88.43%–96.65%) for eMS and 91.04% (95%CI, 86.57%–95.51%) for UCEIS. For EH accuracy, the algorithm achieved an overall score of 95.52% (95%CI, 92.28%–98.76%) for eMS and 97.04% (95%CI, 94.39%–99.69%) for UCEIS.
This study evaluated the performance of an AI model to predict endoscopic disease activity scores from a multicentre, international clinical trial in which central reading scores were used as the reference standard. The neural network was able to predict eMS and UCEIS score with a high level of agreement. A particular strength of this study, when compared to earlier published studies, is the evaluation on full-length videos.
Clearly the study findings need to be externally validated, particularly with videos from phase 3 trials where a 2+1 central reading scheme is used instead of a single central read. Moreover, there are inherent limitations in using human central readers as the reference standard instead of histopathology. However, using AI to predict histological remission on the basis of conventional white light endoscopy is currently unrealistic, whereas the incorporation of algorithms with image enhancement and magnification endoscopy is likely to yield more success [6,7].
The findings of this study are promising, and the deployment of AI models could be used to determine endoscopic qualification for trials and assessment of therapeutic efficacy. Moreover, the proposed benefits of using AI models in clinical trials include possible lower costs, improved reliability and greater efficiency. More importantly, this study builds the foundations for the translation of automated IBD disease scoring into routine clinical practice. However, robust frameworks for further prospective evaluation of AI models need to be developed, to demonstrate efficacy and real-world benefit, before widespread clinical implementation can occur [8].
Omer Ahmad - Short Biography
Omer Ahmad is a senior clinical research fellow at the Wellcome/EPSRC centre for Interventional and Surgical Sciences (WEISS), University College London. His research interests include artificial intelligence, computer vision and advanced endoscopic imaging.