P123 Development of an end-to-end Mayo endoscopic score machine learning algorithm for use in clinical trials of ulcerative colitis
Vieira, M.(1)*;Mishkin, D.S.(2);Gaither, K.W.(3);Safran, M.(4);Dashuta, A.(4);Tarakanov, A.(4);Zfira, L.(4);Fuerst, T.(5);
(1)Clario Inc., Medical and Scientific Affairs, Cleveland, United States;(2)Atrius Health, Gastroenterology, Boston, United States;(3)Clario Inc., Medical and Scientific Affairs, Portland, United States;(4)RSIP VIsion, Research and Development, Jerusalem, Israel;(5)Clario Inc., Medical & Scientific Affairs, San Mateo, United States;
Background
Scoring colonoscopy videos of UC patients requires a high level of expertise, but even among trained expert readers there are disagreements. In clinical trials, this can negatively impact both subject selection and assessment of treatment response. Consistency among central readers might be improved by algorithms that automatically process videos to identify salient features and estimate the level of disease activity using established scoring methods.
Methods
We propose an end-to-end system using machine learning (ML) models to process colonoscopy videos, both shortening the time required for a human expert reader to score a video by filtering out non-informative parts and supplying a second-opinion of Mayo Endoscopic Score (MES) on each informative frame and the full video. Our dataset included videos from UC patients with a representative range of disease activity acquired at a Ukrainian hospital (n=505) and an Israeli hospital (n=227). Video annotation was performed by 12 reviewers, trained and supervised by an expert gastroenterologist (DM) with >10 years central reading experience. Out of almost 3 million frames over ~34% were classified by human reviewers as non-informative (e.g., out of focus, motion blurring, stool). The remaining informative frames were further annotated by the reviewers as containing ulcers (6%) and hence classified as MES=3, erosions (22%) as MES=2, loss of vascularity or erythema (23%) as MES= 1, or none of the above (49%) classified as MES=0. Informative frames were split to 80% training set, 10% validation set and 10% test set. The data set included low quality frames on which the reviewers could still estimate the MES. Models consisting of image processing algorithms, Convolutional Neural Networks (CNN), CNN+Long Short Term Memory (LSTM) model and classical ML classifiers were trained to filter out non-informative frames and score independently each frame and the full video for MES.
Results
Model performance was evaluated using Cohen’s Quadratic Weighted Kappa (QWK) with 95% Confidence Interval [95% CI]..Model agreement with humans is shown in the middle column. Inter-reviewer agreement based on a subset of ‘clear frames’ from various videos of the test set is shown in the last column.
Conclusion
We developed a full end-to-end pipeline for processing colonoscopy videos that estimated MES with accuracy comparable to human-human agreement. Such a model has the potential to make human readers more efficient by highlighting frames with relevant pathology. It can also aid inter-reader agreement by providing a second-opinion MES. Future work will expand model training and testing with new data sources and explore paradigms to combine the model with human readers to improve clinical trial central reading.