OP07 Artificial intelligence surpasses gastrointestinal experts in the classification of endoscopic severity among Ulcerative Colitis
Lo, B.Z.S.(1);Liu, Z.(1,2);Bendtsen, F.(1);Igel, C.(2);Vind, I.(1);Burisch, J.(1)
(1)Copenhagen University Hospital Hvidovre, The Gastrounit, Hvidovre, Denmark;(2)University of Copenhagen, Department of Computer Science, Copenhagen, Denmark The Presager Project
Background
Evaluation of endoscopic disease severity is a key component in the management of ulcerative colitis (UC) patients. However, endoscopic assessment suffers from substantial intra- and interobserver variation, up to 75 %, thereby limiting the reliability of individual assessments. Our aim was to develop an artificial intelligence (AI) model capable of distinguishing active from healed mucosa as well as to differentiate different levels of endoscopic disease activity.
Methods
1484 unique endoscopic images from 467 patients were extracted for classification. Two experts classified all images independent of each other according to the Mayo endoscopic subscore (MES). In case of disagreement, a third expert classified the images.
Different convolutional neural network architectures were implied in the development of the AI model. Five-fold cross-validation was employed to select the best model. Unseen test data were used for evaluation.
The final model was evaluated on its performance for distinguishing MES 0 from 1–3, MES 0–1 (i.e. mucosal healing) from 2–3, and distinguish between all MES.
The accuracy, sensitivity, specificity, positive and negative predictive value, and Cohen’s Kappa were used to evaluate the final models.
Results
Our final model achieved at the most difficult task (distinguishing between all 4 categories of MES) a mean accuracy of 0.82, mean AUC of 0.99, test accuracy of 0.84, a sensitivity of 0.88, and a specificity of 0.81 and a weighted Cohens Kappa of 0.83 (p<0.001 compared to the experts).
The results from the other tasks are shown in table 1.
Task | Test accuracy | Sensitivity | Specificity | PPV | NPV | Cohens Kappa | P-value |
---|---|---|---|---|---|---|---|
Distinguish between all MES | 0.84 (0.64–0.96) | 0.88 (0.80–0.93) | 0.81 (0.73–0.87) | 0.80 (0.72–0.86) | 0.89 (0.82–0.94) | Unweighted: 0.76 (0.70–0.83) Weighted: 0.83 (0.79–0.88) | Unweighted: p<0.001 Weighted: p<0.001 |
MES 0 from 1–3 | 0.94 (0.85–0.97) | 0.95 (0.89–0.98) | 0.93 (0.87–0.97) | 0.94 (0.88–0.97) | 0.84 (0.88–0.97) | 0.88 (0.82–0.94) | p<0.001 |
MES 0–11 from 2–3 | 0.93 (0.84–0.97) | 0.78 (0.66–0.87) | 0.99 (0.96–1.00) | 0.96 (0.86–0.99) | 0.93 (0.88–0.96) | 0.82 (0.74–0.90) | p<0.001 |
*(95 % confidence interval); MES = Mayo endoscopic subscore; PPV = Positive predictive value; NPV = Negative predictive value 1mucosal healing |
Conclusion
We propose a new standardised way of evaluating endoscopic images from UC patients for both clinical and academic purposes. The proposed AI model demonstrated a very good capability of distinguishing between all 4 MES levels of activity. This will optimize and unify the evaluation of the disease severity measured by the Mayo endoscopic subscore across all centres and hospitals no matter the level of medical expertise.