Grader variability and the importance of reference standards for evaluating machine learning models for diabetic retinopathy
This addresses the challenge of inconsistent grading in medical AI for diabetic eye disease, which is incremental as it builds on existing methods to improve reference standards.
The study tackled the problem of grader variability in diabetic retinopathy (DR) diagnosis by examining how different methods of obtaining reference standards affect deep learning model performance, finding that a small set of adjudicated DR grades led to substantial improvements, with the algorithm performing on par with individual U.S. board-certified ophthalmologists and retinal specialists.
Diabetic retinopathy (DR) and diabetic macular edema are common complications of diabetes which can lead to vision loss. The grading of DR is a fairly complex process that requires the detection of fine features such as microaneurysms, intraretinal hemorrhages, and intraretinal microvascular abnormalities. Because of this, there can be a fair amount of grader variability. There are different methods of obtaining the reference standard and resolving disagreements between graders, and while it is usually accepted that adjudication until full consensus will yield the best reference standard, the difference between various methods of resolving disagreements has not been examined extensively. In this study, we examine the variability in different methods of grading, definitions of reference standards, and their effects on building deep learning models for the detection of diabetic eye disease. We find that a small set of adjudicated DR grades allows substantial improvements in algorithm performance. The resulting algorithm's performance was on par with that of individual U.S. board-certified ophthalmologists and retinal specialists.