Skip to main content
  • AI Outperformed Humans in Assessing Ophthalmic Images

    By Lynda Seminara
    Selected by Prem S. Subramanian, MD, PhD

    Journal Highlights

    British Journal of Ophthalmology
    Published online Jan. 31, 2023

    Download PDF

    Pandey et al. trained an artificial intel­ligence (AI) algorithm to classify retinal disorders from fundus photographs alone, then compared the algorithm’s performance with that of human ex­perts. They found that their ensemble of deep convolutional neural networks (CNNs) was more accurate and reliable than image assessment by board-certi­fied ophthalmologists.

    The authors predetermined four conditions to explore in their study: diabetic retinopathy (DR), glaucoma, age-related macular degeneration (AMD), and normal fundus appearance (no pathology). The CNN architecture was based on the InceptionV3 model, and initial weights were pretrained on the ImageNet dataset. Altogether, there were 43,055 fundus images, represent­ing 12 public datasets. An ensemble of five trained CNNs was then tested on an “unseen” set of 100 images, and seven board-certified ophthalmologists were asked to classify these images. Evaluation metrics included overall accuracy, defined as the percentage of correct predictions among all test images, as well as the per-condition and overall (macro-averaged) F1-score, pos­itive predictive value, sensitivity, and specificity. To understand the reliabil­ity of predictions, the authors looked for agreement between a classifier’s confidence level and the accuracy of each prediction. They assumed that confidence would be higher for correct predictions.

    According to the analyses, the over­all mean accuracy rate was 72.7% for the ophthalmologists and 79.2% for the ensemble of deep CNNs (p = .03). The AI system also had significantly better mean F1-scores for identifying DR (76.8% vs. 57.5% for ophthalmologists; p = .01) and numerically higher F1-scores for recognizing glaucoma (83.9% vs. 75.7%; p = .10), AMD (85.9% vs. 85.2%; p = .69), and absent pathology (73% vs. 70.5%; p = .39). Moreover, the mean agreement between accuracy and confidence was higher for the CNN ensemble (81.6% vs. 70.3%; p < .001).

    “This work provides proof-of-prin­ciple that an algorithm is capable of accurate and reliable recognition of multiple retinal diseases using only fundus photographs,” the authors wrote. They believe that their AI model could be a cost-effective adjunct for decision-making in specialty ophthal­mology clinics and in general health care settings, including family practice and emergency departments. They em­phasized that automated AI classifiers may be helpful for community-based eye screening programs.

    The original article can be found here.