Using ROC curves to compare different screening thresholds with a continuous predictive variable
When the screening test measures a continuous value, such as IOP, it becomes more complicated to evaluate the screening test. Figure 1-5 uses data from the Baltimore Eye Survey to graphically display sensitivity and specificity for each value of IOP. The usual cutoff for normal IOP, 21 mm Hg, has a sensitivity of 49% and a specificity of 90%. The intersection of sensitivity and specificity is the optimal threshold for maximum sensitivity and specificity in a screening test. This intersection occurs at 18 mm Hg, where the sensitivity is 65% and the specificity is 66%. With continuous variables, like IOP, there is a trade-off between sensitivity and specificity: a higher sensitivity results in a lower specificity, and vice versa.
Figure 1-5 Sensitivity and specificity of an intraocular pressure (IOP) cutoff as a screening tool for glaucoma. For each IOP level (along the x-axis), the values for sensitivity and specificity are plotted. This demonstrates that with a higher level of IOP as a screening cutoff for glaucoma (for example, IOP >30 mm Hg), the sensitivity decreases and the specificity increases.
(Used with permission from Tielsch JM, Katz J, Singh K, et al. A population-based evaluation of glaucoma screening: the Baltimore Eye Survey. Am J Epidemiol. 1991;134:1102–1110.)
Figure 1-6 depicts another graphical representation of sensitivity and specificity called a receiver operating characteristic (ROC) curve. By convention, an ROC curve plots sensitivity on the y-axis and (1 – specificity) on the x-axis. The larger the area under the curve, the more diagnostically precise is the screening test. The line with the diamond-shaped symbols represents a hypothetical screening test with optimal results; the line with the triangles represents a poor screening test with an ROC area of only 50%; and the line with the circles—the middle curve—represents the Baltimore Eye Survey data used in Figure 1-5. An ROC curve can inform selection of an optimal cutoff point for a screening test by identifying the sensitivity–specificity pair located closest to the upper left of the ROC plot.
Overall, these figures demonstrate that IOP measurement is not a very good screening tool for glaucoma because no cut-off reaches the ideal sensitivity/specificity (upper left of diamond line). Other significant factors in choosing a cutoff point for a screening test are the population to be screened and the relative importance of sensitivity and specificity. If the consequence of missing a diagnosis is very important such as blindness, an investigator may choose a test with high sensitivity but poor specificity. For example, a low cutoff for erythrocyte sedimentation rate might be chosen for a person who has recent vision loss and who is suspected of having giant cell arteritis.
Tielsch JM, Katz J, Singh K, et al. A population-based evaluation of glaucoma screening: the Baltimore Eye Survey. Am J Epidemiol. 1991;134(10):1102–1110.
Figure 1-6 ROC curve of IOP as a screening tool for glaucoma with sensitivity on the y-axis and (1 − specificity) on the x-axis. The middle line replots the data from Figure 1-5, showing all combinations of IOP. Two boxes identify the diagnostic precision of IOP ≥18 mm Hg and IOP ≥21 mm Hg. The other lines represent an optimal (upper line) and a useless (lower line) screening test, respectively.
(Produced with data from Tielsch JM, Katz J, Singh K, et al. A population-based evaluation of glaucoma screening: the Baltimore Eye Survey. Am J Epidemiol. 1991;134:1102–1110.)
Using ROC curves to compare different screening devices
Studies can use ROC curves to compare new diagnostic tests. ROC curves can be used to compare tests that use different units or different scales. Figure 1-7 shows 3 ROC curves illustrating the ability of 3 glaucoma imaging devices to discriminate between healthy eyes and eyes with glaucomatous visual field loss via imaging of the optic nerve head and nerve fiber layer. The area under each ROC curve represents a summary measure of the relative efficacy of the screening test. The ROC curves appear similar for inferior average nerve fiber layer thickness as measured with OCT and for scanning laser polarimetry with variable corneal compensation (GDx VCC nerve fiber index), while the ROC curve for confocal scanning laser ophthalmoscopy (HRT linear discriminant function) is lower. In other words, the figure suggests a higher diagnostic precision for scanning laser polarimetry and OCT than confocal scanning laser ophthalmoscopy.
Medeiros FA, Zangwill LM, Bowd C, Weinreb RN. Comparison of the GDx VCC scanning laser polarimeter, HRT II confocal scanning laser ophthalmoscope, and Stratus OCT optical coherence tomograph for the detection of glaucoma. Arch Ophthalmol. 2004;122(6):827–837.
The effect of pretest probability of disease
Pretest probability of disease uses knowledge of the patient before any screening or diagnostic tests are performed. For example, the investigator may know that the patient has a first-degree relative with glaucoma as well as a thinner-than-average central corneal thickness (both are risk factors for glaucoma). This information suggests a pretest probability of glaucoma about 3 times higher than that of a person picked at random from the general population. How much does a diagnostic test improve the ability to diagnose glaucoma in this patient with a higher pretest probability? How much higher is the relative risk of glaucoma if the test result is positive?
Figure 1-7 ROC curve of 3 glaucoma imaging devices. The single parameter chosen for display for each instrument was the one that performed the best in the authors’ study. There was no statistically significant difference in the area under the ROC curves for these 3 parameters.
(The HRT linear discriminant function is from a paper by Bathija et al, referenced by Medeiros et al; the GDx and OCT parameters are standard test outputs provided by the manufacturers. Graph drawn with data from Medeiros FA, Zangwill LM, Bowd C, Weinreb RN. Comparison of the GDx VCC scanning laser polarimeter, HRT II confocal scanning laser ophthalmoscope, and Stratus OCT optical coherence tomograph for the detection of glaucoma. Arch Ophthalmol. 2004;122:827–837.)
Bayes theorem allows the pretest probability of disease to be combined with the diagnostic precision of a screening test to produce a posttest probability of disease. To use this theorem, the likelihood ratio must be calculated. The likelihood ratio of a positive test is the sensitivity divided by (1 − specificity). For a sample test with 80% sensitivity and 90% specificity (0.8/[1 − 0.9]), the positive likelihood ratio is 8. The likelihood ratio of a negative test is (1 − sensitivity) divided by the specificity. For the same sample test, the negative likelihood ratio is (1 − 0.8)/0.9, or 0.22. Positive likelihood ratios start at 1 and continue to infinity—the bigger, the better. Negative likelihood ratios range from 0 to 1—the smaller, the better. If the goal is to diagnose disease, the test with the larger positive likelihood ratio is the better test; conversely, if the goal is to rule out disease, the test with the smaller negative likelihood ratio is better.
If the positive likelihood ratio is multiplied by the pretest probability of disease, the result is the posttest probability of disease. Thus, for the example patient with the positive family history, thin cornea, and pretest probability of 3, a positive test with a positive likelihood ratio of 8 will result in a posttest probability of glaucoma that is 24 times that of a person drawn at random from the population.
Table 1-3 Changes in PPV and NPV Depending on Pretest Probability in a Test With 80% Sensitivity, 90% Specificity
Table 1-3 demonstrates another important consideration regarding pretest probability of disease. Consider the case of a 65-year-old woman with no risk factors for glaucoma and a pretest probability of disease of 1.0%. A positive test result for glaucoma would raise her probability of disease to 7.5%. Most patients with a positive test result would not actually have the disease! Similarly, an 85-year-old man with a strong positive family history, thin central corneal thickness, and an IOP of 30 mm Hg might have a pretest probability of disease of 50.0%. If his test result were negative, he would still have a posttest probability of disease of 18.2%, greater than that of the 65-year-old woman! This example illustrates the importance of considering the pretest probability of disease in deciding whether to employ a diagnostic test. In general, screening tests do not perform well when the prevalence of disease is low.
Intermediate diagnostic categories, such as “glaucoma suspect,” are often encountered in clinical practice. Sensitivity–specificity and ROC curves cannot account for such categories, because they require borderline subjects to be categorized as either having the disease (eg, glaucoma) or not having it (eg, no glaucoma). However, a likelihood ratio can be calculated for a borderline category, which reflects the risk of patients exhibiting that characteristic (eg, “glaucoma suspect”).
Use of tests in combination
Studies can combine tests in series or in parallel. An example of combining 2 tests in series is when a clinician performs the second test only if the first is positive. The correlation between the 2 tests must be considered when they are used in series. Consider the following case: a study uses cup–disc ratio, determined via optic nerve head photography, as a diagnostic test. If the result is positive, the peripapillary retinal nerve fiber layer thickness observed on OCT imaging is used to confirm the diagnosis. The study provides likelihood ratios for each test. Although it may be tempting to use the product of the 2 likelihood ratios and the pretest probability to calculate a posttest probability, if the screening tests are correlated with one another the predictive ability will appear artificially higher. Because cup–disc ratio and retinal nerve fiber layer thickness both examine tissues of the optic nerve head, albeit using different technologies, they are highly correlated. Thus, because the 2 tests are not independent, the results from the performance of the 2-test strategy are likely to be different from the posttest probability calculated from the product of the 2 likelihood ratios and the pretest probability.
Other studies have combined tests in parallel, considering the result positive if either test result is positive. This strategy works best when the tests have good specificity (combining tests this way makes overall specificity deteriorate) and address different aspects of a disease. Kopplin and colleagues found that a visual acuity of less than 20/40, abnormal/poor-quality nonmydriatic imaging, abnormal frequency doubling perimetry, or abnormal/poor-quality confocal scanning laser ophthalmoscopy resulted in an ROC curve area of 0.827 for detection of visually significant eye disease (eg, cataract, macular degeneration, glaucoma).
Clinical acceptance and ethics of testing
Clinicians should avoid tests that provide a small increment in the likelihood ratio of detecting disease or that are expensive or painful. Also, all tests carry some burden, including the potential for adverse effects (eg, corneal abrasion from tonometry), psychological fear of a disease (eg, related to a screening test for glaucoma), and additional testing and follow-up examinations for abnormal or unusual results. A clinician should avoid a test if it will not change patient management. Similarly, screening for eye disease should include a process for follow-up of those who have abnormal results, regardless of their insurance status. Screening provides little value to participants who are told they might have a disease but are given no method of obtaining a follow-up evaluation or treatment.
Most studies investigate new screening or diagnostic tests in a clinical setting before evaluating them in a population-based sample (largely because of the high cost of performing population-based research). Clinicians should consider whether the data for a new test will apply to their screening population. Even a clinic-based study may not have patients like those in another practice. For example, a study may include only young glaucoma patients without other eye diseases, such as macular degeneration. This leads to excellent sensitivity and specificity, but the results may differ in a sample of patients who have borderline glaucoma and are older.
Excerpted from BCSC 2020-2021 series: Section 1 - Update on General Medicine. For more information and to purchase the entire series, please visit https://www.aao.org/bcsc.