Evaluating diagnostic tests

Authors:: Neal G Mahutte, MD; Antoni J Duleba, MD
Section Editor:: Joann G Elmore, MD, MPH
Deputy Editor:: Carrie Armsby, MD, MPH

Literature review current through: Apr 2025. | This topic last updated: Mar 21, 2025.

INTRODUCTION —

The introduction of new diagnostic tests that claim to improve screening or provide definitive diagnosis is a major dilemma for all clinicians. The decision to embrace or reject these tests is often made individually with incomplete information and without thoughtful reflection.

In this topic review, we will outline a simple stepwise process for evaluating the utility of any diagnostic test:

●Can the test be reliably performed?

●Was the test evaluated in an appropriate population?

●Was an appropriate reference standard used?

●Was an appropriate cutoff value chosen?

●What are the test performance characteristics (eg, sensitivity, specificity) relative to the reference standard?

●How well does the test perform in populations with different prevalences of the disease?

●What is the balance between the cost of the test and the burden(s) of the disease?

A glossary of some of the biostatistical and epidemiologic terms used in this topic and the principles of hypothesis testing are presented separately. (See "Glossary of common biostatistical and epidemiological terms" and "Hypothesis testing in clinical research: Proof, p-values, and confidence intervals".)

Reference ranges for common laboratory tests in adults are also presented separately. (See "Laboratory test reference ranges in adults".)

RELIABILITY OF THE TEST —

It is helpful to determine the extent to which the test is accurate, precise, and user independent to objectively answer this question.

Accuracy and precision — "Accuracy" refers to the ability of the test to actually measure what it claims to measure. It is defined as the number of correct test results (including both true positives and true negatives) divided by the total number of observations (table 1). However, accuracy by itself is often not the best indicator of test performance, since it obscures important information related to its component parts. Other test characteristics (sensitivity and specificity) are often more informative, as discussed below. (See 'Test performance characteristics' below.)

Precision refers to the ability of the test to reproduce the same result when repeated on the same patient or sample.

The properties of accuracy and precision are related but somewhat distinct. For example, a test could be precise but not accurate if on three occasions it produced roughly the same result, but that result differed substantially from the true value determined by a reference standard.

Skill and inter-observer variability — For diagnostic tests that require skill to perform the test and/or interpret results, a major challenge in evaluating test performance is determining to what extent user skill and experience influences the accuracy and precision. Published studies often come from tertiary care centers with advanced capabilities, equipment, and personnel. Such environments may not reflect the resources found at less specialized centers. For example, point-of-care ultrasonography (POCUS) is an accurate and precise tool in an expert's hands, but it may be less reliable when performed by a less experienced practitioner. (See "Bedside pleural ultrasonography: Equipment, technique, and the identification of pleural effusion and pneumothorax", section on 'Choosing bedside versus consultative ultrasound' and "Emergency ultrasound in adults with abdominal and thoracic trauma" and "Indications for bedside ultrasonography in the critically ill adult patient".)

For tests that are prone to interrater variability, the kappa statistic can be calculated to assess the degree of agreement on repeated measurements. (See "Glossary of common biostatistical and epidemiological terms", section on 'Kappa statistic'.)

When deciding whether to incorporate a new diagnostic tool into practice, the local resources and the clinicians’ expertise with use of the new technology should be considered.

POPULATION STUDIED

Generalizability — When evaluating the utility of a diagnostic test, it is important to examine the population in which the test was studied. Optimally, a test should be evaluated in a broad spectrum of patients with and without the disease so that the results are generalizable [1]. Those with the disease should represent all stages and manifestations of the condition. Perhaps more importantly, individuals without the disease should have some clinical manifestations similar to the disease in question. This is critical in demonstrating the ability of the test to distinguish those with the condition of interest from those with other diagnostic considerations in the differential diagnosis.

As an example, the utility of obtaining a serum CA125 concentration for detection of endometriosis depends upon studying a population that includes a range of patients with minimal, mild, moderate, and severe endometriosis. If the study population has a disproportionate number of individuals with severe disease, this might lead to an overestimation of the test’s ability to identify cases. It is also essential to include a large cohort of patients without endometriosis but with similar signs or symptoms (eg, dysmenorrhea, dyspareunia, pelvic pain, infertility, adnexal mass, fibroids). Neglecting to include these patients might falsely inflate the performance of the test.

Sample size — Sample size is part of the question of population appropriateness. An adequate number of patients must be studied to encompass a broad spectrum of manifestations in diseased and nondiseased subjects. However, an overly large sample size may detect a statistically significant test difference that is not clinically meaningful, while a sample size that is too small may yield inconclusive results due to low power.

One direct way of evaluating sample size is to examine the confidence intervals (CIs) for sensitivity, specificity, and likelihood ratio reported in the study. (See 'Sensitivity and specificity' below and 'Likelihood ratios' below.)

REFERENCE STANDARD —

Any new test that purports to have value must be compared with a reference standard. Ideally, the reference standard allows unambiguous determination of whether the patient does or does not have the disease. However, in the real world, many reference standards are imperfect, and they often involve some degree of error or user dependence.

Real-world considerations compel us to use practical definitions. Reference standards represent "the best we have" for distinguishing normal from abnormal. The reference standard is the test that thus far has been shown to most reliably detect the disease.

As an example, histopathology is often used as a reference standard for the diagnosis of endometriosis; however, histopathology is not infallible. Cases can be misdiagnosed because of sampling error or individual differences among pathologists in histologic interpretation. The presence of ectopic endometrial glands, but not stroma (or vice versa), in an individual with clinical signs and symptoms of endometriosis is suggestive of this disorder but does not meet strict criteria for the disease (ie, ectopically located endometrial glands and stroma). By comparison, does an asymptomatic individual have endometriosis if a random biopsy of normal-appearing peritoneum finds endometrial glands and stroma? These questions address issues of both disease definition and what is normal.

For some diseases, lack of a sensitive gold standard can present a challenge for evaluating new diagnostic technologies, particularly when the accuracy of the newer test potentially exceeds that of the reference standard (eg, PCR testing versus culture for certain infectious diseases) [2].

CUTOFF VALUE

The trade-off of higher versus lower cutoff — A cutoff value must be chosen to define an abnormal test result. Selecting the cutoff value involves balancing sensitivity and specificity. The approach is context dependent and should consider the clinical circumstances in which the test will be used. For example, the main priority for a screening test might be to optimize sensitivity; a lower specificity may be acceptable in this setting. By contrast, a diagnostic test used to identify candidates for a treatment that caries substantial risk should have high specificity. (See 'Sensitivity and specificity' below.)

A schematic illustration of the relationship between the cutoff value, specificity, and specificity is shown in the figure (figure 1).

Receiver operating characteristic (ROC) curves — The receiver operating characteristic (ROC) curve plots sensitivity against 1-specificity (ie, the false-positive rate) for all cutoff values measured for the diagnostic test (figure 2). The ROC curve demonstrates the interchange between sensitivity and specificity for different cutoff values. As one moves from left to right along the ROC curve, the sensitivity increases while the specificity decreases.

The area under the curve (AUC) of the ROC curve represents the overall accuracy of the test. A test that performs no better than chance would be represented by a straight line with an AUC of 0.5. A near-perfect test would have a rectangular configuration with an AUC approaching 1.0. The closer the AUC is to 1, the more accurate the test. Similarly, if one wants to select a cutoff value for a test that minimizes both false positives and false negatives (and hence maximizes both sensitivity and specificity), one would select the point on the ROC curve closest to the far upper left corner.

However, finding the right balance between optimal sensitivity and specificity may not involve simultaneously minimizing false positives and false negatives in all situations. For example, when screening for a deadly disease that is curable, it may be desirable to accept more false positives (lower specificity) in return for fewer false negatives (higher sensitivity). ROC curves allow for more thorough evaluation of a test and potential cutoff values, but they are not the ultimate arbiters of how to set sensitivity and specificity.

TEST PERFORMANCE CHARACTERISTICS

Sensitivity and specificity — The sensitivity and specificity can easily be calculated using a two-by-two table (table 1).

●Sensitivity is the number of patients with a positive test who have a disease (true positives) divided by all patients who have the disease (table 1). A test with high sensitivity will not miss many patients who have the disease (ie, low false-negative rate).

●Specificity is the number of patients who have a negative test and do not have the disease (true negatives) divided by the number of patients who do not have the disease (table 1). A test with high specificity will infrequently identify patients as having a disease when they do not (ie, low false-positive rate).

A schematic illustration of the relationship between the cutoff value, specificity, and specificity is shown in the figure (figure 1).

False-positive and false-negative rates — Two-by-two tables can also be used for calculating the false-positive and false-negative rates (table 1):

●The false positive rate = false positives/(false positives + true negatives). It is also equal to 1 − specificity.

●The false negative rate = false negatives/(false negatives + true positives). It is also equal to 1 − sensitivity.

An ideal test minimizes false-positive and false-negative rates.

Likelihood ratios — Another method by which the performance of diagnostic tests can be judged is to assess the positive and negative likelihood ratios, which, like sensitivity and specificity, are independent of disease prevalence.

●Positive likelihood ratio – The positive likelihood ratio is the true positive rate divided by the false positive rate. It can be calculated from the sensitivity and specificity as follows:

Positive likelihood ratio = Sensitivity/(1 – Specificity)

The higher the positive likelihood ratio, the better the test (a perfect test has a positive likelihood ratio equal to infinity).

●Negative likelihood ratio – The negative likelihood ratio is the false negative rate divided by the true negative rate. It can be calculated from the sensitivity and specificity as follows:

Negative likelihood ratio = (1 – Sensitivity)/Specificity

The lower the negative likelihood ratio, the better the test (a perfect test has a negative likelihood ratio of 0).

●Interpretation – The following general guidelines can be used to characterize test performance based on positive and negative likelihood ratios:

•Excellent test performance – Positive likelihood ratio ≥10; negative likelihood ratio <0.1

•Good test performance – Positive likelihood ratio 5 to <10; negative likelihood ratio 0.1 to ≤0.2

•Fair to poor test performance – Positive likelihood ratio <5; negative likelihood ratio >0.2

As an example, consider the value of an elevated CA125 level in distinguishing between ovarian cancer versus a benign ovarian cyst in an individual with an ovarian mass. If the sensitivity of this test for identifying ovarian cancer is 70 percent (ie, 70 percent of patients with ovarian cancer have elevated CA125 levels), but the specificity is only 65 percent (ie, 35 percent of patients with benign cysts also have elevated CA125 levels), then the positive likelihood ratio would only be 2 (ie, 70 percent divided by 35 percent). This would be considered a poor test for the diagnosis of ovarian cancer.

●Limitations of positive and negative likelihood ratios – Although likelihood ratios are independent of disease prevalence, their direct validity is only within the original study population. They are generalizable to other populations to the extent that:

•The test can be reliably performed with minimal interobserver and intraobserver variation

•The study population(s) from which the values were derived was adequate in size and composition of normal and diseased phenotypes

•An appropriate reference standard was used

If a diagnostic test was investigated in a narrow subpopulation or the test relied heavily on user skill/interpretation, then the sensitivity, specificity, and likelihood ratios reported in the study may not be generalizable outside of the original research population. In other words, the test performance parameters may have internal validity but not external validity.

PRETEST PROBABILITY AND PREDICTIVE VALUES —

In addition to the sensitivity, specificity, and likelihood ratios, the other major determinants of the test’s utility are the disease prevalence and pretest probability (calculator 1 and calculator 2).

The usefulness of a positive test decreases as disease prevalence (or pretest probability) decreases. This concept is the basis of predictive values (also called post-test probabilities).

●Positive predictive value (PPV) refers to the probability that a positive test correctly identifies an individual who actually has the disease. It is computed from two-by-two tables: true positives/(true positives + false positives) (table 1).

●Negative predictive value (NPV) refers to the probability that a negative test correctly identifies an individual who does not have the disease. It is computed from two-by-two tables: true negatives/(false negatives + true negatives) (table 1).

For example, assuming a constant sensitivity and specificity, the PPV and NPV for a disease with prevalence of 10, 1, or 0.1 percent are shown in a table (table 2). This example illustrates how a positive result from the same test with near-perfect sensitivity (99 percent) and high specificity (90 percent) may have completely different significance depending upon the baseline prevalence of disease in the population. When applied to a population in which the disease is common (prevalence = 10 percent), the PPV is 53 percent. By comparison, when applied to a different population in which the disease is uncommon (prevalence = 0.1 percent), the PPV is only 1 percent; thus, 99 percent of all individuals who test positive are actually free of the disease. All that the test has accomplished in this population is to slightly upgrade the probability of disease from extremely unlikely (0.1 percent) to very unlikely (1 percent) and, in the process, subjected numerous individuals without the disease to further testing.

A second example, using a different combination of sensitivity, specificity, and prevalence, is illustrated in the figure (figure 3).

A clinically relevant example of the impact of pretest probability on diagnostic test performance is the use of the Wells score in the diagnostic evaluation of suspected new onset lower extremity deep vein thrombosis (DVT). The Wells score is used to estimate pretest probability and then subsequent measurement of D-dimer and need for compression ultrasonography are based upon the pretest probability, as discussed separately. (See "Clinical presentation and diagnosis of the nonpregnant adult with suspected deep vein thrombosis of the lower extremity", section on 'Initial approach (pretest probability)'.)

COST —

The final judgment involved in considering the value of a test is the balance between cost of the test and the burden(s) of the disease. These may include direct costs to the individual or insurer as well as broader societal costs. Cost is often the determinant in deciding when, where, and how a diagnostic test is utilized.

A society and its health care system might be willing to accept low positive predictive values (PPVs) in return for saved lives for a rare disease that is universally fatal but easily curable. By comparison, an accurate but extremely expensive test might be less desirable than an inexpensive but less accurate test if the consequences of misdiagnosis are not serious.

Cost-effectiveness analysis, which involves estimating direct monetary costs as well as all of the indirect costs of disease, testing, and misdiagnosis, is discussed in greater detail separately. (See "A short primer on cost-effectiveness analysis".)

SUMMARY

●Accuracy and precision – "Accuracy" is defined as the number of correct test results (including both true positives and true negatives) divided by the total number of observations (table 1). Precision refers to the ability of the test to reproduce the same result when repeated on the same patient or sample. Both properties are important in determining the utility of a diagnostic test. (See 'Reliability of the test' above.)

●Generalizability – Optimally, a test should be evaluated on a broad spectrum of patients with and without the disorder in question to maximize generalizability. (See 'Population studied' above.)

●Reference standard – Optimally, new diagnostic tests should be compared with a reference standard that allows unambiguous determination of whether the patient does or does not have the disease. However, in the real world, many reference standards are imperfect. (See 'Reference standard' above.)

●Cutoff value – A cutoff value must be chosen to define an abnormal test result. Selecting this value involves balancing sensitivity and specificity (figure 1). The receiver operating characteristic (ROC) curve demonstrates the interchange between sensitivity and specificity for different cut-off values (figure 2). (See 'Cutoff value' above.)

●Test characteristics

•Sensitivity and specificity – Sensitivity is the number of patients with a positive test who have a disease (true positives) divided by all patients who have the disease (table 1). Specificity is the number of patients who have a negative test and do not have the disease (true negatives) divided by the number of patients who do not have the disease. (See 'Sensitivity and specificity' above.)

•Positive and negative likelihood ratios – The positive likelihood ratio (= sensitivity/[1 – specificity]) and negative likelihood ratio (= [1 − sensitivity]/specificity), like the sensitivity and specificity, are independent of disease prevalence. General guidelines for characterizing test performance based on positive and negative likelihood ratios are provided above. (See 'Likelihood ratios' above.)

●Disease prevalence and pretest probability – In addition to the sensitivity, specificity, and likelihood ratios, the other major determinants of the test’s utility are the disease prevalence and pretest probability (calculator 1). The usefulness of a positive test decreases as disease prevalence (or pretest probability) decreases. The table illustrates how the positive and negative predictive value of a test very depending on disease prevalence (table 2). (See 'Pretest probability and predictive values' above.)

Schünemann HJ, Mustafa RA, Brozek J, et al. GRADE guidelines: 21 part 1. Study design, risk of bias, and indirectness in rating the certainty across a body of evidence for test accuracy. J Clin Epidemiol 2020; 122:129.
Patel R, Tsalik EL, Evans S, et al. Clinically Adjudicated Reference Standards for Evaluation of Infectious Diseases Diagnostics. Clin Infect Dis 2023; 76:938.

Topic 2769 Version 27.0

References

1 : GRADE guidelines: 21 part 1. Study design, risk of bias, and indirectness in rating the certainty across a body of evidence for test accuracy.

2 : Clinically Adjudicated Reference Standards for Evaluation of Infectious Diseases Diagnostics.

خرید پکیج

Evaluating diagnostic tests

References