Glossary of common biostatistical and epidemiological terms

INTRODUCTION — This topic review will provide a catalog of common biostatistical and epidemiological terms encountered in the medical literature.

STATISTICS THAT DESCRIBE HOW DATA ARE DISTRIBUTED

Measures of central tendency — Three measures of central tendency are most frequently used to describe data:

●**Mean** – The mean equals the sum of values divided by the number of values.

●**Median** – The median is the middle value when all values are ordered from smallest to largest; when there are an even number of values, the median is defined as the mean of the middle two data points.

●**Mode** – The mode is the value that occurs most frequently.

Measures of dispersion — Dispersion refers to the degree to which data are scattered around a specific value (such as the mean). The most commonly used measures of dispersion are:

●**Range** – The range equals the difference between the largest and smallest observation.

●**Standard deviation (SD)** – The SD measures the variability of data around the mean. It provides information on how much variability can be expected among individuals within a population. In samples that follow a "normal" distribution (ie, Gaussian), 68 and 95 percent of values fall within one and two SDs of the mean, respectively.

●**Standard error of the mean (SEM)** – The SEM describes how much variability can be expected when measuring the mean from several different samples.

●**Percentile** – The percentile is the percentage of a distribution that is below a specific value. As an example, a child is in 90^{th} percentile for weight if only 10 percent of children the same age weigh more than they do.

●**Interquartile range** – The interquartile range refers to the upper and lower values defining the central 50 percent of observations. The boundaries are equal to the observations representing the 25^{th} and 75^{th} percentiles. The interquartile range is depicted in a box and whiskers plot (figure 1).

TERMS USED TO DESCRIBE THE FREQUENCY OF AN EVENT

Incidence — Incidence represents the number of new events that have occurred in a specific time interval divided by the population at risk at the beginning of the time interval. For example, the annual incidence of colon cancer would be reported as the number of new cases per 100,000 people per year.

Prevalence — Prevalence refers to the number of individuals with a given disease at a given point in time divided by the population at risk at that point in time. Prevalence has been further defined as being "point" or "period." Point prevalence refers to the proportion of individuals with a condition at a specified point in time, while period prevalence refers to the proportion of individuals with a condition during a specified interval (eg, a year). For example, the point prevalence of colon cancer in 2022 would be reported as the number of people living with colon cancer per 100,000 people.

TERMS USED TO DESCRIBE THE MAGNITUDE OF AN EFFECT

Relative risk — The relative risk (RR; also called risk ratio) equals the incidence in exposed individuals divided by the incidence in unexposed individuals (figure 2). The RR can be calculated from studies in which the proportion of patients exposed and unexposed to a risk is known, such as a cohort study. (See 'Cohort study' below.)

RR is also commonly used to describe effect sizes in randomized trials, in which case, the RR equals the proportion of patients who had the outcome in the treatment arm divided by the proportion of patients who had the outcome in the control arm. (See 'Randomized controlled trial' below.)

Odds ratio — The odds ratio (OR) equals the odds that an individual with a specific condition has been exposed to a risk factor divided by the odds that a control has been exposed. The OR is used in case-control studies (see 'Case-control study' below). In addition, multivariate analyses often generate ORs and, therefore, other types of studies may report effects sizes using ORs. The OR provides a reasonable estimate of the RR if the outcome is uncommon, but it will tend to overestimate the effect size if the outcome is more common (figure 2).

Hazard ratio — A hazard ratio (HR) is the effect estimate generated by a time-to-event analysis (see 'Time-to-event analysis (survival analysis)' below). The HR is analogous to an OR. Thus, a HR of 10 means that a group of patients exposed to a specific risk factor has 10 times the chance of developing the outcome compared with unexposed controls.

The RR, OR, and HR are interpreted relative to the number 1. An OR of 0.6, for example, suggests that patients exposed to a variable of interest were 40 percent less likely to develop a specific outcome compared with the control group. Similarly, an OR of 1.5 suggests that the risk was increased by 50 percent.

Absolute risk difference — The RR and OR provide an estimate of the relative effect size. However, it is more often desirable to know information about the absolute risk difference (ARD; or absolute risk reduction [ARR]). The ARD depends on the baseline rate of the outcome. If the baseline rate is low, the ARD may not be clinically important despite a large RR reduction. For example, consider a therapy that results in a 50 percent relative reduction in mortality. If the baseline mortality rate is 40 percent, the ARD would be 20 percent, which is clearly a clinically meaningful reduction. However, if the baseline mortality rate is 1 percent, the ARD would be 0.5 percent, which may not be clinically important.

Number needed to treat — The benefit of an intervention can be expressed by the "number needed to treat (NNT)." NNT is the reciprocal of the ARR (event rate in the control arm minus the event rate in the treatment arm). The NNT can be interpreted as follows: "This study suggests that for every five patients treated with the new treatment, one additional death would be prevented compared with the control treatment."

As an example, consider a clinical trial involving 100 patients randomized to treatment with a new drug or placebo, with 50 patients in each arm. Thirty patients died during the study period (10 receiving active drug and 20 receiving placebo), yielding a mortality rate of 20 percent with active drug versus 40 percent with placebo, as shown in the left panel of the figure (figure 3). The ARR between the two treatment arms is used to calculate NNT.

●ARR = 40 percent minus 20 percent = 20 percent = 0.2

●NNT = 1 divided by ARR = 1 divided by 0.2 = 5

Thus, this study suggests that only five patients need to be treated with the drug to prevent one death (compared with placebo).

Because it is intuitive, the NNT is a popular way to express absolute benefit or risk, potentially allowing for comparison of the relative benefit (or harm) of different interventions. However, the NNT can be misleading:

●It implies that the option is to treat or not to treat, rather than to treat or switch to another more effective treatment [1].

●There are variations on how NNT is determined; NNTs from different studies cannot be compared unless the methods used to determine them are identical [2]. This may be a particular consideration when NNTs are calculated for treatment of chronic diseases in which outcomes (such as mortality) do not cluster in time.

●Calculation of the NNT depends upon the control rate (ie, the rate of events in the control arm). The control rate can be variable (particularly in small controlled trials, which are more vulnerable to random effects). As a result, the NNT may not accurately reflect the benefit of an intervention if events occurred in the control arm more or less than would be expected based upon the biology of the disease. This effect can be particularly problematic when comparing the NNTs among placebo-controlled trials (figure 3) [3].

When the outcome is a harm rather than a benefit, a number needed to harm (NNH) can be calculated similarly. (See 'Number needed to harm' below.)

Other variations that sometimes appear in the medical literature include number needed to prevent and number needed to diagnose.

Number needed to harm — NNH is a measure of harm caused by the investigational treatment. Like the NNT, the NNH is the reciprocal of the ARD, which in this case, is an increase rather than reduction (ie, the event rate in the treatment arm minus the event rate in the control arm). The NNH can be interpreted as follows: "This study suggests that treating 20 patients with the investigational treatment would result in one additional adverse event compared with the control treatment."

As an example, consider a randomized trial comparing an investigational new drug versus the current standard treatment for a certain condition. Adverse drug reactions occurred in 20 percent of patients treated with the new drug compared with 15 percent with the standard therapy. Thus, the ARD is 5 percent (20 minus 15), and the NNH is 20 (1 divided by 0.05). This means that for every 20 patients treated with the new drug, there would be one additional adverse drug reaction compared with standard therapy.

TERMS USED TO DESCRIBE RELIABILITY OF MEASUREMENTS — Reliability refers to the extent to which repeated measurements of a relatively stable phenomenon fall closely to each other. Several different types of reliability can be measured, such as inter- and intraobserver reliability and test-retest reliability.

●**Kappa statistic** – The kappa statistic is the most commonly used measure for assessing interobserver agreement. It can range from -1.0 to +1.0. If there is perfect agreement, the value is 1.0, whereas if the observed agreement is what would be expected by chance alone, the value is 0. If the degree of agreement is worse than what would be expected by chance, the kappa value will be negative, with complete disagreement resulting in a value of -1.0. Kappa statistics are often interpreted as:

•Excellent agreement – 0.8 to 1.0

•Good agreement – 0.6 to 0.8

•Moderate agreement – 0.4 to 0.6

•Fair agreement – 0.2 to 0.4

•Poor agreement – Less than 0.2

MEASURES OF DIAGNOSTIC TEST PERFORMANCE — The most common terms used to describe the performance of a diagnostic test are sensitivity and specificity.

Sensitivity and specificity

●**Sensitivity** is the number of patients with a positive test who have a disease divided by all patients who have the disease (table 1). A test with high sensitivity will not miss many patients who have the disease (ie, few false-negative results).

●**Specificity** is the number of patients who have a negative test and do not have the disease divided by the number of patients who do not have the disease. A test with high specificity will infrequently identify patients as having a disease when they do not (ie, few false-positive results).

Sensitivity and specificity are properties of tests that should be considered when tests are obtained. In addition, sensitivity and specificity are interdependent. Thus, for a given test, an increase in sensitivity is accompanied by a decrease in specificity and vice versa. (See "Evaluating diagnostic tests", section on 'Balancing sensitivity and specificity'.)

For example, consider two populations of patients: One has chronic hepatitis as defined by a reference standard such as a liver biopsy, and the other does not. The diagnostic test being used to evaluate for chronic hepatitis is the serum alanine aminotransferase (ALT) concentration. The sensitivity and specificity of the ALT depend upon the value chosen as a cutoff (figure 4).

The interdependence of sensitivity and specificity can be depicted graphically using a receiver operating characteristic (ROC) curve, as summarized in the figure (figure 5) and discussed in detail separately. (See "Evaluating diagnostic tests", section on 'Receiver operating characteristic curves'.)

Predictive values — In addition to sensitivity and specificity, the predictive values of a diagnostic test must be considered when interpreting the results of a test (calculator 1).

●The **positive predictive value (PPV)** of a test represents the likelihood that a patient with a positive test has the disease

●The **negative predictive value (NPV)** represents the likelihood that a patient who has a negative test is free of the disease (table 1)

The PPV and NPV depend upon the prevalence of a disease within a population. Thus, for given values of sensitivity and specificity, a patient with a positive test is more likely to truly have the disease if the patient belongs to a population with a high prevalence of the disease (figure 6).

This observation has significant implications for screening tests, in which false-positive results may lead to expensive and sometimes dangerous testing and false-negative tests may be associated with morbidity or mortality. As an example, a positive stool test for occult blood is much more likely to predict colon cancer in a 70-year-old compared with a 20-year-old. Thus, routine screening of stools in young patients would lead to a high rate of subsequent false-positive examinations and is not recommended. The predictive values of a test should be considered when selecting among diagnostic tests for an individual patient in whom demographic or other clinical risk factors influence the likelihood that the disease is present (ie, the "prior probability" of the disease).

Likelihood ratio — As discussed above, a limitation to predictive values as expressions of test characteristics is their dependence upon disease prevalence. To overcome this limitation, the likelihood ratio has been used as an expression of the performance of diagnostic tests [4]. Likelihood ratios are an expression of sensitivity and specificity that can be used to estimate the odds that a condition is present or absent (calculator 2). (See "Evaluating diagnostic tests", section on 'What are the positive and negative likelihood ratios?'.)

The likelihood ratio represents a measure of the odds of having a disease relative to the prior probability of the disease. The estimate is independent of the disease prevalence. A positive likelihood ratio is calculated by dividing sensitivity by 1 minus specificity: sensitivity/(1-specificity). Similarly, a negative likelihood ratio is calculated by dividing 1 minus sensitivity by specificity: (1-sensitivity)/specificity. Positive and negative likelihood ratios of 9 and 0.25, for example, can be interpreted as meaning that a positive result is seen 9 times as frequently, while a negative test is seen 0.25 times as frequently, in those with a specific condition than those without it. Likelihood ratios can be established for many cutoff points for a diagnostic test, permitting an appreciation for the relative importance of a large versus small increase in a test result.

Accuracy — The performance of a diagnostic test is sometimes expressed as accuracy, which refers to the number of true positives and true negatives divided by the total number of observations (table 2). However, accuracy by itself is not a good indicator of test performance, since it obscures important information related to its component parts. (See "Evaluating diagnostic tests", section on 'Accuracy and precision'.)

TERMS USED WHEN MAKING INFERENCES ABOUT DATA

Confidence interval — A point estimate (ie, a single value) from a sample population may not reflect the "true" value from the entire population. As a result, it is often helpful to provide a range that is likely to include the true value. A confidence interval (CI) is a commonly used for this purpose. The boundaries of a CI give values within which there is a high probability (95 percent by convention) that the true population value can be found. The CIs are calculated based upon the number of observations and the standard deviation (SD) of the data. As the number of observations increases or its variance (dispersion) decreases, the CIs become narrower. The interpretation of CIs is discussed in more detail separately. (See "Hypothesis testing in clinical research: Proof, p-values, and confidence intervals", section on 'Confidence intervals'.)

Credible interval — A credible interval is used in Bayesian analysis to describe the range in which a posterior probability estimate is likely to reside. As an example, a 95 percent credible interval for a posterior probability estimate of 40 percent could range from 30 to 50 percent, indicating that there is a 95 percent chance that the true posterior probability estimate lies within the 30 to 50 percent range. There are fundamental differences in how credible intervals are derived compared with the more commonly used CIs. Nevertheless, their intuitive interpretation is similar.

Errors — Two potential errors are commonly recognized when testing a hypothesis:

●A type I error (also referred to as an "alpha error") is incorrectly concluding that there is a statistically significant difference in a dataset; the probability of making a type I error is called "alpha." A typical value for alpha is 0.05. Thus, a p<0.05 leads to a decision to reject the null hypothesis, although lower values for claiming statistical significance have been proposed [5]. (See "Hypothesis testing in clinical research: Proof, p-values, and confidence intervals".)

●A type II error (also referred to as a "beta error") is incorrectly concluding that there was no statistically significant difference in a dataset; the probability of making a type II error is called "beta." This error often reflects insufficient power of the study.

Power — The term "power" (calculated as 1 – beta) refers to the ability of a study to detect a true difference. Negative findings in a study may reflect that the study was underpowered to detect a difference. A "power calculation" should be performed prior to conducting a study to be sure that there is a sufficient number of observations to detect a desired degree of difference. The larger the difference, the fewer the number of observations that will be required. As an example, it takes fewer patients to detect a 50 percent difference in blood pressure from a new antihypertensive medication compared with placebo than a 5 percent difference. The interpretation of power calculations is discussed in more detail separately. (See "Hypothesis testing in clinical research: Proof, p-values, and confidence intervals", section on 'Power'.)

MULTIVARIATE ANALYSIS — It is often necessary to consider the effects of multiple variables together when predicting an outcome. As an example, when assessing the risk of lung cancer, the effects of smoking, age, and other exposures (occupational exposures, prior radiation, etc) need to be simultaneously considered.

Statistical methods that can simultaneously account for multiple variables are known as "multivariate" (or multivariable) analysis. These methods help to "control" (or "adjust") for variables that are extraneous to the main question and might confound it. Commonly encountered forms of multivariable analysis include:

●**Logistic regression**, which is used in models assessing dichotomous outcomes (eg, alive versus dead, having a disease versus not having it)

●**Linear regression**, which is used in models assessing continuous outcomes (eg, blood pressure)

TIME-TO-EVENT ANALYSIS (SURVIVAL ANALYSIS) — Many examples of medical research deal with an event that may or may not occur in a given period of time (such as death, stroke, myocardial infarction). During the study, several outcomes are possible in addition to the outcome of interest (eg, patients might die of other causes or drop out from the analysis). Furthermore, the duration of follow-up can vary among individuals in the study. A patient who is observed for five years should count more in the statistical analysis than one observed for five months.

Several methods are available to account for these considerations. The most commonly used methods in medical research are Kaplan-Meier and Cox proportional hazards analyses.

Kaplan-Meier analysis — Kaplan-Meier analysis measures the ratio of surviving patients (or those free from an outcome) divided by the total number of patients at risk for the outcome. Every time a patient has an outcome, the ratio is recalculated. Using these calculations, a curve can be generated that graphically depicts the probability of survival as time passes (figure 7).

In many studies, the benefit of a drug or intervention on an outcome is compared with a control population, permitting the construction of two or more Kaplan-Meier curves. Curves that are close together or cross are unlikely to reflect a statistically significant difference. Several formal statistical tests can be used to assess a significant difference. Examples include the log-rank test and the Breslow test.

Cox proportional hazards analysis — Cox proportional hazards analysis is similar to logistic regression because it can account for many variables that are relevant for predicting a dichotomous outcome. However, unlike logistic regression, Cox proportional hazards analysis permits time to be included as a variable and for patients to be counted only for the period of time in which they were observed. The summary effect estimate generated by this type of analysis is the hazard ratio (HR). (See 'Hazard ratio' above.)

STUDY DESIGNS

Cohort study — A cohort is a clearly identified group of people to be studied. A cohort study might identify persons specifically because they were or were not exposed to a risk factor or by taking a random sample of a given population. A cohort study can then move forward to observing the outcome of interest, even if the data are collected retrospectively. As an example, a group of patients who have variable exposure to a risk factor of interest can be followed over time for an outcome.

The Nurses' Health Study is an example of a cohort study. A large number of nurses are followed over time for an outcome such as colon cancer, providing an estimate of the risk of colon cancer in this population. In addition, dietary intake of various components can be assessed, and the risk of colon cancer in those with high and low intake of fiber can be evaluated to determine if fiber is a risk factor (or a protective factor) for colon cancer. The relative risk (RR) of colon cancer in those with high or low fiber intakes can be calculated from such a cohort study. (See 'Relative risk' above.)

Case-control study — A case-control study starts with the outcome of interest and works backward to the exposure. For instance, patients with a disease are identified and compared with controls for exposure to a risk factor. This design does not permit measurement of the proportion of the population who were exposed to the risk factor and then developed or did not develop the disease; thus, the RR or the incidence of disease cannot be calculated. However, in case-control studies, the odds ratio (OR) provides a reasonable estimate of the RR (figure 2). (See 'Odds ratio' above.)

If one were to perform a case-control study to assess the role of dietary fiber in colon cancer as noted above for the cohort study, a group of patients with colon cancer could be compared with matched controls without colon cancer; the fiber intake in the two groups would then be compared. The case-control study is most useful for uncommon diseases in which a very large cohort would be required to accumulate enough cases for analysis.

Randomized controlled trial — A randomized controlled trial (RCT) is an experimental design in which patients are assigned to two or more interventions. Often, one group of patients is assigned to a placebo, but a randomized trial can involve two active therapies (active control).

RCTs are generally the best evidence for proving causality because randomization is the most effective method to minimize bias and confounding, particularly if the trial is well powered (ie, large number of patients). (See "Hypothesis testing in clinical research: Proof, p-values, and confidence intervals".)

A central principle in minimizing risk of bias in RCTs is that trial participants should be analyzed according to the groups in which they were randomized, even if they did not receive or comply with treatment. This is called an "intention-to-treat" (ITT) analysis. The advantage of ITT analysis is that it preserves randomization (ie, assuring that all of the unmeasured factors that could differ in the treatment and control groups remain accounted for in the analysis). For example, it is possible that patients who complied with treatment differed in some important ways than those who did not. Another way to consider the advantage of ITT analysis is that it better accounts for factors that can influence the outcomes of a prescribed treatment, not just the effects on those who adhered to it. For example, a drug that is highly effective but has serious side effects, for example, might look favorable in an "as-treated" analysis but less favorable in an ITT analysis if the majority of patients stopped taking it.

However, an "as-treated" (or "per-protocol") analysis, in which subjects are analyzed according to the actual treatment that they received, can be useful when assessing adverse effects of a treatment. In this case, the ITT analysis may yield an underestimate.

Systematic review and meta-analysis — A systematic review is a comprehensive summary of all available evidence that meets predefined eligibility criteria to address a specific clinical question or range of questions. Meta-analysis, which is commonly included in systematic reviews, is a statistical method that quantitatively combines the results from different studies. Terms used in these types of studies are explained in a separate topic review. (See "Systematic review and meta-analysis", section on 'Glossary of terms'.)

- Moriarty PM. Relative risk reduction versus number needed to treat as measures of lipid-lowering trial results. Am J Cardiol 1998; 82:505.
- Lubsen J, Hoes A, Grobbee D. Implications of trial results: the potentially misleading notions of number needed to treat and average duration of life gained. Lancet 2000; 356:1757.
- de Craen AJ, Vickers AJ, Tijssen JG, Kleijnen J. Number-needed-to-treat and placebo-controlled trials. Lancet 1998; 351:310.
- Weissler AM. A perspective on standardizing the predictive power of noninvasive cardiovascular tests by likelihood ratio computation: 1. Mathematical principles. Mayo Clin Proc 1999; 74:1061.
- Ioannidis JPA. The Proposal to Lower P Value Thresholds to .005. JAMA 2018; 319:1429.

Topic 2759 Version 25.0

آیا می خواهید مدیلیب را به صفحه اصلی خود اضافه کنید؟