ﺑﺎﺯﮔﺸﺖ ﺑﻪ ﺻﻔﺤﻪ ﻗﺒﻠﯽ
خرید پکیج
تعداد آیتم قابل مشاهده باقیمانده : 3 مورد
نسخه الکترونیک
medimedia.ir

Glossary of common biostatistical and epidemiological terms

Glossary of common biostatistical and epidemiological terms
Literature review current through: May 2024.
This topic last updated: May 02, 2024.

INTRODUCTION — This topic provides definitions of common biostatistical and epidemiological terms encountered in the medical literature.

Related topics include:

(See "Evidence-based medicine".)

(See "Hypothesis testing in clinical research: Proof, p-values, and confidence intervals".)

(See "Evaluating diagnostic tests".)

(See "Real-world evidence in health care".)

TERMS USED TO DESCRIBE STUDY DESIGN — Primary research involves collecting data from individuals or groups of individuals. The appropriate study design depends on the question being investigated (figure 1). Questions regarding benefits and harms of a treatment or intervention are best answered with randomized controlled trials (RCTs), whereas questions regarding risk factors for disease and prognosis are best answered with prospective cohort studies. (See "Evidence-based medicine", section on 'Categories of evidence'.)

Observational studies — Observational studies are those in which health outcomes are assessed in patients receiving interventions (eg, medications, procedures, diagnostic tests, medical devices, other medical products) as part of their routine medical care [1]. Participants are not assigned to a specific intervention according to an investigator-specified research protocol (as in a clinical trial), thus the term "noninterventional study" is sometimes used to describe observational studies. Observational studies represent a broad category, and there are many different types of study design within this category (eg, case series, case-control study, cohort study). (See 'Case series' below and 'Case-control study' below and 'Cohort study' below.)

Case series — A case series is a description of the characteristics and outcomes of a group of individuals with either a condition or an exposure (eg, patients who received a specific treatment or procedure) over a period of time. Data may be collected retrospectively or prospectively. There is no control group, and the objective is merely to describe the population and outcomes, rather than compare risks across groups. It is not possible to draw inferences about causality from case series given the lack of control group.

Case-control study — A case-control study is a type of observational (or noninterventional) study that provides a method for examining a potential relationship between an outcome and a hypothesized causal factor. For example, patients with a disease are identified and compared with controls who are similar to the cases but do not have the disease. The researchers then explore various exposures that are present in the cases but not the controls to understand possible associations.

This design does not permit measurement of the proportion of the population who were exposed to the risk factor and then developed or did not develop the disease; thus, the incidence of disease and the relative risk (RR) cannot be calculated. However, the odds ratio (OR) can be calculated and may provide a reasonable estimate of the RR (figure 2). (See 'Odds ratio' below.)

The case-control study design can be useful for studying uncommon diseases. However, a large group of people observed over an extended time would be required to accumulate enough cases for analysis.

Cohort study — A cohort study is a type of observational (or noninterventional) study in which participants do not have the outcome(s) of interest at the outset of the study. Participants are followed over time and assessed for development of the outcome(s) of interest. A cohort study might identify persons specifically because they were or were not exposed to a risk factor or by taking a random sample of a given population. A cohort study may follow participants prospectively over time (prospective cohort study) or it may collect data retrospectively (retrospective cohort study).

The Nurses' Health Study is an example of a prospective cohort study. A large number of nurses were enrolled in the study and followed over time for development of various health outcomes. For example, the study recorded the number of individuals diagnosed with colon cancer, providing an estimate of the incidence of colon cancer in this population. In addition, participants provided information regarding their dietary fiber intake, and the risk of colon cancer in those with high and low fiber diets were evaluated. The RR of colon cancer in those with high fiber diet versus low fiber diet can be calculated to determine if fiber is a risk factor (or a protective factor) for colon cancer. (See 'Relative risk' below.)

Real-world evidence — Real-world evidence (RWE) is evidence about the usage and potential benefits or risks of a medical product derived from analysis of real-world data. Real-world data are data relating to patient health status and/or the delivery of health care, routinely collected from a variety of sources. Data sources frequently used for RWE are summarized in the table (table 1). Most RWE studies are observational (or noninterventional) in nature, and the terms RWE and observational studies are commonly used interchangeably. (See "Real-world evidence in health care".)

Randomized controlled trial — An RCT is a prospective experimental study designed to examine cause and effect relationships between an intervention and outcome. Participants are randomly assigned to one of two or more interventions (called treatment arms). Often, one treatment arm consists of placebo or other control. However, some randomized trials involve two active therapies (active control).

RCTs are generally the best evidence for proving causality because randomization is the most effective method to minimize bias and confounding, particularly if the trial is well powered (ie, large number of patients). (See "Hypothesis testing in clinical research: Proof, p-values, and confidence intervals", section on 'Explanation for the results of a study'.)

A central principle in minimizing risk of bias in RCTs is that trial participants should be analyzed according to the groups in which they were randomized, even if they did not receive or comply with treatment. This is called an "intention-to-treat" (ITT) analysis. The advantage of ITT analysis is that it preserves randomization (ie, assuring that unmeasured factors that could differ in the treatment and control groups remain accounted for in the analysis). For example, it is possible that patients who complied with treatment differed in some important way from those who did not. Another way to consider the advantage of ITT analysis is that it better accounts for factors that can influence the outcomes of a prescribed treatment, not just the effects on those who adhered to it. For example, a drug that is highly effective but has serious side effects, might look favorable in an "as-treated" analysis but less favorable in an ITT analysis if most patients stopped taking it.

However, an "as-treated" (or "per-protocol") analysis, in which participants are analyzed according to the actual treatment that they received, can be useful when assessing adverse effects of a treatment. In this case, the ITT analysis may yield an underestimate of the harms of therapy.

More detailed discussions regarding assessment of risk of bias in RCTs and meta-analyses of RCTs are provided separately. (See "Evidence-based medicine", section on 'Internal validity' and "Systematic review and meta-analysis", section on 'Risk of bias assessment'.)

Systematic review — A systematic review is a comprehensive summary of all available evidence that meets predefined eligibility criteria to address a specific clinical question or range of questions. (See "Systematic review and meta-analysis", section on 'Systematic review'.)

Meta-analysis — Meta-analysis, which is commonly included in systematic reviews, is a statistical method that quantitatively pools the results of different studies. Additional details are provided separately. (See "Systematic review and meta-analysis", section on 'Meta-analysis'.)

Network meta-analysis — Network meta-analysis is a type of methodology used to simultaneously evaluate different interventions across multiple trials, including direct and indirect pairwise comparisons. A schematic representation of a network diagram is shown in the figure (figure 3). In practice, some network diagrams are far more complex (figure 4). Additional details are provided separately. (See "Systematic review and meta-analysis", section on 'Network meta-analysis'.)

BIOSTATISTICAL TERMS

Statistics describing distribution of data

Measures of central tendency — Three measures of central tendency are most frequently used to describe data:

Mean – The mean equals the sum of values divided by the number of values.

Median – The median is the middle value when all values are ordered from smallest to largest; when there are an even number of values, the median is defined as the mean of the middle two data points.

Mode – The mode is the value that occurs most frequently.

Measures of dispersion — Dispersion refers to the degree to which data are scattered around a specific value (such as the mean). The most commonly used measures of dispersion are:

Range – The range equals the difference between the largest and smallest observation.

Standard deviation (SD) – The SD measures the variability of data around the mean. It provides information on how much variability can be expected among individuals within a population. In samples that follow a "normal" distribution (ie, Gaussian), 68 and 95 percent of values fall within one and two SDs of the mean, respectively.

Standard error of the mean (SEM) – The SEM describes how much variability can be expected when measuring the mean from several different samples.

Percentile – The percentile is the percentage of a distribution that is below a specific value. As an example, a child is in 90th percentile for weight if only 10 percent of children the same age weigh more than they do.

Interquartile range – The interquartile range refers to the upper and lower values defining the central 50 percent of observations. The boundaries are equal to the observations representing the 25th and 75th percentiles. The interquartile range is depicted in a box and whiskers plot (figure 5).

Statistics describing rates of disease or events

Incidence — Incidence represents the number of new events that have occurred in a specific time interval divided by the population at risk at the beginning of the time interval. For example, the annual incidence of colon cancer would be reported as the number of new cases per 100,000 people per year.

Prevalence — Prevalence refers to the number of individuals with a given disease at a given point in time divided by the population at risk at that point in time. Prevalence has been further defined as being "point" or "period." Point prevalence refers to the proportion of individuals with a condition at a specified point in time, while period prevalence refers to the proportion of individuals with a condition during a specified interval (eg, a year). For example, the point prevalence of colon cancer in 2022 would be reported as the number of people living with colon cancer per 100,000 people.

Statistics describing effect sizes

Relative risk — The relative risk (RR; also called risk ratio) equals the incidence in exposed individuals divided by the incidence in unexposed individuals (figure 2). The RR can be calculated from studies in which the proportion of patients exposed and unexposed to a risk is known, such as a cohort study. (See 'Cohort study' above.)

RR is also commonly used to describe effect sizes in randomized trials, in which case, the RR equals the proportion of patients who had the outcome in the treatment arm divided by the proportion of patients who had the outcome in the control arm. (See 'Randomized controlled trial' above.)

Odds ratio — The odds ratio (OR) equals the odds that an individual with a specific condition has been exposed to a risk factor divided by the odds that a control has been exposed. The OR is used in case-control studies (see 'Case-control study' above). In addition, multivariate analyses often generate ORs and, therefore, other types of studies may report effects sizes using ORs. The OR provides a reasonable estimate of the RR if the outcome is uncommon, but it will tend to overestimate the effect size if the outcome is more common (figure 2).

Hazard ratio — A hazard ratio (HR) is the effect estimate generated by a time-to-event analysis. (See 'Time-to-event analysis (survival analysis)' below.)

The HR is analogous to an OR. Thus, a HR of 10 means that a group of patients exposed to a specific risk factor has 10 times the chance of developing the outcome compared with unexposed controls.

Interpretation of relative effect estimates — The RR, OR, and HR are interpreted relative to the number 1. A RR of 0.6, for example, indicates that patients with the exposure of interest were 40 percent less likely to develop a specific outcome compared with the control group. Similarly, a RR of 1.5 indicates they were 50 percent more likely to have the outcome.

Absolute risk difference — The absolute risk difference (ARD; or absolute risk reduction [ARR]) is the event rate in the control group minus the event rate in the treatment/exposed group.

Often, the ARD is easier to explain and of greater interest to patients than are RRs, ORs, or HRs. However, the ARD varies depending on baseline risk. If the baseline risk is low, the ARD may not be clinically important despite a large RR reduction. For example, consider a therapy that reduces risk of myocardial infarction (MI) with a RR of 0.5. If the baseline rate of MI is 40 percent, the ARD would be 20 percent, which is clearly a clinically meaningful reduction. However, if the baseline rate is only 1 percent, the ARD would be 0.5 percent, which may not be a clinically important difference.

Number needed to treat — The number needed to treat (NNT) is the reciprocal of the ARR (which is the event rate in the control arm minus the event rate in the treatment arm). A trial for a therapy that reduces mortality with an NNT of 5 can be interpreted as follows: "The findings suggests that for every five patients treated with the new treatment, one additional death would be prevented compared with the control treatment."

As an example, consider a clinical trial involving 100 patients randomized to treatment with a new drug or placebo, with 50 patients in each arm. Thirty patients died during the study period (10 receiving the new drug and 20 receiving placebo), yielding a mortality rate of 20 percent in the new drug group versus 40 percent in the placebo arm, as shown in the left panel of the figure (figure 6). The ARR between the two treatment arms is used to calculate NNT.

ARR = 40 percent minus 20 percent = 20 percent = 0.2

NNT = 1 divided by ARR = 1 divided by 0.2 = 5

Thus, this study suggests that only five patients need to be treated with the drug to prevent one death (compared with placebo).

Because it is intuitive, the NNT is a popular way to express absolute effect size, potentially allowing for comparison of the relative benefit (or harm) of different interventions. However, the NNT can be misleading:

It implies that the option is to treat or not to treat, rather than to treat or switch to another more effective treatment [2].

There are variations on how NNT is determined; NNTs from different studies cannot be compared unless the methods used to determine them are identical [3]. This may be a particular consideration when NNTs are calculated for treatment of chronic diseases in which outcomes (such as mortality) do not cluster in time.

Calculation of the NNT depends upon the control rate (ie, the rate of events in the control arm). The control rate can be variable (particularly in small controlled trials, which are more vulnerable to random effects). As a result, the NNT may not accurately reflect the benefit of an intervention if events occurred in the control arm more or less than would be expected based upon the biology of the disease. This effect can be particularly problematic when comparing the NNTs among placebo-controlled trials (figure 6) [4].

When the outcome is a harm rather than a benefit, a number needed to harm (NNH) can be calculated similarly. (See 'Number needed to harm' below.)

Other variations that sometimes appear in the medical literature include number needed to prevent and number needed to diagnose.

Number needed to harm — NNH is a measure of harm caused by the investigational treatment. Like the NNT, the NNH is the reciprocal of the ARD, which in this case, is an increase rather than reduction (ie, the event rate in the treatment arm minus the event rate in the control arm). A trial for a therapy that has adverse effects with an NNH of 20 can be interpreted as follows: "The findings suggests that treating 20 patients with the investigational treatment would result in one additional adverse event compared with the control treatment."

As an example, consider a randomized trial comparing an investigational new drug versus the current standard treatment for a certain condition. Adverse drug reactions occurred in 20 percent of patients treated with the new drug compared with 15 percent with the standard therapy. Thus, the ARD is 5 percent (20 minus 15), and the NNH is 20 (1 divided by 0.05). This means that for every 20 patients treated with the new drug, there would be one additional adverse drug reaction compared with standard therapy.

Terms describing types of analyses

Multivariate analysis — It is often necessary to consider the effects of multiple variables together when predicting an outcome. As an example, when assessing the risk of lung cancer, the effects of smoking, age, and other exposures (occupational exposures, prior radiation, etc) need to be simultaneously considered.

Statistical methods that can simultaneously account for multiple variables are known as "multivariate" (or multivariable) analysis. These methods help to "control" (or "adjust") for variables that are extraneous to the main question and might confound it. Commonly encountered forms of multivariable analysis include:

Logistic regression, which is used in models assessing dichotomous outcomes (eg, alive versus dead, having a disease versus not having it)

Linear regression, which is used in models assessing continuous outcomes (eg, blood pressure)

Time-to-event analysis (survival analysis) — Many examples of medical research deal with an event that may or may not occur in a given period of time (such as death, stroke, myocardial infarction). During the study, several outcomes are possible in addition to the outcome of interest (eg, patients might die of other causes or drop out from the analysis). Furthermore, the duration of follow-up can vary among individuals in the study. A patient who is observed for five years should count more in the statistical analysis than one observed for five months.

Several methods are available to account for these considerations. The most commonly used methods in medical research are Kaplan-Meier and Cox proportional hazards analyses.

Kaplan-Meier analysis – Kaplan-Meier analysis measures the ratio of surviving patients (or those free from an outcome) divided by the total number of patients at risk for the outcome. Every time a patient has an outcome, the ratio is recalculated. Using these calculations, a curve can be generated that graphically depicts the probability of survival as time passes (figure 7).

In many studies, the benefit of a drug or intervention on an outcome is compared with a control population, permitting the construction of two or more Kaplan-Meier curves. Curves that are close together or cross are unlikely to reflect a statistically significant difference. Several formal statistical tests can be used to assess a significant difference. Examples include the log-rank test and the Breslow test.

Cox proportional hazards analysis – Cox proportional hazards analysis is similar to logistic regression because it can account for many variables that are relevant for predicting a dichotomous outcome. However, unlike logistic regression, Cox proportional hazards analysis permits time to be included as a variable and for patients to be counted only for the period of time in which they were observed. The summary effect estimate generated by this type of analysis is the HR. (See 'Hazard ratio' above.)

Terms used for hypothesis testing

Confidence interval — The confidence interval (CI) conveys the range of values that could be considered reasonably likely based upon statistical testing. The boundaries of the CI give values within which there is a high probability (95 percent by convention) that the true population value can be found. For the purposes of hypothesis testing, a finding is generally considered statistically significant if the 95% CI excludes the null value (ie, if the 95% CI around the RR does not include 1.0) (figure 8). The interpretation of CIs is discussed in greater detail separately. (See "Hypothesis testing in clinical research: Proof, p-values, and confidence intervals", section on 'Confidence intervals'.)

Credible interval — A credible interval is used in Bayesian analysis to describe the range in which a posterior probability estimate is likely to reside. As an example, a 95 percent credible interval for a posterior probability estimate of 40 percent could range from 30 to 50 percent, indicating that there is a 95 percent chance that the true posterior probability estimate lies within the 30 to 50 percent range. There are fundamental differences in how credible intervals are derived compared with the more commonly used CIs. Nevertheless, their intuitive interpretation is similar.

P-value — A simplistic view of the p-value is to think of it as the probability that the observed result could have occurred by chance alone (ie, due to random variation in the sample). More accurately, it is the probability that if the null hypothesis were true, and assuming the results were not affected by bias or confounding, that a result as extreme or more extreme than the one seen in the study would have been observed. P-values are discussed in greater detail separately. (See "Hypothesis testing in clinical research: Proof, p-values, and confidence intervals", section on 'P-values'.)

Type I and type II errors — Two potential errors are commonly recognized when testing a hypothesis:

A type I error (also referred to as an "alpha error") is incorrectly concluding that there is a difference when in truth there is no difference. In most studies, a threshold for type I (alpha) error is 0.05 and if the p-value is <0.05, the null hypothesis is rejected. However, this threshold is somewhat arbitrary, and some experts favor a more restrictive threshold, as discussed separately. (See "Hypothesis testing in clinical research: Proof, p-values, and confidence intervals", section on 'Thresholds for statistical significance'.)

A type II error (also referred to as a "beta error") is incorrectly concluding that there was no statistically significant difference in a dataset; the probability of making a type II error is called "beta." This error often reflects insufficient power of the study.

Power — Power is the statistical probability of avoiding a type II error in a study. That is, it is the probability that a study will not mistakenly accept conclude that there was no effect or difference when there really was one. If a study fails to detect a statistically significant difference, it is reasonable to ask whether there was "adequate power." In other words, one possible explanation for the result is that a small sample size and/or random chance may have led to a failure to detect a difference that really existed. The concept of statistical power is discussed in greater detail separately. (See "Hypothesis testing in clinical research: Proof, p-values, and confidence intervals", section on 'Power'.)

Measures of diagnostic test performance — The following terms are used to describe diagnostic test performance. A more detailed discussion of these concepts is provided separately. (See "Evaluating diagnostic tests", section on 'Test performance characteristics'.)

Sensitivity — Sensitivity is the number of patients with a positive test who have a disease (true positives) divided by all patients who have the disease (table 2). A test with high sensitivity will not miss many patients who have the disease (ie, low false-negative rate). (See "Evaluating diagnostic tests", section on 'Sensitivity and specificity'.)

Specificity — Specificity is the number of patients who have a negative test and do not have the disease (true negatives) divided by the number of patients who do not have the disease (table 2). A test with high specificity will infrequently identify patients as having a disease when they do not (ie, low false-positive rate).

Receiver operating characteristic (ROC) curve — Selecting the cutoff value for a diagnostic test involves balancing sensitivity and specificity (figure 9). The ROC curve plots sensitivity against 1-specificity (ie, the false-positive rate) for all cutoff values measured for the diagnostic test (figure 10). The ROC curve demonstrates the interchange between sensitivity and specificity for different cutoff values. The area under the curve (AUC) of the ROC curve represents the overall accuracy of the test. (See "Evaluating diagnostic tests", section on 'Receiver operating characteristic (ROC) curves'.)

Likelihood ratios — Another method by which the performance of diagnostic tests can be judged is to assess the positive and negative likelihood ratios, which, like sensitivity and specificity, are independent of disease prevalence. (See "Evaluating diagnostic tests", section on 'Likelihood ratios'.)

Positive likelihood ratio – The positive likelihood ratio is the true positive rate divided by the false positive rate. It can be calculated from the sensitivity and specificity as follows:

Positive likelihood ratio = Sensitivity/(1 – Specificity)

The higher the positive likelihood ratio, the better the test (a perfect test has a positive likelihood ratio equal to infinity).

Negative likelihood ratio – The negative likelihood ratio is the false negative rate divided by the true negative rate. It can be calculated from the sensitivity and specificity as follows:

Negative likelihood ratio = (1 – Sensitivity)/Specificity

The lower the negative likelihood ratio, the better the test (a perfect test has a negative likelihood ratio of 0).

Positive and negative predictive values — The positive predictive value (PPV) and negative predictive value (NPV) can be calculated based upon the sensitivity and specificity of the test and the prevalence of disease in the population (or the pretest probability) (table 2) (calculator 1).

The PPV of a test represents the likelihood that a patient with a positive test has the disease.

The NPV represents the likelihood that a patient who has a negative test is free of the disease.

The PPV and NPV depend upon the prevalence of a disease within a population. Thus, for given values of sensitivity and specificity, a patient with a positive test who belongs to a population with a high prevalence of the disease is more likely to truly have the disease than a patient with a positive test who belongs to a population with a low prevalence of the disease (figure 11). (See "Evaluating diagnostic tests", section on 'Pretest probability and predictive values'.)

Accuracy — The performance of a diagnostic test is sometimes expressed as accuracy, which refers to the number of correct test results (including both true positives and true negatives) divided by the total number of observations (table 2). However, accuracy by itself is often not the best indicator of test performance, since it obscures important information related to its component parts. (See "Evaluating diagnostic tests", section on 'Accuracy and precision'.)

Measures of interrater reliability — Reliability refers to the extent to which repeated measurements of a relatively stable phenomenon fall closely to each other. Several different types of reliability can be measured, such as inter- and intraobserver reliability and test-retest reliability.

Kappa statistic — The kappa statistic is the most commonly used measure for assessing interobserver agreement. It can range from -1.0 to +1.0. If there is perfect agreement, the value is 1.0, whereas if the observed agreement is what would be expected by chance alone, the value is 0. If the degree of agreement is worse than what would be expected by chance, the kappa value will be negative, with complete disagreement resulting in a value of -1.0.

Kappa statistics are generally interpreted as follows:

Excellent agreement – 0.8 to 1.0

Good agreement – 0.6 to 0.8

Moderate agreement – 0.4 to 0.6

Fair agreement – 0.2 to 0.4

Poor agreement – Less than 0.2

Topic 2759 Version 26.0

آیا می خواهید مدیلیب را به صفحه اصلی خود اضافه کنید؟