ﺑﺎﺯﮔﺸﺖ ﺑﻪ ﺻﻔﺤﻪ ﻗﺒﻠﯽ
خرید پکیج
تعداد آیتم قابل مشاهده باقیمانده : 3 مورد
نسخه الکترونیک
medimedia.ir

Hypothesis testing in clinical research: Proof, p-values, and confidence intervals

Hypothesis testing in clinical research: Proof, p-values, and confidence intervals
Literature review current through: Jan 2024.
This topic last updated: Jan 23, 2024.

INTRODUCTION — Biostatistical concepts can be confusing to clinicians. The meaning of a p-value in particular is commonly misunderstood and yet is central to the way most clinicians interpret the results of scientific studies [1].

This topic will review the interpretation of p-values and confidence intervals, the idea of proof, and the concept of statistical power.

A glossary of these and other biostatistical and epidemiological terms is provided separately. (See "Glossary of common biostatistical and epidemiological terms".)

PROOF — A common question to be addressed in clinical research, and in scientific research in general, is "what constitutes proof?" How do we decide when the evidence for or against a hypothesis is adequate to consider the matter proven?

Certain clinical research methodologies are considered "higher quality" than other methodologies. For instance, randomized clinical trials are generally considered better evidence than case-control studies. (See "Evidence-based medicine", section on 'Categories of evidence'.)

Proof, however, never exists in a single trial result or a single piece of evidence. Proof is a human concept having to do with the rational thought process. Information may be sufficient to allow one person to consider something proven, whereas another may not.

As an example, there are no clinical trials demonstrating that cigarette smoking causes lung cancer in humans. However, evidence from epidemiologic studies overwhelmingly shows a relationship between smoking and lung cancer. A dose-response relationship in these studies and evidence from animal studies provide strong support for a causal relationship (ie, smoking is not just associated with lung cancer but it causes lung cancer). Most people consider it proven that smoking causes lung cancer despite the absence of clinical trials in humans.

In contrast, there are claims of certain homeopathic preparations (essentially extremely diluted preparations that on average will have very little of the original "therapeutic" substance remaining) having "proven" efficacy because a randomized clinical trial achieved a p-value of <0.05 (see 'P-values' below). However, a much higher statistical standard of proof may be appropriate for such claims given the implausibility of the underlying hypothesis.

Thus, when discussing whether an issue in medicine has been proven, disproven, or remains uncertain, it is important to remember that no single statistic or value will provide the answer.

In the practice of evidence-based medicine, assessing the certainty or validity of a body of evidence also involves addressing questions about the internal validity (Could the study's findings be biased or confounded?) and external validity (Do the results apply to my patients?). This is discussed separately. (See "Evidence-based medicine", section on 'Assessing the validity of the evidence'.)

STATISTICAL TESTS AND THE NULL HYPOTHESIS

Samples — Statistical testing is performed to assess how likely it is that the findings in the study are different from what would be expected by chance or random variation in sampling. Because of random variation, a sample is likely to differ in various ways from the population from which it was selected. Statistical testing is used to estimate the effects of random variation in samples and to predict how likely it is that the results in the sample accurately reflect what would be seen in the entire population. In general, the size of the sample, but not the size of the population, matters when considering random variation.

As an example, if three people are randomly selected from a population of 1000 and given a drug for high blood pressure, the results in those three people are unlikely to accurately reflect what would be seen in the entire population. By contrast, if 500 people were randomly selected, the results would be expected to more accurately reflect the underlying population.

Null hypothesis — The null hypothesis proposes that there is no association between the exposure or intervention that is being studied and the outcome of interest. When statistical tests are used in research, it is generally to decide whether to reject the null hypothesis. Thus, if a certain level of statistical significance is reached, the null hypothesis will be rejected (ie, the study will conclude that there is an association). Otherwise, the null hypothesis will not be rejected. (See 'P-values' below and 'Statistical significance' below.)

As an example, consider a clinical trial studying the effect of beta blockade on mortality in patients with heart failure. The null hypothesis would be that the effect of beta blockade is not different from placebo. Even if beta blockers really have no effect on mortality, the mortality rates in patients receiving beta blockers will likely not be exactly the same as in patients receiving placebo due to random variation in the study population. Thus, some method is needed to decide how different is "different enough" to reject this null hypothesis and conclude that beta blockade has an effect. Statistical tests are used for this purpose.

INFORMATION PROVIDED BY STATISTICAL TESTS — Once data are gathered from a study, statistical tests are performed on the results. Statistical tests pool the data and assess the probability that the findings would have occurred given some assumptions about the underlying population being studied.

Statistical tests calculate many different values that are used for interpreting the data. Some of these statistics are purely descriptive (eg, mean age of the cohort). However, hypothesis testing requires making comparisons (eg, was the mortality rate in treated patients different from that of the control group?).

For comparative data, three key pieces of information provided by statistical tests include:

The effect estimate (see 'Effect estimates' below)

The confidence interval (see 'Confidence intervals' below)

The p-value (see 'P-values' below)

Effect estimates — The effect estimate conveys the magnitude of the association between the exposure (or treatment) and the outcome of interest. In studies investigating treatments, this is often referred to as the treatment effect or treatment effect size. Effect estimates measure the magnitude of the difference in outcomes between two groups (eg, patient who received the treatment versus the control group). This can be expressed as a relative difference (eg, relative risk [RR], odds ratio [OR], hazard ratio [HR]) or absolute difference. These terms are discussed in greater detail separately. (See "Glossary of common biostatistical and epidemiological terms", section on 'Terms used to describe the magnitude of an effect'.)

Confidence intervals — Confidence intervals (CIs) convey the range of values that could be considered reasonably likely based upon statistical testing.

The most commonly reported interval is the 95% CI. The CI and the p-value are related; the narrower the CI, the smaller the p-value. A finding is generally considered statistically significant if the 95% CI excludes the null value (ie, for RR, if the 95% CI does not include 1.0; for absolute risk difference, if the 95% CI does not include 0). However, the threshold used to determine statistical significance is somewhat arbitrary, as discussed below. (See 'Thresholds for statistical significance' below.)

Despite its name, the CI cannot be used to directly infer how confident one should be in the result. Consider a clinical trial that finds lower mortality in patients treated with beta blockers compared with placebo, with a RR of 0.75 and 95% CI 0.7-0.8. This does not mean that there is only a 5 percent chance that the real RR is below 0.7 or above 0.8. Instead, it means that in correctly performed studies, we would expect the CI to surround the true value for the RR 95 percent of the time. The difference between these two views is that the actual likelihood that the RR is between 0.7 and 0.8 depends upon the prior likelihood (before the study was performed) that the RR was in that range. If it were very unlikely prior to the study, then the likelihood after the study would not be 95 percent despite the apparent meaning of the term "confidence interval."

However, because it may be very difficult to know the prior probability, CIs are often interpreted as representing a range of believable values. This is particularly useful in deciding whether a study had an adequate number of patients (ie, whether there was adequate power). (See 'Power' below.)

Using the same example as above, if the trial found that the RR was 0.75 with 95% CI 0.5-1.12, the study might have been reported as "negative," when in reality, the study was simply too small to answer the clinical question. A 50 percent relative reduction in mortality would clearly be clinically meaningful, and a 12 percent relative increase in mortality would likely also be meaningful. If, instead, the trial found a RR of 0.99 with 95% CI 0.97-1.01, it would be appropriate to conclude that there was little to no effect on mortality.

The interpretation of confidence intervals, like the interpretation of p-values, should ideally consider the totality of evidence for or against a hypothesis.

P-values — A simplistic view of the p-value is to think of it as the probability that the observed result could have occurred by chance alone (ie, due to random variation in the sample). More accurately, it is the probability that if the null hypothesis were true (see 'Null hypothesis' above), and assuming the results were not affected by bias or confounding, that a result as extreme or more extreme than the one seen in the study would have been observed. The p-value is not the probability that the result of the study is true or false.

In most studies, a threshold of p <0.05 is chosen as the threshold for claiming a "statistically significant" finding. However, the threshold is somewhat arbitrary, as discussed below. (See 'Statistical significance' below.)

As an example, consider a hypothetical clinical trial comparing beta blockers to placebo in patients with heart failure. If the trial is methodologically high-quality (ie, low risk of bias and confounding) and finds that mortality is lower in the beta blocker group with a RR of 0.75 and p-value of 0.03, this means that if beta blockers truly had no effect, we would have expected to see a RR ≤0.75 only 3 percent of the time.

Importantly, the p-value says nothing directly about the probabilities that we are most interested in: The probability that beta blockers actually work or the probability that the RR is truly 0.75. These probabilities are not knowable from a single study. If the prior probability (the probability before the study was performed) that beta blockers affect mortality was very low, then even after the study was performed and resulted in a p-value of 0.03, the likelihood that beta blockers affect mortality would be much lower than 97 percent. In contrast, if the prior probability was very high (eg, because of evidence from other studies), then the probability that beta blockers affect mortality after the above study was performed would be higher than 97 percent.

It is sometimes assumed that if the p-value is very low (<0.005), there is a high likelihood that the finding would be confirmed subsequent studies (ie, that there is a high "replication probability"). However, this may not be the case. In an analysis of >23,000 clinical trials in the Cochrane Database of Systematic Reviews, trials with an initial p-value between 0.001 and 0.005 had only a 60 percent chance of yielding a subsequent p-value <0.05 upon attempted replication [2].

INTERPRETATION

Explanation for the results of a study — There are four possible explanations for findings in clinical research (regardless of whether the study has a positive or negative result):

Truth – The findings in the study may accurately reflect the answer to the underlying question that was being asked.

Bias – There may be one or more errors in the way the study was performed that distort the results and affect the findings.

Confounding – There may be one or more variables that are associated both with the exposure being studied and also with the outcome of interest that affect the results of the study.

Chance – Random variations that occurred within the sample of the population being studied may lead to erroneous conclusions. If random chance leads to a mistaken conclusion that there was an effect, the mistake is called a type 1 error (alpha error); if random chance leads to a mistaken conclusion that there was no effect, the mistake is called a type 2 error (beta error). (See "Glossary of common biostatistical and epidemiological terms", section on 'Errors'.)

Of these four explanations, statistical tests that yield p-values and confidence intervals (CIs) only address the final one (ie, whether chance could explain the findings). P-values and CIs do not help in determining whether the study's finding represent the truth nor whether the study has problematic bias or confounding.

Statistical significance

Thresholds for statistical significance — The choice of a specific threshold for a p-value or degree of confidence for a confidence interval is arbitrary. In most studies, a threshold of p <0.05 (and the corresponding 95% CI) is chosen as the threshold for claiming a "statistically significant" finding. But there is no particular reason why a p-value of 0.02 (and a corresponding 98% CI) isn't the standard for calling a result "statistically significant."

There is debate as to what the most appropriate threshold is for defining statistical significance in clinical research [3-6]. Some experts favor a more restrictive threshold (eg, p <0.005), whereas other experts advocate for de-emphasizing or abandoning the p-value altogether.

Although the choice of the threshold is arbitrary, it is important to set a threshold of statistical significance when testing hypotheses. This allows us to know what percentage of "statistically significant" findings are likely false-positives (ie, type I error or alpha error). Since the usual threshold for statistical significance is a p-value of <0.05, this means that approximately 5 percent of all "statistically significant" findings could have resulted from chance alone. (See "Glossary of common biostatistical and epidemiological terms", section on 'Errors'.)

A result with a p-value of 0.001 is less likely to be due to chance (all other things being equal) than one with a p-value of 0.05, and a result with a p-value of 0.06 is only slightly more likely to be due to chance than one with a p-value of 0.05. However, we generally do not interpret p-values on a scale. If we try to use these values in any way other than as falling on one side or the other of a threshold (arbitrarily p = 0.05), we will no longer be able to say with certainty what percentage of "statistically significant" results were really the result of type 1 error.

Misconceptions about statistical significance — Common misconceptions about statistical significance include:

A statistically significant result (eg, p-value <0.05) means the study's finding is true. As discussed above, p-values cannot answer the question of whether a finding is true or false. Rather, they assess how likely it would be to observe the result (or a more extreme result) if there truly were no difference. (See 'P-values' above.)

A statistically significant result (eg, p-value <0.05) means the study's finding is clinically important or meaningful. This is not true. There are many examples of reports of statistically significant findings that are of questionable clinical significance. For example, a trial may find that systolic blood pressure in treated patients was on average 1 mmHg lower than in the control group. If the study is large enough, the finding might reach statistical significance. However, it likely has no clinical significance.

The smaller the p-value, the more meaningful the finding is. This claim is similar to the previous misconception, and it is similarly incorrect. The p-value says nothing about the clinical importance of the finding or magnitude of the effect.

Lack of statistical significance (eg, p-value >0.05) means that there is no difference or no effect. This is a very common misinterpretation of p-values and statistical significance. Failure to detect a difference is not the same as demonstrating no difference. In fact, it is very difficult for a study to confidently demonstrate that there is no difference. Consider a trial investigating the effect of beta blockers on mortality in patients with heart failure. If the relative risk (RR) were 0.75 with 95% CI 0.6-1.01 and p-value 0.06, it would not be appropriate to conclude that beta blockers have no effect on mortality. In fact, based on these data, it seems plausible that beta blockers may have an effect, and if the trial had been larger, it may have detected a statistically significant difference (see 'Power' below). On the other hand, if the RR were 0.99 with 95% CI 0.97-1.02 and p-value 0.8, it might be reasonable to conclude from the study that there is little to no effect, though this conclusion should ideally be based upon consideration of the totality of evidence, not just a single trial.

Power — Power is the statistical probability of avoiding a type 2 error in a study. That is, it is the probability that a study will not mistakenly accept the null hypothesis and conclude that there was no effect or difference when there really was one. (See "Glossary of common biostatistical and epidemiological terms", section on 'Errors'.)

If a study fails to detect a statistically significant difference, it is reasonable to ask whether there was "adequate power." In other words, one possible explanation for the result is that a small sample size and/or random chance may have led to a failure to detect a difference that really existed. This issue is particularly pressing when the effect estimate in the study appears clinically important. In the above hypothetical study, a relative risk for mortality of 0.75 with a 95% CI of 0.50-1.12 would not be statistically significant; however, the point estimate of a 25 percent reduction in mortality would clearly be clinically meaningful if true.

It is possible to calculate the power a study has to find a given result (for instance, a 25 percent reduction in mortality) given a particular sample size (and also given the underlying variation in the population). Power calculations are useful in the design of studies to decide whether a study is large enough to have a reasonable chance of finding a positive result or to calculate the number of patients required to achieve a certain power.

However, when studies try to address the issue of whether there was "adequate power" after a negative study result by performing a power calculation using the effect estimate found in the study, the result is meaningless. The power in such a calculation will always be less than 50 percent [7]. Instead, as discussed above, the way to decide after the fact whether a nonsignificant finding had sufficient power is to look at the 95% CI and see whether clinically important values exist within the range of the statistically likely values represented by the CI. (See 'Confidence intervals' above.)

SUMMARY

Proof – Proof is a human concept and never comes from a single piece of evidence or a statistical test. (See 'Proof' above.)

Statistical testing – In comparative studies that involve hypothesis testing, three key pieces of information are provided by statistical testing:

The effect estimate – Effect estimates measure the magnitude of the difference between outcomes for two groups (eg, patients who received the treatment of interest versus the control group). This can be expressed as a relative difference (eg, relative risk [RR], odds ratio [OR], hazard ratio [HR]) or absolute difference. (See 'Effect estimates' above and "Glossary of common biostatistical and epidemiological terms", section on 'Terms used to describe the magnitude of an effect'.)

Confidence interval – The confidence interval (CI) conveys the range of values that could be considered reasonably likely based upon statistical testing. (See 'Confidence intervals' above.)

The p-value – The p-value conveys how likely it would be to observe the result (or a more extreme result) if there truly were no difference. It is not the probability that the result of the study is true or false. (See 'P-values' above.)

Possible explanations for study findings – Possible explanations for the observed result in a study are (see 'Explanation for the results of a study' above):

Truth

Bias

Confounding

Chance

Statistical tests that yield p-values and CIs only address the last of these (ie, whether chance could explain the finding).

Statistical significance – The choice of a specific threshold for a p-value or degree of confidence for a CI is arbitrary. In most studies, a threshold of p <0.05 (and the corresponding 95% CI) is chosen as the threshold for claiming a "statistically significant" finding. (See 'Statistical significance' above.)

A statistically significant result (eg, p-value <0.05) does not mean the study's finding is true nor that the finding is clinically meaningful. Similarly, lack of statistical significance (eg, p-value >0.05) does not mean that there is no effect. If a study fails to detect a statistically significant difference, one possible explanation is small sample size (inadequate power). The way to determine whether a nonsignificant finding had sufficient power is to look at the 95% CI. (See 'Misconceptions about statistical significance' above and 'Power' above.)

Topic 2777 Version 20.0

آیا می خواهید مدیلیب را به صفحه اصلی خود اضافه کنید؟