So there is a difference,but how big is it? Measuring the effect size for binary outcomes

Abstract

Introduction

In medical research, we frequently wish to compare the outcome or response of groups of individuals. We often use a hypothesis test and its associated P value to decide whether we believe we have sufficient evidence to conclude that there are (true) differences between the groups. But, even if there is evidence of a difference in response between groups, are these differences clinically relevant? As well as presenting P values, we must also place emphasis on the effect size of any comparison. In this short report, we shall consider binary outcomes, i.e. situations in which individuals can have one of two responses (e.g. yes/no, dead/alive, responder/non-responder).

Example

Consider a hypothetical randomized controlled trial investigating treatment for varicose veins. We shall imagine that we have a new intervention/treatment that we wish to compare with a standard of care (which shall be traditional surgery). We randomized 1004 individuals in a 1:1 ratio, with 501 individuals randomized to receiving the new treatment and 503 individuals to the standard of care. We assume no individuals were lost to follow-up and that all individuals received their assigned treatment. The primary outcome of interest is patient satisfaction after treatment, which we have coded into a binary outcome of ‘yes’ or ‘no’. The results of the study are presented in Table 1. We find that 475 (94.8%) individuals who received the new technique, and 446 (88.7%) who received the standard of care, were satisfied with the outcome of their treatment. Thus, there is very strong evidence that the new treatment is associated with increased patient satisfaction (P < 0.0001; chi-squared test).

Table 1

Results of a hypothetical randomized controlled trial of treatment for varicose veins

		Outcome
		Response (patient satisfied with outcome)	No response (patient not satisfied with outcome)	Total
Treatment group	Treatment (new technique)	475 (94.8%)	26 (5.2%)	501 (100.0%)
	Placebo (standard of care – traditional surgery)	446 (88.7%)	57 (11.3%)	503 (100.0%)

Absolute risk reduction (treatment versus placebo): 94.8–88.7% = 6.1% (95% CI 2.8–9.5%)

Relative risk (treatment versus placebo): (475/501)/(446/503) = 94.8%/88.7% = 1.07 (95% CI 1.03–1.11)

Odds ratio (treatment versus placebo): (475/26)/(446/57) = (94.8%/5.2%)/(88.7%/11.3%) = 18.3/7.8 = 2.33 (95% CI 1.44–3.78)

P value (chi-squared test): <0.0001

Absolute risk reduction

The absolute risk reduction is the difference between the response percentages. In our example, the absolute risk reduction is calculated as 94.8% minus 88.7%, which provides a value of +6.1% (Table 1). Thus, based on the data obtained from our study sample, our best estimate of the effect of the new treatment in the population is that it leads to 6.1% higher levels of patient satisfaction (see ref.¹ for a discussion of the difference between samples and populations). Note that if there were no difference in patient satisfaction between the two groups (i.e. an 88.7% response was seen in both groups), then we would expect the absolute risk reduction to take the value of zero (as 88.7% minus 88.7% equals zero).

The corresponding 95% confidence interval (95% CI) for this effect is +2.8% to +9.5% (see ref.^2,3 for details of calculation). We can interpret this as the range of ‘plausible values’ for the true effect of the new treatment on patient satisfaction. Therefore, we believe that the new treatment has an effect in the range of a 2.8% higher level of patient satisfaction to a 9.5% higher level. Note that if we assume that values outside of this 95% CI are not plausible values for the true difference between groups, then we are ruling out the chance that there is no difference between treatment arms (as the value zero is not contained within the CI). This is consistent with the obtained P value (P < 0.0001).

Relative risk

As the name implies, the relative risk (RR; or risk ratio) is the response rate in the treatment arm (94.8%), relative to the response rate in the standard of care arm (88.7%). Therefore, it is calculated as 94.8% divided by 88.7%, which gives a relative risk of 1.07. Thus, there was a 7% increase in the rate of patient satisfaction seen in the new treatment group compared with the standard of care.

Note that if there were no difference in patient satisfaction between the two groups (i.e. 88.7% response in both groups), then we would expect the RR to take the value of one (as 88.7% divided by 88.7% equals one).

Similarly to the absolute risk reduction, we can present a 95% CI for our estimate,^2,3 which is 1.03–1.11. Thus, we believe that the true effect of the new treatment is somewhere in the range of a 3% increase to an 11% increase in patient satisfaction. Note that we are again ruling out the chance that there may be no difference between treatment arms (as the value one is not within the 95% CI).

Odds ratio

Although odds ratios (OR) are perhaps not as intuitive as relative risks and absolute risk reductions, they have attractive statistical properties which means they are frequently used in practice (e.g. they can be calculated in case-control studies, and multivariable estimates can be easily calculated using logistic regression). Before one can calculate an OR, one must first understand the concept of an odds. It is a familiar concept in the betting world. It is calculated by dividing the number of individuals who have a response by the number of individuals who do not have a response. In our example for the standard of care group, the odds of a response is 446 divided by 57 (equivalently 88.7% divided by 11.3%). This gives an odds of 7.8 (or ‘7.8 to 1’ in betting terms). Therefore, for every 7.8 individuals who were satisfied with their treatment there was one individual who was not. Similarly, the odds for a response in the new treatment group was 18.3. The OR is simply the ratio of these two values (18.3 divided by 7.8), giving an estimate of 2.33. Thus, those who received the new treatment had more than twice (2.33 times) the odds of being satisfied with their treatment than those who received the standard of care.

Note again that if there were no difference in patient satisfaction between the two groups, then we would expect the OR to take the value of one (as 88.7%/11.3% divided by 88.7%/11.3% equals one). The 95% CI for our OR is 1.44–3.78.⁴ Thus, we are once again ruling out the chance that there may be no difference between treatment arms (as the value one is not contained within the CI).

Comparison of methods

In our example above, we can see from the results of the chi-squared test that the new treatment is clearly associated with better patient satisfaction than the standard of care. However, we must also consider the clinical benefits of the new treatment – the satisfaction rate in the standard of care arm is already almost 90%. Is a relative increase of 7%, or an absolute increase in rates by 6%, clinically important once other factors have been taken into account (e.g. cost, ease of carrying out the treatment)? If only the P value is presented without an estimate of the effect size, information on the clinical importance of our finding may have been missed.

We can also see that there are situations in which one might obtain quite a different impression of the effectiveness of a treatment depending on the summary measure used. When the percentage experiencing the outcome is small, we find that the RR and OR are similar; however, we can see that is not true for our example, where our outcome is very common. Furthermore, one can see examples where the OR and RR will give a very different interpretation to the absolute risk reduction. For example, imagine that a secondary outcome of our study was serious adverse events (SAEs). Imagine that the new treatment arm had a rate of 0.2% for SAEs, and 0.4% of individuals in the standard of care arm had an SAE. The OR and RR for this effect are both 0.50, suggesting that SAEs occur half as often on the new treatment, which seems like a clinically important effect. However, the absolute risk reduction for the effect is +0.2%, perhaps suggesting a more modest benefit of the new treatment. Thus, one must make sure that both the relative and absolute rates are considered when making conclusions about treatments.

References

Smith

, Fox

. The use and abuse of hypothesis tests: how to present P values. Phlebology 2010;25: 107–12

Petrie

, Sabin

. Medical Statistics at a Glance. 3rd edn. London, UK: Wiley Blackwell, 2009

Campbell

, Swinscow

TDV

. Statistics at Square One. 11th edn. Chichester, West Sussex, UK: John Wiley and Sons, 2009

Bland

, Altman

. The odds ratio. BMJ 2000;320: 1468