Interpreting confidence intervals

Abstract

Introduction

In the previous two Research Design and Statistics reports in Phlebology, we have considered potential ways of measuring the effect size for continuous and binary outcomes.^1,2 In discussing these issues, we have also briefly mentioned confidence intervals. In this report, we shall consider the definition of confidence intervals and their interpretation in more detail.

We shall use the example previously presented in ‘So there's a difference, but how big is it? Measuring the effect size for binary outcomes’.¹ Here, we wished to investigate patient satisfaction with a new treatment for varicose veins, compared with the gold standard of traditional surgery.¹ The two interventions were compared in a randomized controlled trial. We found that 475/501 (94.8%) of individuals who received the new technique, and 446/503 (88.7%) who received the standard of care, were satisfied with the outcome of their treatment. There was very strong evidence that the new treatment is associated with increased patient satisfaction (P < 0.0001; chi-squared test). The observed relative risk (RR) was 1.07 (=94.8%/88.7%).

Samples and populations

We have discussed samples and populations in a previous publication in Phlebology.³ In brief, when performing our studies, we are usually interested in answering a research question in a complete population. However, it is nearly always impossible to obtain the relevant information on every single person in our population. We therefore usually take a representative sample of individuals from our population of interest to include in our study. In our example, our study population was all adults requiring treatment for varicose veins. Our sample may perhaps have been 1004 individuals recruited from hospitals across the UK.

Therefore, the results we obtain (i.e. our RR of 1.07) describe the effectiveness of the new treatment in our study sample, not in the entire population. So, how do we use the results of our study to discover the effectiveness of the new intervention in the population?

Sample statistics and population estimates

We wish to estimate the RR for our population – i.e. if our intervention was given to all individuals with deep vein thrombosis, how would patient satisfaction compare with if the gold standard was used? What is our ‘best guess’ of the true efficacy of our treatment? One logical choice is to use the RR obtained in our sample of 1004 individuals to estimate the RR in the population. Thus, we could conclude that our estimate of the population treatment effect is 1.07.

However, it is clear that we cannot definitively state that the true effect of our intervention is 1.07, as we have only estimated the population RR using our sample RR. We have therefore introduced random error. If another sample of individuals were instead included in our study, we would have obtained a slightly different result by chance. How sure are we of the accuracy of the estimate of 1.07? Further information is gained by calculating a confidence interval.

Definition and interpretation of a 95% confidence interval

The formal definition of a 95% confidence interval is that, ‘if we were to draw several independent, random samples from the same population and calculate 95% confidence intervals for each of them, then on average 95% of such confidence intervals would contain the true population estimate’ (quote taken from ref.⁴, see also ref.⁵). As we have previously seen, the 95% confidence interval for the RR in our hypothetical study was 1.03–1.11.

So, how do we interpret this confidence interval in practice? Although not strictly correct, many interpret the 95% confidence interval as the range of values in which we are 95% sure that the true population RR lies. Thus, they would say that we are 95% sure that the true effect of the new treatment is somewhere between 1.03 and 1.11. Alternatively (and perhaps preferably), we can think of this interval as the range of plausible values for the population RR. Thus, we would say that we believe that the population RR could plausibly take any value in the range of 1.03–1.11, but we believe that values outside of this range are not plausible (as they are unlikely to be the true effect).

The width of the confidence interval gives us some idea about how uncertain we are about the population RR. A common approach is to look at the lower limit and upper limit of the confidence interval in turn, and imagine what our conclusions would be based on these values. If our conclusions would be the same then we can state that our confidence interval is precise. If our conclusion would be different, then we state that our confidence interval is imprecise. Thus, in our example, if the new intervention really improved patient satisfaction rates by 3% (i.e. RR = 1.03), would we think this a worthwhile new intervention? What would our conclusions be if the new intervention resulted in 11% greater efficacy (i.e. RR = 1.11)? If we would reach the same conclusion regardless of whether the RR were truly 1.03 or 1.11, then our estimate is sufficiently precise for our needs. Otherwise, our estimate is imprecise. Note that determining whether an estimate is precise is therefore usually a subjective decision.

How to calculate the confidence interval

In practice, confidence intervals are usually calculated using statistical programmes rather than by hand. As this is a practical review of confidence intervals, readers are encouraged to refer to other sources if they are interested in the formulae used to calculate confidence intervals.⁶ However, it is important to note that it is possible to calculate a confidence interval for nearly every sample estimate.⁷ Furthermore, the two factors that primarily determine the width of the confidence interval is the sample size of the study and, for continuous variables, how variable the measure is.⁸ The reasons for this are clear: the larger the number of individuals in a study, the more certain we will be that the result is accurate and we will therefore have a narrower confidence interval. Similarly, the less variable a continuous variable is, the more certain we will be of the accuracy of the findings, and we will again obtain a narrower confidence interval.

Other confidence intervals

Here we have considered 95% confidence intervals, as these are typically used in medical literature. However, there is no magical reason as to why 95% confidence intervals are presented, other than the fact that this is probably a reasonable level of accuracy. However, it would be equally valid to present 90% or 99% confidence intervals. Clearly, we will find that the 90% confidence interval is narrower than a 95% confidence interval, and a 99% confidence interval will be wider.

Conclusion

Confidence intervals provide additional information as to the certainty of our results of a study, and to the likely effect size of any intervention or risk factor. Guidelines for reporting results of randomized controlled trials and observational studies now recommend that confidence intervals are always presented in medical studies, and therefore understanding how they should be interpreted is vital.^9,10

References

Smith

. So there's a difference, but how big is it? Measuring the effect size for binary outcomes. Phlebology 2012;27:38–40

Smith

. So there's a difference, but how big is it? Measuring the effect size for continuous outcomes. Phlebology 2012;27;96–8

Smith

, Fox

. The use and abuse of hypothesis tests: how to present P values. Phlebology 2010;25:107–12

Kirkwood

, Sterne

JAC

. Essential Medical Statistics. 2nd edn. Oxford, UK: Blackwell Sciences

Gardner

, Altman

. Confidence intervals rather than P values: estimation rather than hypothesis testing. BMJ 1986;292:746–50

Altman

, Gardner

. Statistics with Confidence. 2nd edn. Bristol, UK: BMJ Books, 2000

Petrie

, Sabin

. Medical Statistics at a Glance. 3rd edn. London, UK: Wiley Blackwell, 2009

Smith

. How many patients do I need? Sample size and power calculations. Phlebology 2011;26:44–5

Altman

, Schulz

, Moher

, The revised CONSORT statement for reporting randomized trials: explanation and elaboration. Ann Intern Med 2001;134:663–94

10.

The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) Statement. Guidelines for reporting observational studies. Lancet 2007;370:1453–7