Abstract
Background:
Randomized controlled trials (RCTs) are generally regarded as the gold standard for demonstrating causality because they effectively mitigate bias from both known and unknown confounders. However, conducting an RCT is not always feasible because of logistical and ethical considerations. This is especially true when evaluating surgical interventions, and non-randomized study designs must be utilized instead.
Methods:
Statistical methods that adjust for baseline differences in non-randomized studies were reviewed.
Results:
The three methods used most commonly to adjust for confounding factors are multiple logistic regression, Cox proportional hazard, and propensity scoring. Multiple logistic regression (MLR) is implemented to analyze the influence of categorical and/or continuous variables on a single dichotomous outcome. The model controls for multiple covariates while also quantifying the magnitude of each covariate's influence on the outcome. Selecting which variables to include in a model should be the most important consideration, and authors must report how and why variables were chosen. Cox proportional hazards modeling is conceptually similar to logistic regression and is used when analyzing survival data. When applied to survival curves, Cox proportional hazards can adjust for baseline group differences and provide a hazard ratio to quantify the effect that any single factor contributes to the survival curve. Propensity scores (PS) range from zero to one and are defined as the probability of receiving an intervention based on observed baseline characteristics. Propensity score matching (PSM) is especially useful when the outcome of interest is a rare event. Treated and untreated subjects with similar propensity scores are paired, forming balanced samples for further analysis.
Conclusions:
The method by which to address confounding should be selected according to the data format and sample size. Reporting of methods should provide justification for selected covariates, confirmation that data did not violate model assumptions, and measures of model performance.
The integration of clinical research and medical decision-making occurs through the use of evidence-based practice whereby physicians use the best available evidence to guide clinical decision making [1]. Ideally, the evidence should demonstrate a definitive cause-and-effect relation, or causality, between an exposure and outcome, i.e., a given intervention will lead to patient improvement. Randomized controlled trials (RCTs) are generally regarded as the gold standard for proving causality because successful randomization mitigates bias and confounding for both known and unknown factors [2]. However, conducting an RCT is not always feasible because of logistical, financial, or ethical considerations. This is especially true when evaluating surgical interventions because there are inherent obstacles to recruiting subjects, blinding patients and physicians, and standardizing treatments [3]. Thus, a different approach is usually selected out of necessity.
When carefully designed and conducted, non-randomized studies may serve as a reasonable alternative to RCTs. Non-randomized studies are useful for identifying correlations that, depending on the robustness of the study, may allow for inferences on causation to be made. For example, although there are no RCTs comparing mortality after early (<12 hours) and late (>12 hours) surgical debridement for necrotizing soft tissue infections, early debridement is widely accepted as the preferred treatment based on several large observational studies [4]. However, unlike RCTs, non-randomized studies always carry an inherent amount of confounding, because subjects are intentionally allocated to treatment groups according to baseline factors. Multiple statistical methods have been developed to adjust for these differences and improve balance between groups. However, it is important to note that unlike RCTs, only known and measured confounders can be adjusted. Any bias caused by unknown or unmeasured differences between groups will persist.
In the following sections, we highlight three of the methods used most commonly to adjust for confounding factors: multiple logistic regression (MLR), Cox proportional hazard, and propensity scoring (PS). For each method, we provide a non-technical overview, some common pitfalls, and critical reporting guidelines. Examples of these methods are provided using articles published previously in Surgical Infections.
Multiple Logistic Regression
A simple correlation between two variables is the amount of change expected in variable x when variable y is increased or decreased. The strength of the correlation is measured by Pearson and Spearman correlation coefficients, which range from −1 to 1. Refer to the detailed description in Ho and Pieracci [5]. Using the example of body mass index and surgical site infection, examples of positive, negative, and no correlations are displayed in Figure 1. Multiple logistic regression can be conceptualized as a multidimensional version of the correlation and is a statistical tool used for both explanatory and predictive modeling [6]. Using this technique, one tries to determine the strength of association between a predictive (independent) variable and an outcome (dependent) variable while holding all other predictive variables constant. In non-randomized studies, Multiple logistic regression is frequently implemented to analyze the influence of categorical and/or continuous variables (independent variables) on a single dichotomous outcome (dependent variable) [7]. Its utility resides in its ability to control for multiple covariates while quantifying the magnitude of each covariate's influence on the outcome, which is reported as an odds ratio [8]. By this same process, predictive models may be generated to estimate an individual's odds of an outcome based on the presence or absence of known prognostic factors [9]. We use the study by McGillen et al. [10] as an example to illustrate these concepts. In this study, the authors explored whether the duration of post-operative antibiotic agents was associated with infection after appendectomy for complicated appendicitis. When the authors compared patients who developed an infection with those who did not, they found that patients with an infection were more likely to have free fluid on their initial computed tomography (CT) scan, and were also more likely to have had a drain placed intra-operatively. Therefore, these two potentially confounding factors were used in the MLR, as well as duration of antibiotic therapy (independent variables), to determine their impact on the odds of post-operative infection (dependent variable).

Depictions of strong positive correlation
Selecting the appropriate covariates to include in the model is a critical consideration when preparing for the analysis. The conventional process for selecting variables begins by conducting multiple comparisons between outcome groups, such as control group and intervention group, to identify significant variables, with a more liberal cutoff for significance (commonly p < 0.2) [11]. Results of these univariate comparisons are shown in Table 1. Variables with statistically significant differences between the outcome groups should then be reviewed to ensure their relation to the outcome is both plausible and clinically meaningful [11]. In McGillen et al. [10], both free fluid on CT and intra-operative decision to leave a drain are markers of severity of appendicitis, so a relation to post-operative infections is indeed plausible. All included variables should be independent without relation to other variables. For example, white blood cell count and absolute neutrophil count would have highly related values, and therefore should not both be used; this colinearity weakens the predictive model. An additional assumption for MLR is that the relation between the predictive variables and the outcome is constant (linear correlation) across all values for the variable [12]. This assumption may not always be true, for example at the physiologic extremes of variables such as age, height, weight, etc.
Baseline Demographic and Clinical Characteristics According to Post-Operative Surgical Site Infection Development a
Reprinted with permission [10].
NS = not significant; HIV = human immunodeficiency virus; CAD = coronary artery disease; WBC = white blood cells; ASA = American Society of Anesthesiologists; LOS = length of stay; IR = interventional radiology.
Mean ± standard deviation.
t-test is only between intravenous antibiotic agents and intravenous plus oral antibiotic agents.
Patient underwent delayed abdominal closure and did not develop surgical site infection.
Including too many variables in the model can be problematic, especially for rare events. One issue, termed overfitting the model, occurs when the number of events is small compared with the number of variables included in the regression [13]. In this situation, the model is fit to “noise” in addition to underlying variables, which limits the utility of the model with a different patient sample [13]. Therefore, it has been suggested that the number of covariates be limited by the sample size and the event rate (i.e., the incidence of the dependent variable) [14]. In the referenced article [10], there were 25 subjects with a post-operative infection, and three variables included in the model, which comes out to roughly eight events per variable (EPV). The conventionally accepted rule of thumb states that there should be a minimum 10 EPV, and no fewer than 20 EPV for risk prediction [15,16]. However, this remains a controversial topic, because simulations have shown that results using five to nine EPV are comparable with those with 10–16 EPV [16].
The validity of a given model is assessed through measures of discrimination and calibration. Discrimination is the ability of a model to correctly distinguish subjects at high risk of experiencing the event from subjects at low risk [9]. In the study by McGillen et al. [10], discrimination is the ability of the three included variables to predict who would develop an infection and who would not. For logistic regression, which has a binary categorical outcome (infection or no infection), discrimination is best determined by a concordance measure known as the C-index (or C-statistic) [9]. This measure is equivalent to the area under the receiver operating characteristics (AUROC) curve and is usually reported as such [9]. Calibration, which is also referred to as goodness-of-fit, is the model's ability to estimate the absolute risk correctly [17]. In the referenced article [10], the authors report that 24 hours of post-operative antibiotic agents was associated with an increase in the odds of infection by a factor of 10 compared with no post-operative antibiotic agents (Table 2). In a well-calibrated model, for every infection in a subject who received no antibiotic agents, we would expect 10 infections in subjects with 24 hours of antibiotic agents. Calibration is most often reported using the Hosmer-Lemeshow test, which compares the difference between predicted (what we would expect based on the odds ratios) and observed (what the sample actually shows) event rates and determines a probability that the difference is attributable to chance alone [17]. Therefore, a Hosmer-Lemeshow p value that is statistically significant (p < 0.05) suggests that the difference between the observed and expected event rates cannot be explained by chance alone and is instead the result of a poorly calibrated model [17]. It is worth noting that the Hosmer-Lemeshow test may be misleading with large samples, because a statistically significant p value may correspond to a clinically insignificant difference between expected and observed odds [17]. These values should always be reported in predictive models, however, they are less important for explanatory models [6].
Odds of Surgical Site Infection Development Based on Post-Operative Antibiotic Duration a
Reprinted with permission [10].
CI = confidence interval.
Adjusted for post-operative drain placement and free fluid on imaging.
Multiple logistic regression is among the methods most commonly utilized to control for confounding in non-randomized studies. Selecting which variables to include in a model should be the most important consideration. As such, authors must report how and why the variables in their model were selected. Performance of the model should be reported in terms of discrimination and calibration, and we suggest using the AUROC and the Hosmer-Lemeshow test for these values, respectively. Whereas model performance should always be reported for predictive models, it is unnecessary for explanatory models.
Cox Proportional Hazards
Cox proportional hazards modeling is conceptually similar to MLR and is used when analyzing survival data over time [9]. Proper application of Cox proportional hazards is contingent upon understanding the unique aspects of survival analyses, including censorship and time to event outcomes. Although multiple methods exist, Kaplan-Meier (KM) curves are the most commonly utilized and will serve as the focus of this section. The example used in this section comes from Guidry et al. [18], whose study explored how aggressive and conservative antimicrobial protocols in the intensive care unit (ICU) might impact long-term survival.
The use of KM curves is preferred because they allow subjects with incomplete data, seen commonly in non-randomized studies, to be included in the analysis as censored subjects [19]. Censored subjects are those who did not experience the event (usually death, but may also be infection, cancer recurrence, etc.), either because they are lost to follow-up or because the study ends [9]. This allows those subjects with incomplete records to contribute partially to the analysis for the time period in which they were followed, and to be excluded from the analysis after that timepoint. For each subject, three datapoints are required to perform the analysis: (1) serial time, which is the total time of observation of the subject; (2) status at specific timepoints, which is either the outcome (death) or the censoring (last encounter) of the subject; and (3) subject's treatment group (intervention or control) [20]. In the study by Guidry et al. [18], serial time would be the interval from ICU admission until the date of death, last known survival date, or the date of termination of the study, depending upon the patient. Each patient was followed for four years. It therefore follows that all patients who survived to four years become censored at the end of the study period, because they are no longer being followed. The two treatment groups in this study were based on aggressive and conservative antimicrobial protocols. Curves for each group are generated to display survival, which is the proportion of all subjects without the event over set time intervals [21]. The hazard rate, which instead represents the incidence of event occurrence, can also be calculated from the curve [22]. To determine if a significant difference in survival (or hazard) exists between groups, the log-rank test is performed [20]. In the cited study, there was no significant difference in survival between patients in the aggressive and conservative protocol groups at four years.
Cox proportional hazards may be applied to the survival curves to adjust for baseline group differences and provide a hazard ratio to quantify the effect that any single factor contributes to the survival curve [22,23]. Guidry et al. [18] performed Cox proportional hazards to control for differences in demographics and comorbidities between patients in each of the protcols. Two key assumptions must be met for the model to hold. The first is called the proportionality assumption, which stipulates that the hazards between the study groups are proportional (i.e., curves do not cross). The second assumption is that all variable effects are roughly constant throughout the study period [23]. The authors of this study clearly state that variables were checked to make sure these assumptions held true, and variables that did not were excluded. As with MLR, the rule of 10 EPV has been recommended for Cox proportional hazards [15,16] After controlling for demographics and comorbidities on Cox proportional hazards, the authors found a significant association between the aggressive antimicrobial protocol and death at four years (Table 3).
Cox Proportional Hazards Model Results
Reprinted with permission.
CI = confidence interval; APACHE = Age and Acute Physiology and Chronic Health Evaluation.
Defined as baseline creatine ≥2.0 mg/dL.
No significant association was identified at 1 year and therefore was not included in the model.
Refers to any renal replacement required at the time of diagnosis.
One important consideration in KM curves and Cox proportional hazards is the presence of any competing risks [24]. Competing risks are events that prevent the subject from experiencing the outcome of interest. For example, the authors in the referenced article [18] might have chosen re-admission to the ICU with an infection as their outcome of interest. Any patient who died during the study period is no longer at risk of being re-admitted to the ICU with an infection, therefore, death would be considered a competing risk. There are several strategies available for handling data with competing risks, however, this is outside of the scope of this article.
Kaplan-Meier curves and Cox proportional hazards are best applied when the outcome of interest is the time to an event. Kaplan-Meier curves are generated to compare survival between two treatment groups; the log-rank test is subsequently performed to determine if differences in survival are statistically significant, and this result should be reported. Cox proportional hazards may be applied to adjust for baseline differences between groups. Similar to MLR, variables must be selected carefully and authors must provide justification for all variables included.
Propensity Score
Propensity score (PS) methods are a category of analyses that includes four distinct approaches to address confounding in non-randomized studies. The four analyses are based on calculating a PS for each subject, which ranges from zero to one and is defined as the probability of receiving an intervention based on observed baseline characteristics [14,25]. One way to understand the PS score is to consider “assuming that my patient has these observed baseline characteristics, what is the likelihood that he/she will receive this intervention?” A PS of 0 means that the patient would never receive the intervention, a PS of 0.5 means that the patient would receive the intervention 50% of the time, and a PS of 1 means that the patient would always receive the intervention. The example in this section comes from Gabriel et al. [26], who performed a PS matched analysis between patients who underwent rapid source control laparotomy (RSCL) and those who had primary fascial closure (PFC) to compare hospital length of stay (LOS) and mortality. All baseline variables that predict outcome independently should be included, whereas those that only affect the probability of receiving the intervention (that are not associated with the outcome) should be excluded [14,25]. The authors in the cited study selected indication for surgery (based on International Classification of Disease [ICD]-10 diagnostic code), patient comorbidities, and presence of pre-operative sepsis as their matching variables, given their influence on LOS and mortality. It therefore holds that for any given PS, baseline characteristics should have similar distributions in both treatment groups, which is demonstrated in Table 4 [27]. Because PS consolidates multiple variables into one numeric value, these methods are ideal for their ability to handle even few events per variable [28].
Indications for Intervention, Baseline Patient Characteristics, and Operative Risk in RSCL and PFC Cohorts
Reprinted with permission [26].
ICD = International Classification of Diseases; PFC = primary fascial closure; RSCL = rapid source-control laparotomy.
The four methods for PS analysis are: (1) propensity score matching (PSM), (2) PS stratification, (3) PS adjustment, and (4) PS analysis by inverse probability of treatment weighting [14]. Of these methods, PSM is used most frequently because it has been shown to balance the greatest proportion of baseline differences [25]. Therefore, we focus here exclusively on PSM. For a more detailed discussion of all four PS analysis techniques, the authors refer the reader to an article by Zakrison et al. [25]. With PSM, subjects from the treatment group are matched to subjects from the control group according to PS [14]. Several matching algorithms exist to identify a match between subjects. The most common matching algorithm is greedy nearest neighbor matching, whereby subjects from the intervention group are matched to the nearest subject from the control group [14,29]. Greedy matching will always produce a match, even if the matched pair are dissimilar in their PS. A variation on this algorithm is greedy nearest neighbor matching within specified caliber widths [29]. This is similar to greedy nearest neighbor with the exception that a match will only occur if the difference between the PS does not exceed a fixed value (most commonly 0.2 of the standard deviation of the logit of the PS) [30]. An alternative approach is optimal matching, which minimizes the average difference in PS between all matched pairs [27]. Finally, matching may occur with or without replacement. In matching without replacement, once a control is matched to a treated subject, that control is no longer available to match to a different treated subject [29]. This is in contrast to matching with replacement, which allows a control to match to more than one treated subject [29]. In comparing these methods, greedy nearest neighbor caliper matching without replacement appears to result in the least amount of bias, and is therefore the recommended algorithm [29]. In the cited example from Gabriel et al. [26], PS matching was integrated into their NSQIP database query. Although not explicitly stated by the authors, this strategy most resembles the greedy nearest neighbor matching without replacement algorithm.
Matching is performed most commonly in a 1:1 ratio, however, this may be expanded so that each treated subject is matched to multiple controls [25,27]. Matching multiple controls to each subject has the advantage of increasing the precision of the estimated treatment effect (by increasing the sample size), however, this may also result in dissimilar matches that will increase bias [31]. It is therefore recommend to limit matches to two controls per treated subject [31]. To determine the adequacy of matching and ensure balance between the two groups, the covariate balance must be assessed and reported. Table 4 confirms that Gabriel et al. [26] achieved perfect balance in the matched variables between the RSCL and PFC groups. Balance is critical to appropriate matching and unbiased effect estimation [14]. It is recommended that the standardized difference of variables be used for this analysis, with an accepted cutoff of <0.1 [25]. Once matching is completed, an analysis that reflects the paired relation of the samples (rather than independent samples), i.e., paired samples t-test, McNemer test, and Cox proportional hazards, may be performed [25]. In the above example, the authors performed an MLR to identify risk factors for death. Subjects who underwent RSCL had approximately two-fold increased odds of death compared with those who underwent PFC.
Analyses based on PS can be highly variable given the numerous methods and algorithms discussed above. As such, it is essential that authors be comprehensive in reporting their methods and results. All variables used to generate the PS, as well as a rationale for inclusion, should be provided. The PS method, matching algorithm (if applicable), and any additional tests used for analysis should be explicitly stated. Finally, authors should provide confirmation that covariate balance was assessed and achieved.
Conclusion
Randomized controlled trials are the preferred study design for mitigating bias from both known and unknown factors. When RCTs are not feasible, non-randomized studies may serve as a reasonable alternative if adjustment for baseline differences is performed. The method by which to address confounding should be selected according to the data format and sample size. Reporting of these methods should provide justification for the selected covariates, confirmation that data did not violate the model assumptions, and measures of model performance.
Footnotes
Funding Information
No funding was received.
Author Disclosure Statement
None of the authors report any financial disclosures/conflicts of interest
