Post-Operative Infection Prediction and Risk Factor Analysis in Colorectal Surgery Using Data Mining Techniques: A Pilot Study

Abstract

Background:

Post-operative infections have many negative consequences for patients' health and the healthcare system. Among other things, they increase the recovery time and the risk of re-admission. Also, infection results in penalties for hospitals and decreases the quality performance measures. Surgeons can take preventive actions if they can identify high-risk patients. The purpose of this study was to develop a model to help predict those patients at risk for post-operative infection.

Methods:

A retrospective analysis was conducted on patients with colorectal post-operative infections. Univariable analysis was used to identify the features associated with post-operative infection. Then, a support vector classification-based method was employed to select the right features and build prediction models. Decision tree, support vector machine (SVM), logistic regression, naïve Bayes, neural network, and random forest algorithms were implemented and compared to determine the performance algorithm that best predicted high-risk patients.

Results:

From 2016 to the first quarter of 2017, 208 patients who underwent colorectal resection were analyzed. The factors with a statistically significant association (p < 0.05) with post-operative infections were elective surgery, origin status, steroid or immunosuppressant use, >10% loss of body weight in the prior six months, serum creatinine concentration, length of stay, unplanned return to the operating room, administration of steroids or immunosuppressants for inflammatory bowel disease, use of a mechanical bowel preparation, various Current Procedural Terminology (CPT) codes, and discharge destination. However, accurate prediction models can be developed with seven factors: age, serum sodium concentration, blood urea nitrogen, hematocrit, platelet count, surgical procedure time, and length of stay. Logistic regression and SVM were stable models for predicting infections.

Conclusion:

The models developed using the pre-operative features along with the full list of features helped us interpret the results and determine the significant factors contributing to infections. These factors present opportunities for proper interventions to mitigate infection risks and their consequences.

Surgical site infections (SSI) have become a focal point for evaluating outcomes after colorectal surgery. They are a significant cause of morbidity and increased healthcare costs as they prolong hospital stays, increase re-admission rates, and can mandate additional procedures [1,2]. Every SSI occurrence adds at least $17,000 to an individual's bill, which results in approximately $1.6 billion in direct costs yearly in the United States [1,3]. Estimated rates of infections after colectomy range from 3% to 25% [4,5].

Multiple risk factors have been linked to high incidences of SSIs in colorectal operations. However, there is wide variation between studies when looking at specific variables, and not one single intervention has been shown to decrease SSI rates definitely [6,7].

Data mining, a concept that originated in the middle of the 1990s, is the process of finding patterns, clustering, and classifying data to obtain insights and knowledge and build models to predict outcomes [8]. These techniques are used widely in many areas outside healthcare. In healthcare systems, data mining is applied increasingly because of the power of its techniques to analyze complex and voluminous data. Data mining can facilitate decision making by identifying hidden patterns and trends in complex data. The analysis can help hospitals and healthcare systems reduce costs, increase revenue, and enhance operating efficiency.

Data mining applications in healthcare include identification and classification of at-risk populations and the discovery of patterns in the treatment of diseases. There is an extensive application of mining algorithms in clinical decision making (diagnosis, choice of treatment, and prognosis prediction) and administrative decision making (staff estimation, market trends analysis, process improvement, and insurance payments management) [9].

The purpose of this study was to develop and compare various prediction models using data mining techniques and determine the best-performing model in identifying those patients who are more likely to develop SSI after colectomy.

Patients and Methods

After approval was obtained from the Institutional Review Board, data were collected from a tertiary referral hospital in Georgia using the electronic medical records. Patients admitted from 2016 through the first quarter of 2017 who underwent colorectal resection were included in the study. We observed that 18 of the 208 patients having colorectal surgery during that period developed post-operative infections. The data were used to develop models to predict such infections. Table 1 lists the 38 patient characteristics examined and compares these features in the infected and non-infected patients. Continuous variables are reported as median values and categorical variables as the number and percentage of patients. Significant differences between infected and non-infected patients were evaluated with a t-test for continuous features and a χ² test for categorical features.

Table 1.

Patient Characteristics

Index	Variable	Not infected (188)	Infected (18)	p
1	Age	60.73	55.28	0.186
2	Gender (%)			0.264
	Male	89 (47.3)	11 (66.1)
	Female	99 (52.7)	7 (38.9)
3	Race (%)			0.641
	White	115 (61.2)	10 (55.6)
	African-American	73 (38.8)	8 (44.4)
4	Elective surgery (%)			0.005
	No	54 (28.7)	11 (61.1)
	Yes	134 (71.3)	7 (38.9)
5	Origin (%)			0.043
	Not transferred	167 (88.8)	13 (72.2)
	Outside emergency department	21 (11.2)	5 (27.8)
6	Mean Body Mass Index	30.15	27	0.22
7	Smoker (%)			0.796
	No	151 (80.3)	14 (77.8)
	Yes	37 (19.7)	4 (22.2)
8	Hypertension (%)			0.547
	No	70 (37.2)	8 (44.4)
	Yes	118 (62.8)	10 (55.6)
9	Disseminated cancer (%)			0.885
	No	179 (95.2)	17 (94.4)
	Yes	9 (4.8)	1 (5.6)
10	Steroid/immunosuppressant use for chronic condition (%)			<0.001
	No	186 (98.9)	14 (77.8)
	Yes	2 (1.1)	4 (22.2)
11	>10% loss of body weight in 6 mos prior to surgery (%)			<0.001
	No	185 (98.4)	14 (77.8)
	Yes	3 (1.6)	4 (22.2)
12	Mean serum sodium concentration	139.68	138.94	0.321
13	Mean blood urea nitrogen	17.07	14	0.116
14	Mean serum creatinine concentration	1.1	0.8	0.001
15	Mean white blood cell count	8.4	7.8	0.384
16	Mean hematocrit	37.08	34.26	0.125
17	Mean platelet count	271.08	311	0.158
18	Emergency case (%)			0.461
	No	167 (88.8)	17 (94.4)
	Yes	21 (11.8)	1 (5.6)
19	Mean surgical procedure time	162	188	0.253
20	Mean no of other procedures	0.59	1.22	0.098
21	Mean no of concurrent procedures	0.14	0.17	0.812
22	Mean no superficial incisional SSI	0.04	0	0.008
23	Mean no deep incisional SSI	0	12	<0.001
24	Mean length of stay (d)	7.07	28.27	<0.001
25	Readmission within 30 d (%)	28 (14.9)	6 (33.3)	0.213
26	Unplanned return to operating room (%)			<0.001
	No	182 (96.8)	14 (77.8)
	Yes	6 (3.2)	4 (22.2)
27	Colectomy steroid immunosuppressant for IBD (%)			0.001
	No	184 (97.9)	15 (83.3)
	Yes	4 (2.1)	3 (16.7)
28	Colectomy mechanical bowel preparation (%)			0.001
	No	56 (29.8)	12 (66.7)
	Yes	132 (70.2)	6 (33.3)
29	Colectomy oral antibiotic preparation (%)			0.066
	No	46 (24.5)	8 (44.4)
	Yes	142 (75.5)	10 (55.6)
30	Colectomy chemotherapy within 90 d (%)			0.796
	No	180 (95.7)	17 (94.)
	Yes	8 (4.3)	1 (5.6)
31	CPT code (%)			<0.001
	140	17 (9.0)	2 (11.1)
	141	3 (1.6)	1 (5.6)
	143	9 (4.8)	4 (22.2)
	144	1 (0.5)	0
	145	11 (5.9)	2 (11.1)
	146	3 (1.6)	0
	150	2 (1.1)	3 (16.7)
	160	22 (11.7)	2 (11.1)
	204	55 (29.3)	0
	205	29 (15.4)	2 (11.1)
	206	5 (2.7)	0
	207	22 (11.7)	1 (5.6)
	210	9 (4.8)	1 (5.6)
32	Surgeon no (%)			0.353
	1	4 (2.1)	0
	2	5 (2.7)	1 (5.6)
	3	27 (14.4)	1 (5.6)
	4	14 (7.5)	1 (5.6)
	5	11 (5.9)	0
	6	1 (0.5)	0
	7	1 (0.5)	0
	8	58 (40.3)	11 (61.1)
	9	12 (6.38%)	0
	10	1 (0.5)	0
	11	16 (8.5)	0
	12	9 (4.8)	3 (16.7)
	13	6 (3.2)	0
	14	9 (4.8)	0
	15	6 (3.2)	0
	16	8 (4.3)	1 (5.6)
33	Diabetes mellitus (%)			0.5
	No	155 (82.5)	15 (83.3)
	Insulin	11 (5.9)	0
	Non-insulin	22 (11.7)	3 (16.7)
34	Sepsis (%)			0.376
	None	172 (91.5)	16 (88.9)
	Sepsis	7 (3.7)	0
	Septic shock	7 (3.7)	1 (5.6)
	SIRS	2 (1.1)	1 (5.6)
35	Incision classification (%)			0.095
	Clean	1 (0.5)	0
	Clean/contaminated	140 (74.5)	10 (55.6)
	Contaminated	18 (9.6)	1 (5.6)
	Dirty/infected	29 (15.4)	7 (38.9)
36	Incision closure (%)	0		0.85
	All layers of incision (deep and superficial) fully closed = 0	167 (88.8)	16 (88.9)
	No layers of incision are surgically closed = 1	3 (1.60)	0
	Only deep layers closed; superficial left open = 2	18 (9.6)	2 (11.1)

BMI = body mass index; BUN = blood urea nitrogen; CPT = current procedural terminology; SIRS = systemic inflammatory response syndrome; SSI = surgical site infection; WBC = white blood cells.

Dummy variables

For a qualitative attribute that has two or more categories, representing each level by a number (1, 2, 3, and so on) does not make sense because these numbers are used to indicate various categories of that attribute and do not have intrinsic meaning of their own. Thus, artificial variables are created that take values of 0 or 1 to indicate whether the sample does or does not belong to the specific category of the attribute. These artificial variables are called “dummy” variables, and the technique is widely used in regression, logistic regression, and other data mining algorithms [10]. For instance, to identify various categories of incision classification (clean, clean/contaminated, contaminated, and dirty/infected), three dummy variables are created as substitutes for the original variable in the dataset. These dummy variables are binary, so 1 indicates membership in the class; 0 means the class does not apply. The dummy variables used for incision are incision_clean/contaminated, incision_contaminated, and incision_dirty/infected. The original dataset's category of clean was eliminated because if all three of the dummy variables have a value of 0, the incision is regarded as clean.

There are 38 characteristics in the dataset that are continuous (e.g., age) or categorical (e.g., incision type). Applying the dummy variable technique, the total number of features is 83.

Feature Selection

Usually, not all the features are relevant and useful for application in prediction models. Feature selection or variable selection is a technique that tries to choose the best features that contain the most useful information to predict the dependent variable [11]. Feature selection removes irrelevant and redundant features and can improve the performance of prediction algorithms [12]. Feature selection techniques reduce the complexity of generated models and result in more cost-effective predictors; because the techniques hunt for the most valuable features, they provide a better understanding of the importance of the features [13].

There are three classes of feature selection techniques: filter, wrapper, and embedded. Filter methods select a variable for inclusion in the subset of best predictors based on some scores and independent of any prediction algorithms; for instance, the correlation between the independent variables and the dependent variable. In wrapper methods, the prediction model evaluates the interaction between variables and selects the combination that has the best predictive power. Embedded methods perform feature selection as part of a training procedure and add a penalty to an objective function to eliminate the insignificant variables [13].

In this study, an L1_based feature selection technique, linear support vector classification [14], was applied to select the non-zero coefficient and reduce the dimensionality of the data for various prediction algorithms. This technique can be classified as a wrapper method that not only considers the impact of the individual factors on the dependent variable but also the relation between selected variables. So, a set of factors will be selected that can classify the dependent variable efficiently.

Imbalanced data

In general, most prediction algorithms perform better if the dataset is balanced; i.e., the ratios of the positive and negative cases are almost equal [15]. However, in the current dataset, only 8.7% (18 of 208) of the cases were positive (i.e., infected). The class that has the most members is considered the dominant class.

Many techniques have been used in data mining to deal with imbalanced data. The random under-sampling technique tries to eliminate samples randomly from the dominant set to obtain balanced classes [16]. The disadvantage of this technique is that important information can be lost by eliminating the majority of cases. On the contrary, in the random over-sampling technique [16], random cases of the minority class are replicated to balance class distribution. This can increase the risk of overfitting, as it generates exact copies of the minority examples. Thus, in this study, a heuristic method, the synthetic minority over-sampling technique (SMOTE) [17], was used to over-sample the minority cases by generating a new sample between two adjacent minority samples. Thus, SMOTE can prevent overfitting of the trained model and help the prediction algorithm perform better in learning from a dataset.

Performance measurements

Each prediction model produces a discrete output that indicates the class label of the predicted instance. When there are two classes (infected and not infected or positive and negative), we have four possible outcomes: if the patient is a positive case and is predicted to be positive, it is counted as true positive (TP) and if predicted as negative, it is counted as false positive (FP). Also, if the patient is a negative instance and is predicted to be negative by the model, we have a true negative (TN) case, and if predicted to be positive, it is counted as false negative (FN). A two-by-two matrix, called a confusion matrix, can be constructed from the four outcomes that will demonstrate how well the model predicts the class of the instances (Fig. 1).

FIG. 1.

Confusion matrix. TN = true negative; FN = false negative; TP = true positive; FP = false positive.

The measures used to assess the algorithms' performances in predicting the class of the instances are accuracy, sensitivity, specificity, and precision.

Accuracy shows the percentage of patients whose infection status is predicted correctly: $A c c u r a c y = \frac{T N + T P}{T N + F N + T P + F P}$

Sensitivity shows the proportion of infected cases that are correctly predicted to be positive: $S e n s i t i v i t y = \frac{T P}{T P + F N}$

Specificity represents the percentage of the not-infected cases that are correctly predicted as negative: $S p e c i f i c i t y = \frac{T N}{T N + F P}$

Precision illustrates the percentage of cases that are predicted to be positive and actually develop infections: $P r e c i s i o n = \frac{T P}{T P + F P}$

Another important measurement is the receiver operating curve (ROC), which illustrates the true positive rate (TPR) on the Y axis against the false-positive rate (FPR) on the X axis. A powerful prediction model would score near the upper left-hand corner (higher TPR, lower FPR). This diagram can be used to compare various prediction models' performances with each other. In fact, any model that has a larger area under curve (AUC) of ROC (closer to 1) is a better method for predicting the class of the instances [18].

Six prediction algorithms were used to predict the class (infected or not infected) of the patients. These are naïve Bayes (NB) [19], decision tree (DT) [19], logistic regression (LR) [19], SVM [18], artificial neural network (ANN) [20], and random forest (RF) [21]. The dataset is split into 80% training data to apply SMOTE and build the prediction models and 20% testing data to investigate the models' performance in predicting the class of the data.

Internal validation

A small-sample dataset has a higher chance of overfitting in predictive modeling. However, some techniques, such as bootstrap and cross-validation, can eliminate overfitting. This research used a bootstrap resampling technique to generate random samples with replacement from the original data set [22] to train the prediction model. The model's calculated performance measures subtracted from the bootstrap samples mean performance measures generate optimism-corrected model performance. This study used 100 bootstrap samples of the size of the original dataset and employed AUC as the performance measurement.

Figure 2 summarizes the various steps of the techniques implemented to analyze the data. This analysis used the python 2.7 and scikit-learn packages [14] to implement codes for various prediction models.

FIG. 2.

Flow diagram of various steps of the analysis. SMOTE = synthetic minority over-sampling technique.

We developed the prediction models based on all 83 features listed in Table 1 and identified the significant factors. Furthermore, the models were built again using only pre-operative features (i.e., Nos. 1–18, 27–31, and 33 and 34 in Table 1) to investigate the likelihood at the time of admission of developing a post-operative infection when only pre-operative features were collected. The results of both of the set of prediction models developed are presented below.

Results

There were 208 patients who underwent colorectal resection during the time; 18 developed post-operative infections. When examining patient characteristics, the following factors resulted in a statistically significant difference (p value <0.05): Elective surgery, whether the patient was transferred in from another hospital, steroid or immunosuppressant use, use of same agents for inflammatory bowel disease, >10% loss of body weight in the prior six months, serum creatinine concentration, length of stay (LOS), unplanned return to the operating room, use of a mechanical bowel preparation, various CPT codes, and discharge destination. There was no association between age, BMI, smoking status, surgical procedure length, oral antibiotic bowel preparation, diabetes mellitus, wound classification, or American Society of Anesthesiologists (ASA) score and SSI.

To select important factors, the SVC-base feature selection technique was utilized. The list features selected by the technique were age, serum sodium concentration, BUN concentration, hematocrit, platelet count, surgical procedure time, and LOS. These factors contributed most in predicting the likelihood of developing infections after colorectal surgery. By collecting the record of the significant factors, the probability of infection could be calculated, and high-risk patients could be identified for interventions and further care. We produced good performing prediction models by selecting just seven features.

Confusion matrices for various classifiers are depicted in Table 2.

Table 2.

Confusion Matrix for Various Algorithms

Prediction algorithm	Predicted class	True class
Prediction algorithm	Predicted class	Negative	Positive
Naïve Bayes	Negative	TN (30)	FP (4)
Naïve Bayes	Positive	FN (1)	TP (7)
Logistic regression	Negative	TN (27)	FP (7)
Logistic regression	Positive	FN (2)	TP (6)
Decision tree	Negative	TN (32)	FP (2)
Decision tree	Positive	FN (4)	TP (4)
Support vector machine	Negative	TN (28)	FP (6)
Support vector machine	Positive	FN (2)	TP (6)
Artificial neural network	Negative	TN (31)	FP (3)
Artificial neural network	Positive	FN (2)	TP (6)
Random forest	Negative	TN (31)	FP (1)
Random forest	Positive	FN (2)	TP (6)

FN = false negative; FP = false positive; TN = true negative; TP = true positive.

The values of the performance measurement are shown in Table 3. Clearly, all algorithms have good performance in predicting true positive cases, which are the patients who actually developed an infection, and the models predicted them as positive cases.

Table 3.

Performance Measurements for Various Classifiers (Full Features)

Prediction algorithm	Accuracy	Sensitivity	Specificity	Precision
Naïve Bayes	0.881	0.875	0.882	0.636
Logistic regression	0.786	0.75	0.794	0.462
Decision tree	0.857	0.5	0.941	0.667
Support vector machine	0.81	0.75	0.824	0.5
Artificial neural network	0.881	0.75	0.912	0.667
Random forest	0.929	0.75	0.971	0.857

The AUC of various methods based on full features is shown in Figure 3. These measurements were obtained for seven selected features based on the SVC feature-selection method using all features.

FIG. 3.

Area under the curve (AUC) for various techniques (full features). NB = naïve Bayes; LR = logistic regression; DT = decision tree; SVM = support vector machine, ANN = artificial neural network; RF = random forest.

The results from developing prediction models based on pre-surgery features only are shown in Table 4 and Figure 4. In this case, 14 features were selected by the feature selection algorithm for building the prediction model (age, gender, elective surgery, BMI, steroid/immunosuppressant use for chronic condition, body weight loss, serum sodium, BUN concentration, serum creatinine concentration, WBC count, hematocrit, platelet count, pre-colectomy mechanical bowel preparation, CPT Code_8). Generally, the performance of algorithms worsens when a smaller number of features is used to develop the prediction models. Naïve Bayes and SVM perform best in classifying data in terms of AUC, although SVM has a greater accuracy, and naïve Bayes works better as judged by sensitivity.

FIG. 4.

Area under the curve (AUC) for various techniques (pre-operative features). NB = naïve Bayes; LR = logistic regression; DT = decision tree; SVM = support vector machine, ANN = artificial neural network; RF = random forest.

Table 4.

Performance Measurements for Various Classifiers (Pre-Operative Features)

Prediction algorithm	Accuracy	Sensitivity	Specificity	Precision
Naïve Bayes	0.643	0.875	0.588	0.333
Logistic regression	0.786	0.5	0.853	0.444
Decision tree	0.81	0.125	0.971	0.5
Support vector machine	0.81	0.625	0.853	0.5
Artificial neural network	0.19	1.0	0.0	0.19
Random forest	0.81	0.0	1.0	0.0

For internal validation, we considered the AUC measure, as it best represents the various algorithms' performance in predicting the dependent variable among all other measures. The mean AUC of the various methods after bootstrapping for full-feature models is presented in Table 5. It can be seen that although random forest has the highest AUC in the original dataset, considering the bootstrap samples and the optimism values, logistic regression and SVM have a more stable performance in predicting the result.

Table 5.

Area under the Curve Measure of Various Algorithms for Original and Bootstrap Data Sets

Prediction algorithm	Original	Bootstrap (100 repetitions)	Optimism in model performance
Naïve Bayes	0.9	0.68	0.22
Logistic regression	0.9	0.83	0.07
Decision tree	0.72	0.74	−0.02
Support vector machine	0.89	0.84	0.05
Artificial neural network	0.63	0.76	−0.13
Random forest	0.95	0.78	0.17

Discussion

As SSIs continue to gain attention, ways of predicting and preventing them become more important. They are a significant cause of morbidity and result in nearly one million excess hospital days over a one-year time in addition to increasing hospital costs significantly [1,2].

Our results indicated that elective surgery, transfer from another institution, steroid or immunosuppressant use, >10% reduction in body weight in the prior six months, serum creatinine level, LOS, unplanned return to the operating room, inflammatory bowel disease, undergoing a mechanical bowel preparation, CPT code, and discharge destination all were significant factors in whether a patient developed a post-operative infection. Although some of the above factors correlate with the patient's health status (emergency surgery, steroid use, body mass loss, unplanned return to the operating room, inflammatory bowel disease, discharge destination) and therefore can explain the difference in SSI rates, others (such as the type of procedure, CPT code, or transfer in from another facility) do not. Interestingly, age, BMI, smoking status, surgical procedure length, oral antibiotic bowel preparation, diabetes mellitus, wound classification, and ASA score did not correlate with the SSI rate. These are factors that have been shown in various studies to increase the SSI rate; and therefore further show the complexity and variability in predicting SSIs after colorectal operations [2,3,23]. Although from a prediction performance perspective, using age, serum sodium, BUN concentration, hematocrit, platelet count, surgical procedure time, and length of stay features can result in the most powerful prediction models to identify high-risk patients.

The above prediction models, when used appropriately, can identify those patients at higher risk of post-operative infections and therefore help the clinician better control the modifiable factors that are known from the literature to be significant in decreasing the risk of post-operative infectious complications. These factors include optimizing nutrition and performance status pre-operatively [24 –26], following an enhance recovery after surgery (ERAS) pathway [27 –30], using sterile intra-operative techniques such as a separate closing tray [31], changing gown and gloves prior to closing [32, 33], and implementing minimally invasion techniques [34,35].

Pedroso-Fernandez et al. [36] tried to predict SSI after colorectal surgery by identifying the surgical predictors. They conducted univariable and multivariable analysis on 911 patients undergoing colorectal surgery, of whom 221 developed SSI. They found that being female, having open surgery, being subject to a more than 72 hours pre-operative stay, and dirty or contaminated surgery were significant predictors of SSI.

Fusco et al. [37] conducted a retrospective cohort study for 115 patients who underwent colon surgery from January 2010 to December 2013. They investigated the risk factors associated with SSI by univariable analysis. Also, they performed logistic regression analysis with stepwise backward variable selection and identified the significant factors. Male sex, Charlson index, and mechanical bowel preparation were among the significant predictors in both univariable and multivariable analysis.

Paulson, Thompson, and Mahmoud [38] analyzed the prospectively collected data of 787 patients who underwent colorectal procedures between 2011 and 2015. They identified incision length, body mass index, surgical indication, and surgical approach as the significant factors associated with SSI. The risk factors were identified by the multivariable logistic regression method.

A limitation of this study is that it was conducted within one healthcare facility; thus, it may not include the variation in the patient groups or different types of hospitals. Also, the sample is relatively small, and using data of a broader timeframe or further experimentation on large and diverse datasets can improve the validation of the results. We conducted bootstrap analysis as internal validation to avoid overfitting and investigated the stability of the prediction models' performance. However, applying the proposed techniques on a larger dataset could help obtain more reliable results. All in all, we can claim that the developed prediction models can help surgeons identify patients at risk of a post-operative infection in a similar hospital with similar patient population characteristics.

Conclusion

Colorectal post-operative infections remain a large burden on healthcare systems, with studies and efforts aimed at trying to identify factors associated with the complications and ways to prevent them. Data mining classification methods provide a great tool to predict the probability of infection after colorectal surgery. Prediction models developed using pre-operative features enable physicians to estimate the risk of an infection at the time of a patient's admission. Our study showed that logistic regression and SVM algorithms performed better than the other classification methods using all features in the model. However, when only pre-operative characteristics were examined, SVM had the best performance. These models can help identify patients at high risk of SSIs and aid clinician in decreasing this risk. Future studies could aim to collect more detailed data and develop models to help predict specific events such as anastomotic leaks, cardiopulmonary complications, etc. in pre-operative patients.

Footnotes

Author Disclosure Statement

No competing financial interests exist. No funding was received for the preparation of this manuscript.

References

Cima

, Dankbar

, Lovely

, et al. Colorectal surgery surgical site infection reduction program: A National Surgical Quality Improvement Program–driven multidisciplinary single-institution experience. J Am Coll Surg, 2013; 216:23–33.

Tang

, Chen

, Wang

, et al. Risk factors for surgical site infection after elective resection of the colon and rectum: A single-center prospective study of 2,809 consecutive patients. Ann Surg, 2001; 234:181–189.

Wick

, Hirose

, Shore

, et al. Surgical site infections and cost in obese patients undergoing colorectal surgery. Arch Surg, 2011; 146:1068–1072.

National Nosocomial Infections Surveillance System. Data summary from January 1992 through June 2004; issued October 2004. Am J Infect Control, 2004; 32:470–485.

Smith

, Bohl

, McElearney

, et al. Wound infection after elective colorectal resection. Ann Surg, 2004; 239:599.

Keenan

, Speicher

, Thacker

, et al. The preventive surgical site infection bundle in colorectal surgery: An effective approach to surgical site infection reduction and health care cost savings. JAMA Surg, 2014; 149:1045–1052.

Anthony

, Murray

, Sum-Ping

, et al. Evaluating an evidence-based bundle for preventing surgical site infection: A randomized trial. Arch Surg, 2011; 146:263–269.

Yoo

, Alafaireet

, Marinov

, et al. Data mining in healthcare and biomedicine: A survey of the literature. J Med Syst, 2012; 36:2431–2448.

Tekieh

, Raahemi

. Importance of data mining in healthcare: A survey. Proc IEEE ACM Int Conf Adv Soc Netw Anal Min, 2015; 1057–1062.

10.

Garavaglia

, Sharma

. A smart guide to dummy variables: Four applications and a macro. In Proceedings of the Northeast SAS Users Group Conference, 1998; 43.

11.

Bermingham

, Pong-Wong

, Spiliopoulou

, et al. Application of high-dimensional feature selection: Evaluation for genomic prediction in man. Sci Rep, 2015; 5:10312.

12.

Aladeemy

, Tutun

, Khasawneh

. A new hybrid approach for feature selection and support vector machine model selection based on self-adaptive cohort intelligence. Expert Syst Appl, 2017; 88:118–131.

13.

Guyon

, Elisseeff

. An introduction to variable and feature selection. J Mach Learn Res, 2003; 3:1157–1182.

14.

Pedregosa

, Varoquaux

, Gramfort

, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res, 2011; 12:2825–2830.

15.

Longadge

, Dongre

. Class imbalance problem in data mining review. arXiv preprint arXiv, 2013; 1305:1707.

16.

Batista

, Prati

, Monard

. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor, 2004; 6:20–29.

17.

Chawla

, Bowyer

, Hall

, Kegelmeyer

. SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res, 2002; 16:321–357.

18.

Fawcett

An introduction to ROC analysis. Pattern Recognit Lett, 2006; 27:861–874.

19.

Harrington

Machine Learning in Action. Greenwich: Manning; 2012.

20.

Daniel

Principles of Artificial Neural Networks. World Scientific; 2013.

21.

TK.

Random decision forests. Proc Int Conf Doc Anal Recognit, 1995; 1:278–282.

22.

Efron

, Tibshirani

An introduction to the bootstrap. Monographs on Statistics and Applied Probability. New York: Chapman & Hall; 1993.

23.

Poon

, Law

, Wong

, et al. Impact of laparoscopic colorectal resection on surgical site infection. Ann Surg, 2009; 249:77–81.

24.

Barreiro

, Sánchez

, Diz

, et al. Impact of preoperative oral nutrition therapy in patients undergoing surgery for colorectal cancer. Nutr Hosp, 2019; 36:1150–1156.

25.

Moya

, Soriano-Irigaray

, Ramirez

, et al. Perioperative standard oral nutrition supplements versus immunonutrition in patients undergoing colorectal resection in an enhanced recovery (ERAS) protocol: A multicenter randomized clinical trial (SONVI study). Medicine, 2016; 95:e3704.

26.

Onerup

, Angenete

, Bock

, et al. The effect of pre-and post-operative physical activity on recovery after colorectal cancer surgery (PHYSSURG-C): Study protocol for a randomised controlled trial. Trials, 2017; 18:212.

27.

Gustafsson

, Scott

, Hubner

, et al. Guidelines for perioperative care in elective colorectal surgery: Enhanced Recovery after Surgery (ERAS^®) Society recommendations: 2018. World J Surg, 2019; 43:659–695.

28.

Kehlet

, Wilmore

. Evidence-based surgical care and the evolution of fast-track surgery. Ann Surg, 2008; 248:189–198.

29.

Ljungqvist

, Scott

, Fearon

. Enhanced Recovery after Surgery: A review. JAMA Surg, 2017; 152:292–298.

30.

Shida

, Tagawa

, Inada

, et al. Modified Enhanced Recovery after Surgery (ERAS) protocols for patients with obstructive colorectal cancer. BMC Surg, 2017; 17:18.

31.

Schiavone

, Moukarzel

, Leong

, et al. Surgical site infection reduction bundle in patients with gynecologic cancer undergoing colon surgery. Gynecol Oncol, 2017; 147:115–119.

32.

Zywot

, Lau

CSM

, Fletcher

, et al. Bundles prevent surgical site infections after colorectal surgery: Meta-analysis and systematic review. J Gastrointest Surg, 2017; 21:1915–1930.

33.

Edmiston

, Leaper

, Barnes

, et al. An incision closure bundle for colorectal surgery. AORN J, 2018; 107:552–568.

34.

Marks

, Do Nassif

, Frenkel

. Minimally invasive colorectal surgery. Minerva Gastroenterol Dietol, 2012; 58:201–211.

35.

Abu Gazala

, Wexner

. Re-appraisal and consideration of minimally invasive surgery in colorectal cancer. Gastroenterol Rep, 2017; 5:1–10.

36.

Pedroso-Fernandez

, Aguirre-Jaime

, Ramos

, et al. Prediction of surgical site infection after colorectal surgery. Am J Infect Control, 2016; 44:450–454.

37.

Fusco

, Massarico

, Alves

, et al. Surgical site infection and its risk factors in colon surgeries. Rev Esc Enferm USP, 2016; 50:43–49.

38.

Paulson

, Thompson

, Mahmoud

. Surgical site infection and colorectal surgical procedures: A prospective analysis of risk factors. Surg Infect, 2017; 18:520–526.