Abstract
Abstract
Background:
Despite the widespread utilization of a four-stage wound classification system to risk-adjust operations for surgical site infection (SSI) rates, we are not aware of any study evaluating the definitions of the wound classes for clarity. We limited our study of wound classifications to appendectomies and posed the question whether different reviewers classify individual cases differently.
Methods:
We evaluated the wound classifications of 105 consecutive appendectomies in our community hospital. Four reviewers graded retrospectively the wound classifications, first after reading the description of the appendix in the operative report and again after reading the pathology report. The wound classifications of the four reviewers were evaluated for concordance with the original operating room nurse (ORN) assignment.
Results:
The kappa scores for inter-observer concordance of wound classifications among the four reviewers based on their interpretation of the operative report and the ORN who originally classified the operation ranged from 0.1028 to 0.1597. By conventional standards, this represents no better than “slight agreement” for any of the reviewers. We found that 19%, 50%, 94%, 95%, or 96% of our appendectomies would be considered “high risk,” Class 3 or 4, operations depending on which rater classified the operation. The additional information contained in the pathology reports did not change the distribution of wound classifications of the four reviewers significantly.
Conclusions:
Our study demonstrated considerable differences in the distribution of wound classifications of appendectomies among our ORNs and retrospective reviewers. A review of the surgical literature supports our finding that the incision classification system utilized commonly lacks precision, at least in the rating of appendectomies. We recommend that further studies be performed to determine whether changes in the definitions of wound classes are warranted.
Another risk-adjustment tool utilized commonly that requires observer interpretation is the four-level wound classification system introduced in 1964 during a multi-institutional study designed to assess the potential benefits and risks of placing ultraviolet lights in operating rooms in an attempt to reduce surgical site infections (SSIs) [10]. This classification system, which characterizes operations as clean, clean–contaminated, contaminated, or dirty-infected, was adopted subsequently by the American College of Surgeons (ACS) and the U.S. Centers for Disease Control and Prevention (CDC) [6,11].
Although several authors have documented differences of opinion among surgeons as to the proper wound classifications for some neonatal and dermatologic procedures, we are aware of only two studies evaluating differences in the assignment of the wound class by the operating room nurse (ORN) present at operation and a review conducted by someone else [2,3,16,17]. In one study, the authority who assigned the wound class was the surgeon/author, who observed operations and then rated the ORN's classification as “correct” or “incorrect” [2]. The ORNs reportedly assigned a correct wound classification in 94% of general surgical cases and 88% of trauma cases. The interpretation of wound class definitions was not blinded in the sense that a one-month “practice period” was provided to train the nurses in the proper interpretation of wound classes, and additional training was given during the study regarding the proper classification of incisions for trauma cases. The intra-study retraining was necessitated because of low agreement of the ORN and reviewer classifications.
The second article reported that between 10% and 60% of surgical cases at a major teaching hospital were assigned an incorrect wound class by the ORN at the time of operations, as judged retrospectively by various members of the hospital's infectious disease team [3]. The primary author of the article, a nurse reviewer, conducted a study of wound classifications and reported that initially, 19% of wounds were assigned an incorrect class by ORNs. With further ORN education, the error rate of the classifications decreased to 14%. No explanation is given for the marked disparity (10%–60%) in the calculated percentage of incorrectly classified operations, as determined by various members of the surgical infectious diseases team.
The conclusions of these articles were predicated on the assumption that the author/reviewer was a higher authority who knew the correct wound classification [2,3]. The possibility of ambiguity in the classification definitions was not considered.
Because each type of operation may present unique challenges in wound classification, we chose to limit our study of inter-rater concordance to a single operation. We chose appendectomy for study because: (1) Appendectomies are one of the most frequently performed surgical procedures; (2) the same operation can be assigned different classifications in different cases; and (3) pathology reports are available to correlate with the operative findings. We tested our hypothesis that different reviewers might classify appendectomies differently retrospectively than did the ORN who classified the operation initially.
Patients and Methods
The study began at our community hospital after an in-service for ORNs in which the CDC wound classification definitions were read and distributed. Examples of proper wound classifications were not given in order to not prejudice the ORN's interpretation of the definitions. The definitions were posted in all operating rooms, and nurses were instructed to discuss the wound classification with the operating surgeon at the end of the procedure and before recording a classification. The 105 consecutive appendectomies performed between January 1 and September 1, 2008, formed the basis for our study. Fourteen different ORNs were involved in recording the wound classifications at the time of appendectomies. The operations were performed by seven general surgeons.
Forty-two of the appendectomies were performed laparoscopically, and 63 were performed as open operations. Operations were placed in one of the following four CDC-approved wound classes (Table 1): For the sake of brevity, we will refer in this article to the wound classes as 1, 2, 3, and 4.
Wound classifications were assigned retrospectively by each of four independent reviewers: (1) The chairman of the Department of Surgery (PRD); (2) the chief of the section of General Surgery (AKM); (3) a general surgeon with 20 years' experience (RML); and (4) our hospital's infection prevention and control coordinator (EJF). The reviewers first assigned a wound classification for each operation on the basis of the surgeon's description of the appendix in the operative note (O). Each reviewer was then asked to reconsider the wound classification after reading both the operative note and the pathology report (P). If there was a discrepancy between the O and the P, precedence was to be given to P.
The reviewers were asked to study the wound classification definitions before applying them to the individual cases and to refrain from discussing the interpretation of the definitions with other reviewers until after the reviews were completed. The reviewers were not aware of the wound classification given by the ORN or by other reviewers at the time of their classification. After the reviewers completed wound reassignments, they met to discuss possible reasons for uncertainty in applying the definitions to individual cases.
Inter-rater concordance in the application of the wound classification was calculated utilizing kappa values as described by Cohen [12]. A kappa value of 0.0 represents no agreement beyond what would be predicted by chance. A value between 0.00 and 1.00 can be interpreted as follows: 0.01–0.20=slight agreement; 0.21–0.40=fair agreement; 0.41–0.60=moderate agreement; 0.61–0.80=substantial agreement; 0.81–0.99=almost perfect agreement [13].
A chart review was conducted to determine the percentage of cases in which the operating surgeon recorded a post-operative diagnosis of “appendicitis” and a review of the P to determine the percentage of cases in which the pathologist recorded a finding of “inflammation” of the appendix.
Results
As shown in Table 2, the original assignments of wound class recorded by the ORN at the time of operation resulted in 1% of cases being rated as Class 1, 49% as Class 2, 38% as Class 3, and 12% as Class 4. Table 2 documents the distributions of wound classifications by the ORN and the four reviewers based on the description of the appendix in the operative report.
Table 3 illustrates the concordance among the four retrospective reviewers of wound classification based on the description of the appendix in the O and the original assignment by the ORN. The kappa scores for inter-rater concordance between the wound classification of the ORN and the four reviewers based on a retrospective review of the O ranged from 0.1028 to 0.1507. These scores represent no better than “slight agreement” in evaluating operations for appendicitis according to the interpretation utilized conventionally of Landis and Koch [11].
We found that 19%, 50%, 94%, 95%, or 96% of our appendectomies would be classified as “high risk” (Class 3 or 4) operations under the Study on the Efficacy of Nosocomial Infection Control (SENIC) and National Nosocomial Infections Surveillance (NNIS) systems, depending on whether we accepted the incision classification as recorded by the ORN or the retrospective classification by reviewers 1, 2, 3, or 4 based on the description of the appendix in the O [6].
Table 4 illustrates the differences in the wound classification by the four reviewers based first on a review of the description of the appendix in the operative note and second on a review of both the O and the P. Precedence was given to the P if there appeared to be disagreement on the stage of the appendicitis. Table 5 illustrates the aggregate classifications of reviewers 1, 2, 3, and 4 of operations based on an evaluation of O alone and the wound classification based on O+P with precedence given to P.
The chi-square test of the significance in change in the distribution of wound classes totals between the O and O+P assignments were not significant, whether for the individual raters (p=0.46, 0.49, 0.69, and 0.95) or for the four combined (p=0.79). In a post-classification discussion, all four reviewers expressed uncertainty about how classifications should be applied to individual cases. A literal interpretation of the statement under the description of contaminated wounds (Class 3): “and incisions in which acute, non-purulent inflammation are included in this category” implied to all reviewers that every case of appendicitis is at least Class 3. It seemed implausible to the reviewers, however, that the risk of SSI would be the same for cases involving “gross spillage from the gastrointestinal tract” and those involving appendectomies performed for an inflamed, but non-gangrenous, appendix. This issue was of such a concern to reviewer 3 that he downgraded cases to Class 2 unless there was evidence of peritonitis or perforation. He acknowledged that this was not based on what he considered to be a literal interpretation of the definition of Class 3 operations.
The notation at the conclusion of the Class 4 description that “This definition suggests that the organisms causing post-operative infection were present in the operative field before the operation” was found to be confusing. Because transudation of bacteria may occur through an inflamed, but intact, appendix, the reviewers wondered if this definition should result in all cases of appendicitis being classified as Class 4 [14].
The post-operative diagnosis in 104 of 105 (99%) cases, as recorded by the operating surgeon was “appendicitis,” whereas 99 cases (94%) were reported by pathologists as inflammatory changes of the appendix. This difference did not reach significance (p=0.12 by the two-tailed Fisher exact test).
Discussion
We believe ours is the first systematic study of the clarity of the CDC's four-stage wound classification system for the concordance of multiple raters. We demonstrated considerable differences in the wound classification of appendectomies between the ORNs, who classified the operation at the time of the procedure, and four retrospective reviewers. The kappa scores for inter-rater concordance between the classifications based on the description of the appendix in the O demonstrated “only slight agreement.” In a discussion of the classification process, none of our reviewers considered himself or herself a higher authority with certainty as to the correct wound classification.
There are different methods for testing concordance of wound classification than we chose. Reviewers could have been present at each operation to see and hear what the ORNs rating the cases saw and what description they heard from the operating surgeon. Although this would have been the most rigorous manner of testing for ORN concordance with retrospective reviewers, we believe it would not have been feasible to have multiple reviewers present during large numbers of operations.
Although our ORNs viewed cases from a different vantage point than our retrospective reviewers, we believe that a comparison of the wound classifications of the two groups is appropriate. Devaney and Rowell reported that in their National Surgical Quality Improvement Program (NSQIP)-participating hospital, the wound classifications recorded by ORNs were reviewed for correctness before national reporting [3]. Although it is not stated whether Os or Ps were reviewed, a retrospective analysis by someone deemed to be a higher authority is given precedence over the original classification by the ORN.
Published studies have reported that the histopathology findings in appendectomies are sometimes at variance with the visual appearance of the appendix by the operating surgeon, but disagreement exists over which should take precedence in determining the stage of the appendicitis [14,18–20]. A validation of the intra-operative staging of a wound class as a risk-adjustment tool should demonstrate a similar pattern of classification on the basis of the appearance of the appendix at the time of the operation and the subsequent histopathology findings. We found that although surgeons reported a higher percentage of cases as demonstrating appendicitis than did the pathologists, this difference did not reach statistical significance in this relatively small series. The additional information provided by the pathology report did not change substantially the distribution of wound classes for the retrospective reviewers. Thus, we found although the visual appearance of the appendix correlated closely with the histopathology findings, ambiguity in the wound classification definitions led to markedly different distributions of wound classes.
The essential conclusion of our study, that the four-level wound classification system used commonly lacks clarity for risk-adjusting appendectomies, is supported by a review of the relevant literature. The original description of the wound classifications from 1964 listed appendectomies as Clean (Class 1) operations “unless inflammation was noted” [10]. As recently as 2001, others have offered the same opinion [15,21]. However, an ACS document from 1984 categorized “appendectomy” as an example of a Class 2 wound [11]. Other interpretations of the wound descriptions have graded appendectomies as at least Class 2 but without clarity as to which cases should be upgraded [3].
In an attempt to standardize wound classifications, Devaney and Rowell stated that “all cholecystectomies and appendectomies start out as clean/contaminated (Class 2) procedures because the gastrointestinal tract is entered. If the surgeon encounters inflammation or pus or a major break in a surgical technique, the procedure must be reclassified accordingly” [3]. The authors described a “routine appendectomy” as an example of a Class 2 incision and an “appendectomy for inflamed appendicitis” as a Class 3 wound [3]. Our reviewers were uncertain as to whether the authors meant that a Class 2 “routine appendectomy” procedure only included non-inflamed (normal) appendices.
Our reviewers believed that the term “inflamed appendicitis” was redundant rather than explanatory. All reviewers thought that a literal interpretation of the CDC's definition of a Class 3 incision as well as Devaney and Rowell's description of a Class 3 appendectomy suggest that between 94% and 99% of our appendectomies should be classified as at least Class 3 operations depending on whether we used the surgeon's post-operative diagnosis or the pathologist's findings. This interpretation would seem to be at considerable variance from the published wound class distributions for appendectomies from a variety of institutions.
We identified three previous studies reporting the distribution of wound classes for appendectomies. A Veterans' Affairs NSQIP participating hospital study from 2003 reported a distribution of incision classifications for appendectomies as follows: Class 1, 18%; Class 2, 43%; Class 3, 24%; and Class 4, 15% [22]. An analysis of this report indicates that 39% of appendectomies performed in VA hospitals fell into the CDC's “high-risk” categories of Classes 3 or 4. The authors state that all of the cases that were classified as 1 should have been “more appropriately” graded as Class 2. It is not reported, however, whether the O or P reports were reviewed to determine if any of the wound classifications were believed by the authors to be correct.
A compilation of data from 40 German hospitals found that 41% of appendectomies were considered “high risk” Class 3 or 4 operations [1], whereas an NNIS study from 27 hospitals in Victoria, Australia, reported that 57% of appendectomies were classified similarly [4]. We found that 19%, 50%, 94%, 95%, or 96% of our appendectomies would be considered “high-risk” (Classes 3 or 4) operations depending on whether we accepted the wound classification given by the ORN or the retrospective classification by Reviewers 1–4 (Table 4).
We do not believe we can conclude from our study that our community hospital has a higher or lower risk severity of appendectomy cases than the NSQIP VA study cited above or the aggregated reports of hospitals from Australia and Germany [1,4]. At least for appendectomies, we believe that the lack of clarity of the four-level classification system makes hospital-to-hospital risk adjustment comparisons unreliable [1,4,22].
Nichols has offered the opinion that, although the descriptions of the four wound classes “overlap,” there is a clearer distinction between Classes 1 and 2 as a group and Classes 3 and 4 as a second group [23]. We did not find evidence of such a clear division in the classification of appendectomies, and we believe that Nichols' use of the term “overlap” in his descriptions of the wound classifications is an acknowledgment of ambiguity.
Although morbidity has been reported to increase with the presenting stage of the disease in patients with appendicitis [1,28], the varied stage classifications utilized in the literature make comparison of studies difficult. We note that the four-level wound classification system is rarely utilized in the academic literature involving appendicitis. We speculate that authors have understood intuitively what we have demonstrated: That the system is not sufficiently descriptive to adjust appendectomy risk reliably.
Unfortunately, no other risk-classification system has generally been accepted. The failure to use a common risk classification for appendectomies makes comparisons of different institutional experiences difficult and hampers our understanding of quality issues surrounding the approximately 280,000 appendectomies performed yearly in the United States [29].
Several authors have utilized International Classification of Diseases-9 Clinical Modification (ICD9-CM) data to adjust appendectomy risk [28,29]. It has been noted, however, that codes have changed over the years, and Newman et al. found such a marked disparity in results from hospital to hospital that the use of ICD-9 CM codes for comparing quality data must be considered unreliable [30].
Pieper et al. utilized a four-tiered classification for appendectomies: Normal, phlegmonatous, gangrenous, or ruptured [31]. Many authors divide appendicitis into either simple (or early or acute) or complex (or late or severe), but the distinction between the two categories is not always evident [24,25,27,32–35].
Precision of incision classifications requires that terms such as “acute,” “bland,” “simple,” “routine,” “complex,” “late,” and “high risk” be used only if carefully defined by the authors. We found the term “acute” to be particularly common and confusing in the literature discussing appendicitis. The definition of “acute” in Taber's Cyclopedic Medical Dictionary reads: “1. Sharp, severe; 2. Having a rapid onset, severe symptoms, and a short course” [36]. However, many authors use the term “acute” to suggest that the appendicitis being described is “simple,” early, or less severe [24,26,32,34,35]. Carr's pathologic subclassifications of “acute appendicitis” cover the spectrum of disease ranging from acute intraluminal inflammation to gangrenous appendicitis to periappendicitis [37].
The possibility that the four-level wound classification system lacks clarity in risk-adjusting operations other than appendectomies is suggested by the findings of Devaney and Rowell, who reported considerable (10%–60%) variance in the error rates of ORNs in assigning proper wound classifications, as judged by different members of the hospital's infectious disease team [3]. These magnitudes of disagreement over proper wound classification deserve further study. We recommend that the definitions of the four commonly utilized incision classifications be reviewed for clarity by the ACS and the CDC.
All quality definitions, either for risk adjustment or for outcome measurements, which require observer interpretation should be subjected to concordance studies of multiple independent raters. Only those demonstrating clarity, as shown by high inter-rater concordance, should be utilized for comparing hospitals and individual surgeons.
Another tool that requires rater interpretation and which has been utilized for risk-adjusting operations for SSI rates is the ASA score [1]. Concern has been raised that aggregate ASA scores of surgical patients rose in some NSQIP participating hospitals over just a two-year period of time (2006–2007) [8]. This observation suggested to Cohen et al. the possibility of “incentive-driven inflation of ASA” status as hospitals vie for better quality statistics. They warn that if this inflationary trend continues, it could “impact on our ability to evaluate overall trends in quality” [8]. Whether wound classification assignments are increasing in hospitals cannot be determined from published data, as far as we are aware.
A demonstration of inflation in the reporting of interpretation-dependent quality data should not be assumed to be a sign of volitional misrepresentation. Data inflation might be the result of institutions legitimately attempting to present their data in the most favorable light when presented with ambiguous risk criteria. It is incumbent on organizations dealing with quality initiatives to ensure that definitions requiring observer interpretation are defined as clearly as possible.
It is important to recognize that kappa calculations for inter-rater concordance differ from statistical significance analyses in that the author is left to determine what level of concordance is “good enough” for the purpose of the study. For instance, Ragheb et al. consider a kappa value of >0.41 “acceptable” in the inter-rater reliability of anesthesia providers in assigning ASA scores [9]. However, a value between 0.41 and 0.60 by statistical convention demonstrates only “moderate agreement” [13].
Although no quality measurement requiring observer interpretation may be entirely free of ambiguity, we speculate that kappa values for assessing inter-rater concordance can be employed most usefully for comparing the relative clarity of two different descriptions of the same quality measurement.
Footnotes
Author Disclosure Statement
No competing financial interests exist.
