Abstract
Abstract
Objective:
The aim of this study was to determine if peer review conducted under real-world conditions is systematically biased.
Study Design:
A repeated-measures design was effectively created when two board-certified obstetrician-gynecologists reviewed the same 26 medical records of patients treated by the same physician, and provided written evaluations of each case and a summary of their criticisms. The reviews were conducted independently for two different, unaffiliated hospitals. Neither reviewer was aware of the other's review, and neither was affiliated with either hospital or knew the physician under review. This study reports the degree of agreement between the two reviewers over the care rendered to these 26 patients.
Results:
Three of the 26 cases reviewed had complications. Both reviewers criticized these cases, but criticized 2 of them for different reasons. At least one of the reviewers criticized 14 (61%) of the 23 uncomplicated cases, about which no quality concerns had been raised prior to the review. With one exception, they criticized completely different cases and criticized this 1 case for different reasons. Thus, only 4 of the 17 cases criticized by at least one of the reviewers were criticized by both of them, and only 1 of the 4 cases were criticized for the same reason. The Kappa statistic was −0.024, indicating no agreement between the reviewers (P = 0.98).
Conclusions:
As presently conducted, peer review can be systematically biased even when conducted independently by external reviewers. Dual-process theory of reasoning can account for the bias and predicts how the bias may potentially be eliminated or reduced.
Introduction
Materials and Methods
Twenty-six cases treated by the same obstetrician-gynecologist, OBG, were reviewed independently by two board-certified obstetrician-gynecologists, R1 and R2, for two unaffiliated hospitals, H1 and H2, where OBG held privileges. Neither reviewer was aware of the other's review, neither was affiliated with H1 or H2, and neither knew OBG. Five bowel complications within a 3.5-month period triggered the reviews. All the cases reviewed were treated at H1. H2 requested copies of the medical records and had them independently reviewed by its own reviewer.
H1 selected 30 cases for review, but 4 cases were excluded because only one of the reviewers reviewed them. These included 2 of the bowel complications, which, for unknown reasons, were not sent to R1. Both involved enterotomies that were recognized and repaired during the primary operation without sequella. One occurred during laparoscopic resection of endometriosis and was repaired laparoscopically; the other occurred during the replacement of a suprapubic tube under cystoscopic guidance in a patient who had had prior abdominal surgery and radiation for a pelvic malignancy in the distant past, and whose urinary incontinence was being managed by suprapubic drainage.
H1 instructed its reviewer to “evaluate the quality and appropriateness of the care” and to prepare a report “which contains case-by-case, specific findings and conclusions, as well as overall findings and conclusions, as to the quality and appropriateness of care.” H2 requested its reviewer to express an opinion regarding “the general judgment displayed and care rendered [and] to focus upon the specific issues of whether there were appropriate indications for the surgical procedure performed, whether patient selection for open versus laparoscopic approach was appropriate, and whether the complications that occurred were expected, recognized appropriately, and managed appropriately.” Cohen's kappa statistic was used to test for significance of agreement between the reviewers. The statistical significance of proportions and differences between proportions were determined from using the normal approximation to the binomial distribution.
Results
Each reviewer provided written comments on each case and summaries of their criticisms. The agreement between the reviewers' criticisms of the 26 cases is shown in Table 1. At least one of the reviewers criticized 17 (65.4%) of the 26 cases, but only 4 were criticized by both of them, and only 1 of the 4 cases was criticized for the same reason. The kappa statistic was 0.024 (P = 0.98), indicating that there was no agreement between the reviewers.
Cohen's kappa statistic = −0.024.
R1, reviewer 1; R2, reviewer 2.
Both reviewers criticized the 3 bowel complications, but criticized only 1 of them for the same reason, and R1 was not critical of OBG's management of 1 of the cases. R1 and R2 were both critical that OBG performed laparoscopy in a 65-year-old woman for partial small-bowel obstruction, claiming that bowel obstruction is an absolute contraindication to laparoscopic surgery. The patient was admitted with vomiting, fever, and hypokalemia onto the medical service, treated with nasogastric suction, and infectious disease, general surgery, and gastroenterology were consulted. A computed tomography (CT) scan was interpreted as showing ileus and a fibroid uterus. OBG was consulted on hospital day 12 after sigmoidoscopy showed extrinsic compression of the rectum by the fibroid. Since a PAP smear showed glandular cells, and fibroids rarely cause bowel obstruction, OBG scheduled the patient for hysteroscopy, laparoscopy, and possible laparotomy and hysterectomy. Hysteroscopically directed biopsies and frozen section showed marked inflammation but no malignancy. At laparoscopy, a left tubo-ovarian abscess was found that had loops of small bowel densely adherent to it and the left side of the uterus. During attempted dissection of bowel loop off the side of the uterus, bleeding was encountered from the mesentery and an incidental enterotomy was made. OBG converted to a laparotomy, repaired the enterotomy, and performed an otherwise uneventful hysterectomy. At the conclusion of the hysterectomy, OBG was concerned about the viability of the bowel and consulted a general surgeon, who resected 8 in of small bowel. The patient made an uneventful recovery.
R1 and R2 criticized OBG's handling of a patient with a tubo-ovarian abscess, adult respiratory distress syndrome (ARDS), and incipient septic shock for a different reason. A different attending physician had misdiagnosed the patient on examination under anesthesia as having a fibroid uterus and canceled a scheduled laparoscopy so the patient could be consented for a hysterectomy. OBG was consulted 2 days later after the patient had deteriorated and developed ARDS and incipient septic shock. OBG drained the abscess laparoscopically, removed the right tube and ovary and appendix, noted a small 5-mm laceration in the cecum at the conclusion of the procedure, and repaired it through a small extension of the suprapubic trocar insertion site without sequella. The patient had a stormy postoperative course due to sepsis but made a full recovery. R1 claimed that peritonitis and abscess are absolute contraindications to laparoscopic surgery, and that OBG should have merely drained the abscess and not removed the right tube, ovary, and appendix. R2 attributed the patient's stormy postoperative course to the laparoscopic route for surgery, which, he claimed, added 60–90 minutes of operating time (the actual operating time was 2.5 hours).
R1 and R2 did not agree over their criticisms of the third complication, the only one associated with untoward sequella. This patient underwent a laparoscopic resection of severe endometriosis, including resection of the cul-de-sac peritoneum, and sustained an injury to the anterior rectal wall. No injury was noted during surgery, and photodocumentation, although of poor quality, also showed no obvious injuries. The injury was recognized postoperatively prior to discharge and was treated with a colostomy. R1 was not critical of OBG, noting that the surgery was complex, the injury a recognized risk of this type of surgery, and promptly recognized, but he was highly critical of the surgeon's management of the complication. R2 was critical of PM's use of electrosurgery and hydrodissection in this case, which, he contended, placed the patient at high risk of rectal injury.
Fourteen (61%) of 23 uncomplicated cases were criticized by at least one reviewer, 6 (26.1%) by R1, and 9 (39.1%) by R2. The difference between these proportions was not statistically significant (z = 0.95, P > 0.1), but both were significantly different from zero (P < 0.01). Only 1 of the 14 uncomplicated cases was criticized by both reviewers—a vaginal hysterectomy for recurrent postmenopausal bleeding and negative curettages (the uterus weighed 166 g and had adenomyosis)—and it was criticized for different reasons. R2 claimed the surgery was unindicated, and R1 claimed it took too long (3 hours). (The actual operating time was 140 minutes, and the surgery, performed by residents under OBG's supervision, included bilateral salpingo-oophorectomy, McCall culdoplasty, and cystoscopy.)
For ease of presentation, the 23 uncomplicated cases were grouped into four groups, A–D: group A, minor procedures (5 cases); group B, level I laparoscopies and postpartum tubal ligations (7 cases); group C, level II laparoscopies consisting of adnexal surgery and lysis of adhesions (6 cases); and group D, laparoscopic and vaginal hysterectomies (5 cases). Table 2 shows the number of cases in each group criticized by each reviewer.
Cone biopsy (1); laser vaporization of warts (2); cervical cerclage (1); dilatation and curretage (1).
Diagnostic laparoscopy (3); post-partum tubal ligation (3); laparoscopic sterilization (1).
Bilateral salpingectomy for failed sterilization (1), salpingostomy for ectopic pregnancy (1); salpingo-oophorectomy and ahesiolyisis for pain (2); ovarian cystectomy (2).
Vaginal hysterectomy ± salpingo-oophorectomy (4); laparoscopically assisted vaginal hysterectomy (1).
R1, reviewer 1; R2, reviewer 2.
R2 criticized 2 cases in group A: a laser vaporization of vulvovaginal warts for taking too long (2 hours; actual time, 75 minutes), and a cerclage for excessive bleeding and for taking 45 minutes. (The procedure actually lasted 15 minutes, and the estimated blood loss was listed as minimal on the anesthetic record, as 30 cc in the hand-written operative note, and as 500 cc in the dictated operative note, an obvious transcription error.) R2 criticized 2 group B cases: one because OBG used the direct trocar-insertion technique, which R2 claimed increased the risk of visceral injuries, and the other for failure to document the preparations made for converting a diagnostic laparoscopy in an 18-year-old girl with a questionable 5-cm complex adnexal mass in case it proved malignant. (The patient did not, in fact, have an adnexal mass, but she was consented for possible laparotomy.) R2 criticized 1 level II laparoscopy (bilateral salpingectomy for failed sterilization) for excessive bleeding (the estimated blood loss was 25 cc). Finally, in addition to the hysterectomy he criticized for taking too long, R2 also criticized the route of 3 of the 5 hysterectomies (2 vaginal, 1 laparoscopic), because he claimed the fibroids would have made the procedures more difficult. He also opined that 2.5 hours for the laparoscopically assisted vaginal hysterectomy was at the upper limit of the acceptable duration for this operation (the actual operating time was 130 minutes). Table 3 summarizes the relevant information about the 5 hysterectomies reviewed by the reviewers.
R2, reviewer 2; LAVH, laparoscopically assisted vaginal hysterectomy; TVH, total vaginal hysterectomy; RSO, right salpingo-oophorectomy; BSO, bilateral salpingo-oophorectomy.
R1, as already noted, criticized one of the hysterectomies and claimed it was unindicated. R1 also criticized 5 level II laparoscopies in group C: 2 because he claimed peritonitis is an absolute contraindication to laparoscopic surgery; 1 because he believed extensive laparoscopic lysis of dense adhesions had a high risk of enterotomy; 1 because he believed a 5-cm ovarian cyst in a 48-year-old woman should be left unless it contains solid areas; and 1 salpingostomy for ectopic pregnancy because he claimed rapid fall in human chorionic gonadotropin levels presaged resolution of the pregnancy (the rapid fall was between pre- and postoperative levels).
Discussion
This study differs in two important respects from previous ones that have investigated the reproducibility of peer review. First, it is the only comparison based on real-world reviews; previous inter-reviewer comparisons were conducted under research conditions.1–6 Second, most cases on which this comparison is based had no complications or adverse outcomes and effectively served as negative controls, whereas previous comparisons involved only or mostly complications or adverse outcomes.1–6 In as much as agreement is greater over normal than abnormal clinical findings, 5 and irreproducibility of peer reviews has been attributed to hindssight bias (i.e., knowledge of outcome and its severity 2 ), one would have anticipated good agreement between the reviewers, at least over uncomplicated cases. This was not, however, observed.
This study found that the peer review conducted by each external reviewer was systematically biased to find quality problems in the care rendered by the physician under review. This is evident from the fact that each reviewer criticized more uncomplicated cases, about which no quality concerns had been raised prior to the review than could be expected by chance. That these were not valid criticism of deficiencies missed by the hospital's quality-oversight procedures is evidenced by the complete lack of agreement over the cases the reviewers criticized (Table 2). These findings cannot be explained by differences in the reviewers' propensity to criticize, 1 since the difference between the proportion of cases criticized by each reviewer was not statistically significant. They also indicate that irreproducibility of peer review cannot be explained on the basis of hindsight bias alone. 2
Why would two reviewers from different parts of the country, with no affiliation with the hospitals, be independently biased to find fault with a physician neither of them knew? One explanation is that the reviewers were acting out of improper motives. 9 Supporting this conclusion are that R2 misrepresented the operating times of 8 patients, 7 of which he called “excessive,” and declined to correct his report when the “errors” were pointed out to him, and that each reviewer made numerous assertions about clinical facts and standards that have no empirical basis. For example, there is no basis for R1's contention that peritonitis remains a contraindication to laparoscopic surgery, 10 for R2's assertion that the direct trocar-insertion technique is associated with more visceral injuries than the more commonly used Veress needle technique 11 or that 2.5 hours is the upper acceptable limit for the duration of a laparoscopic hysterectomy, 12 or for both their claims that laparoscopic surgery is contraindicated in partial small-bowel obstruction of uncertain etiology after decompression by nasogastric suction. 13 The alternative explanation is that the context of peer review, and the manner in which it was conducted, engendered cognitive biases in the peer reviewers. This competing explanation is consistent with dual-process models of reasoning and decision making14–18 and leads to testable hypotheses about the causes of bias and irreproducibility, and potential means to avoid them, whereas the alternative explanation is speculative and untestable.
Modern cognitive science has refuted many long-held beliefs about human rationality. 19 One of its most startling findings has been that much of a person's mental life occurs outside conscious awareness,20,21 and that most decisions are made by rapid, nonconscious mental processes over which the person has little control, but may override by deliberate, conscious effort.14–18,22,23 Dual-process theory was originally proposed by Evans to explain the biasing effects of content and context on logical reasoning tasks, 17 but the idea that there are two distinct kinds of reasoning is far from new. 24
It is now widely accepted that two distinct mental systems, referred to as System 1 (also called implicit, heuristic, and associative) and System 2 (also called analytic), are involved in making judgements as well as performing other cognitive tasks.14–18,24 System 1 processes are automatic, operate in parallel, rely on pattern recognition, and make few demands on working memory.14–18,24 They contextualize and solve judgment problems by recruiting prior knowledge and beliefs cued by context and task goals. 15 These systems are capable of processing a vast amount of information rapidly and produce quick outputs with little effort.15,17 We have no control over, or conscious insight into, how these processes provide their outputs, and simply become aware of them when they appear in consciousness as “pop-ups.”14–18,24 We use these processes to execute learned tasks and to make quick decisions in familiar domains. 25 In fact, expertise in a domain is measured by the extent to which domain tasks have been automated, as in driving a car. 25
By contrast, System 2 is slow and effortful, uses sequential, rule-based processing, and is limited by working memory capacity.14–18,24 These processes are uniquely human, provide the basis for analytical thought,14,17 and are the ones we usually associate with “thinking.” We are conscious of these processes, can control them by directing attention and retrieving information from long-term memory, 25 and we can report on the process by thinking out loud and revealing the contents of working memory.14,24 We use System 2 processes to learn new tasks, solve new problems, or unexpected ones encountered during System 1 operations. 25 However, limitations on working memory capacity constrain System 2 and preclude us from considering all possible solutions to problems to find the optimal one (i.e., “bounded rationality”).14,17 We generally consider and test solutions to problems one at a time (i.e., “singularity principle”), and accept the solutions that satisfy task goals without continuing to consider alternatives until an optimal solution is found (i.e., “satisficing”). 14
Systems 1 and 2 operate in tandem and provide outputs in a two-step, default-and-revise sequence.14–17,24 System 1 provides an initial output based on selective features of problem content and associated knowledge retrieved from long-term memory. System 1 outputs are then passed along to System 2, which may revise, replace, or accept the original output.14,16 However, unless the task is difficult or new, System 2 monitors the default outputs of System 1 only perfunctorily, and people readily accept plausible judgments that quickly come to mind. 16
Therefore, when people are dealing with familiar problems about which they have well-established beliefs, which is the case when a physician reviews the medical records of his or her peers, their judgments are produced by automatic outputs of System 1 processes. Limitations on working memory 15 and environmental factors, such as fatigue, divided attention, and cognitive load, 25 will often prevent System 2 from revising these outputs even when they are biased. Context-engendered cognitive biases account for many medical and surgical errors.26–29 It is the thesis of this article that they also explain systematic biases and irreproducibility of peer review.
Context biases peer review because the reviewer will know that there is a perceived problem with the competence, technical skills, or judgement of the physician under review from the mere fact that a review has been requested, even though nonclinical factors may have actually influenced the decision to review. This initial belief will be reinforced by information the hospital provides to the reviewer about the physician or his or her cases, and the complications or adverse outcomes in the cases he or she reviews.
It is difficult to shake off the biasing influences of this kind of information because Gilbert et al. have shown that when people are presented with completely new information, they do not reserve judgment about whether or not to believe the information until they have sufficient evidence to make a judgment, as Descartes maintained. 22 Rather, they accept the information as true through System 1 processes, as Spinosa posited, unless they are overridden by System 2. 22 In most situations, the Cartesian and Spinosan models of information processing will lead to identical beliefs, but if the second corrective phase is impeded, the person will be left believing the original information. 22 Once such a belief has been formed, review of the medical records and application of prior knowledge and past experience will be biased in predictable ways,26,30 and, surprisingly, the strength of the prior belief does not affect the extent of these biases. 31
In his or her review of medical records, the reviewer will tend to look for, remember, and give more weight to evidence consistent with the belief that medical care was suboptimal and to ignore and devalue disconfirming evidence. 30 Management at variance with the reviewer's own practices will be taken as evidence of this belief and lead to the construction of after-the-fact norms, 32 which will be overgeneralized, 33 exactly as both reviewers did in this case. Complications and adverse outcomes will trigger counterfactual thinking that links antecedents selectively with the outcome and to the causal judgment that the outcome could have been avoided by the reviewer's treatment preferences, and that the actual treatment violated norms constructed ad hoc after the events in question have occurred.32,34
All these judgments are made automatically at a preconscious level and form the predicates for analytic thought. 17 However, satisficing will cause the reviewer prematurely to terminate his or her evaluation, once evidence for a problem is found. 17 These cognitive biases are insidious, because they are automatic and reviewers will not be aware of them. They can only be avoided if reviews are structured to force reviewers to consider all relevant evidence and to engage in System 2 processing by suitably designed instructions and specific debiasing questions.
Another well-known bias of System 1 processes—the false attribution error—will cause reviewers to attribute complications and adverse outcomes, preferentially, to errors, lack of skill, or poor judgment on the part of the physician under review, rather than the pathology being treated, the inherent risks of treatment, or other technical factors, unless overridden by System 2. 23 False attribution error is a general cognitive propensity of people to attribute the actions of others to their personal characteristics, rather than situational factors, but to attribute their own decisions to external factors, rather than idiosyncratic personal experiences or character traits. 23 This explains why reviewers generally believe that judgments that are actually often parochial and based on personal experiences are normative (i.e., as shared by others to a greater extent than they actually are) and declare them as the standard of care. 33
Conclusions
In conclusion, this study has demonstrated that peer review conducted by external reviewers can be systematically biased, and that the bias can be explained by dual-process theories of cognition. The thesis that systematic bias is caused by cognitive biases engendered by how peer review is conducted is novel and helps explain judgments that seem objectively inexplicable and, therefore, venal. It also leads to testable predictions about how the biases may be avoided and are currently under investigation by the author.
Footnotes
Disclosure Statement
The author represented OBG but there are no outstanding proceedings involving this matter, and the author has no financial stake or interest in the subject matter.
