Abstract
This article focuses on anchoring effects in the process of peer reviewing research proposals. Anchoring effects are commonly seen as the result of flaws in human judgment, as cognitive biases that stem from specific heuristics that guide people when they involve their intuition in solving a problem. Here, the cognitive biases will be analyzed from a sociological point of view, as interactional and aggregated phenomena. The article is based on direct observations of ten panel groups evaluating research proposals in the natural and engineering sciences for the Swedish Research Council. The analysis suggests that collective anchoring effects emerge as a result of the combination of the evaluation techniques that are being used (grading scales and average ranking) and the efforts of the evaluators to reach consensus in the face of disagreements and uncertainty in the group. What many commentators and evaluators have interpreted as an element of chance in the peer review process may also be understood as partly a result of the dynamic aspects of collective anchoring effects.
Within contemporary cognitive psychology, there is a generally accepted understanding that human judgment is often influenced by what are called ‘anchoring effects’. The theory of anchoring effects refers to an inclination or bias arising in an act of judgment in relation to a certain numerical value (‘the anchor’). In a given context, the first numerical value that an individual encounters – such as the asking price that a real estate agent gives for a house – tends to influence his or her judgment of what is to be assessed. The figure creates a heuristic reference point. Once the anchor is set, it will be next to natural for the individual to adjust and reinterpret new information based on this anchor. This phenomenon has been considered problematic because it entails arbitrariness in judgments, sometimes affecting crucial decisions (Englich and Mussweiler, 2001; Tversky and Kahneman, 1974).
Here, I use the theory of anchoring effects in relation to a sociological study based on direct observations of ten panel groups reviewing grant proposals in the natural and engineering sciences for the Swedish Research Council. This government agency is one of the largest in Sweden responsible for supporting basic scientific research. To understand the connection between academic judgment and anchoring effects, I pay attention to the organization’s review methods, grading criteria, sharing of responsibilities, and average ranking. In addition, I take into account conditions and problems associated with creating consensus, rooted in uncertainty and disagreement. Because the study deals with an organized evaluation process, in which individuals translate their own judgments into different numerical grades – which are subsequently compared and adjusted through dialogue – my primary focus is anchoring effects at the group level, or collective anchoring effects. Although both the reviewers and the Swedish Research Council see and attempt to counteract these undesired and problematic biases during the evaluation process, such biases are partly beyond their control. To the best of my knowledge, anchoring effects have not previously been studied in connection with reviews of research grant proposals.
It should be obvious that the theory of anchoring effects does not constitute a sociologically comprehensive explanatory model for understanding this evaluation process. The process centrally concerns expert reviewers’ ability to discern, assess, compare, and communicate what scientific quality is or may be. Power, hierarchy, status, network, and morality are of relevance, but I will touch on them only indirectly.
Anchoring effects, double heuristics, and academic judgment
What are anchoring effects?
The discovery of anchoring effects, described as such by Amos Tversky and Daniel Kahneman (1974), gave rise to what could most simply be likened to a new paradigm in cognitive psychology research. The exploration of anchoring effects and similar forms of bias has been important not only in psychology but also in political science and the branch of economics that studies people’s judgments of profit and risk on markets: In many situations, people make estimates by starting from an initial value that is adjusted to yield the final answer. The initial value, or starting point, may be suggested by the formulation of the problem, or it may be the result of a partial computation. In either case, adjustments are typically insufficient. That is, different starting points yield different estimates, which are biased toward the initial values. We call this phenomenon anchoring. (Tversky and Kahneman, 1974: 1128)
Tversky and Kahneman understand anchoring effects as a kind of cognitive inclination that has its roots in one of several different heuristic principles that people use spontaneously in everyday problem-solving. These heuristics can be regarded as ‘shortcuts’ people use when judgments involve intuition. Intuition is crucial and typically works very well. This fact, however, makes it interesting to study intuition in situations where it leads us astray.
How can we demonstrate anchoring effects? Tversky and Kahneman conducted various experiments in an attempt to illustrate how these effects are manifested. In one experiment, they gave subjects a numerical value between 1 and 100. This figure was taken to represent an estimate of the number of African countries that are members of the United Nations. The subjects were then asked to make their own estimate based on the given figure. One group of subjects was given a median value of 10 as the anchoring value, which resulted in a median assessment of approximately 25. In another group, subjects were presented with a median of 45 as the anchoring value, which instead resulted in a median assessment of approximately 65. The significant difference demonstrates that anchoring values have a significant effect.
Anchoring effects may also influence researchers and experts, as Tversky and Kahneman (1974) point out: ‘Experienced researchers are also prone to the same biases – when they think intuitively’ (p. 1130). However, the research program of Tversky and Kahneman does not really cover the intuitive thinking of experts or experts gathered in groups as that has come to be understood. The standard view of intuition (which they adopt in their studies) is captured in dual-process approaches pitting emotion, sensibility, and intuition against logic and deliberation. However, consonant with much work in Science and Technology Studies (e.g. Collins and Evans, 2007) and according to some of the proponents of the so-called fuzzy-trace theory within cognitive psychology, ‘intuitive thinking underlies the most advanced thinking’ (Reyna, 2012: 333; see also Reyna and Lloyd, 2006). An interesting study in which expert judges were exposed to anchoring effects was reported by Englich and Mussweiler (2001). This study, based on experiments involving experienced court judges, clearly indicated that anchoring effects had an influence on decisions. The researchers were able to observe differences in sentences as great as eight months for exactly the same crimes, correlated with the number of months in jail for which the prosecutor first pressed.
Studies of anchoring effects have been carried out in many different contexts, such as in price assessments (Beggs and Graddy, 2009; Mussweiler et al., 2000; Nortcraft and Neale, 1987), negotiations (Galinsky and Mussweiler, 2001; Ritov, 1996), estimates of the likelihood of nuclear war (Plous, 1989), and in connection with various general knowledge questions (Mussweiler and Englich, 2005). Anchoring effects have frequently appeared to be almost inescapable. In one study, even subjects who had been warned of anchoring effects beforehand were not able to avoid them (Wilson et al., 1996). Today’s research on anchoring effects has become partly detached from the old anchoring paradigm of Tversky and Kahneman, with a more multifaceted approach having been developed gradually during recent years (Epley and Gilovich, 2010).
The diversity of anchoring effects in everyday life: Looking for novel approaches
The generalizability and possible limitations of anchoring effects have been common topics of discussion (Chapman and Johnson, 1994; Jacowitz and Kahneman, 1995; Mussweiler, 2001; Oppenheimer et al., 2008; Simmons et al., 2010). Although these effects have been fairly easy to prove in experiments, some have seen them as somewhat enigmatic. What exactly causes anchoring effects to emerge and affect human judgment? What conditions are required for the effects to arise at all? Are there variations and differences of degree? Do temporary frames of mind or certain personality traits play a role (Englich and Soder, 2009; McElroy and Dowd, 2007)? These are a few examples of recently discussed questions that may be important to consider when investigating anchoring effects.
Epley and Gilovich (2010) claim that in order to describe and explain all of the everyday life scenarios in which anchoring effects actually emerge, the anchoring paradigm must be properly opened up to further exploration. They elaborate their view in the following way: The main benefit of this broader perspective is that it brings into view sources of variability in anchoring effects that have so far been largely ignored: variability in the kinds of anchors that influence judgment, variability in processes that give rise to anchoring effects, variability in the contexts in which anchoring effects are likely to be elicited, and variability in consequences of anchoring that go beyond an immediate influence on numerical judgment. (Epley and Gilovich, 2010: 20)
This is where sociology, by focusing on anchoring effects occurring in groups and the creation of consensus, can broaden this discovery originally made in cognitive psychology. Attention can be directed to collective anchoring effects or effects arising on the aggregate level. The anchoring effects, then, are indirectly understood as the result of an interplay, in which a multitude of judgments and negotiations affect the efficiency with which they work. I relate anchoring effects to the epistemological concept of ‘double heuristics’ described by Turner (2012): Thinking in terms of double heuristics compels us to think about collective decision procedures in terms of the same problems of bias, selectivity, and so forth that characterize the individual knowledge related activities of which collective activity is composed. (p. 5)
This approach can help us to target the collective dimension of the anchoring phenomenon in relation to expert knowledge and academic judgment. The concept of double heuristics introduces the idea that individuals, each with their own capacities to judge, each with biases and blind spots, are aggregated by an organized procedure, and this second-order procedure generates its own heuristics, with its own biases and limitations (Turner, 2012: 1)
In the organizational context in focus here, specifically elaborated rules, methods, and technologies are applied to make judgments and decisions. The review work carried out in the observed panel groups requires a series of adjustments of the groups’ respective provisional assessments. Because the Research Council aims at consensus on the merits of a large number of grant proposals, negotiations create a breeding ground for the spread of anchoring effects to the aggregate level. The work of the panel groups can without doubt be regarded as an organized, practical attempt to arrive at reasonable agreements despite differences in outlook. This is a matter of having a dialogue and narrowing the gap between different panel members’ perspectives: ‘People “jump” some amount from their original egocentric anchor and evaluate whether this new perspective plausibly captures the other’s perception’ (Epley et al., 2004: 328). What is special in the case of the Research Council – and described in more detail below – is the significance the average ranking is given as a collection of anchoring values. The average ranking is constantly in focus during the panel group meetings and affects the consensus arrived at in the group. Ultimately, it is in light of the average ranking, which by reflecting all the individual judgments acts as an ‘intermediary link’, that the anchoring effects acquire a new collective breeding ground.
The making of consensus involves adjustments of individual scores during group negotiations. However, this fact does not imply that all the adjustments should automatically be interpreted as the result of collective anchoring effects. For example, some of the proposals that appear to constitute very strong anchors in the average ranking may not be adjusted much, despite pressures from members of the panel group. During the wide range of comparisons of the variations in scores and the plausibility of the rankings, collective anchoring effects affect the final outcome. It is important to note that the problems with anchoring effects are related to the uncertainties and the lack of accuracy of groups’ efforts to adjust scores from initial values (self-generated or externally provided).
Cognitive particularism, communication and the creation of consensus
To understand the academic context of anchoring effects, it is important to add a few points that touch upon the very core of the evaluation process. How is consensus achieved in such a highly specialized, differentiated, and heterogeneous world? In previous research, it has repeatedly been claimed that academic judgment must inevitably bring about various forms of cognitive and institutional bias (Bourdieu, 1975; Chubin and Hackett, 1990; Hemlin, 1996; Huutoniemi, 2012; Langfeldt, 2006; Travis and Collins, 1991; Turner, 2014). At the same time, implicit in the peer review process is that reviewers can disregard their own preferences in order to, as a group, single out the strongest and most promising proposals through dialogue. In fact, the funding agencies use peer review in order to mobilize acknowledged experts as an evaluative technology that can reduce biases and corrupting forces that may occur in the evaluation process (Lamont, 2009). Still, however, the process depends on which particular individuals are selected to review the proposals. Cole et al. (1981) conducted an experimental study of differences in assessments of 150 grant proposals, reviewed by the National Science Foundation (NSF) panel group as well as by a control group, and found a remarkable amount of variation: The fate of a particular grant application is roughly half determined by the characteristics of the proposal and the principal investigator, and about half by apparently random elements which might be characterized as the ‘luck of the reviewer draw’. (p. 885)
A certain measure of chance thus seems to have been confirmed. In addition, there are tensions inherent in the creation of consensus, particularly the effects that reviewer disagreements may have on decisions (see also Mayo et al., 2006; Porter, 2005).
In interviews, researchers have emphasized that academic judgment is about having a specifically developed sensitivity, ‘researcher’s intuition’, through which they learn to recognize worthwhile questions, problems, and methods. Merton (1973) refers to a study in which well-known American Nobel Laureates talked about their personal experiences as researchers: ‘They uniformly express the strong conviction that what matters most in their work is a developing sense of taste, of judgment’ (p. 453). Research intuition is an integral part of academic judgment and can play a constructive role during the process of research evaluation. It might be noted that intuition may not be a homogeneous concept but a label used for different cognitive mechanisms. Glöckner and Witteman (2010: 7–14) have distinguished between four different types of intuition: associative, matching, accumulative, and constructive.
Both agreement and disagreement between reviewers are related to what has come to be known as ‘cognitive particularism’, the existence of cognitive boundaries within and between different scientific specialties and disciplines (Travis and Collins, 1991: 327). The very idea of using panel discussions is to ensure fair evaluations and to reduce differences between the reviewers (Fogelholm et al., 2012; Obrecht et al., 2007). Assessing research in this way is related to what Polanyi (1962) has called an ‘organizational principle’, which he explains as follows: We thus have a considerable degree of overlapping between the areas over which a scientist can exercise a sound, critical judgment. And, of course, each scientist who is a member of a group of overlapping competences will also be a member of other groups of the same kind, so that the whole of science will be covered by chains and networks of overlapping neighbourhoods. Each link in these chains and networks will establish agreement between the valuations made by scientists overlooking the same overlapping fields, and so, from one overlapping neighbourhood to the other, agreement will be established on the valuation of scientific merit throughout all the domains of science. (p. 59)
The organization of peer review is a construction of overlapping competences, and consensus is eventually formed by joint deliberations and compromises. However, it remains an open question whether a particular group of reviewers is able to identify the truly novel and potentially groundbreaking projects. The evaluation of proposals requires that reviewers make sound critical judgments about research yet to be performed, and this is a task that is often faced with uncertainties and disagreements. Previous studies have also shown that grant peer review tends to be conservative and risk avoiding (Braben, 2004; Langfeldt, 2006; Luukkonen, 2012).
Nevertheless, when the Research Council organizes its panel groups, it is self-evident that all judgments count; at the same time, however, it actively attempts to counteract various forces that may have undesired effects on the end result of the review process. Individual value judgments should lead to collective decisions through dialogue. According to Lamont (2009), ‘[d]ebating plays a crucial role in creating trust: fair decisions emerge from dialogue among various types of experts, a dialogue that leaves room for discretion, uncertainty, and the weighing of a range of factors and competing forms of excellence’ (p. 7). Organizing meetings that allow reviewers to have a dialogue with one another is crucial to the expression of different kinds of academic judgments and, thus, to the legitimacy of the entire process. The question is, however, to what extent the stated average ranking and the latent group disagreements can actually be changed. What do compromises lead to in these kinds of decisions? Is there actually scope for reappraising proposals from the beginning if necessary, or is the average value in combination with the considerable time pressure simply too predominating? In a recent article, Fogelholm et al. (2012) emphasized the results of their study, stating that ‘panel discussions per se did not improve the reliability of the evaluation’ (p. 48). The theory of anchoring effects can help to broaden our understanding of the problems that panel members can experience while carrying out their review work in a group that is forced to reach consensus despite uncertainties and disagreements.
The Research Council’s peer review process: A brief overview
The Research Council’s peer review process in the field of natural and engineering sciences proceeds as follows: First, the reviewers read each proposal independently. It is at this stage that they assign their preliminary grades and rank the grant proposals. The reviewers do not receive any formal training in the evaluation of the proposals and must therefore instead rely on their own individualized approaches (Lamont, 2009: 43). The evaluation is made based on four fundamental criteria: (a) novelty and originality, (b) scientific quality, (c) the applicant’s merits, and (d) feasibility. In addition, there is a summary judgment of the proposal’s scientific merits. A seven-point scale is used for all criteria except feasibility, which is instead assessed on a three-point scale. The steps of the seven-point scale are 7 = Outstanding, 6 = Excellent, 5 = Very good to excellent, 4 = Very good, 3 = Good, 2 = Weak, and 1 = Poor. The Research Council also expects the reviewers to take notice on other relevant criteria, such as gender, diversity, and societal importance. In the next step, the grades are compiled and an average ranking is established by the Research Council administrators. The average ranking provides a crucial point of departure for the later review work, as it gives a picture of how the group, as a collective, has judged the proposals. This average ranking, around which some of the ensuing discussions are centered, forms a basis for whatever consensus is reached. As I will argue in this article, the average ranking produces a powerful multitude of anchoring values that from the very beginning influences the group in a considerable way. Each panel group has between 10 and 13 reviewers. The results finally agreed upon by the panel groups during the 2-day meetings provide the basis for the subsequent funding decisions, which are made by the Board of the Research Council later. The joint decision of the panel group concerning the ranking of the proposals is therefore not the final decision in a formal sense – unlike some other granting bodies, such as the US NSF (Holbrook and Frodeman, 2011). Nevertheless, the Board usually follows these recommendations. The making of consensus in the panel is therefore very important for the final decisions.
Some of the stronger unfunded proposals can be sent to a special redistribution group, where a ‘second opportunity’ is given. This opportunity sometimes makes fine-tuning of the grades important; the administrators who are monitoring the meetings sometimes urge the panelists not to use the grades strategically, to give some proposals a better chance in the redistribution group. If the reviewers make efforts to ‘manipulate’ the grades on some of the proposals, the administrators might report it directly to the redistribution group.
Because of the large number of proposals, deliberations are often carried out at a fast pace. However, the discussions do not begin until those who may have a conflict of interest have left the room, which leads to a constant flow of people leaving and entering. Conflict of interest rules are strict, but compliance is largely based on reviewers’ personal conscience and honesty before the group. These provisions made, the discussion can begin. An estimated funding cut-off line, calculated using both the available budget for the competition and the budget requested in the grant applications, is displayed within the list of applications. The line is meant as a tool to facilitate discussion and may not be entirely accurate, as budget reductions may occur. Some proposals that are on the borderline for funding have to be discussed longer and proposals that have a reasonable chance of being funded are allotted the greatest amount of time. Discussions may take quite some time, particularly if the reviewers are not in full agreement. When the conversations get out of hand, the panel group’s chairperson has to intervene and restore order.
As a rule, each proposal is scrutinized by three reviewers; occasionally, external reviewers are involved. The names of all applicants are displayed on a large screen along with the collected grades assigned by the reviewers. The chairperson plays an important leading role; from time to time, he or she must remind the reviewers that time is scarce and force their willingness to compromise. Two administrators and one senior adviser from the Board are also present as observers during the meeting, to guide and monitor the process.
In a typical case, the process may proceed as follows: The main reviewer starts by pointing out that the application is well written and exciting. She also mentions that the project is quite innovative and may lead to important knowledge. Moreover, the applicant has published eighteen articles since 2009, mostly in prestigious journals. The main reviewer then indicates that out of the seventeen proposals she has reviewed, she has ranked the one in question second. She reports the following grades: 6 for novelty and originality, 6 for scientific quality, 6 for merits, and 3 for feasibility. She assigns a summary grade of 6. Following the main reviewer’s remarks, the other reviewers present their own views and grades. The next reviewer agrees with the main reviewer, suggesting a somewhat lower grade for novelty and originality but still agreeing on the summary grade of 6. The third reviewer, however, feels that the suggested grades are too high and should be lowered. Not being as impressed, this reviewer has ranked the proposal much lower, assigning a grade of 4 for the novelty and merits criteria, instead of the other reviewers’ 6s. The chairperson now inquires which preliminary grades should be entered into the system. Grades 5, 6, 5, and 3 are agreed upon. This results in a summary grade of 5. The main reviewer will then have to produce a written judgment that matches the scores.
The ranking of a proposal with a mixed rating might undergo significant shifts relative to other proposals during the evaluation process. However, it is generally understood from the outset that the proposals that received the highest total rating prior to the meeting will get funded: No one really disputes the top-rated proposals. The initial scores and rankings of these proposals thus represent ‘strong anchors’, and ones that have direct consequences for the rest of the consensus-making in the panel group.
More complicated anchoring effects are evident in cases of disagreement on the grading or when there is some uncertainty as to how a certain proposal should be assessed in relation to others. Maintaining consistency in the evaluation of quality is a complicated issue. The reviewers compare different subsets of proposals and use slightly different standards that they continuously must try to calibrate. It is on these efforts that I focus below.
Methodology
The empirical investigation that forms the basis of this study was carried out in 2013. Out of nineteen Research Council panel groups in the natural and engineering sciences section, ten were observed during their respective two-day meetings. The panel groups to be observed were selected to obtain variation in scientific discipline as well as based on the varying proportions of international and domestic reviewers: According to the rules of the Research Council, at least 20 percent of the reviewers in each group are to originate from countries other than Sweden. In some of the groups, all members came from the Nordic countries; in those cases, the discussion language was ‘Scandinavian’, rather than English, as it was in other groups. The panel group meetings were held at a major Stockholm conference hotel. Two colleagues and I carried out observations of the meetings, which resulted in a total of twenty observation days. Having a small observation team entailed some advantages. First, it was possible to collect extensive observational data. Second, our observations could be constructively compared and discussed following the meetings. This approach enhanced, in several respects, our picture of what is actually relevant and important to understand regarding the question of how expertise and knowledge are organized in this specific review context.
What was observed?
Our main focus was on the social interaction between the involved individuals and on how the review work was carried out in practice. The general strategy was to try to document as much as possible of what we were able to see and apprehend during the meetings. In line with this ambition, we tried to remain as unnoticed as possible and to perform our observations without disturbing the proceedings in any appreciable way. This turned out to be quite easy, because once the meetings had begun, the pace was so fast that no one seemed to notice or care about our presence. On the other hand, several reviewers were eager to ask us questions during the breaks. Several even expressed some enjoyment in being for once the subjects of a scientific investigation. For the purposes of the study, what occurred during the meetings themselves, when the review work was fully underway, was of primary relevance. But even conversations between reviewers during lunches and coffee breaks conveyed a good deal of interesting information. These conversations often related directly to the assessment work, but they could also concern completely different matters. The observations amounted to a kind of ethnography of meetings. In addition to the obvious importance of these particular meetings, there are more general methodological merits to studying formal meetings: ‘Meetings reveal themselves to be condensed field sites for the examination of the workings of organisational logics’ (Thedvall, 2013: 117; see also Schwartzman, 1989).
Retrospective analysis of anchoring effects on the group level
The most common methods for discovering and proving anchoring effects applied to date have involved designing psychological experiments of various kinds. In these experiments, researchers have studied the existence and degree of these effects in different contexts. The numerical values used have frequently spanned larger ranges, making the effects clearer and easier to test. From the very outset, anchoring effects have been the focus of these studies. But my challenge here is somewhat different: How can one describe and explain anchoring effects on the basis of large-scale empirical data that are not the result of experiments designed specifically to investigate the presence of such effects? How should one approach the small numerical ranges provided by the grading scales used for the quality criteria? Is it possible to see anchoring effects in data of this kind? Can this be done in an acceptable manner at all, without resorting to mere speculation?
I examine anchoring effects using an interpretative analysis, using the idea of anchoring analogically (see, for example, Holton, 1998; Shelley, 2003; Swedberg, 2014). Anchoring effects became apparent retrospectively – after the observations were completed, after carefully reading through the extensive notes taken during observation sessions and after thoroughly surveying the literature on human judgment. In contrast to common practice in cognitive psychology, I localize the anchoring effects to the group level as well, not only to the individual level. The effects came to be considered the result of a collection of comparisons, adjustments, and rankings, where numerical values constantly served a crucial purpose in the groups’ work to create consensus. Although the initial anchoring values were created when individuals, working in isolation, assigned their preliminary numerical grades, the group work was nonetheless organized around the group’s average ranking. This itself creates a new collective ‘anchoring situation’.
However, understanding anchoring effects in terms of organized group interaction does not entail neglecting the individuals who do the actual work. On the contrary, it is important here to consider both the mutual dependence and the individual variations involved in the review processes. As I noted above, cognitive particularism and research intuition are central aspects to many accounts of academic judgment. Anchoring effects should nevertheless be interpreted on the interaction-aggregate level. But achieving this in the total absence of experimentation requires a different kind of reconstruction work. In this study, anchoring effects are instead elucidated through a kind of retrospective meta-analysis. By describing what was said and done during the panel group meetings, evidence for the existence of collective anchoring effects can be presented. This approach is rendered scientific by virtue of its capacity to elucidate and interpret, rather than to measure and provide clear-cut proof. To be sure, the retrospective method used here entails certain limitations regarding its precision because it may be difficult to determine, after the fact, exactly where in the evaluation process anchoring effects take effect.
Supported by Epley and Gilovich’s (2010) ambition to broaden our perspective on anchoring effects – an ambition that includes acceptance of non-experimental methods – I wish to claim that, regarding the evaluation process, the extent of the numerical scale per se is not as important as the translation from meaning to figures, which may contain a large measure of uncertainty, vagueness, and ambivalence. Even strongly value-loaded words can weigh heavily during the negotiations and can reinforce the gravity of the numerical anchor value. It is therefore important to acknowledge the ‘semantics of anchoring’ (Mussweiler and Strack, 2001). However, in this study, I will first off all focus on the issue of grading and on the influence of the numbers in the review process.
Anchorings and adjustments of academic judgments: Observations and examples from the Research Council panel groups
The organization of the peer review meeting
The task to reach consensus is one of the most defining conditions of the review process. Reviewers’ chances of achieving consensus depend not only on their individual expert knowledge, shared standards and on colleagueship but also on how the respective funding agencies have organized the entire procedure. Two factors of particular relevance to the review work are how the review methods have been worked out and the size of the overall budget. The available funds obviously set limits on the number of proposals that can be granted financing. Moreover, the shaping of the review methods – the grading scales, ranking procedures, examination responsibility, and so on – may have tangible and interesting effects on the review process. In Norway, political scientist Liv Langfeldt (2001) has analyzed ‘such seemingly irrelevant or “innocent” factors as rating scales and peer panels ranking methods’ (p. 820). Funding agencies frequently differ in their examination methods, and different methods having different consequences. For example, use of preliminary ranking procedures and open decision-making tends to allow more scope for scientific diversity and innovative projects. Enthusiastic reviewers can thus be observed to change their colleagues’ views on proposals that had initially been considered too risky, peripheral, or immature. Langfeldt further pointed out that the combination of some factors – namely, having several reviewers in a group that grades all proposals, using a fine-grained grading scale (1.0–4.0), and employing an average ranking – tends to contribute to a more solid and predictable end result. She further stated that ‘few reviewers per application and no scholars to compare and rank the whole portfolio of applications give ample room for randomness’ (Langfeldt, 2001: 832).
In some respects, the Research Council’s review methods can be seen as a mixture of the types of methods discussed by Langfeldt. For example, there is open discussion of each proposal, which has been scrutinized by three reviewers and assigned an average ranking. However, many of the observed groups made provisional use of decimals to sort and rank the proposals. For obvious reasons, the average ranking serves an important purpose in the group. The idea is that it should give a rough picture of how the group, as a collective, has assessed the grant proposals, thus providing a reference value for reviewers to relate to during the meeting. However, it also creates a new anchoring position that influences future collective adjustments in various ways. Furthermore, the fact that only three reviewers have the responsibility for judging each proposal can make it somewhat difficult for the group to compare all the grades and ratings for a more comprehensive overview. One seemingly inevitable problem that has been identified in relation to the practice of assigning scores to proposals is that it involves certain degrees of arbitrariness and chance (Day, 2015; Mayo et al., 2006).
On uncertainty and the magic of numbers
In our collected observations of the Research Council panel groups, we saw evidence of constellations of expert reviewers working together systematically to achieve consensus. None of the groups failed, in any formal sense, to produce grounds for decision-making during the time allotted to them. Still, the review work was far from problem-free. Several reviewers expressed feelings of uncertainty, powerlessness, and ambivalence in connection with adjusting figures and comparing the various proposals. Some of them had been more conservative in their gradings; others appeared to be somewhat uncertain when they gave in to others during the deliberations. In contrast, other reviewers seemed to be quite confident in their own judgments. And it was there, in the zone between certainty and uncertainty, that grade adjustments were to be made and consensus reached. The possible power of collective anchoring effects depends on the variability in scores and the social interaction, in which tugs of war between different reviewers often end in a kind of pragmatic acceptance of the situation. As shown in psychology studies, reviewers’ varying degrees of confidence in their own and/or others’ judgments may depend, for example, on personality traits and emotions (Englich and Soder, 2009; McElroy and Dowd, 2007). The very point of organizing meetings is to give reviewers a chance to modify the mechanically calculated average value through discussion and joint reflection. In this way, they are able to express their opinions and attempt to convince one another of the proposals’ scientific value. But the opportunity they have to influence and change things does not always seem to be fully utilized, as pointed out to the panel group by one reviewer: ‘I’m worried that we’re sticking with our initial ranking, and not changing things.’ In part, this confirms the inertia that is inherent in the review process and that is associated with the fixed review methods drawn up by the Research Council, methods that affect the conditions in which consensus can be created.
The exchange of views presented below provides an example of how it could sound when several reviewers discussed issues they experienced as highly problematic in relation to their work. It took place on the second meeting day, when the reviewers continued to adjust the details of the ranking in order to arrive at a final consensus:
I couldn’t sleep. It really says something about us when a person like that gets that low on scientific quality. … And I know why: because no one who has judged it really understands it. But not understanding the proposal is not the same as low scientific quality. I say this because I feel bad.
But everybody understands the rules of the game!
This is a dilemma. No one is an expert on everything, and the grades fluctuate. At this point we can’t change anything. But I agree with you, there is always a bad taste after these meetings. But let’s save the discussion for afterwards, because we have more to do.
We’re ‘killing’ people. They have no place to go without funding.
We’re not finished; we need to have lunch now.
The reviewers seemed unable to do very much about the circumstances. Although they seemed to work well together on the whole and demonstrated good colleagueship, occasional moments of concern arose that were directly tied to the very task of evaluating, comparing, and assigning new, joint grades to the various proposals. Feelings of inadequacy, injustice, and lack of time were among the problems that could be observed during the meetings. In the above conversation, the group’s chairperson mentioned that the grades ‘fluctuate’. To a certain extent, this captures an important aspect of the conditions under which the entire review process takes place. Grades are undeniably changed during the course of the meetings, which is as intended; reviewers are encouraged to calibrate their grades in order to come to agreements that are as reasonable as possible.
However, if we disregard the evaluation method per se for a moment, the numbers seemed to – remarkably often – exercise a kind of power over the reviewers, which suggested the existence of anchoring effects. Many reviewers bore witness to things that puzzled them, things that seemed to exceed their own and the organization’s control. During a lunch break, one reviewer declared that ‘the figures tend to become strong’. In another group, three of the Research Council’s administrators said that they saw a serious dilemma in the fact that the reviewers become ‘spellbound by the numbers’. All three considered it quite problematic that reviewers sometimes referred to the values almost mechanically, although ‘everybody knows how misleading the figures can be’. According to one of the administrators, when the figures are in place – once they have been anchored – they exercise a magical power over the reviewers, a power that the reviewers seem to find hard to ignore when they adjust and adapt their grades. Different reviewers and administrators frequently mentioned magical forces and chance as inevitable elements of the review process. This seems to be fairly common in many different social contexts: In everyday life, chance is something that enters our conversation when we get into certain types of trouble and want a certain class of solutions. It is not the only class of solutions – another, historically more prevalent, has been witchcraft. (Levi Martin, 2011: 55; see also Manis and Meltzer, 1994)
Reviewers’ and administrators’ references to chance and magic provide important qualitative evidence that corroborates the idea that the complex review work involves anchoring effects at the interactional-aggregated level. Other relevant factors are that no one has a comprehensive overview and that the grades sometimes vary considerably depending on who has assigned them. What comparisons are the reviewers making and according to what standards? This was a recurrent dilemma, often brought up in the reviewers’ conversations. The amount of time reviewers scrutinize proposals may vary considerably; one reviewer, a bit embarrassed, pointed out during lunch that she had had only 20 minutes to spend on some of the proposals, which might have been highly disadvantageous for their authors.
In an important respect, anchoring effects begin at the level of the individual, when the reviewers read and grade the proposals in isolation: The reviewers anchor their judgments subjectively, assigning preliminary numerical grades to the proposals that will subsequently affect the average ranking. But what do the figures represent in relation to the estimated values? And how does this affect the average ranking, if it is subsequently to be adjusted during the discussions? Several reviewers indicated on different occasions that it was sometimes incredibly difficult to evaluate proposals in terms of the quality criteria. One reviewer pointed out, among other things, that ‘our use of the grades and what they correspond to in words is sometimes a bit vague’. What kind of consensus has been reached if the grades, from the very beginning, contain vagueness and uncertainty? In such situations, the reviewers must often rely on their expert intuition. And as shown in the psychology research, this is when anchoring effects emerge and influence future decisions. Adjusting the grades thus becomes a problem, which one of the chairpersons pointed out for his panel group: ‘It is quite obvious that we have some difficulty in handling the grades.’ This statement is serious because it shows that the proposals’ fates are determined on the basis of a fundamental difficulty in establishing grades – something that could be observed, in principle, in all of the 10 panel groups.
The uncertainty found in the review process may also be explained by what another chairperson pointed out during a break, namely that her group had too many specialties to cover. The consequence of this, according to her, is that certain algorithms discussed in the proposals may be very difficult to understand or that the value of certain publications may be difficult to determine. Such difficulties were mentioned by several group members. Even in groups where these problems were not mentioned explicitly, a surprisingly large number of reviewers stressed their feelings of uncertainty in relation to some of the proposals. One reviewer pointed out, for example, that it was highly disagreeable to be the main reviewer of a proposal in a field one does not master completely, adding, ‘How do you choose between assigning grades 3 and 5?’ Moreover, several reviewers indicated that it could be very difficult to assess the novelty and originality of a proposal. Some even felt that this was completely impossible in some cases. One reviewer found a proposal he had reviewed less than interesting, but he added that he had not completely understood it. The phrase ‘difficult to judge’ was used on a regular basis in all of the observed panel groups. The feasibility criterion was also said to be difficult to judge for a reviewer who is not entirely familiar with the methods and techniques to be used. This became evident in various situations, particularly one in which a reviewer raised objections to her colleagues’ views concerning the feasibility of a project: ‘I disagree on the feasibility. But again, it is not certain what feasibility is.’
Another problem, associated with the large number of proposals to be reviewed, is grade inflation. Many reviewers expressed their frustration over the problem of grade inflation and its effect on the average ranking. When in doubt, many reviewers tend to place themselves in the middle of the grading scale. An older reviewer commented on this in the following way: ‘I don’t know so much about this, and if you don’t know so much you tend to stay in the middle.’ In another panel group, the problem seemed instead to be that too many 6 grades were being assigned. This caused the chairperson to tell the group, half in jest, that they needed to stop giving so many high grades. Thus, grade inflation can manifest itself in various ways. A dialogue between a middle-aged reviewer, who had participated in many panel group meetings, and one of the Research Council administrators went like this:
I’ve tried for many years to make the Research Council understand that the grading scale does not work. Had we really applied it, that would have been a disaster.
This is a well-known problem. Grade 5 is frequent because it’s convenient to assign it. But of course we are here precisely in order to discuss grades and everything. The discussions are meant to level out unclear and vague points.
It appears difficult to totally re-evaluate all grades; the existing average ranking and the need to reach a consensus are always taken into account. One reviewer critically expressed his wonder as to the meaning of the whole thing: ‘But did I get it right: Is the idea that I should be able to reappraise my entire assessment here in the group? If not, what point is there in our sitting here?’ The reviewers may reappraise their assessment, but only so much. That is the great challenge to the panel group as a collective. The trick is to find a reasonable and acceptable consensus while avoiding biases. However, biases still arise when, in comparisons of the proposals’ quality, reviewers’ interpretations diverge owing to uncertainty, vagueness, and disagreement. When adjustments are to be made, the anchoring effects creep, often quite subtly, into the review process. The weight relations are not always easy to see through, but the magic of numbers seems to wield a certain power over the panel groups.
Which anchors carry the most weight? Anchoring effects are always relatively context-dependent phenomena. Reviewers’ fights for their grades are often a matter of negotiation, and negotiations can hover between flashy improvisations and seemingly standard arguments. Anchored values can have considerable effects on the consensus ranking, and this seems to be an inevitable part of the achievement of consensus. No one is immune to what has been given. A chairperson pointed out during a break that everyone in the group seemed to wish to ‘stick with the grades they had assigned at home’. This may have been an exaggeration, however, as most reviewers displayed their willingness to consider and incorporate the other reviewers’ opinions. In some cases, though, the grades seemed to cause one, several, or all reviewers to hold tight to their own figures, often directly or indirectly justifying this with reference to the average ranking. Even when reviewers did not hold tight to their earlier grades, this ultimately led to collective adjustments, which affected the remaining proposals in the ranking. As explained by Graves et al. (2011), ‘most proposals are likely to occupy a tightly packed middle ground. These proposals are the most difficult to separate and a slight change in score can push a proposal below or above a funding line’ (p. 2). When some reviewers went back on their grades and others held on to theirs, this caused the group to ‘drift away’ from certain original grades and to stick to others. This seems to be an essential feature of the dynamics of collective anchoring effects, and it has to do with the multiple gravities between the many judgmental anchors during the deliberation.
One reviewer emphasized, during a discussion in one of the groups, that ‘[w]e really have to agree on how the average grades are arrived at, so that it won’t be fraught with randomness and improprieties’. However, how much can the figures be twisted and turned without causing distortions and biases to emerge during the deliberations? The Research Council officials are well aware of this dilemma. In every group, there was almost always a critical point at which the grades were bandied back and forth and where the question of funding was in the balance. In these situations, the problem was often to discern and agree upon those proposals that were on the threshold of being financed. How should adjustments be made? Every quality criterion was compared and reconsidered from various angles. It was sometimes impossible to see what was actually going on, as there were long intervals of silence and persistent reflection. Grade proposals were presented, taking other suggestions into account. Advances aimed at raising or lowering the grades were made. In connection with a critical decision, one reviewer pointed out that ‘It is not completely satisfying when different proposals get such similar grades.’ In many cases, improvisations were inevitable, and, as noted, improvisation is not without its risk. For example, in a situation where the group clearly began to improvise figures during the second meeting day, it was pointed out by one of the Research Council administrators that they were not allowed to start ‘manipulating’ the figures. After a while he added, ‘It’s really very important that you stick to what was said yesterday, and don’t start changing too much. Because then part of yesterday’s work will be wasted.’ In some cases, there were long discussions about how to interpret the different criteria. Comments could sound like this: ‘But in relation to the preceding proposal, you considered that to be a strength, but in this one you think it’s a weakness. We have to be more consistent.’ Discussions of whether the grading scale was used relatively or absolutely were not uncommon. As a rule, there was agreement that it was both relative and absolute. But how the individual reviewers had proceeded previously, when they worked in isolation and made their assessments, cannot be determined. Several situations arose in which reviewers said that they had made assessments based on quite different standards from the outset, which was construed as having resulted in a distorted and biased point of departure. When they found that too many proposals had been placed somewhere in the middle, around grade 5, for example, several groups began using decimals to distinguish between them, and thus two proposals could receive grades of 5.8 and 5.9. Sometimes several proposals were placed around the same decimal, and in some panel groups, the reviewers arrived at a decision by show of hands.
A numerical grade can mean different things depending on who assigned it. A reviewer from one panel made this point very well: ‘I gave her a 5, which means a lot in my case!’ Does this mean that the same grade could actually be interpreted as a 7 if it were translated to the other reviewers’ ways of defining their value judgments in terms of figures? Another reviewer referred to having been ‘in a bad mood’ when she read some of the proposals, which could explain her stringent grading – on the difference between happy and depressed reviewers in relation to the emergence of anchoring effects, see Englich and Soder (2009). We see here a general element of uncertainty in the review process and in the interpretation of how the different values should be weighed in. Thus, groups sometimes started focusing on interpreting the meaning attached to the evaluations carried out by the individual reviewers.
The anchoring effects occur at a somewhat elusive stage in the review process. This is because the reviewers are constantly making their own private comparisons between the different proposals, and these may not be evident to their colleagues (or other observers), although they push the adjustments in a certain direction. In one of the panel groups, a situation arose in which one reviewer had evaluated a certain proposal as the highest in a ranking. Later on, the ranking itself came to serve as an argument in the deliberations. At one point, this reviewer, showing her frustration, pointed out that she had awarded 7, 7, 6 and 3, and that a grade of 4 for scientific quality, which her colleagues wanted to assign, was much, much too low. After a while, she exclaimed, ‘Come on!’ Another reviewer then replied that proposals are always judged and evaluated through people’s different standards and interpretations of what constitutes quality. She then brought up another example from a previous discussion, in which a certain proposal had been evaluated much lower on one of the grading criteria, but now it seemed that the same criterion was being construed completely differently. Thus, the figures acquire a solidity of their own and constantly influence the review process in a way that is not always desirable or fair. When there is complete agreement on which proposals should receive the top rankings, everything is perceived as relatively unproblematic. When disagreement arises, however, it becomes evident that chance elements make themselves felt in the review process.
During a coffee break, two reviewers with whom I was talking pointed out that there is frequently agreement about which proposals are near the top and which are near the bottom. This does not imply, however, that the reviewers agree completely on those cases. For example, one reviewer joked with his colleagues, claiming that there were top proposals at the bottom of the group’s average ranking. His point was that the high variability in the scores between the reviewers had a profound impact on whether a proposal would be pushed above the funding line. This variability might be based on differences in expertise or in the insufficient adjustments of the grades during the meeting. To be sure, any reviewer can suggest ‘raising’ a proposal that has a low ranking. When this occurs, the proposal is brought up for discussion even if it does not qualify for funding. In principle, this never results in funding but only in putting a finishing touch on the grades, with more positive signals being sent back to the applicant.
Sometimes, in this game with figures, it appeared that groups had lost sight of where all these adjustments were actually leading. In an annoyed tone of voice, one of the international reviewers pointed out to the rest of the group (in a case in which she felt they were too lax in using the higher grades), ‘“Excellent” must mean something!’ This is the very stumbling block being dealt with here: How can the content of opinions and numerical grades be combined to form a mutual group evaluation?
Collective anchoring effects and the power of chance
As I have noted above, use of the average ranking is essential to the work of the panel group, because it offers a guiding overview that sheds new light on the individual grades. Without the average ranking, the reviewers would have a much more difficult time getting a firm grip on the general starting position. In a positive spirit, one chairperson emphatically pointed out, at the beginning of the first meeting day, the advantage of using the average ranking as a basis for evaluation: ‘It provides an important idea of the situation on the group level.’ Even so, the average ranking gives rise to certain ambivalence. Later during the same day, however, the same chairperson repeatedly indicated the need for a ‘better system’ for arriving at this average ranking.
During our observations, several reviewers talked about the effect of chance on the review work. A somewhat extreme view was expressed during a coffee break when one reviewer declared that evaluation of proposals written by younger researchers in particular is more like a lottery that could just as well be taken care of by a computer. Other reviewers suggested that, to a certain extent, this also applies to proposals written by more established researchers. Apparent chance effects can be seen in the following exchange:
So what should we say? You two have given him 4, and X was the only one who understood it, he gave it a 6.
You had too many good ones.
I can go either way.
I’m not an expert in this field. But I could easily bring this guy up. I just had too many in the middle, that’s why I ranked him a bit low.
In certain cases, when reviewers hold on tight to their grades and, in other cases, when they give in (directly or in relation to the average ranking), these behaviors are bound to have effects on the aggregate level. However, it is not easy to establish either why reviewers act as they do or exactly how their actions will influence the collective end result.
Lack of time, uncertainty, and disagreement are the key factors contributing to the emergence of collective anchoring effects during the process of adjusting judgments. Several reviewers testified to the influence of chance. In addition, inertia was felt in most groups; sometimes, one reviewer would struggle in vain to change the group’s standpoint. Inertia could be seen in the reviewers’ view that granting funding to certain proposals often felt like a given from the very outset. Very often the topmost proposals enjoyed an early consensus. The panel group never really considered an alternative outcome, but instead leaned on the agreement inherent in the grades. As a result, the reviewers did not generate reasons for why a certain ‘strong anchor’ may be inappropriate.
When we consider the entire process, including the fact that tendencies in the review process actually change on both the group and the individual level, this mixture of predictability and unpredictability seems an obvious consequence. Despite the fact that the whole process originally emerges from individual judgments, these judgments are reshaped by the organizational conditions and the specific assessment methods used. This involves a complicated interaction between a large number of different anchoring values that are expected to be adjusted and that cause the power of chance to penetrate what would seem to be self-evident in the ranking. The extent to which the figures move away from the previously anchored values varies from case to case, as does the degree of arbitrariness. Moreover, the underlying factors guiding the groups’ consensus-making sometimes appear rather unclear, as observed by one reviewer who said with astonishment, ‘How did we reach this – considering the original grades given?’
Concluding remarks
Peer review is a cornerstone of science. However, as previous research has shown, it comes with imperfections. To study the peer review of grant proposals, and processes for creating consensus, I started from the cognitive psychological theory of anchoring effects. However, I broadened the interpretive framework, making it more sociological: I identified anchoring effects at the aggregate group level and in relation to conditions of social interaction. In order to further explain the sociological meaning of collective anchoring effects, I applied the notion of double heuristics to the issue. The concept of double heuristics introduces the idea that experts, each with their own special capacities to judge, each with preferences, biases, and blind spots, are aggregated by an organized procedure, and this second-order procedure generates its own heuristics, with its own biases and limitations (Turner, 2012).
My goal is not to prove or measure anchoring effects, which would have required other methods altogether. Instead, it has been my intention to show, using a retrospective analysis, that the anchoring effects emerge through combinations of different conditions that guide the panel groups’ deliberations, not least owing to the use of numerical grades and an average ranking. In the study, several people in the different groups talked about the ‘magic of numbers’. In almost every group, the influence of chance on the review process was mentioned. In the context under study, this experience of chance can be related to the collective anchoring effects that emerge through the large number of anchorings, adjustments, and reappraisals that are made during panel meetings. These effects are in part linked to the uncertainty and disagreements that occur in relation to the evaluation of scientific quality. Most of the reviewers struggled, on several occasions, with their own as well as one another’s judgments, as I have attempted to depict in this analysis. Many small biases in scoring can have some elusive consequences on the creation of consensus (Day, 2015; Graves et al., 2011). I regard my identification of collective anchoring effects not as an alternative explanation to what Cole et al. (1981) presented as the ‘luck of the reviewer draw’, but rather as an explanatory complement to help understand the relation between chance and consensus in the review process. I focus on the type of interactional-aggregated bias that is connected to the seemingly innocent factors as grading and ranking.
An important question is whether it would be possible to minimize or even to avoid anchoring effects in the review process. It has, for instance, been suggested that individual anchoring effects can be reduced by corrective strategies in which efforts to question the plausibility of anchors by counterfactual thinking are employed – ‘to consider the opposite’ (Mussweiler et al., 2000). But what exactly would it mean to consider the opposite in the review process? Even if it is possible to develop strategies for reducing individual anchoring effects in peer review, the nature of the review process as a whole is likely to make it difficult to neutralize collective anchoring effects. As I have argued in this article, the average ranking produces a powerful multitude of anchoring values that influences the panel in a considerable way from the very beginning and throughout the negotiations.
This study does not provide any solutions concerning how to deal with collective anchoring effects, nor does it provide any policy recommendations. However, improved knowledge of this very important process within the scientific communities may further benefit the ongoing discussions concerning peer review as a method for funding research.
Footnotes
Acknowledgements
I would like to thank Staffan Furusten and Adrienne Sörbom for their generous help in doing observations in some of the panel groups. I also want to thank Moa Bursell, Martin Gustafsson, Mikaela Sundberg, and Richard Swedberg for valuable comments on earlier versions of the manuscript. I am also grateful to the journal’s anonymous reviewers and the editors for insightful comments that substantially improved the clarity of this article. And last, but not the least, I want to thank the Swedish Research Council for giving me access to the field.
Funding
The postdoctoral project of the author between 2012 and 2014 was funded by Stockholm Centre for Organizational Research (Score).
