Abstract
Currently, guidelines do not exist for applying interrater agreement indices to the vast majority of methodological and theoretical problems that organizational and applied psychology researchers encounter. For a variety of methodological problems, we present critical values for interpreting the practical significance of observed average deviation (AD) values relative to either single items or scales. For a variety of theoretical problems, we present null ranges for AD values, relative to either single items or scales, to be used for determining whether an observed distribution of responses within a group is consistent with a theoretically specified distribution of responses. Our discussion focuses on important ways to extend the usage of interrater agreement indices beyond problems relating to the aggregation of individual level data.
Assessments of interrater agreement, or the degree to which raters are interchangeable (Kozlowski & Hattrup, 1992), 1 are integral to many types of organizational and applied psychology research. For instance, interrater agreement assessments have recently been central with respect to addressing substantive questions within domains such as organizational climate and leadership (e.g., Dawson, Gonzalez-Roma, Davis, & West, 2008; Walumbwa & Schaubroeck, 2009), conducting quantitative and qualitative research, as well as laboratory and field studies (e.g., Katz-Navon, Naveh, & Stern, 2009; Kreiner, Hollensbe, & Sheep, 2009; Van Kleef et al., 2009), developing measures (e.g., Bledow & Frese, 2009; Lawrence, Lenk, & Quinn, 2009), dealing with various types of data analysis problems (e.g., Grant & Mayer, 2009; Nicklin & Roch, 2009; Trougakos, Beal, Green, & Weiss, 2008), and deciding whether or not to aggregate data (e.g., Borucki & Burke, 1999; Takeuchi, Chen, & Lepak, 2009). Further, usage of interrater agreement statistics is on the rise. In the Journal of Applied Psychology and Personnel Psychology alone, there has been a largely linear increase in the use of these statistics over the past decade (see Figure 1). Notably, in 2010 almost half of the articles published in these journals used interrater agreement statistics.

Percentage of articles published in Personnel Psychology and the Journal of Applied Psychology that used interrater agreement statistics, including rWG, average deviation (AD), intraclass correlation (ICC), percentage agreement, and Cohen’s kappa.
Despite the relevance of interrater agreement assessments for dealing with a broad array of theoretical and methodological issues and their widespread usage, systematically derived guidelines for applying interrater agreement indices to the vast majority of problems that researchers and practitioners encounter do not exist. The primary objective of this article is to derive practical guidelines to assist researchers using the average deviation (AD) index in making more informed decisions about interrater agreement problems. We focus on the AD index, the average deviation from the mean or median of ratings, for two primary reasons. First, AD is straightforward. It measures agreement, while intraclass correlations (ICC) measure both agreement and reliability simultaneously (LeBreton & Senter, 2008), potentially complicating inferences. Further, for both ICC and rWG, researchers must choose from among numerous variations to employ the statistic (see LeBreton & Senter, 2008, for a review). Second, AD performs well. In a simulation study, Roberson, Sturman, and Simons (2007) found that the AD index performs as well as similar other statistics. Kline and Hambley (2007) reported similar findings.
Importantly, we are concerned with practical significance, or “whether an index indicates that interrater agreement is sufficiently strong or disagreement is sufficiently weak so that one can trust that the average opinion of a group is interpretable or representative” 2 (Dunlap, Burke, & Smith-Crowe, 2003, p. 356), as practical significance is the basis on which agreement is typically evaluated. We present critical values for addressing the frequently asked methodological question concerning practical significance, “How much agreement/dispersion is there?” These critical values can be used to assess agreement on a single item or a scale. This question concerns the level of agreement in a set of ratings. An answer to this question often informs decisions about the quality of a measure of central tendency, such as a group’s mean, as an indicator of the group’s standing on a phenomenon or construct of interest. While previous work has also addressed this question, as we will discuss in what follows, the guidelines provided are of very limited use.
In particular, we go beyond the work of Burke and Dunlap (2002), who previously provided a decision rule for interpreting the practical significance of observed AD values, to provide decision rules that cover many more circumstances. As we detail in what follows, though the calculation of AD does not require the specification of a null distribution representing no agreement, the interpretation of AD does. In other words, while one can calculate AD in the absence of a specified null distribution, one cannot draw conclusions regarding observed AD values without comparing them to some notion of “no agreement.” Burke and Dunlap’s guideline is based exclusively on the uniform distribution as the null distribution; there are no guidelines for interpreting the practical significance of AD relative to any other null distributions. In what follows, we discuss the criticisms of researchers’ overreliance on the uniform distribution despite other distributions often being more appropriate. Herein, we provide guidelines for interpreting AD in terms of the level of agreement relative to numerous other distributions. Our guidelines will allow researchers to interpret interrater agreement relative to null distributions more appropriate to their research than the uniform distribution.
Furthermore, we present guidelines for addressing the less commonly posed yet theoretically important question of “How well does the pattern of observed agreement/dispersion match the theoretically specified pattern of agreement/dispersion?” These guidelines can be used in relation to either agreement on a single item or a scale. An answer to this question informs decisions regarding the scoring of the group as consistent or not with the theoretically specified distribution and, thus, the use of such scores in subsequent analyses at the group level of analysis. Addressing questions related to the pattern of dispersion will be of increasing importance as researchers attempt to test new theories concerning group and other higher level phenomena that specify patterns of dispersion as variables (e.g., see DeRue, Hollenbeck, Ilgen, & Feltz, 2010; Harrison & Klein, 2007). By focusing on the pattern in addition to level of agreement/dispersion, our work promotes conceptual advances in research and goes beyond previous work on interrater agreement (e.g., Burke & Dunlap, 2002).
For the purpose of demonstrating how our guidelines would be used to address problems relating to the pattern of dispersion, we will focus on notions of diversity and team efficacy dispersion, as theories relating to these phenomena have recently been presented. For the purpose of demonstrating how our guidelines would be applied to the assessment of the level of agreement, we focus our discussion on the common use of interrater agreement indices for data aggregation decisions. The guidelines we present, however, would apply to the study of a broad array of interrater agreement problems.
To unfold our discussion, we begin with a brief summary of research on multilevel modeling and data aggregation to set the stage for a discussion related to assessments of the level of agreement. This discussion also includes an overview of the relevance of interrater agreement assessments for determining whether or not the observed pattern of dispersion matches a theoretically specified pattern of dispersion. Then, we present interpretive standards for assessments of interrater agreement for both the level of agreement and pattern of dispersion, with detailed discussions of how the derived guidelines can be applied to a variety of research problems.
Issues Related to the Level of Agreement and Pattern of Dispersion
In this section we discuss the use of interrater agreement in multilevel research to justify the aggregation of lower level data to higher levels of analysis based on the level of observed agreement. Then, we discuss a second possible usage of interrater agreement statistics, which is to assess the goodness of fit between an observed pattern of dispersion with a theoretically specified pattern of dispersion. We give examples of recent multilevel theories that predict outcomes based on patterns of dispersion. Related to both level of agreement and pattern of dispersion, we discuss the limited availability of guidelines available to researchers for interpreting agreement.
Level of Agreement
Multilevel research commonly entails researchers aggregating data so as to create measures or indicators of higher level constructs. The appropriateness of representing higher level constructs by aggregating individual-level data is established by a composition model, which represents theory on how multilevel constructs are related at each level of analysis (Chan, 1998; Kozlowski & Klein, 2000; see also Klein, Dansereau, & Hall, 1994; Rousseau, 1985). For instance, Chan’s (1998, p. 236) direct consensus model is the idea that the “meaning of [the] higher level construct is in the consensus among lower levels”; the referent-shift consensus model is the idea that the “lower level units being composed by consensus are conceptually distinct though derived from the original individual-level units”; and the dispersion model is the idea that the “meaning of [the] higher level construct is in the dispersion or variance among lower level units.” Importantly, composition arguments indicate the type of evidence needed to justify the aggregation of individual-level data, with several models, including the direct consensus and referent-shift models (Chan, 1998), specifying interrater agreement, or the interchangeability of raters, as the appropriate type of evidence. Interrater agreement is also important for dispersion models (Chan, 1998); in this case, the degree of agreement itself represents the higher level construct.
Essentially, interrater agreement via the average deviation index is established by demonstrating that observed agreement is sufficiently greater than no agreement. Thus, though it is not necessary to the calculation of AD, in order to assess, or interpret, observed AD values, researchers must identify an appropriate random response distribution, or null distribution, to which observed variability in responses can be compared. A number of scholars have cited the choice of a null distribution as key to interpreting indices of interrater agreement, and thus drawing appropriate inferences from data (e.g., Brown & Hauenstein, 2005; A. Cohen, Doveh, & Nahum-Shani, 2009; James, Demaree, & Wolf, 1984; LeBreton & Senter, 2008; Lindell & Brandt, 1997; Lüdtke & Robitzsch, 2009; Meyer, Mumford, & Campion, 2010). In practice, however, researchers routinely rely on the uniform distribution as the null distribution, though doing so is likely often inappropriate (e.g., Brown & Hauenstein, 2005; Meyer et al., 2010). In fact, LeBreton and Senter (2008) recently called for a moratorium on the unconditional reliance on the uniform distribution.
The consequences of inappropriately comparing observed data to the uniform null distribution can be (a) that researchers mistakenly do not read interrater agreement as being sufficient for aggregation to higher levels of analysis, (b) that researchers mistakenly read interrater agreement as being sufficient for aggregation to higher levels of analysis (e.g., see Meyer et al., 2010), or (c) that researchers fail to appropriately interpret a group’s standing on a variable of interest. Thus, comparing observed data to an inappropriate null distribution can lead to erroneous inferences that have important implications for researchers. Nonetheless, the only decision rule for interpreting the practical significance of observed AD values is Burke and Dunlap’s (2002) decision rule, which is based on the uniform distribution as the null distribution. Currently, there are no guidelines for interpreting practical significance relative to any other distributions.
While assessments of within-group agreement for methodological purposes, such as data aggregation as discussed previously, address the question, “How much agreement/dispersion is there?” another question researchers can answer using interrater agreement indices is “How well does the pattern of observed agreement/dispersion match the theoretically specified pattern of agreement/dispersion?” In the following we discuss the issue of the pattern of dispersion and the theoretical distributions to which observed patterns can be compared.
Pattern of Dispersion
Harrison and Klein (2007) recently argued for the theoretical import of considering the pattern of dispersion. They distinguished among separation diversity (e.g., differences in opinions, beliefs, or attitudes), variety diversity (e.g., differences in knowledge or experience), and disparity diversity (e.g., differences in proportionate ownership or control over socially valued assets). They argued that depending on the type of diversity, minimum, moderate, and maximum diversity would be associated with differently shaped distributions; that is, both the type and degree of diversity determine the shapes of distributions. For instance, maximum separation diversity is characterized by a bimodal distribution, maximum variety diversity is characterized by a uniform distribution, and maximum disparity diversity is characterized by a skewed distribution. For separation diversity, minimum, moderate, and maximum degrees of diversity are characterized as unimodal, uniform, and bimodal, respectively. Considering both type of diversity and pattern of dispersion, they argued that maximum separation diversity (bimodal distribution) and maximum disparity diversity (skewed distribution) will have negative outcomes, such as reduced cohesion and group member input, respectively, while maximum variety diversity (uniform distribution) will have positive outcomes, such as increased creativity.
Importantly, according to their theory, both the type of diversity and the pattern of dispersion must be known in order to effectively predict outcomes. For example, separation diversity could be measured with regard to team members’ opinions about what their teams’ goals are (Harrison & Klein, 2007). For each team, the pattern of the distribution of these opinions would be compared to unimodal, uniform, and bimodal distributions as these are the distributions theoretically specified by Harrison and Klein (2007) as representing minimum, moderate, and maximum separation diversity. The degree of separation diversity, then, would be indicated by the theoretical distribution that is most similar to the observed distribution. With this measure of degree of separation diversity for each team, in addition to measures of cohesion, conflict, trust, and performance, researchers could test Harrison and Klein’s hypothesis that as the degree of separation diversity increases, team outcomes will be more negative: less cohesion and trust, more conflict, and lower performance.
DeRue et al.’s (2010) work on team efficacy provides another example of the potential theoretical importance of the pattern of dispersion above and beyond the level of dispersion. They argued that teams could have the same level of dispersion in their team efficacy ratings, but have different theoretically meaningful patterns of dispersion. These different patterns of dispersion, they argued, would predict different outcomes. Thus, according to DeRue et al.’s theory of team efficacy dispersion, assessing the pattern of dispersion in team efficacy ratings is essential for making predictions about team effectiveness. For instance, they argued that while a bimodal distribution of team efficacy ratings would lead to both positive and negative outcomes, a uniform distribution would lead to positive outcomes. Regarding the effects of a uniform distribution, their argument was that their disagreement will lead team members to share their differing views, thus enhancing team structuring, planning, and learning, while simultaneously allowing the team to avoid problems of extreme magnitudes of efficacy, which can lead either to overconfidence or helplessness, and social factions, which create dysfunctional conflict. In contrast, they argue that a bimodal distribution will similarly lead to team members sharing their differing views and thus enhancing team processes, but due to the existence of social factions, will also lead to dysfunctional conflict.
While the question of the level of dispersion has been important for various reasons, especially justifying aggregating individual-level data to form higher level variables, it is likely that the question of the pattern of dispersion will become increasingly important as more researchers consider the theoretical import of response distributions in and of themselves. This forecast is consistent with a recent call from Edwards and Berry (2010) to increase the theoretical precision in management research by developing hypotheses that specify effects in terms of magnitude, form (linear, nonlinear, etc.), and conditions (i.e., moderators). In reviewing 25 years (1985-2009) of articles published in the Academy of Management Review, Edwards and Berry (2010) found that 10.4% of the propositions stated only that a relationship would exist, and 89.6% only indicated the direction of the relationship. The theories presented by DeRue et al. (2010) and Harrison and Klein (2007) are important steps toward more precise management theories because they consider the shapes of distributions rather than simply measures of central tendency.
In cases for which the pattern of dispersion is of interest, it will be necessary to specify a “null response range,” analogous to a null range with regard to a formal test of the null hypothesis (see Greenwald, 1975), to determine whether the observed pattern of responses, or the relative percentages of individuals within the respective categories, is consistent with the theoretical distribution. To date, though researchers have suggested that observed patterns of dispersion can be quantitatively assessed (DeRue et al., 2010; Harrison & Klein, 2007), no one has developed practical guidelines for drawing inferences about the goodness of fit between an observed distribution and a theoretically specified distribution. As such, practical guidelines are needed for addressing both the methodological question of the level of agreement/dispersion and the theoretical question of the pattern of responses.
Summary
In order to address this dearth of guidelines, we specify a variety of response distributions that researchers could use to address a number of theoretical and methodological issues, and we derive decision rules for the AD index relevant to each of these distributions to aid researchers in making inferences about interrater agreement. We explain why and how the critical values presented must be used differently to answer different research questions. Our intention is to help researchers to interpret interrater agreement under the specified conditions, and importantly, the results will help researchers to make more appropriate decisions, including those regarding the aggregation of data and more appropriate inferences regarding the interpretation of group phenomena. In what follows, we discuss the AD index, relevant distributions, and interpretive standards for the AD index.
The AD Index of Interrater Agreement
Burke, Finkelstein, and Dusig (1999) introduced the average deviation as an index of interrater agreement, which represents the average absolute deviation in ratings from the mean rating of an item (ADM
),
3
and as such is interpretable in the metric of the original scale. ADM
for an item is calculated as follows:
As noted previously, Burke and Dunlap (2002) derived a decision rule for inferring the practical significance of observed AD values. This decision rule has two critical limitations. First, it only addresses assessments of the level of agreement, not the pattern of distributions, which may be theoretically important. With the advance of theories such as DeRue et al.’s (2010) theory of team efficacy dispersion and Harrison and Klein’s (2007) theoretical classification of types of diversity, multilevel researchers will need to consider agreement/dispersion as a theoretically meaningful issue. As such, guidelines addressing interpretations of the shapes of distributions are needed. Second, this decision rule applies only when the uniform distribution is the appropriate null distribution. As discussed previously, though the uniform distribution is widely applied, it is thought to be quite often inappropriately applied. There is a mounting push from the scholarly community to justify the choice of a particular null distribution, rather than using the uniform distribution unconditionally, yet too few guidelines exist for researchers who do opt to use alternative null distributions.
In what follows, we identify the null and theoretical distributions used in our article. Then we explain how we derived critical values for evaluating the practical significance of interrater agreement in relation to these null distributions. These critical values can be used to assess the level of interrater agreement in regard to data aggregation, which is a within-group assessment, as well as a host of other problems relating to interrater agreement. Further, based on these critical values, we calculated null ranges to be used in relation to studying theoretical problems; that is, assessing the fit between an observed pattern of dispersion and a theoretical distribution.
Interpretive Standards for the AD Index
Here we present our derivations and resulting critical values and null ranges for the AD index given a number of different response distributions. First, the distributions are described in brief. Then, we explain our derivations of interpretive standards for the AD index. Finally, detailed discussions of problems that relate to these theoretical and methodological reasons are presented.
Distributions
The distributions and their methodological and theoretical bases are listed in Table 1. The proportions endorsing each value for 5-point and 7-point scales are listed for each distribution in Tables 2 and 3. Graphical depictions of these distributions are presented in Figures 2 through 5.
Example Theoretical and Methodological Bases for Different Response Distributions.
Critical Values and Null Ranges for ADM Given Distributions Defined by Skew.
Note: The critical values were calculated without restricting decimal places, but they were rounded to two decimal places for reporting purposes. The only exception was ADM , which was restricted to two decimal places when inputted into the calculations.
aThe proportions are rounded such that they do not sum to 1. For this scale, equal proportions summing to 1 require 15 decimal places.
Critical Values and Null Ranges for ADM Given Distributions Defined by Kurtosis and Variance.
Note: The critical values were calculated without restricting decimal places, but they were rounded to two decimal places for reporting purposes. The only exception was ADM , which was restricted to two decimal places when inputted into the calculations.
a“Moderate” and “extreme” refer to the distance between subgroups.
b“A” and “B” refer to the differential proportion of subgroup responses.
cThe proportions are rounded such that they do not sum to 1. For this scale, equal proportions summing to 1 require 15 decimal places.

Slight, moderate, and heavy skew distributions for a 5-point scale.

Bimodal distributions for a 5-point scale.

Subgroup distributions for a 5-point scale.

Triangular-shaped, bell-shaped, and uniform distributions for a 5-point scale.
First, we developed critical values for three basic forms of skewed distributions: slight skew, moderate skew, and heavy skew (see Table 2 and Figure 2). Second, while one could model many forms of bimodal distributions, here we simplify our presentation by suggesting two: “moderate” bimodal and “extreme” bimodal. As shown in Table 3 and Figure 3, the size of the subgroups in both cases is 50% of the raters; the difference between the two distributions is that in the moderate bimodal distribution, the subgroups are less divergent than they are in the extreme bimodal distribution. Third, though there are numerous possible ways in which one could model subgroup distributions, we have simplified our presentation by considering four possibilities based on two dimensions: the size of the subgroup and the distance between the subgroup ratings and the majority of ratings. These distributions are shown in Table 3, and they are graphically depicted for a 5-point scale in Figure 4. We define a smaller subgroup as 10% of the raters (labeled as “A” in Table 3 and Figure 4) and a more moderately sized subgroup to be 20% of the raters (labeled as “B” in Table 3 and Figure 4). We define extreme distance as the subgroup responses and the majority of responses being on opposite ends of the Likert-type scale and moderate distance as the subgroup responses being at the midpoint of the scale, while the majority of responses are at one extreme of the scale (these are labeled “extreme” and “moderate” in Table 3 and Figure 4). Finally, we present triangular-shaped, bell-shaped, and uniform distributions in Table 3; they are graphically represented in Figure 5. The triangular-shaped distributions are based on a formula presented by Messick (1982) and the bell-shaped distributions are based on LeBreton and Senter (2008). Note that the upper limits for the uniform distribution (presented in both Tables 2 and 3) are consistent with Burke and Dunlap’s (2002) c/6 decision rule for assessing the practical significance of AD, where c is equal to the number of response categories.
Critical Values for Level of Agreement
In order to simplify our derivations, we begin with the basic case of agreement across judges on a single item with respect to two categories. 4 In the case of a dichotomy, AD can be calculated based on the proportion of judges falling into one of the two categories (Burke & Dunlap, 2002). 5 Based on an upper limit for AD of .35, where .35 or lower represents meaningful agreement, Burke and Dunlap (2002) demonstrated that meaningful agreement could be defined as 77% of the judges endorsing one category. Based on a more stringent upper limit of AD, .33, they indicated that meaningful agreement could be defined as 79% agreement. 6 They noted that this notion of 77% to 79% agreement being meaningful is consistent with many practical examples and problems relating to proportional agreement, such as 60% to 80% agreement being required for including critical incidents when creating behaviorally anchored rating scales (BARS; Cascio, 1998). Based on Burke and Dunlap’s calculations for the AD index, as well as conventional interpretations of meaningful agreement in percentage or proportional terms, we adopted a starting value of 80% agreement.
We note that assumptions are necessary to the derivation process. By making ours explicit, readers can readily revise the starting value as needed; yet, we suggest that a starting value of 80% agreement, or 20% disagreement, relative to the AD index is likely to suit most readers’ situations. Notably, the value of 20% disagreement is comparable to the upper limits of acceptable disagreement for the AD index for scales that range from 3 to 99 response options (see Burke & Dunlap, 2002); that is, they are comparable in the sense of being approximately equal to the maximum level of allowable disagreement. In other words, our use of the dichotomous case here does not limit the applicability of our derivations to dichotomies.
Given our intent of proposing interrater agreement cut-offs and null ranges for response distributions for Likert-type scales with markedly different dispersion, which have different numbers of response options and reflect a variety of response patterns including non-normal distributions, we next convert a proportion of .80 (or 80%) to a standardized effect size (a correlation coefficient, r) to work further with variances as indicators of dispersion. Since non-normal response distributions are expected in many cases for theoretical reasons, we initially employ an arcsine transformation to convert the proportion of .80 to a standardized effect size (i.e., a d-statistic; see Lipsey & Wilson, 2001) and then use a maximum likelihood transformation of this d-value to obtain a correlation coefficient.
The d-value is computed as the difference between the arcsine of the proportion representing meaningful agreement (i.e., .80) and the arcsine of the proportion representing no agreement (.00) using Lipsey and Wilson’s (2001) formula:
The resulting d-value is 2.214. Next, we transform the value of 2.214 to a correlation coefficient via the maximum likelihood formula (Hunter & Schmidt, 2004):
7
We note that arriving at approximately the same value for a correlation as Burke and Dunlap (2002) does not indicate circularity in our derivations, but it does reflect our explicit assumption that the underlying response distribution may meaningfully deviate from normality, thus calling for the arcsine transformation of percentage agreement to produce a correlation. As we discuss in the appendix, assuming that the underlying distribution of responses is normal would call for a probit transformation of the proportion to produce a correlation. The resulting value for the correlation would become approximately .8 (Lipsey & Wilson, 2001). Furthermore, Burke and Dunlap’s starting point of .7 for a correlation was, in large part, based on empirical data relating to stability coefficients and correlations based on ratings of targets by alternate sources. Their judgment and ours that a correlation of .7 is a reasonably high correlation is consistent with J. Cohen (1977) who indicated that correlations greater than or equal to .5 can be considered large.
Next, as is recognized in a number of quantitative fields (e.g., see Burke & Dunlap, 2002; Greene, 1997; Guion, 1998; McCall, 1970; Parsons, 1978), we define a correlation (r) in terms of variances as
Recall that in Equation 5, AD
2 was substituted as an approximation for the observed variance; that is to say that AD approximates the standard deviation (σ). In fact, the standard and average deviations vary by a constant that is dependent on the specified response distribution. As Burke and Dunlap (2002) noted, for the uniform distribution the σ:AD ratio is 1.2. Thus, in order to calculate the upper limits, or critical values, for the AD index, assuming a uniform distribution, they divided AD (the result of Equation 8) by 1.2. That is, they corrected for the difference between σ and AD introduced in Equation 5. The same adjustment is needed here. The resulting value of Equation 8 must be divided by the σ:AD ratio relevant to a given response distribution. Thus, upper limits for acceptable interrater agreement for ADM
must be calculated separately for each null or theoretical response distribution using the following equation:
The resulting critical values are listed in Tables 2 and 3 along with the pattern of responses for each of the distributions identified and the relevant statistics. For use as decision heuristics, we have rounded the critical values in Tables 2 and 3 to two decimal places; they can be applied to individual items or to multi-item scales. For the purpose of assessing the level of agreement, the critical values should be used in the conventional way (Burke & Dunlap, 2002): An observed ADM
value equal to or less than the relevant critical value (
Null Ranges for Pattern of Dispersion
In addition to developing critical values, in response to recent advances in multilevel theory, we also developed null ranges to facilitate researchers’ ability to assess how well the shape of an observed distribution fits a theoretically specified distribution. This issue of comparing the pattern of observed dispersion with a theoretical distribution is analogous to judging the goodness of fit between one’s data and the theoretical response distribution. Cortina and Folger (1998) described tests of goodness of fit as a matter of accepting the null hypothesis of no statistically significant difference between observed data and theoretical models. Here, we are dealing with practical significance rather than statistical significance meaning that goodness of fit in this context is a matter of concluding that there is no meaningful difference between an observed distribution and the theoretically specified response distribution.
The values for ADM shown in Tables 2 and 3 quantify the dispersion of different distributions. Therefore, if an observed value is equal to the relevant tabled ADM value, then the pattern of observed dispersion should fit perfectly with the pattern of theoretical dispersion. It is unlikely, however, that observed and tabled values will perfectly match; thus, the question becomes what is the “null range”? In other words, how far can an observed value be from the tabled value before researchers must conclude that their observed distribution has a poor fit with the theoretical distribution?
Analogous to Greenwald’s (1975) discussion of how to accept a null hypothesis gracefully (also see discussions by Cashen & Geiger, 2004; Cortina & Folger, 1998), researchers would need to decide in advance of collecting data what magnitude of effect, in this case, the magnitude of ADM
, would be considered nontrivial. We suggest defining this magnitude as the difference between the expected ADM
value for a distribution and the respective upper limit for that distribution. While the decision to specify this magnitude is arguably somewhat arbitrary, it is nevertheless made in advance of collecting data and tied to our derivations for assessments of practical agreement. Consistent with Greenwald’s arguments about establishing a null range for the formal test of a null hypothesis, this minimum magnitude of ADM
that the researcher is willing to consider nontrivial is then the boundary of the null range. That is, for observed ADM
values, this magnitude would be the difference between the tabled ADM
value and the relevant tabled critical value,
Although we present null ranges that are symmetrical around the expected ADM value, researchers can readily define w and the width of the null range relative to the purposes of their investigations. In these cases, larger values for w will result in smaller, more conservative null ranges than those reported in Tables 2 and 3 for the respective response distributions. In addition, researchers may desire to consider the construction of nonsymmetrical, one-tailed null ranges for some types of theoretical response distributions. As with the use of critical values, the researcher may desire to consider a priori several theoretical response distributions when making judgments about whether the observed and theoretical response distributions are meaningfully different.
Using the null ranges to gauge the goodness of fit between an observed and a theoretical distribution is straightforward. Using the previous example of a slightly skewed null distribution and a 5-point scale, an observed ADM of .98 would suggest a perfect match between the observed and theoretical distributions (see Table 2). Yet, how can a researcher interpret an observed ADM of .70? The relevant range is .84 to 1.12. 8 Thus, a researcher who observes an ADM of .70 would conclude that the observed distribution is meaningfully different from the theoretical distribution. That is, the observed ADM of .70 falls outside of the null range.
What researchers would do after determining a lack of fit would depend upon the theoretical context. In some cases it may be that a lack of fit suggests that the phenomena researchers are attempting to study are not represented in the data. This eventuality is analogous to researchers who do multilevel research finding a lack of agreement such that aggregation to a higher level of analysis cannot be justified (e.g., Chan, 1998). Or, it may be the case that shapes of observed distributions are compared to multiple theoretically specified distributions. While .70 does not fall into the null range for slight skew, it does fall into the range for moderate skew. In this case, the researcher would be able to categorize the group as a “moderate skew” group and make theoretically based predictions accordingly. More broadly, researchers can use the null ranges provided in Tables 2 and 3 in order to classify groups according to the pattern of their distributions of scores, and then based on this classification, make theoretically derived predictions about group outcomes. These null ranges and those relevant to the other distributions discussed in the following can be used relative to a single item or a scale.
It is important to note that researchers must visually check the observed distribution of responses. For instance, the direction of skew may be of theoretical relevance. Because the AD index is calculated via absolute values, the direction of skew cannot be determined from AD values. Quantifying agreement as well as visually checking the direction of skew is necessary. This point holds for other distributions discussed as well.
Distribution Choice
In the following, we discuss examples of when these different response distributions might be relevant (see also Table 1). Note that we do not assume that only one distribution is relevant in any given research context; rather, as others have suggested (e.g., James et al., 1984), we think it is reasonable that multiple distributions may be appropriate. We organize our discussion by first considering the issue of level of agreement and then considering the issue of pattern of dispersion. Within these sections, we refer to distributions defined by skew (Table 2) and those defined by kurtosis and variance (Table 3).
Level of Agreement
A number of response biases suggest that the appropriate null distribution is a skewed distribution. James et al. (1984) and LeBreton and Senter (2008) have discussed the likelihood of leniency and social desirability in contexts of assessing interrater agreement. Leniency may apply, for instance, in the performance appraisal domain where subordinates tend to judge their supervisors in relatively positive terms (Schriesheim, 1981). Klein, Conn, Smith, and Sorra (2001) found social desirability to be applicable in a survey of organizational members’ workplace perceptions. Agreement among members was related to the social desirability of the survey items (e.g., “The supervisor to whom I report praises me for excellent performance” and “My work here is enjoyable”; Klein et al., 2001, p. 11). To the extent that these biases are expected to be strong versus weak, and to the extent that multiple biases are expected to be relevant, researchers could utilize moderately to heavily skewed distributions as their null distributions.
Though skewed distributions have most often been suggested as alternatives to the uniform null distribution, other distributions are relevant as well. Likert-type response formats that convey or have different informational value may result in subgroups or small to moderate percentages of respondents using particular response options. For instance, Schwarz, Knauper, Hippler, Noelle-Neumann, and Clark (1991) showed that participants responded differently to the question “How successful would you say you have been in life?” when the 11-point scale ranged from –5 to 5 rather than 0 to 10, even though the anchors were identical (not at all successful to extremely successful). For the former scale, 34% endorsed –5 to 0; for the latter scale, 13% endorsed 0 to 5. Schwarz (1999) argued that the question of success is somewhat ambiguous in that success could be marked by the presence of positive features or the absence of negative features and that participants use the scale numbers as well as the anchors to interpret the items. In addition, Lindell and Brandt (1997) discussed the possibility of distinct factions among raters due to characteristics such as clinical orientations in assessments of psychotherapy, raters’ academic disciplines in rating research proposals, raters’ functional department in ratings of organizational climate, and so on. The possibility of such factions might call for the use of a bimodal response distribution as the baseline distribution for assessing level of agreement among a set or raters. We present four different subgroup distributions and two different bimodal distributions (see Table 3). In addition, triangular-shaped or bell-shaped distributions 9 are applicable null distributions if one expects raters to succumb to the central tendency bias (e.g., James et al., 1984; LeBreton & Senter, 2008). For instance, James et al. (1984, p. 91) suggested that the central tendency bias may occur “when judges are purposefully cautious or evasive because responses to items are not collected on a confidential basis, and political reasons exist for not departing from the neutral alternatives on the scales.” They also suggested that naïve and unmotivated participants may exhibit the central tendency bias when responding to ambiguous or complicated items.
Finally, while the uniform distribution has been described as an often inappropriate null distribution, there are circumstances under which it is the appropriate null distribution. It is applicable if no rater bias is expected. It may also be applicable if raters face conceptual ambiguity. For instance, Heidemeier and Moser (2009) found that raters demonstrated less agreement in job performance ratings when the work being evaluated was less straightforward; that is, there was less agreement regarding white-collar work and work high in job complexity compared to agreement regarding blue-collar work and work low in job complexity.
Pattern of Dispersion
There are also theoretical bases for modeling agreement on most of these distributions. DeRue et al. (2010) provided an example theoretical basis for choosing a bimodal distribution as a theoretical response distribution: Equally sized subgroups within teams that judge team efficacy differently will have mixed effects on team effectiveness by impairing social processes, but enhancing task processes. They went on to propose that the greater the divergence between the subgroups, the more negative the effect on team effectiveness will be. In discussing maximum separation diversity, such as diversity in team members’ judgments of team efficacy, Harrison and Klein (2007) discussed an extreme bimodal distribution, where subgroups exist on opposite ends of a continuum. Consistent with DeRue et al., they argued that this extreme bimodal distribution would have negative outcomes: reduced cohesiveness, interpersonal conflict, distrust, and decreased task performance.
Related to bimodal distributions are unimodal distributions that have distinct subgroups. From a theoretical perspective, DeRue et al. (2010) discussed “minority belief” dispersion where one team member rates team efficacy differently than the other team members. We previously reported their proposition that when minority belief dispersion is characterized by one individual rating team efficacy lower than everyone else, the effect on team effectiveness will be negative. DeRue et al. also theorized about the opposite distribution: One individual rates team efficacy more highly than the other team members. They proposed that this pattern of dispersion, which is the mirror-image of the first scenario, will have mixed effects on team effectiveness because the dispersion is likely to impair social processes, but enhance task processes.
Finally, DeRue et al. (2010) provided an example of when the uniform distribution would be theoretically specified as the expected response distribution: Fragmentation, characterized by a uniform distribution of team efficacy beliefs, should augment team effectiveness by positively impacting social and task processes. Their argument is based on the idea that fragmented teams may communicate more effectively than other teams because they do not have subgroups, coalitions, and factions that can hinder effective communication in teams and they are motivated to create a shared understanding of team efficacy. These teams are likely to openly discuss issues like goals and expectations that can help in teams’ task-related processes, as well as helping to establish a shared belief about team efficacy. Harrison and Klein (2007) proposed similarly positive effects for variety diversity, such as diversity in educational background, when it is at a maximum level, which is characterized by a uniform distribution: more creativity, greater innovation, higher decision quality, more task conflict, and increased unit flexibility.
Conclusion
Given numerous calls for researchers who use interrater agreement indices to stop their unconditional use of the uniform response distribution, a primary purpose of our study was to provide researchers with guidelines for using alternative null distributions and theoretical distributions to make judgments about practical significance, when addressing both methodological and theoretical issues. In doing so, we derived critical values for a variety of response distributions that vary in terms of skew, kurtosis, and variance. We also discussed how to use the critical values shown in Tables 2 and 3 differently depending on whether one seeks to ascertain the level of agreement or the pattern of dispersion. While the question of the level of agreement is familiar, the question of the pattern of dispersion is more novel, but likely to become more and more important with advances in multilevel theory and research. The current paper stands to promote such conceptual advances.
Although we focused the substantive discussion of interrater agreement problems on data aggregation in relation to the level of agreement and team efficacy dispersion and diversity in relation to the pattern of dispersion, the derived critical values and null ranges can be applied to numerous other research questions. For instance, the alternative null distributions and critical values could assist in addressing interrater agreement questions related to job analysts’ ratings of task items for a job, or judges’ ratings of critical or cut-off scores on the items of a test (e.g., using the Angoff method whereby cut-off scores are based on subject matter experts’ estimates of the probability that a competent person will respond to an item accurately; e.g., see Hudson & Campion, 1994) just to name a few types of pertinent research questions. As another example, the notion of a theoretical distribution could be used to specify the demographic makeup of a community (e.g., racial/ethnic composition in terms of percentages within each category), thereby permitting the quantification of demographic similarity/dissimilarity between employees and residents (i.e., the difference between an observed ADM value and the relevant tabled value for a theoretical distribution). Quantifying the effects of community demographic similarity in this manner may meaningfully extend the measurement and study of employee-community racial/ethnic similarity from the individual level of analysis (e.g., see Avery, McKay, & Wilson, 2008; Brief et al., 2005) to the organization or business unit levels of analysis. Importantly, irrespective of the group phenomena under study, the AD index itself and the derived null ranges provide a means for tracking or studying expected changes in group phenomena possibly relative to stages of group development or shocks that the group might encounter. Further, practical significance critical values could be similarly developed for other interrater agreement indices.
Future research should also address the problem of assessing the statistical significance of AD values relative to a variety of null or theoretical distributions. As discussed earlier, the work to date in this area is limited. Burke and Dunlap (2002) and Dunlap et al. (2003) used an approximate randomization test to establish statistical significance cut-offs for AD for judges’ ratings of a single item relative to the uniform distribution. Cohen et al. (2009) built upon this work to establish statistical significance cut-offs for AD for judges’ agreement on multi-item scales relative to the uniform distribution and a slightly skewed distribution. In order to assist researchers in inferring whether levels of agreement (i.e., AD values) are due to chance, cut-off values for statistical significance should be established relative to more distributions, such as those identified in Tables 2 and 3. Without additional guidelines, researchers are likely to continue to over-rely on the uniform distribution when making inferences about their data.
In closing, we emphasize that the practical guidelines presented herein are just that: guidelines. As others have advised, it is important that researchers take a common sense approach to interpreting observed agreement. Speaking in terms of whether interrater agreement is sufficient to justify the aggregation of individual-level data, LeBreton and Senter (2008, p. 836) asserted that “the value used to justify aggregation ultimately should be based on a researcher’s consideration of (a) the quality of the measures, (b) the seriousness of the consequences resulting from the use of aggregate scores, and (c) the particular composition model to be tested.” James et al. (1984), in addressing the problem of uncertainty over which null distribution applies in a given situation, suggested interpreting observed agreement on the basis of several null distributions: “The rationale here is that even though we cannot pinpoint a particular null with a high degree of confidence, we can place bounds on the most likely types of nulls and thereby increase the likelihood that the true null lies somewhere in this range of distributions” (p. 95).
Similarly, we urge researchers to consider their particular circumstances when assessing interrater agreement and to consider the use of a range of critical values based on several different null or theoretical distributions. For instance, researchers should consider whether they have missing data (e.g., see Newman & Sin, 2009). Our guidelines do not account for systematically missing data and thus may be sensitive to this problem, particularly in the cases of certain distributions, such as the bimodal distribution, which may appear as a unimodal distribution if data are systematically missing from one of the two subgroups. In other cases, researchers may need to apply a null or theoretical distribution not included in Tables 2 and 3, or they may need to adjust the starting value for interrater agreement of 80% agreement used in the present derivations. Moving away from 80% agreement or considering another transformation of percentage agreement for derivational purposes, such as a probit transformation of a proportion, will result in more stringent or more lenient critical values and decisions concerning interrater agreement depending on whether one adjusts this value upward or downward, or whether one employs a more versus less conservative transformation of the proportion, such as arcsine versus probit transformations. Recognizing the possibility that the research context may dictate the consideration of other assumptions or response distributions than those used in the study, we present in the appendix a general procedure for researchers to use in establishing critical values and null ranges based on other assumptions not considered here. In this regard, our proposed guidelines offer a uniform and parsimonious means for studying interrater agreement given a variety of methodological and theoretical problems.
Footnotes
Appendix
Acknowledgments
We would like to thank Greg Oldham and Isaac Smith for helpful comments on previous drafts of the article and Julie Seidel and Teng Zhang for research assistance.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
