When measuring group-level psychological properties (e.g., organizational climate, leadership, team motivation), researchers typically aggregate individual perceptions to represent the group. L. R. James provided the groundbreaking insight that, in order to justify aggregating individual perceptions to represent a group-level property, one must first establish that there exist shared perceptions—or shared psychological meaning—within the group. Here we label and describe two distinct theoretical parameters that can both be used to define within-group agreement: (a) (i.e., a parameter that defines within-group agreement as Individual True-Score Consensus), which arises from the theoretical work of L. R. James and colleagues in the 1970s, and (b) (i.e., a parameter that treats within-group agreement as a Group True-Score Reliability Analog), which forms the theoretical basis for the index. We extend the work of L. R. James by offering a systematic comparison of different estimators of the two within-group agreement parameters ( and ). Recommendations are provided for estimating within-group agreement, to continue the legacy of justified measurement of group-level psychological properties.
Over the past few decades, organizational researchers have made remarkable progress developing theories and methods for studying the multilevel or meso nature of organizational phenomena (Hitt, Beamish, Jackson, & Mathieu, 2007; Klein & Kozlowski, 2000; Roberts, Hulin, & Rousseau, 1978; Rousseau, 1985). One critical issue when conducting multilevel research is the way in which higher level (group-level) variables are operationalized. Perhaps the most common method is to aggregate data from lower level measures (e.g., individual perceptions) to form higher level variables (e.g., team motivation or organizational climate; for discussion, see Bliese, 2000; Chan, 1998; James & Jones, 1974; Ostroff, 1993). In this regard, the notion of within-group agreement as proposed by L. R. James and colleagues (James, 1982; James & Jones, 1974) has been an extremely useful concept to justify aggregating lower level measures to represent group-level properties.
Specifically, high within-group agreement suggests that individual members hold shared perceptions, permitting a combination (typically an average) of their responses to represent the higher level concept. According to Kozlowski and Klein’s (2000) typology, such operationalizations reflect a convergent form of group construct emergence, consistent with an isomorphic composition process. Similarly, Chan’s (1998) typology articulates two such composition models: direct consensus and referent-shift consensus. L. R. James and colleagues’ work served as a major catalyst for getting people to think about issues of measurement, aggregation, and the meaning of group-level constructs; and has thus been central to the inception of the multilevel age of organizational research.
The current review is organized into four sections, each of which fundamentally reflects the contributions of L. R. James to the study of multilevel organizational behavior. First, we explain the origins of the theoretical concept of within-group agreement, in particular as the concept emerged from debates over the phenomenon of organizational climate. Second, we formally codify two theoretical parameters of within-group agreement (which we label and ). The first theoretical parameter, , represents the notion of shared psychological meaning that forms the foundation of group-level psychological constructs, as described verbally in the work of L. R. James and colleagues prior to the introduction of the index in 1984. The second theoretical parameter, , emerges from the derivation of the index (James, Demaree, & Wolf, 1984, 1993; LeBreton, James, & Lindell, 2005), and treats within-group agreement as a reliability analog that emerges from a measurement model that includes a group-level true score but no individual-level true score. Third, we review several attempts by organizational researchers to derive empirical estimators of within-group agreement, the most dominant of which is James et al.’s (1984) index. Fourth and finally, we compare different estimators and provide recommendations for how to best estimate the two theoretical parameters of within-group agreement ( and ).
Origins of the Theoretical Concept of Within-Group Agreement
The theoretical notion of within-group agreement was born from discussions in the 1970s and 1980s over how best to conceptualize and measure organizational climate. The concept of organizational climate rose to popularity as a consequence of early attempts to specify the psychological situation faced by members of organizations (e.g., Campbell, Dunnette, Lawler, & Weick, 1970; Forehand & Gilmer, 1964), and coincident with the emerging zeitgeist of person-situation interactionism (Bowers, 1973; Sells, 1963). Following some discussion over whether climate resides at the individual versus organizational level of analysis (Guion, 1973; James & Jones, 1974) and whether climate is a subjective/perceptual versus objective feature of organizations (Glick, 1985; James, Joyce, & Slocum, 1988), modern definitions have come to treat organizational climate as a collective psychological concept, owing to individuals’ shared perceptions of organizational practices, policies, and procedures (Reichers & Schneider, 1990; see also Ostroff, Kinicki, & Tamkins, 2003). Schneider (2000) specifies, “When there is sharedness of the psychological life of organizations [italics added], that psychological life is a property of the organization” (p. xvii). Thus, climate is conceptualized to reside at the organizational level, although it can be seen only through its individual-level manifestations (i.e., individual climate perceptions).
Organizational climate is an emergent construct in that “it originates in the cognition…and other characteristics of individuals, is amplified by their interactions, and manifests as a higher-level, collective phenomenon” (Kozlowski & Klein, 2000, p. 55; Katz & Kahn, 1966). Individual employees’ experiences of an organization’s internal environment or climate come to be shared through mechanisms of interpersonal interaction (Giddens, 1993; Kenny, 1991; Roberson, 2006), negotiated responses to events (Morgeson & Hofmann, 1999), leadership (Kozlowski & Doherty, 1989), and personality homogeneity (Schneider & Reichers, 1983). Climate can be codified in the mutual understandings of what behavior is rewarded, supported, and expected (Schneider, 1990), as well as in habitual routines of the group (Gersick & Hackman, 1990). The important conceptual point is that organizational climate is a property of the group itself, although it both emerges from, as well as gives rise to, individual perceptions of the organization that are held in common by the perceivers.
The conceptual role of individual differences in climate research is worthy of further discussion. A watershed moment for climate research came when James and Jones (1974) introduced the concepts “psychological climate” and “organizational climate” to distinguish climate perceptions as reflective of individual-level versus organization-level attributes. This cogent call to acknowledge the two levels of analysis in climate specification helped to focus more attention on the development of composition models, designed expressly to propose bottom-up linkages from the individual to the organizational level (Roberts et al., 1978; Rousseau, 1985). The dominant composition model for climate is the convergent composition model (Kozlowski & Klein, 2000; also known as the consensus model; Chan, 1998). In this composition model, organizational climate is conceptualized as functionally isomorphic to individual psychological climate (and can be represented as the average of individuals’ climate scores) under the precondition that there exists sufficient agreement among group members to suggest shared assignment of psychological meaning.
An early expression of this key sentiment—that organizational climate only exists when there are shared perceptions of it—can be found in James and Jones’s (1974) statement:
Returning to the perceptual definition of organizational climate, it would seem that the reliance on perceptual measurement may be interpreted as meaning that organizational climate includes not only descriptions of situational characteristics, but also individual differences in perceptions and attitudes. This is somewhat confusing if one wishes to employ organizational climate as an organizational attribute or main effect, since the use of perceptual measurement introduces variance which is a function of differences between individuals and is not necessarily descriptive of organizations or situations. Therefore, the accuracy and/or consensus of perception must be verified if accumulated perceptual organizational climate measures are used to describe organizational attributes (Guion, 1973).” (p. 1103, italics added)
In a reiteration of this position, L. R. James and colleagues noted that, because organizational climate is of necessity perceived by individuals (James et al., 1988), it is subject to individual differences in perception and concept formation. Jones and James (1979) summarize,
The [conceptual] argument for aggregating perceptually based climate scores (i.e., psychological climate scores) appears to rest heavily on three basic assumptions: first, that psychological climate scores describe perceived situations; second, that individuals exposed to the same set of situational conditions will describe these conditions in similar ways; and third, that aggregation will emphasize perceptual similarities and minimize individual differences. Based on this logic, it is generally presumed that empirically demonstrated agreement among different perceivers implies that these perceivers have experienced common situational conditions (Guion, 1973; Insel & Moos, 1974; James & Jones, 1974; Schneider, 1975a).” (p. 206, italics added)
Of particular interest here are the theoretical assumptions of the climate composition model, vis-à-vis the role of individual differences in traits and perceptions. Namely, the model assumes that climate is a property of the organization (i.e., the climate true score is at the group level, not at the individual level; James et al., 1984), and it also implies that individual differences which might influence climate perception (either directly or through interaction with the situation) are symmetrically distributed, with a mean of zero (i.e., and will thus tend to be removed through aggregation across individuals; Jones & James, 1979, see quote above). Both these assumptions derive from the theoretical climate composition model, and are appropriate when a group-level construct is defined according to consensus (Chan, 1998).
Further, L. R. James and colleagues repeatedly emphasized that within-group agreement is not just one from a list of many equivalent criteria that could be used to justify aggregating individual perceptions to represent a group-level construct. Rather, within-group agreement was the criterion for justifying aggregation. George and James (1993) made this position clear:
The key statistical test of the appropriateness of aggregation to the group level of analysis is that there is within-group agreement on the variable in question. If there is agreement within groups on the theorized group-level variable, then the aggregate may be used in subsequent analyses.
…Agreement within a group is not conditional on between-groups differences. For example, in a scenario that Yammarino and Markham portray, in which all members in each group have the same moderately high score, both agreement and aggregation may be justified provided that aggregation to the group level was theoretically based. However, there would be no group effect inasmuch as the group means do not vary under these conditions. (p. 799, emphasis added)
According to this position, intraclass correlations (ICC) and comparisons of within-group variance versus between-group variance are not the key tests needed to justify aggregation; within-group agreement is the paramount parameter.
To summarize, the section above illuminates three main assertions:
These constructs are reflected in individual-level, psychological perceptions.
Individual perceptions can be aggregated to represent a group-level psychological property, if the individual perceptions are shared [i.e., “shared psychological meaning”].1
Within-Group Agreement: The Theoretical Parameters ( and )
When employing a consensus model to compose individual-level perceptions into a group-level construct, demonstration of agreement or consensus among individuals can help to establish the construct validity of the group-level construct scores (James, 1982; Kozlowski & Hattrup, 1992). In this model, the group-level construct (e.g., organizational climate) is a group-level property (has a group-level true score; James et al., 1984), and individual-level perceptual measures of it vary within groups for reasons unrelated to the latent group construct itself. Because individuals are rating only a single, group-level target, within-group variance is framed as random error, and an agreement index reflects interchangeability of judges’ ratings (James et al., 1993). That is, under a consensus model (where there exists a group-level true score), group agreement is defined as the absence of random (or chance) disagreement.
is the within-group variance between individuals’ perceptions that would be expected due to chance responding, and is determined theoretically (not empirically), according to one’s conceptual definition of zero agreement. It can be thought of as theoretical chance disagreement.2
is the empirical within-group variance of an item (j). It can be thought of as observed disagreement.
The major advantage of this index of within-group agreement—in comparison to simply inspecting the variance or the standard deviation within groups—is that the expression in Equation 1 is unitless. Its scaling positions observed disagreement as a proportion of theoretical chance disagreement (i.e., ).
At this point, however, we would like to introduce a distinction between the theoretical parameter of within-group agreement, versus the empirical estimators of within-group agreement. Such a distinction is essential if we are to compare the different indices/estimators of within-group agreement. That is, to compare different estimators, we must first specify what is actually being estimated. To this end, we present the following two parameters: and .
Agreement: Individual True-Score Consensus
We begin by defining the following theoretical parameter:
where
is still the within-group variance between individuals’ perceptions that would be expected due to chance responding, and is determined theoretically (not empirically), according to one’s conceptual definition of zero agreement. It can be thought of as theoretical chance disagreement.
is the actual within-group variance between individuals’ perceptions; that is, the within-group variance between individuals’ latent true scores (e.g., psychological climate perceptions). It can be thought of as actual disagreement, in terms of latent individual perceptions.
The central feature of is that it is a theoretical parameter. Notice how each term in (Equation 2) is a theoretical entity, as opposed to an empirical estimate. As approaches 1.0, actual within-group disagreement () is approaching zero—a condition that may be thought of as theoretically complete agreement. As approaches 0.0, actual disagreement () is approaching theoretically complete disagreement, or zero agreement. Once the theoretical parameter has been defined, only then can we begin to trace how different estimators of within-group agreement estimate that parameter under various empirical conditions (e.g., sample size, measurement error, number of items).
Importantly, the numerator of Equation 2 includes the term , which is the within-group variance in individual-level latent (true score) perceptions of the group-level construct (e.g., individual-level true-score variance in psychological climate perceptions). In our interpretation, the relevance of this theoretical variance term is directly implied by the early writings of L. R. James on organizational climate (see James, 1982; James et al., 1988; James & Jones, 1974; as well as the more recent summary by James et al., 2008). That is, the term (and thus the term ) captures the theoretical notion of shared psychological meaning between individuals within a group, and as such can be thought of as a key parameter for justifying the existence of a group-level construct.
As we explain in a later section, however, the theoretical parameter is missing from the derivations of the estimators of within-group agreement now widely in use, including the family of indices that followed from the popular multi-item index (James et al., 1984, 1993). To further explain, we will eventually review the derivations of some alternative estimators of within-group agreement. First, we describe another (and more dominant) theoretical parameter that has been used to define within-group agreement: .
Agreement: Group True-Score Reliability Analog
Beyond conceptualizing within-group agreement and emphasizing its critical role in the measurement of group-level psychological properties, a second seminal contribution by L. R. James was the development of an agreement index that is regularly used to justify aggregation in multilevel research (and has been cited over 4,000 times). The dominant index of within-group agreement is James et al.’s (1984) multi-item extension of , which is labeled [note the subscript (J), which refers to the number of items on a multi-item scale]. James et al. (1984) originally defined as an index of reliability, and as such they derived by applying the Spearman-Brown correction to the single-item index (as was done by Finn, 1970). The resulting formula that extends to the multi-item case is:
where J is the number of items on the multi-item scale, and is the average itemwise variance, averaged over the J items. Some controversy ensued over whether could rightly be considered a reliability index (Kozlowski & Hattrup, 1992; Schmidt & Hunter, 1989).3 This controversy led to (a) an extended explanation of the status of as a reliability versus agreement index (see James et al., 1993, which we elaborate in the following section), (b) the introduction of an alternative index of within-group agreement called (Lindell, Brandt, & Whitney, 1999) that did not derive from the Spearman-Brown reliability formula, and (c) a rederivation of (LeBreton et al., 2005) that did not explicitly involve a Spearman-Brown correction, but which resulted in the same expression for shown in Equation 3a.
What is the theoretical parameter that was designed to estimate? Although this has not been explicitly discussed as such, the theoretical parameter that is intended to estimate is a parameter that derives from a psychometric model that includes a group-level true score but no individual-level true score, and that treats within-group agreement as an analog of reliability. As explained in Appendix C, the index was designed to assess a theoretical parameter that we here label :
where J is the number of items on the multi-item scale, and is a theoretical term representing a combination of both itemwise variance and personwise variance around the group-level true score (see Appendix C). The important thing to note at this point is that (Equation 3b) and (Equation 2) are not the same—they represent two distinct definitions of what within-group agreement is. To summarize:
The parameter (individual true-score consensus) defines within-group agreement as the absence of person-level variance in true perceptions of the group (), as a proportion of theoretically complete disagreement () (i.e., = shared psychological meaning).
In contrast, the parameter (group true-score reliability analog) defines within-group agreement as the proportional reduction in error variance (cf. reliability) for a composite of multiple items, where error variance is defined as a combination of person-level variance and item-specific variance (; Appendix C), total variance is defined as theoretically complete disagreement (), the analog to true score variance is defined as the difference between the two (), and Cronbach’s alpha is constrained to zero (within groups) [this definition of within-group agreement was explained clearly by LeBreton et al. (2005); note that the constraints implicit in the definition of emerge naturally from a measurement model that contains a group-level true score, but has no individual level true score; see Appendices B and C].
In the section above we made three main assertions:
Within-group agreement, as conceptualized by L. R. James and colleagues in theoretical work prior to 1984, can be represented by the theoretical parameter , which is defined as lack of variance in individual perceptions of the group-level property (or shared psychological meaning, see Equation 2).
Within-group agreement, as conceptualized by L. R. James and colleagues in theoretical work in 1984 and thereafter, can be represented by the theoretical parameter , which is defined as a group true-score reliability analog (see Equation 3b, and Appendix C).
Several within-group agreement estimators have been developed (e.g., , ), which are typically used for the purpose of estimating within-group agreement ( or ).
Estimators of Within-Group Agreement: rWG(J), rWG(J)*, ADM(J), rWG(α), and (J)
Before we move on, we must be very clear about one point. We are not advocating that one of the above parameters is “right” and the other is “wrong.” As noted by a helpful reviewer, such a comparison would be like saying that Cronbach’s α based on items is right and the coefficient of stability is wrong. Cronbach (1947) himself made a similar observation, when—after describing four distinct definitions of “reliability”—coefficient of stability, coefficient of equivalence, coefficient of stability and equivalence, and hypothetical self-correlation—he stated:
The important thing is to recognize that in the past all four of these and many approximations to them have been called “the reliability coefficient.” No one of these is the “right” coefficient. They measure different things, and each is useful. What is important is to avoid confusing one with another, and using one as an estimate of another. (Cronbach, 1947, p. 6)
To restate, each index estimates a different definition of reliability, and as such each is useful for its own purpose. Every estimator must necessarily be estimating some parameter [e.g., the retest correlation is best suited to estimate the coefficient of stability; Cronbach’s α (1951) is best suited to estimate the coefficient of equivalence; and Schmidt, Le, and Illies’s (2003) CES is best suited to estimate the coefficient of stability and equivalence]. Further, if we cannot specify which parameter is being estimated, then we have no idea what the purpose of a particular estimator is. So it is important to be clear about which parameter an index was designed to assess.
For example, if you use Cronbach’s α to estimate the coefficient of stability parameter, there are certain conditions under which your estimate will be inaccurate. This is not a criticism of Cronbach’s α—Cronbach’s α simply was not designed to estimate measurement stability over time/occasions (Cronbach, 1951; Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Feldt & Brennan, 1989; Schmidt et al., 2003). Likewise, we are not currently criticizing -- was simply not designed to estimate (individual true-score consensus); rather, was designed to estimate (group true-score reliability analog).
The Index
As an index of the within-group agreement parameter (i.e., the notion of within-group agreement that is based on a measurement model that includes a group-level true score but no individual-level true score; Appendix C), the index is excellent. That is, is an essentially unbiased estimator of .
In contrast, the use of as an index of the parameter of within-group agreement (i.e., shared psychological meaning) suffers two primary complications. First, depends heavily on the number of items (J) used to measure a construct (e.g., the number of items on a multi-item climate scale). As such, large values can always be obtained by simply adding more items to the scale, regardless of the amount of true within-group agreement between individuals (). Second, relies upon the mean of item variances (), rather than the variance of scale composite scores (). This results in an index that focuses on item-specific variance, in addition to within-group variance in latent individual perceptions (e.g., in addition to variance in individual-level psychological climate true scores). Both of these features of can affect its ability to estimate within-group agreement.
Issue 1: The role of number of items (J) in
The index was originally derived as a reliability index, and despite suggestions that should be used as an index of within-group agreement instead of reliability (e.g., Kozlowski & Hattrup, 1992), the formula for has never changed (Finn, 1970; James et al., 1984; 1993; LeBreton et al., 2005; see Equation 3a above). As for whether should be considered an index of reliability, James et al. (1993) made very clear their position that was both an agreement index and a reliability index:
Kozlowski and Hattrup are also correct in stating that our intention was to suggest a measure of agreement, and not consistency [reliability], and that rWG is an estimator of agreement. However, what cannot be done, at least not the way things are presently set up, is to follow Kozlowski and Hattrup’s recommendation to sever all ties between interrater reliability and rWG and to treat rWG as strictly a measure of agreement with, in effect, no ties to classic measurement theory. It is not possible to follow this recommendation because rWG is currently derived in terms of classic measurement theory as an interchangeability (agreement) index of interrater reliability. (p. 306, emphasis added)
That is, James et al. (1993) clearly understood that was an index of reliability, as well as an index of agreement.
Acknowledgment that is an index of reliability (in addition to being an index of agreement) leads to an important insight about the role of scale length in estimates. A quick inspection of Equation 3a makes clear that, as the number of items (J) on the multi-item scale increases, will converge to 1.0. That is, similar to other reliability indices, increases with scale length. As a concrete example of this, we offer Figure 1. Figure 1 displays estimates across a range of scale lengths (i.e., the number of items on the multi-item climate scale, J, ranges from 1 to 23 items),4 and across a range of values (i.e., mean of itemwise variances, , ranges from 0.2 to 1.8), with theoretical chance agreement = 2.0. Figure 1 shows that increases as the number of items (J) increases, and conveys the point that can almost always exceed .70 whenever there are enough items on the scale (regardless of the level of within-group agreement among individual perceptions). This is perhaps best illustrated by the condition in Figure 1 where = 1.8 (i.e., where the within-group variance term is very close to the theoretical definition of zero agreement, = 2.0), but there are 23 items on the scale. In such cases, although within-group agreement is very close to zero, the = .72.
The index increases with scale length. Note: Even when mean of item variances nears zero agreement (i.e., when = 1.8, which is very close to = 2.0), we can still get = .72 by using 23 items on the scale.
What does all this mean? It means that, as an estimator of within-group agreement (: consensus among individuals), the index can be unreasonably influenced by the number of items on the scale. As a practical example, if we sought to estimate within-group agreement among a group of citizens as to their attitudes toward a contemporary social topic (e.g., war, abortion, immigration), we would certainly not expect individuals in the group to suddenly agree more with each other if we simply added more items to the attitude scale. But that is exactly the sort of inference that organizational researchers are routinely making if they use to assess the notion of within-group agreement as consensus among individuals’ perceptions ().
Issue 2: The use of mean of item variances () versus variance of scale scores ()
Next, we point out a key notational distinction that is relevant to many estimators of within-group agreement: the difference between (mean of itemwise variances) and (variance of scale scores). Notice that (which is the basis for traditional calculations) is different from (which is the familiar term for simple variance of a scale score).
The conceptual distinction between mean of item variances () and variance of scale scores () can be described as follows (see Sin & Newman, 2005). The mean of item variances (which is estimated by first calculating the variance of each item, then averaging across these item-level variance terms) includes all of the item-specific variances [i.e., variance in each item that does not overlap with the latent individual-level true score (e.g., psychological climate) construct]. In contrast, (which is estimated by averaging over items to create a scale score prior to calculating within-group variance) effectively removes much of the item-specific variance from the estimate of within-group variance (i.e., item-specific variances tend to be removed when averaging across multiple items to form a scale score). As a consequence, will generally be larger than , because (which is the basis for ) includes item-specific variances, whereas does not.
The distinction between and is depicted in Figure 2. In the language of factor analysis, tends to approximate variance of the latent psychological climate factor (i.e., , see Equation 2), whereas includes variance of this latent factor plus item-level uniquenesses. We introduce a formula that explicitly relates the two quantities:
Why the mean of itemwise variances () is almost always larger than scale score variance ().
where J is the number of scale items and is Cronbach’s α within groups. As Equation 4 shows, will generally be smaller than , because includes the item-specific variance.
Conceptually, any within-group agreement index based upon (e.g., the index, the index, and the ADM index, see below) defines disagreement to include item-specific sources of disagreement that do not relate to between-persons differences in latent individual true score (climate) perceptions, whereas a within-group agreement index based upon would tend to focus exclusively on between-persons disagreement in latent individual climate perceptions. As such, focusing on as opposed to can lead to underestimation of within-group agreement (), because the term includes item-specific variance (item uniquenesses), in addition to latent true score variance between individuals within the group (i.e., ).
The Index and the ADM(J) Index
As alternatives to the dominant index of within-group agreement (James et al., 1984), two other indices of within-group agreement have received considerable attention. The first is Lindell et al.’s (1999) index:
where is the average itemwise variance (see Equation 4). Lindell et al. (1999) developed to be an index of within-group agreement that is not explicitly influenced by the number of items (J) on a multi-item scale (cf. Figure 1), and to avoid a discontinuity problem in that occurs when the denominator of is near zero.
A second widely used index of within-group agreement is Burke, Finkelstein, and Dusig’s (1999)ADM(J) index. The ADM(J) index is calculated in two steps: (a) compute the average absolute value deviation of individual raters’ scores from the mean, for each item j (called ADM(j)), and then (b) average the itemwise ADM(j) values across items, to get ADM(J). Importantly, we note that the itemwise average deviation index (ADM(j)) of within-group agreement (Burke et al., 1999), is roughly equivalent to an item standard deviation under conditions of distributional normality [as Burke and Dunlap (2002) note, “the AD is a reasonable approximation to the standard deviation…we can let equal ” (p. 163)]. When used for multi-item scales, is calculated as an average across item-level estimates (Burke & Dunlap, 2002). Therefore, can be considered essentially equivalent to . For this reason, the index conveys essentially redundant information with Lindell et al.’s (1999) index (see Equation 5).5
The major issue with and estimators of within-group agreement is that they are both based on , which is the mean of itemwise variances (see Equation 4). As such, both and tend to capture item-specific variance, in addition to within-group variance in individual true score perceptions (; see Equation 2 and Figure 2). This feature of and would tend to make these indices underestimates of within-group agreement (i.e., inclusion of item-specific variance leads to underestimation of within-group agreement among individuals’ true scores, ).
The Index and the Index
What if our goal were to estimate the parameter of within-group agreement (i.e., individual true-score consensus, within groups)? For this purpose, it can become problematic that James et al.’s index is influenced by (a) the number of items (J), and (b) item-specific variance. Further, Lindell et al.’s (as well as Burke et al.’s ) is also influenced by item-specific variance (see Figure 2). Due to these issues, Newman and Sin (2008) recommended that within-group agreement (the parameter) would be more appropriately estimated using the following index:6
where is the familiar variance of scale scores (i.e., first, average across items to get the scale score for each person, then calculate the variance of the scale score across persons). The advantage of using in the numerator is that, by virtue of averaging across multiple items, the term removes item-specific variance and thereby homes in on true score variance in individuals’ perceptions of the group (i.e., ; see Figure 2). This feature tends to make a preferable estimator of the theoretical parameter of within-group agreement among individuals’ true scores ().
The theoretical measurement model that underlies the derivation of differs markedly from the theoretical measurement model that underlies the derivation of . Most importantly, was derived under a structural model that includes a group-level true score, but that assumes no individual-level true score (James et al., 1984, 1993; LeBreton et al., 2005).7 This assumption is seen clearly in the derivation of (LeBreton et al., 2005, p. 132), which explicitly assumes that items on a multi-item climate scale are completely uncorrelated, within groups (in other words, is based on the assumption that Cronbach’s α = 0, within groups).8
The recommendation to use (Equation 6) is based on a different theoretical measurement model, which includes individual-level true scores and thus does not assume that Cronbach’s α = 0, within groups. Indeed, the fact that we do not impose an a priori constraint that Cronbach’s α = 0 is the reason that we labeled the index of within-group agreement “” (i.e., the subscript “α” suggests that items on a climate scale are allowed to correlate—a condition forbidden in the model equation that forms the basis for ). The theoretical measurement model underlying the index is presented in Appendix B (“Multilevel Measurement Model,” Equation B5a), and includes both a group-level true score (κ) and an individual level true score (T).
Finally, Newman and Sin (2008) provided an expression for the unbiased estimator of within-group agreement (: consensus among individuals’ true scores):
where is the variance of scale scores between persons, and is Cronbach’s α, within groups. Equation 7 simply applies the formula , in order to create a disattenuated estimate of (see Equation 2).
When estimating as a precursor to estimating within-group agreement via Equation 7 (), there is often likely to be some degree of sampling error in due to small group size. As a numerical example, in the extreme case when a climate scale has 7 items and a tiny group size of n = 5, then the standard error of will be SE = .085 (for average interitem correlation = .5), and SE = .121 (for average interitem correlation = .4; see Duhachek & Iacobucci, 2004). To mitigate some of the sampling error involved in estimating on tiny groups, it is permissible to use the pooled across groups (i.e., a compound artifact correction) when calculating via Equation 7. The pooled can be computed by first group mean-centering each item, and then calculating Cronbach’s α in the usual fashion. Alternatively, it is also permissible to calculate separately for each group.
In the section above, we have shown:
The index is greatly influenced by the number of items (J), in part because it is a reliability analog, in addition to being an index of within-group agreement.
The index is an excellent estimator of the (group true score reliability analog) parameter of within-group agreement.
The index, index, and index are all influenced by item-specific variance, due to their reliance on the average itemwise variance . The reliance of on is partly a result of having been derived from a measurement model that has a group-level true score but no individual-level true score, and which assumes Cronbach’s α within groups () is zero.
The new index is relatively insensitive to both number of items (J) and item-specific variance, and as such it tends to cleanly estimate true within-group variance in latent individual perceptions of the group property (). This is partly a result of having been derived from a multilevel measurement model (Appendix B) that has a group-level true score and an individual-level true score, and which does not assume Cronbach’s α within groups () is zero.
The index is an unbiased estimator of within-group agreement among individuals’ true perceptions (), which is calculated by applying a small reliability disattenuation correction to , using .
Comparing Estimators of Within-Group Agreement ()
In the preceding section, we reviewed several estimators of within-group agreement. Among the estimators, only the original index does an adequate job of estimating the definition of within-group agreement (i.e., group true-score reliability analog). In contrast, we will now compare these estimators of within-group agreement (i.e., , , , and ) in terms of their ability to estimate the theoretical parameter of within-group agreement among individuals’ perceptions (: individual true-score consensus). As reviewed previously, is inspired by the early theoretical work of L. R. James (i.e., denotes shared psychological meaning). Our comparison of agreement indices will involve an empirical example, followed by mathematical expressions relating each agreement index to .
Empirical Example of Within-Group Agreement () Estimators: Comparing Indices
We now differentiate the new index from James et al.’s traditional index and Lindell et al.’s index, to illustrate the bottom-line implications of the new index. That is, we want to answer the question, “Does the field of organizational research really need another within-group agreement index?” Our answer to this question is, “It depends upon what one means by ‘within-group agreement.’” If within-group agreement is defined as the absence of within-group variability in individual-level perceptions of the group-level property (i.e., ; low within-group variance in individual-level true scores (; Appendix B), then the traditional index and the index (which is conceptually identical to the index) will not suffice. On the other hand, if within-group agreement is defined as a reliability analog (i.e., ; Appendix B; James et al., 1993; LeBreton et al., 2005), then only will suffice. That is, the index would be appropriate if within-group agreement were defined as the absence of within-group random error, where random error includes variability in individual-level perceptions of the group-level property (; Appendix B), plus item-specific variability (; Appendix B), with an adjustment for number of scale items (J). Similarly, Lindell et al.’s index and Burke et al.’s index would only be appropriate if within-group agreement were defined as the absence of within-group random error, where random error includes variability in individual-level perceptions of the group-level property (; Appendix B), plus item-specific variability (; Appendix B). We want to create better alignment between how agreement is defined and how it is indexed.
How big are the differences likely to be between these various agreement indices, empirically speaking? To exemplify the answer to this question, we use an illustrative multilevel dataset of 2,467 supermarket employees, nested in 249 departments (see Schneider, Ehrhart, Mayer, Saltz, & Niles-Jolly, 2005, for a description of the sample and method). The focal variable was service climate [rated on an 8-item, 5-point Likert-type scale, where the item referent was the department (Chan, 1998)]. For this climate scale, α = .89, pooled = .88,9 average interitem correlation was = .51, and pooled within groups interitem correlation = .49. These data were used to calculate James et al.’s (1984), Lindell et al.’s (1999), and the newer index (details on the latter calculation are presented in Appendix D), for each of the 249 groups. Figure 3a is a scatterplot of these agreement indices, plotted against each other to show that the various within-group agreement indices are not empirically identical.
(a) Empirical comparison of within-group agreement indices. Note: k = 249 groups; N = 2,467 individuals; J = 8 items on service climate scale; average interitem correlation within groups = .5. (b) Mathematical comparison of within-group agreement indices. Note: J = 8 items on climate scale, average interitem correlation within groups = .5.
Before we interpret Figure 3a, it is useful to remind the reader that the parameter defines within-group agreement as the absence of within-group variability in individual-level perceptions of the group-level property (see James & Jones, 1974; Jones & James, 1979). As seen in Figure 3a, James et al.’s index is notably larger than . That is, appears to be an overestimate of within-group agreement. This overestimation in James et al.’s is especially apparent whenever is greater than .7. Second, Figure 3a shows that Lindell et al.’s (1999) index appears to be a strong underestimate of within-group agreement. Third and finally, Figure 3a shows that the index appears to provide a nearly unbiased estimate of within-group agreement, that very slightly underestimates true agreement among individuals within the group.
Mathematical Comparison of Within-Group Agreement Estimators: Comparing Indices
Despite our empirical illustration, the main argument we are making does not depend upon any single empirical dataset (i.e., the data in Figure 3a are merely demonstrative). For drawing more general conclusions, we provide formulas to express the relationships between within-group agreement and James et al.’s , Lindell et al.’s , and Newman and Sin’s . The three equations that give , , and as a function of within-group agreement are:
and
where J is the number of items and α is Cronbach’s α. Equations 8, 9, and 10 were used to create Figure 3b, under the condition where J = 8 items on the climate scale, and the average interitem correlation for the climate scale, within groups, is = .5 (note that standardized Cronbach’s α = ).
Figure 3b shows the same patterns as Figure 3a.10 The new index is a fairly good approximation of within-group agreement. The index only slightly underestimates , to the extent that Cronbach’s α and are below the perfect 1.0. Lindell et al.’s (1999) strongly underestimates , especially as falls below 1.0. Finally, James et al.’s (1984) tends to underestimate within-group agreement only when is small, but then overestimates within-group agreement across most of the realistic range of agreement values (including everywhere that is above .7; see Figures 3a and 3b).
To reiterate the point about the relative performance of , , and as estimators of within-group agreement (), we now offer three specific numerical examples (see Table 1). As shown in Table 1, when individual true-score within-group agreement = .60, then = .75, = .38, and = .56. Also, when true within-group agreement = .70, then = .85, = .53, and = .67. The examples in Table 1 give some indication of the amount of bias that can be expected when using , , or to estimate within-group agreement ().
Examples of Estimates of Within-Group Agreement (individual consensus)
When = .60
When = .65
When = .70
=
.75
.81
.85
=
.38
.46
.53
=
.56
.61
.67
Note: is true within-group agreement (i.e., the absence of variability in individual perceptions of the group-level property). Examples calculated under J = 5 items and Cronbach’s α = .9.
As for the related issue of how the observed differences among indices matter when it comes time to justify aggregation for a group-level consensus composition model (Chan, 1998), we return to this issue in the Discussion section (where we address the heuristic cutoff of .7; Lance, Butts, & Michels, 2006). The point we are making at this juncture is simply that the way one defines and then indexes agreement will affect the final agreement estimate one gets. The new index fits the theoretical definition of agreement as absence of within-group variability in individual-level perceptions of the group-level property. By using this intuitive definition of agreement based upon the early theoretical work of James and colleagues (James, 1982; James & Jones, 1974; Jones & James, 1979), we can propose an agreement index () that (a) is not founded on the assumption that within-group Cronbach’s α equals zero for all scale measures of group-level properties and (b) does not converge to 1.0 as the number of scale items increases.
In the section above, we made three major points about estimating the within-group agreement parameter:
James et al.’s (1984) index tends to overestimate the within-group agreement parameter (: individual true-score consensus) by a notable amount, especially when is greater than .7.
Lindell et al.’s (1999) index tends to strongly underestimate the within-group agreement parameter ().
The new index tends to slightly underestimate the within-group agreement parameter (), by an amount that is small and typically negligible.
Discussion
The current article reviewed the influential contributions of L. R. James to the study of within-group agreement, which have formed a basis for the study of multilevel phenomena in organizations. James and colleagues provided a persuasive rationale that within-group agreement, or shared psychological meaning, was the key feature required for conceptualizing psychological variables at the group level of analysis (George & James, 1993; James, 1982; James & Jones, 1974; Jones & James, 1979). This classic and well-known contribution is perhaps best reflected in James’s own words:
Perceptual agreement implies a shared assignment of psychological meaning, from which it follows that an aggregate (mean) climate score provides the opportunity to describe an environment in psychological terms. (James, 1982, p. 221)
Shared assignment of meaning justifies aggregation to a higher level of analysis (e.g., groups, subsystems, organizations) because it furnishes a way of relating a construct (psychological climate) that is defined and operationalized at one level of analysis (the individual) to another form of the construct at a different level of analysis (e.g., group climate, subsystem climate, organizational climate). Although the unit of analysis for the aggregate psychological variable is the situation (e.g., group, subsystem, organization), the definition and basic unit of theory remains psychological. (James et al., 1988, p. 130)
Organizational climate is the overall meaning derived from the aggregation of individual perceptions of a work environment (i.e., the typical or average way people in an organization ascribe meaning to that organization) (James, 1982; Schneider, 1981). Thus, organizational climate can be viewed as the outcome of aggregating individuals’ psychological climates. The important caveat is that these psychological climates are shared in order to make the inference that an organizational climate exists. (James et al., 2008, pp. 15-16)
To briefly restate, L. R. James’s classic contributions advanced the theoretical notion of within-group agreement as the shared assignment of psychological meaning, where agreement was initially defined as the lack of individual variance in psychological climate perceptions within a group (James, 1982; James & Jones, 1974). In the current article, we mathematically formalize James and colleagues’ classic notion of agreement by describing a theoretical parameter, , which simply defines within-group agreement as the absence of individual variance in perceptions of the group (see Equation 2).
After reviewing the importance and the theoretical origins of the within-group agreement concept from the 1970’s and early 1980’s, we also pointed out a different theoretical parameter, which we label , that was essentially introduced in 1984 when James et al. derived the multi-item index. The parameter and the parameter offer two different theoretical definitions of what within-group agreement is (similar to how the coefficient of equivalence and the coefficient of stability offer two different theoretical definitions of what reliability is; see Cronbach, 1947).
Subsequently, we reviewed various empirical estimators of within-group agreement. Among these indices, James et al.’s (1984) provides the best estimate of the parameter (which it was designed to do). In contrast, James et al.’s tends to overestimate the within-group agreement parameter when is large (see Figure 3b and Table 1). The reason James et al.’s tends to overestimate within-group agreement is because increases toward 1.0 as the number of scale items (J) increases (see Figure 1, and Equation 3a).
Lindell et al.’s (1999) tends to greatly underestimate both within-group agreement, and within-group agreement (see Figure 3b). The reason that underestimates within-group agreement is because includes item-specific within-group variance, in addition to true score within-group variance ()-- is based on (see Figure 2, and Equation 5). However, unlike James et al.’s (1984) index, does not contain a countervailing upward correction for the number of items on the scale. Burke et al.’s (1999)ADM index suffers from the same underestimation issue.
Perhaps not surprisingly, the newer index is a fairly unbiased estimator of within-group agreement (see Figures 3a and 3b). Because is based on , instead of (see Figure 2), tends to focus on true-score variance in individual perceptions (in contrast to Lindell’s , which includes item-specific variance). However, averaging items together does not remove 100% of the item-specific variance, unless Cronbach’s α reaches a perfect 1.00; this is the reason that retains a small degree of underestimation bias (see Figures 3a and 3b). On the other hand, the index fairly substantially underestimates the within-group agreement parameter, whenever is large. This is because does not contain the same correction for number of items that was incorporated into James et al.’s (1984) (i.e., does not use the group true-score reliability analog [] definition of within-group agreement).
Finally, we presented an unbiased estimator of the within-group agreement parameter, labeled (Equation 7), which is a disattenuated within-group agreement index. To more formally express how this index relates to alternative -based estimators of within-group agreement, we present the following conversion formulas:
The relationships expressed by these conversion formulas simply reflect the same patterns shown in Figure 3b, but across a wider range of possible conditions (i.e., see Equations 8 through 10). is a slight underestimate of , and Lindell et al.’s is a large underestimate of ; but both converge toward true agreement () as Cronbach’s α approaches 1.0 (because item-specific variance is removed as α approaches 1.0). James et al.’s substantially overestimates (particularly when > .7), and a large Cronbach’s α does not remove this overestimation due to the correction for number of items (J; see Figure 1)].
Recommendations
Given the position we are forwarding in the current article (see previous sections), our chief recommendations are:
Recommendation 1: To estimate within-group agreement as individual consensus (), researchers should use the new index or the index (Equations 6 and 7).11
Recommendation 2: To estimate within-group agreement as a group true-score reliability analog (James et al., 1984;), researchers should use the index (Equation 3a).
As for determining whether the researcher is more interested in the theoretical parameter or the theoretical parameter, that will depend upon what one’s definition of within-group agreement is [see the previous section titled “Within-Group Agreement: The Theoretical Parameters ( and )”]. Arguments could be made for using either parameter as a basis for justifying aggregation. Again, our goal in the current article was to articulate an alternative perspective on within-group agreement that is grounded in a new multilevel psychometric model (Appendices B and C) that includes both a group-level true score and an individual-level true score—inspired by the multilevel theoretical work of James (1982), James and Jones (1974), and Jones and James (1979).
If one chooses to estimate the index or the index, procedures for estimating pooled are given below, and detailed SAS syntax for estimating and is presented in Appendix D (which uses an artificial sample of 43 individuals nested in 4 groups).
How to Calculate Pooled
Next, we demonstrate how to calculate the unbiased estimate of within-group agreement (), which is based on the multilevel measurement model (see Appendices B and C; and Equation B4). A multi-item dataset is inherently a repeated measures dataset (because items are presented consecutively). To conduct this analysis, the usual persons × items dataset can be rearranged into a stacked format, with responses to the eight climate items (items C1 through C8) occupying consecutive rows, with each set of eight items/rows nested within the corresponding respondent’s ID score (i.e., an items within-persons format). The SAS syntax for rearranging the data is:
Next, we estimated the multilevel model shown in Equation B5a (Appendix B), as follows (see Singer, 1998).
This is a three-level model, with items (time) nested within persons (ID), which are in turn nested within groups (Group). Using Schneider’s climate data (2,467 employees in 249 departments) as an example, the parameter estimates corresponding to Equation B5a (i.e., Equation B5a:) are:
= 3.35 [grand mean],
var() = .123 [variance of organizational climate () around grand mean ()], var() = .554 [variance of individual-level psychological climate perception ()
around group-level organizational climate ()], and
var() = .758 [variance of item-level response () around individual-level
psychological climate perception ()].
There exists an alternative—but empirically identical—way to estimate this same multilevel model (Equation B5a). This alternative approach explicitly models the repeated measures aspect of the items within-persons (note the use of the “repeated” statement):
The above syntax, which was created for a repeated-measures design (see Singer, 1998), is an alternative specification that treats items as crossed with (rather than nested within) persons (i.e., in the same manner that repeated measurement waves are crossed with respondents in a longitudinal design). It is important to note that the above, repeated-measures specification gives exactly identical parameter estimates and fit indices to the previous, three-level nested model.12
It is also possible to estimate the model using a single group, two-level model (see Equation B5b), by simply dropping “Group” from the multilevel model specification above. One advantage of using the entire three-level model (Equation B5a), however, is that the parameter estimates can be used to directly calculate pooled (i.e., sample-size weighted mean across groups), as well as the intraclass correlations (ICC), in a single step (see Appendix D for another example):
Readers should note that the above estimate is derived from the three-level unconditional means model that includes no individual- or group-level predictor variables, although such covariates could be added directly into the model later (see Bliese, 2002; Lance & James, 1999); e.g., if one wanted to systematically study individual differences in climate perception formation.
Heuristic Usage of
The current article’s demonstration that and are relatively unbiased estimators of within-group agreement (in comparison to ) will still leave some readers unsatisfied. Because most researchers use by comparing it to a heuristic cutoff of .70 (see Lance, Butts, et al., 2006), readers may believe that our demonstrations in the current article have no practical use unless they cause agreement estimates to jump to the other side of the .70 barrier. In this regard, we note it is not our intention to criticize the heuristic cutoff, nor to defend it. The cutoff is heuristic. Instead, our contribution is about delineating what agreement is. That is, we have shown that multilevel modeling can be used to tease apart within groups, between-persons variance from between-items, within-persons variance (see Multilevel Measurement Model, Appendix B and Appendix C). We have also shown that doing so produces indices of within-group agreement ( and ) that are more a function of how much individual group members’ perceptions agree, and less a function of the number of items on a scale ().
If we were asked to propose a less heuristic, more statistically grounded test of within-group agreement for the purpose of justifying aggregation, we would recommend a demonstration that the within-group variance [; Equation 14] is significantly smaller than the expected null variance, . Expected variances under a null distribution could be taken from LeBreton and Senter (2008, p. 832), and compared to the observed within-group variance using the standard error of the variance (Kendall & Stuart, 1977; as reported by Newman & Sin, 2009):
So for the above example of service climate in grocery store departments, was .554 on a 5-pt scale. The expected null variance () from a 5-pt. uniform distribution, for instance, would be 2.00 (and for a normal distribution, would be 1.04; LeBreton & Senter, 2008). The standard error for this variance is = .236 (Equation 16), so the z-score for the observed within-group variance [] is
which solves to , and is statistically significant (p < .05). This means the true within-group variance estimate [] is smaller than . So aggregation is justified, according to this straightforward statistical test.
Recommendation 3: To justify aggregation, one could conduct a simple statistical significance test of against (see Equation 16).
Recommendation 4: The heuristic that should be ≥ .70 need not apply to the index or the index. Because is an overestimate of within-group agreement [consensus among individual-level perceptions], a lower heuristic cutoff value for and would seem appropriate (e.g., >.50 or >.50).
Conclusion
In the current article, we reviewed L. R. James’s seminal theoretical work on measuring group-level psychological properties, which has been pivotal in the advent of the multilevel age of organizational research. Two major innovations in James’s work were (a) drawing the distinction between individual-level perceptions of the group-level property (e.g., psychological climate) and the group-level property (e.g., organizational climate), and (b) articulating the notion that within-group agreement (absence of variability between individuals’ perceptions of the group, or shared psychological meaning) was necessary in order to claim that a group-level construct exists. From James’s theoretical notion of within-group agreement (James, 1982; James & Jones, 1974; Jones & James, 1979), we formalized the theoretical parameter of within-group agreement, which we denote by . The theoretical parameter (individual true-score consensus/shared psychological meaning) is distinguished from the theoretical parameter (i.e., group true-score reliability analog, which forms the theoretical basis for the index; James et al., 1984). Next, we traced the origins and properties of several estimators of within-group agreement (i.e., different indices in the family; beginning with James et al., 1984), to illustrate that the commonly used index and the index are not the preferred estimators of , although the index is the preferred estimator of . We also presented the estimator (and disattenuated estimator), and showed how to calculate pooled in a single step, on the basis of a newly derived multilevel measurement model (i.e., items within persons within groups; Appendix B).
Inspired by L. R. James’s theoretical work on within-group agreement (James, 1982; James & Jones, 1974; Jones & James, 1979), our new multilevel measurement model specifies both an individual-level true score (e.g., psychological climate perceptions) and a group-level true score (e.g., organizational climate). This is a marked distinction from the measurement model upon which the traditional index was based (James et al., 1984), which involved only a group-level true score and no interitem correlations within groups (see Appendices B and C). It is our hope that the current demonstrations will continue the legacy of L. R. James, by offering estimators of within-group agreement that yield a clear picture of his theoretical notions of agreement among individuals’ perceptions of the group (i.e., the parameter), while enabling the specification of an individual-level true score in the perceptions of one’s group.
Supplemental Material
Supplemental_Material_Appendix - Within-Group Agreement (rWG): Two Theoretical Parameters and their Estimators
Supplemental_Material_Appendix for Within-Group Agreement (rWG): Two Theoretical Parameters and their Estimators by Daniel A. Newman and Hock-Peng Sin in Organizational Research Methods
Footnotes
Appendix A
Appendix B
Appendix D
Acknowledgments
The authors would like to thank Ben Schneider and Dave Mayer for the data used in the empirical example. This work was presented in a worldwide webcast honoring the career contributions of Larry James on April 26, 2013. An earlier version of the article received the Sage Best Paper Award from the Academy of Management Research Methods Division, and was presented at the annual meetings of the Academy of Management and published in the Academy of Management Proceedings (Newman & Sin, 2008).
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
Notes
References
1.
AguinisH.O’BoyleE.Gonzalez-MuléE.JooH. (2016). Cumulative advantage: Conductors and insulators of heavy-tailed productivity distributions and productivity stars. Personnel Psychology, 69, 3–66.
2.
BlieseP. D. (2000). Within-group agreement, non-independence, and reliability: Implications for data aggregation and analysis. In KleinK. J.KozlowskiS. W. J. (Eds.), Multilevel theory, research, and methods in organizations (pp. 349–381). San Francisco, CA: Jossey-Bass.
3.
BlieseP. D. (2002). Multilevel random coefficient modeling in organizational research: Examples using SAS and S-PLUS. In DrasgowF.SchmittN. (Eds.), Measuring and analyzing behavior in organizations: Advances in measurement and data analysis (pp. 401–445). San Francisco, CA: Jossey-Bass.
4.
BlieseP. D.PloyhartR. E. (2002). Growth modeling using random coefficient models: Model building, testing and illustrations. Organizational Research Methods, 5, 362–388.
5.
BowersK. S. (1973). Situationism in psychology: An analysis and a critique. Psychological Review, 80, 307–336.
6.
BrownW. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3, 296–322.
7.
BrykA. S.RaudenbushS. W. (1992). Hierarchical linear models: Applications and data analysis methods. Thousand Oaks, CA: Sage.
8.
BurkeM. J.DunlapW. P. (2002). Estimating interrater agreement with the average deviation index: A user’s guide. Organizational Research Methods, 5, 159–172.
9.
BurkeM. J.FinkelsteinL. M.DusigM. S. (1999). On average deviation indices for estimating interrater agreement. Organizational Research Methods, 2, 49–68.
10.
CampbellJ. P.DunnetteM. D.LawlerE. E.IIIWeickK. E.Jr (1970). Managerial behavior, performance, and effectiveness. New York, NY: Mc-Graw-Hill.
11.
ChanD. (1998). Functional relations among constructs in the same content domain at different levels of analysis: A typology of composition models. Journal of Applied Psychology, 83, 234–246.
12.
ChenG.BlieseP. D.MathieuJ. E. (2005). Conceptual framework and statistical procedures for delineating and testing multilevel theories of homology. Organizational Research Methods, 8, 375–409.
13.
ChenG.MathieuJ. E.BlieseP. D. (2004). A framework for conducting multilevel construct validation. Research in Multi-Level Issues, 3, 273–303.
14.
CortinaJ. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78, 98–104.
15.
CronbachL. J. (1947). Test “reliability”: Its meaning and determination. Psychometrika, 12, 1–16.
16.
CronbachL. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334.
17.
CronbachL. J.GleserG. C.NandaH.RajaratnamN. (1972). The dependability of behavioral measurements. New York, NY: John Wiley.
18.
DaltonD. R.AguinisH. (2013). Measurement malaise in strategic management studies: The case of corporate governance research. Organizational Research Methods, 16, 88–99.
19.
DansereauF.YammarinoF. J. (2006). Is more discussion about levels of analysis really necessary? When is such discussion sufficient?Leadership Quarterly, 17, 537–552.
20.
DuhachekA.IacobucciD. (2004). Alpha’s standard error (ASE): An accurate and precise confidence interval estimate. Journal of Applied Psychology, 89, 792–808.
21.
DyerN. G.HangesP. J.HallR. J. (2005). Applying multilevel confirmatory factor analysis techniques to the study of leadership. Leadership Quarterly, 16, 149–167.
22.
FeldtL. S.BrennanR. L. (1989). Reliability. In LinnR. L. (Ed.), Educational measurement (3rd ed., pp. 105–146). Washington, DC: American Council on Education/Macmillan.
23.
FinnR. H. (1970). A note on estimating the reliability of categorical data. Educational and Psychological Measurement, 30, 71–76.
24.
ForehandG. A.GilmerB. V. H. (1964). Environmental variation in studies of organizational behavior. Psychological Bulletin, 62, 361–382.
25.
GeorgeJ. M.JamesL. R. (1993). Personality, affect, and behavior in groups revisited: Comment on aggregation, levels of analysis, and a recent application of within and between analysis. Journal of Applied Psychology, 78, 798–804.
26.
GersickC. J. G.HackmanJ. R. (1990). Habitual routines in task-performing groups. Organizational Behavior and Human Decision Processes, 47, 65–97.
27.
GibsonC. B. (1999). Do they do what they believe they can? Group efficacy and group effectiveness across tasks and cultures. Academy of Management Journal, 42, 138–152.
28.
GiddensA. (1993). New rules of sociological method (2nd ed.). Stanford, CA: Stanford University Press.
29.
GlickW. H. (1985). Conceptualizing and measuring organizational and psychological climate: Pitfalls in multi-level research. Academy of Management Review, 10, 601–616.
30.
GrahamJ. M. (2006). Congeneric and (essentially) tau-equivalent estimates of score reliability: What they are and how to use them. Educational and Psychological Measurement, 66, 930–944.
31.
GrantA. M. (2013). Rocking the boat but keeping it steady: The role of emotion regulation in employee voice. Academy of Management Journal, 56, 1703–1723.
32.
GuionR. M. (1973). A note on organizational climate. Organizational Behavior and Human Performance, 9, 120–125.
33.
GulliksenH. (1950). Theory of mental tests. New York, NY: John Wiley.
34.
HangesP. J.DicksonM. W. (2006). Agitation over aggregation: Clarifying the development of and the nature of the GLOBE scales. Leadership Quarterly, 17, 522–536.
35.
HittM. A.BeamishP. W.JacksonS. E.MathieuJ. E. (2007). Building theoretical and empirical bridges across levels: Multilevel research in management. Academy of Management Journal, 50(6), 1385–1399.
36.
HofmannD. A.MorgesonF. P.GerrasS. J. (2003). Climate as a moderator of the relationship between leader-member exchange and content specific citizenship: Safety climate as an exemplar. Journal of Applied Psychology, 88, 170–178.
37.
HofstedeG. (1980). Culture’s consequences: International differences in work-related values. Beverly Hills, CA: Sage.
38.
HulinC. L.DrasgowF.ParsonsC. K. (1983). Item response theory: Applications to psychological measurement. Homewood, IL: Dow Jones-Irwin.
39.
InselP. M.MoosR. H. (1974). Psychological environments: Expanding the scope of human ecology. American Psychologist, 29, 179–188.
40.
JamesL. R. (1982). Aggregation bias in estimates of perceptual agreement. Journal of Applied Psychology, 67, 219–229.
41.
JamesL. R.ChoiC. C.KoC. H. E.McNeilP. K.MintonM. K.WrightM. A.KimK. I. (2008). Organizational and psychological climate: A review of theory and research. European Journal of Work and Organizational Psychology, 17(1), 5–32.
42.
JamesL. R.DemareeR. G.WolfG. (1984). Estimating within-group interrater reliability with and without response bias. Journal of Applied Psychology, 69, 85–98.
43.
JamesL. R.DemareeR. G.WolfG. (1993). rwg: An assessment of within-group interrater agreement. Journal of Applied Psychology, 78, 306–309.
44.
JamesL. R.JonesA. P. (1974). Organizational climate: A review of theory and research. Psychological Bulletin, 81, 1096–1112.
45.
JamesL. R.JoyceW. F.SlocumJ. W.Jr (1988). Comment: Organizations do not cognize. Academy of Management Review, 13, 129–132.
46.
JonesA. P.JamesL. R. (1979). Psychological climate: Dimensions and relationships of individual and aggregated work environment perceptions. Organizational Behavior and Human Performance, 23, 201–250.
47.
KatzD.KahnR. L. (1966). The social psychology of organizations. New York, NY: John Wiley.
48.
KendallM. G.StuartA. (1977). The advanced theory of statistics. London, UK: Griffin.
49.
KennyD. A. (1991). A general model of consensus and accuracy in interpersonal perception. Psychological Review, 98, 155–163.
50.
KennyD. A.BermanJ. S. (1980). Statistical approaches to the correction of correlational bias. Psychological Bulletin, 88, 288–295.
51.
KingL. M.HunterJ. E.SchmidtF. L. (1980). Halo in a multidimensional forced-choice performance evaluation scale. Journal of Applied Psychology, 65, 507–516.
52.
KleinK. J.KozlowskiS. W. J. (Eds.). (2000). Multilevel theory, research, and methods in organizations. San Francisco, CA: Jossey Bass.
53.
KozlowskiS. W. J.DohertyM. L. (1989). Integration of climate and leadership: Examination of a neglected issue. Journal of Applied Psychology, 74, 546–553.
54.
KozlowskiS. W. J.HattrupK. (1992). A disagreement about within-group agreement: Disentangling issues of consistency versus consensus. Journal of Applied Psychology, 77, 161–167.
55.
KozlowskiS. W. J.KleinK. J. (2000). A multilevel approach to theory and research in organizations: Contextual, temporal, and emergent processes. In KleinK. J.KozlowskiS. W. J. (Eds.), Multilevel theory, research, and methods in organizations (pp. 3–90). San Francisco, CA: Jossey-Bass.
56.
LanceC. E.BaxterDMahanR. P. (2006). Evaluation of alternative perspectives on source effects in multisource performance measures. In BennettW.JrLanceC. E.WoehrD. J. (Eds.), Performance measurement: Current perspectives and future challenges (pp. 49–76). Mahwah, NJ: Lawrence Erlbaum.
57.
LanceC. E.ButtsM. M.MichelsL. C. (2006). The sources of four commonly-reported cutoff criteria: What did they really say?Organizational Research Methods, 9, 202–220.
58.
LanceC. E.JamesL. R. (1999). ν2: A proportional variance-accounted-for index for some cross-level and person-situation research designs. Organizational Research Methods, 2, 395–418.
59.
LanceC. E.NobleC. L.ScullenS. E. (2002). A critique of the correlated trait-correlated method and correlated uniqueness models for multitrait-multimethod data. Psychological Methods, 7, 228–244.
60.
LeBretonJ. M.JamesL. R.LindellM. K. (2005). Recent issues regarding rWG, r*WG, rWG(J), and r*WG(J). Organizational Research Methods, 8, 128–138.
61.
LeBretonJ. M.SenterJ. L. (2008). Answers to 20 questions about interrater reliability and interrater agreement. Organizational Research Methods, 11, 815–852.
62.
LeungK.BondM. H. (1989). On the empirical identification of dimensions for cross-cultural comparisons. Journal of Cross-Cultural Psychology, 20, 133–151.
63.
LindellM. K. (2001). Assessing and testing interrater agreement on a single target using multi-item rating scales. Applied Psychological Measurement, 25, 89–99.
64.
LindellM. K.BrandtC. J. (1997). Measuring interrater agreement for ratings of a single target. Applied Psychological Measurement, 21, 271–278.
65.
LindellM. K.BrandtC. J. (1999). Assessing interrater agreement on the job relevance of a test: A comparison of the CVI, T, rWG(J) and r*WG indexes. Journal of Applied Psychology, 84, 640–647.
66.
LindellM. K.BrandtC. J. (2000). Climate quality and climate consensus as mediators of the relationship between organizational antecedents and outcomes. Journal of Applied Psychology, 85, 331–348.
67.
LindellM. K.BrandtC. J.WhitneyD. J. (1999). A revised index of interrater agreement for multi-item ratings of a single target. Applied Psychological Measurement, 23, 127–135.
68.
LordF. M.NovickM. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
69.
McGrawK. O.WongS. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30–46.
70.
MeyerR. D.MumfordT. V.BurrusC. J.CampionM. A.JamesL. R. (2014). Selecting null distributions when calculating rwg: A tutorial and review. Organizational Research Methods, 17, 324–345.
71.
MorgesonF. P.HofmannD. A. (1999). The structure and function of collective constructs: Implications for multilevel research and theory development. Academy of Management Review, 24, 249–285.
NewmanD. A. (2014). Missing data: Five practical guidelines. Organizational Research Methods, 17, 372–411.
74.
NewmanD. A.SinH. P. (2008). Within-group agreement for multi-item scales: Considering interitem correlations. In Academy of Management best paper proceedings (pp. 1–7). New York, NY: Academy of Management.
75.
NewmanD. A.SinH. P. (2009). How do missing data bias estimates of within-group agreement? Sensitivity of SDWG, CVWG, rWG(J), rWG(J)*, and ICC to systematic nonresponse. Organizational Research Methods, 12, 113–147.
76.
OstroffC. (1993). Comparing correlations based on individual level and aggregate data. Journal of Applied Psychology, 78, 569–582.
77.
OstroffC.KinickiA. J.TamkinsM. M. (2003). Organizational culture and climate. In BormanW. C.IlgenD. R.KlimoskiR. J. (Eds.), Handbook of psychology: Industrial and organizational psychology (Vol. 12, pp. 565–593). New York, NY: John Wiley.
78.
PetersonM. F.CastroS. L. (2006). Measurement metrics at aggregate levels of analysis: Implications for organization culture research and the GLOBE project. Leadership Quarterly, 17, 506–521.
79.
RaykovT. (1997). Estimation of composite reliability for congeneric measures. Applied Psychological Measurement, 21, 173–184.
80.
ReichersA. E.SchneiderB. (1990). Climate and culture: An evolution of constructs. In SchneiderB. (Ed.), Organizational climate and culture (pp. 5–39). San Francisco, CA: Jossey-Bass.
81.
RobersonQ. M. (2006). Justice in teams: The effects of interdependence and identification on referent choice and justice climate strength. Social Justice Research, 19, 323–344.
82.
RobertsK. H.HulinC. L.RousseauD. M. (1978). Developing an interdisciplinary science of organizations. San Francisco, CA: Jossey-Bass.
83.
RousseauD. M. (1985). Issues of level in organizational research: Multilevel and cross-level perspectives. In CummingsL. L.StawB. M. (Eds.), Research in organizational behavior (Vol. 7, pp. 1–37). Greenwich, CT: JAI.
84.
SchmidtF. L.HunterJ. E. (1989). Interrater reliability coefficients cannot be computed when only one stimulus is rated. Journal of Applied Psychology, 74, 368–370.
85.
SchmidtF. L.LeH.IlliesR. (2003). Beyond alpha: An empirical examination of the effects of different sources of measurement error on reliability estimates for measures of individual differences constructs. Psychological Methods, 8, 206–224.
86.
SchneiderB. (1975). Organizational climates: An essay. Personnel Psychology, 28, 447–479.
87.
SchneiderB. (1981). Work climates: An interactionist perspective(Research Report No. 81-2). East Lansing: Michigan State University, Department of Psychology.
88.
SchneiderB. (1990). The climate for service: An application of the climate construct. In SchneiderB. (Ed.), Organizational climate and culture (pp. 383–412). San Francisco, CA: Jossey-Bass.
89.
SchneiderB. (2000). The psychological life of organizations. In AshkanasyN. M.WilderomC. P. M.PetersonM. F. (Eds.), Handbook of organizational culture & climate (pp. xvii-xxi). Thousand Oaks, CA: Sage.
90.
SchneiderB.EhrhartM. G.MayerD. M.SaltzJ. L.Niles-JollyK. (2005). Understanding organization-customer links in service settings. Academy of Management Journal, 48, 1017–1032.
91.
SchneiderB.ReichersA. A. (1983). On the etiology of climates. Personnel Psychology, 36, 19–39.
92.
SellsS. B. (1963). An interactionist looks at the environment. American Psychologist, 18, 696–702.
93.
ShroutP. E.FleissJ. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420–428.
94.
SinH. P.NewmanD. A. (2005). Variance of means versus mean of variances: A contrarian view on operationalizing group dispersion. In Academy of Management best paper proceedings (pp. B1–B6). New York, NY: Academy of Management.
95.
SingerJ. D. (1998). Using SAS PROC MIXED to fit multilevel models, hierarchical models, and individual growth models. Journal of Educational and Behavioral Statistics, 24, 323–355.
96.
SirotnikK. A. (1980). Psychometric implications of the unit of analysis problem with examples from the measurement of organizational climate. Journal of Educational Measurement, 17, 245–282.
97.
SpearmanC. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3, 271–295.
98.
TriandisH. C.McCuskerC.BetancourtH.IwaoS.LeungK.SalazarJ. M.…ZaleskiZ. (1993). An etic–emic analysis of individualism and collectivism. Journal of Cross-Cultural Psychology, 24, 366–383.
99.
WherryR. J.SrBartlettC. J. (1982). The control of bias in ratings: A theory of rating. Personnel Psychology, 35, 521–551.
100.
YammarinoF. J.MarkhamS. E. (1992). On the application of within and between analysis: Are absence and affect really group-based phenomena?Journal of Applied Psychology, 7, 168–176.
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.