Abstract
A latent variable modeling method for evaluation of interrater agreement is outlined. The procedure is useful for point and interval estimation of the degree of agreement among a given set of judges evaluating a group of targets. In addition, the approach allows one to test for identity in underlying thresholds across raters as well as to identify possibly aberrantly evaluating judges. A measure of interrater agreement is proposed, which is related to popular indexes of interrater reliability for observed variables and composite reliability. The outlined method also permits the examination of underlying common sources of ratings variability, provides a useful complement to the literature on interrater agreement with manifest measures, and relaxes some of its assumptions. The procedure is illustrated with numerical data.
In behavioral, educational, social, and biomedical research, a study design is oftentimes used where several “raters” (or “judges”)—such as psychiatrists, counselors, teachers, clinicians, evaluators, or observers—are asked to evaluate a group of ‘targets,’ for example, subjects, students, teachers, clients, objects, or patients. In this context, a researcher is typically interested in estimating the interrater agreement (IRA) as the degree to which there is consistency in the ratings (cf. Shrout & Fleiss, 1979). A large body of literature is currently available that deals with this and related topics in various circumstances characterized by different inferential aims (e.g., Burke & Dunlap, 2002; Congdon & McQueen, 2000; Dimitrov, 2012; LeBreton & Senter, 2008; Lindell, 2001; Raymond, Harik, & Clauser, 2011; Shoukri, 2011; von Eye & Mun, 2005). 1
A useful approach to evaluating IRA would be to adopt a widely employed assumption across a number of empirical contexts in the behavioral and social sciences. Specifically, each rater may be presumed to be using an underlying continuous dimension along which he or she implicitly evaluates each target. However, the rater can provide only ordinal information about the location of each rated target on that continuum, with such information resulting from a process of coarse measurement by the judge of the extent to which the characteristic of concern may be possessed by that target (e.g., Skrondal & Rabe-Hesketh, 2004). Thereby, the rater states only one of a limited set of several possible integer numbers―the pertinent individual rating.
The assumption of the existence of an underlying latent dimension (continuum) along which rater evaluation proceeds is widely used in the popular latent variable modeling (LVM) framework when dealing with categorical dependent variables (e.g., Muthén, 1984). This assumption, which has proved useful in theoretical and empirical behavioral and social research, opens here an opportunity to capitalize on recent advances in LVM for the purpose of IRA evaluation (see Raykov & Marcoulides, 2012, for examining this assumption in an empirical setting; cf. Jöreskog & Sörbom, 1996.) To our knowledge, an LVM-based approach to IRA evaluation with this assumption has not been published in previous IRA studies. In addition, interval estimations of IRA have been rarely carried out in past research, and only on assumptions that could not be readily considered tenable (cf. Raykov, 2011; Shrout & Fleiss, 1979; Shrout & Lane, in press).
The present article intends to contribute to bridging this gap. The following discussion describes an LVM-based approach to evaluating IRA when multiple targets are evaluated by a fixed set of raters. The outlined procedure can be used for point and interval estimation of the extent to which the raters are consistent in their assessment of the targets. The procedure uses less stringent assumptions compared with methods that are based on manifest variables. The proposed approach can also be used when the goals are (a) to examine a set of raters for possible common sources in their ratings’ variability, (b) to test underlying thresholds for identity across raters, and (c) to identify possibly aberrantly evaluating raters.
With these features, the article contributes to a substantial body of literature on IRA evaluation that already contains several prior attempts to devise particular models for IRA. Within the context of manifest variables, Schuster (2001), for example, advanced an interpretation of Cohen’s (1960) kappa as a parameter of a log-linear symmetry model. von Eye and Mun (2005) discussed a family of log-linear models of rater agreement that incorporates, among others, Tanner and Young’s (1985) equal weight agreement model, the weight-by-response-category agreement model, the symmetry model, models with covariates, weighted kappa (Cohen, 1968), and models with ordinal categories (Schuster & von Eye, 2001; see also Schuster & Smith, 2005; Uebersax, 1993). Alternatively, within the latent variable framework, mixture models (Schuster, 2002) have been discussed as well as latent class models (Schuster & Smith, 2002) that similarly aimed at IRA evaluation. von Eye and Mun (2005) further proposed a structural equation model for the comparison of ratings from two or more groups of raters.
The method outlined in the remaining discussion differs from all these approaches in a number of important respects. First, this is a latent variables approach, and as such it is distinct from manifest variables procedures in that a measure of the degree of agreement is based on underlying latent dimensions here. In addition, the following procedure differs from the latent variable models proposed in the above-mentioned work by Schuster and colleagues (Schuster & Smith, 2002; Schuster & von Eye, 2001) in that (a) the present one uses groups of raters instead of individuals and (b) the raters are considered random variables here. This article’s method is also distinct from von Eye and Mun’s (2005) structural equation modeling approach in that the former goes beyond comparing groups of raters by proposing measures of the degree of agreement and tools to also evaluate individual raters. Finally, the following procedure differs from all those alternative approaches in that this method makes assumptions about underlying processes that are continuous in nature.
Background, Notation, and Assumptions
In the remainder of this article, we assume that n targets are evaluated by each of r raters using an ordinal scale with possible scores 1, 2, . . ., m (n, m > 1; r > 2). That is, each rater is asked to evaluate a certain characteristic for each target and can use thereby only one of these m possible values. Below, we will also refer to the raters as “evaluators” or “judges,” to the objects of measurement as “targets” or “objects,” and on a few occasions to IRA as “inter-observer agreement” (cf. Shavelson & Webb, 1991). For example, a set of teachers may be evaluated by a set of inspectors, whereby the teacher’s performance in a prespecified area of professional activity (e.g., degree of being successful in engaging students in discussion during class) is graded using the rating 1, 2, . . ., or m, independently of how any other teacher is evaluated on the same characteristic by the same or another judge. In this setting, the goal is to obtain point and interval estimates of an IRA index, that is, the extent to which the judges are consistent in their ratings. This IRA index should reflect correspondingly the amount of overlap among observers in their evaluations of the examined objects.
As indicated earlier, this article adopts an LVM-based approach to IRA evaluation. The method outlined below is motivated by the assumption that a set of judges would be consistent in their ratings of targets to the extent to which they might be using a common underlying “metric” for the purpose of target assessment. As part of this assumption, that metric may be conceptualized as a common underlying continuum along which the targets may be thought of as being approximately positioned. This assumption is testable within the proposed LVM-based approach. The following method is also readily extended to the case when more than a single latent dimension may be underlying the raters’ evaluations, which will be addressed in a later section.
To develop an LMV-based approach to IRA estimation, we view the raters as separate observed random variables that are formally “taken” or measured on each of the targets. Because of the coarse evaluation involved, these variables represent categorical manifest measures, which are denoted here y1, . . ., y
r
and viewed as elements of the vector y = (y1, . . ., y
r
)′ (priming denotes transposition in this article). Furthermore, as typically done in applications of LVM and indicated earlier, we assume the existence of a normal latent variable underlying each categorical observed variable (Muthén, 1984; Muthén & Muthén, 2010; see also Raykov & Marcoulides, 2012, for testing of this assumption). Let us designate these underlying variables by
To use the LVM framework with categorical observed variables for IRA estimation, it is next noted that the observed ratings (scores) produced by the raters can be viewed as resulting from the relationship of their underlying latent variable realization to an associated set of thresholds (e.g., Skrondal & Rabe-Hesketh, 2004). Specifically, denoting these thresholds by τj1, τj2, . . ., τj,m− 1 for the jth rater, this relationship is
where, j = 1, 2, . . ., r). As often conducted in applications of LVM with categorical manifest variables, a confirmatory factor analysis (CFA) model can be fitted to the underlying variables y* (Muthén, 1984). In the context of IRA evaluation, it may be argued that oftentimes it would be of interest to fit a single-factor model. Indeed, it is readily realized that rater agreement may well result from the fact that the underlying variables y* have a dominant source of shared variability, which would then be represented by their common factor (see the Discussion and Conclusion section for alternatives). That is, the following testable model
where Λ is the r× 1 matrix of factor loadings, η the factor, and ε the r× 1 vector of unique factors, can be employed as a useful starting point when studying IRA. As usual, the unique factors in Model (2) are assumed uncorrelated among themselves and with the common factor η (e.g., Muthén, 1984). We stress that Model (2) is testable when empirical data are available using LVM. If this model fits the data well, the single factor η can be straight-forwardly interpreted as the common source of variability in the rater-specific dimensions
Adopting Model (2) permits us also to address an issue that does not seem to have received the deserved attention in the extant literature on IRA. Specifically, when evaluating IRA, a natural question to posit initially is whether there are commonalities in the specific “cutoffs” that raters use to evaluate the targets. This question translates here to that of identity across judges of each of the m− 1 thresholds in Equation (1), that is, across the r observed random variables involved. In our view, it is important to assume that these thresholds are at least comparable if not the same for all raters, since otherwise it may be hard to speak of IRA to begin with. The proposed LVM approach is applicable whether the thresholds are the same or not across the raters, and in addition allows testing for their identity across raters as a by-product (see below). Moreover, Model (2) permits one to identify judges that provide possibly aberrant evaluations compared with the other raters. Specifically, a comparison of each factor loading in Model (2) with each of the remaining (r− 1) factor loadings for the judges makes it possible to identify potentially “outlying” raters. In addition, the method does not need to assume that each judge “draws” to the same extent from the underlying latent dimension when evaluating the targets. That is, the rater-specific loadings λ1, . . ., λ r in the single-factor Model (2) need not be assumed equal across judges, and in fact their differences permit a considerable degree of flexibility in the model when applied in empirical research, while the identity of these loadings for all or a subset of the raters is similarly testable with the present approach. (Equality of any of these factor loadings is testable using essentially the same approach as that for testing threshold identity; see below.) Furthermore, the procedure outlined next does not assume equal residual variances across the raters, that is, the variances of the rater-specific residual terms ε1, . . ., ε r need not be equal (cf. Shrout & Fleiss, 1979; Shrout & Lane, in press; the equality of these variances is similarly testable with essentially the same method as that used for threshold identity testing below).
A Latent Variable Modeling–Based Index of Interrater Agreement
Typically when working with observed categorical variables, of actual interest are the individual values on their underlying normal variables,
Interrater Agreement as Proportion Common Variance
One possible way to conceptualize IRA in the currently considered setting is as the proportion of variance in the average underlying rating value that is common to all raters. Accordingly, we define IRA as the proportion of variance in the average underlying rating,
which is accounted for by the common factor η of the pertinent rater variables
Then the IRA index proposed in this article, denoted ρ1, is defined as follows (Var(·) designates variance in the sequel):
where
We observe that since the IRA index (5) instrumentally depends on the underlying latent or error-free dimension η, this index can be also interpreted as the extent of “true agreement” among the raters. This interpretation is further corroborated by noting that the observed individual ratings y ij provided by the judges cannot be usually considered to be very precise (perfect) measures of examined target characteristics in the behavioral and social sciences (i = 1, . . ., n; j = 1, . . ., r).
Interrater Agreement as “Scale Reliability”
An alternative and related approach to obtaining the IRA index in Equation (5) is to formally use expressions for reliability of the “scale score,”
we may view for our aims the first term in the right-hand side of Equation (6) as a “true score,” and its second term as an “error score” within the well-known classical test theory decomposition of the “observed score”z. (In this section, we use apostrophes when referring verbally to z, in order to indicate that z is not the usual scale score―as it is not actually observable―but is conceptually treated as such here for the purpose of an alternative justification of the proposed IRA index 5.
With this in mind, an application of the scale reliability formula (e.g., Raykov & Marcoulides, 2011) to the expression in Equation (6) leads to
An implication of the equality ρ1 = ρ2 shown here is that the proposed IRA index (5) can be alternatively defined as a reliability coefficient of the “scale score”z.
Interrater Agreement as a Correlation
A third justification of the proposed IRA index (5) is provided by the following demonstration that this index equals also the degree of linear interrelationship between (a) the common factor η representing the common sources of variance in the underlying rating variables
Thus, the proposed IRA index (5) is also the squared correlation between the sum of the underlying rating variables
A by-product of the discussion thus far is also that the IRA coefficient ρ1 equals the R2 index associated with the regression of the sum z of the underlying rating variables on their common source of variability, η.
Interrater Agreement as Interrater Reliability
In the present setting of fixed raters, it is informative to revisit the corresponding developments in Shrout and Fleiss (1979; see also Shrout & Lane, in press) with regard to their interrater reliability (IRR) index ICC(3, r), under the assumptions made there in its derivation. The latter index was shown by those authors to represent the reliability of the average observed rating.
It can be directly demonstrated that the IRA index (5) of this article equals the ICC(3, r) index in Shrout and Fleiss (1979; see also Shrout & Lane, in press) when applied on the underlying latent variables . To this end, let us rewrite the equation in Model (2) that pertains to the ith target when evaluated by the jth rater,
in the following useful way:
where λ• is the average factor loading in Model (2), that is, λ• = (λ1+⋯+λ r )/r (i = 1, . . ., n; j = 1, . . ., r).
In the terminology of Shrout and Fleiss (1979), the first term on the right-hand side of Equation (10) can be interpreted as the “true score” associated with the ith target, considering formally
which they showed in their manifest variable context to represent the reliability of the average observed rating, assuming that the error variances were equal (and denoted
For our purposes in this section, we can formally use Equation (11) with regard to the underlying latent variables, in order to define IRR of the average underlying rating. We emphasize that there is no need to assume equality of the corresponding error variances within our LVM approach. When these error variances are equal, however, using the same developments as in Shrout and Fleiss (1979) that yield Equation (11), one can show that the IRA coefficient (5) in this article equals their IRR coefficient (11) when applied on the rating underlying latent variables
To this end, we first realize that the terms in the right-hand side of Equation (11) are variance components, which were obtained from an appropriate analysis of variance on observed scores in Shrout and Fleiss (1979; (assuming there are equal error variances; see also Shrout & Lane, in press). Therefore, to use the logic behind their formula (11) within our LVM-based approach, we need to work out the “true score” variance and “error” variance with regard to the underlying latent variable scores
where
In Equation (13), ICC*(3, r) designates the IRR coefficient ICC(3, r) in Equation (11) from Shrout and Fleiss (1979) when applied to the average rating for the underlying rater latent variables
Likewise, one can also show that the IRR index (5) proposed in this article equals the longitudinal reliability coefficient for fixed time points in Laenen, Alonso, and Molenberghs (2009), if applied to the underlying rater variables
Testing Identity in Rating Thresholds
As indicated earlier, testing for identity of the rater-specific thresholds under the present LVM-based approach can shed further light on the nature of IRA in an empirical study. To test the hypothesis that the set of thresholds in Equation (1) are the same across all raters, one tests for significance the difference in fit of two nested models (e.g., Muthén & Muthén, 2010). Specifically, these are (a) Model (2) and (b) the same model after imposing the constraints of equal thresholds, that is, Model (2) with the restrictions
The test of the null hypothesis (14) is readily carried out by fitting both models described in (a) and (b) to the analyzed rating data set and evaluating the significance of the (corrected) difference in associated chi-square values (e.g., Muthén & Muthén, 2010; see next section for an illustration).
Interval Estimation of Interrater Agreement
Because of the fact that the proposed IRA index (5) is defined as a proportion of common variance, one can readily construct a confidence interval for this index. To this end, once Model (2) is fitted and found plausible for a given data set, a large-sample standard error for the IRA index (5) can be obtained using the popular delta method in a first step (e.g., Raykov & Marcoulides, 2004). This is achieved by introducing an “external” parameter in Model (2), which is defined as the right-hand side of Equation (5) (see the appendixes for Mplus source code and pertinent model constraint accomplishing this aim). Using this standard error and the observation that Equation (5) is a bounded ratio of two variances that cannot be negative or larger than 1, the confidence interval procedure outlined in Raykov and Marcoulides (2011, chap. 7) can be directly applied in a second step via the logit transformation and then its inverse—the logistic transformation—to obtain a confidence interval of the proposed IRA index. This second step can be achieved using the R function “ci.ira” presented in Appendix B. The IRA interval estimation procedure is demonstrated in the illustration section.
Identification of Possible “Outlying” Raters
In certain empirical settings, it is possible that one or more of the judges do not evaluate the targets in a manner consistent with the remaining judges (or a certain group of them). In such cases, it is of relevance to be in a position to identify possibly “outlying” judges. The IRA evaluation procedure proposed in this article can be straightforwardly used to address this issue as well. Specifically, once Model (2) is fitted to the data and found plausible, one compares the factor loading confidence intervals that can be requested from the software used (e.g., Muthén & Muthén, 2010) across the raters. A judge(s) associated with a confidence interval below (above) the confidence intervals of all remaining evaluators of interest may be considered possibly aberrant in their ratings. 2
We illustrate next the outlined IRA evaluation procedure with an example.
Illustration on Data
In this section, to demonstrate the utility and applicability of the proposed IRA index (5) and related evaluation procedures, we employ simulated data. (A main reason for using simulated data here is to have a knowledge of all underlying model parameters, and hence the resulting possibility to compare them with the results obtained using the outlined procedure in this article.) Specifically, first we generate data for n = 1,000 “targets” and r = 5 “raters” using a Likert-type scale with m = 4 possible values (1 through 4, say). The data are simulated under the following model (see Equation 2):
where Var(η) = 1 is the variance of a zero-mean normal variable, ε1 through ε5 are independent normal zero-mean residual terms with variances .5100, .4375, .3600, .2775, and .1900, respectively, while the thresholds are τ1,1 = . . . , τ5,1 = .20, τ1,2 = . . . , τ5,2 = .50, and τ1,3 = . . . , τ5,3 = .80. 3 (Further details on the simulation procedure can be obtained from the authors on request.)
We start by examining the latent structure associated with the resulting data set, which allows us also to test if the thresholds are the same across the five “judges.” To this end, we first fit the single-factor model with categorical indicators (see Equation 2) and unconstrained corresponding thresholds for equality across “raters.” This model, referred to as full model below, is associated with the following tenable goodness-of-fit-indexes: chi-square (χ2) = 6.588, degrees of freedom (df) = 5, p value (p) = .253, and root mean square error of approximation (RMSEA) = .018. (The needed Mplus source code is provided in Appendix A.) Adding the corresponding equality constraints (14) for the triple of thresholds across raters (see Note in Appendix A for needed code) leads to a tenable model as well, which is referred to as restricted model below: χ2 = 27.020, df = 17, p = .058, RMSEA = .024. Next, to test the thresholds for identity across raters, we use the corrected difference in chi-square values, which incorporates appropriately the chi-square fit indexes and degrees of freedom of the full and restricted models (Muthén & Muthén, 2010; Satorra, 2000; see Note 2 in Appendix A for code). The associated test statistic is thereby found to be Δcχ2 = 20.181, df = 12, p = .064. This result indicates plausibility of the assumption of equal thresholds across judges, which is a correct finding since this identity was built into the data generation process.
Given the plausibility of the hypothesis of equal thresholds across raters, we proceed with the estimation of the IRA coefficient (5) proposed in this article. To this end, we include in the restricted model, the definition of this coefficient (see Equation 5) in the form of an “external parameter.” (The needed Mplus source code is given in Appendix B; note that the pertinent “model constraint” does not affect the fit of the restricted model or its degrees of freedom.) Table 1 contains the parameter estimates obtained with the last model, along with their standard errors and 95% confidence intervals (95% CIs). We note that the parameter estimates are close to the true values that are well covered by the associated confidence intervals.
Factor Loadings and Interrater Agreement (IRA) Index Estimates, Standard Errors, and 95% Confidence Intervals for the Illustration Example.
Note. 95% CI = confidence interval at .95 level; φ = Var(η) = latent variance (see Equations 15); — = not applicable (because of pertinent parameter being fixed, for identification purposes). Estimates of residual variances, not presented in this table, are obtained by subtracting from 1 the square of the associated factor loading (Muthén & Muthén, 2010; see also Note 3).
See Appendix C for obtaining this confidence interval (using the R function “ci.ira,” with substituted IRA estimate and SE reported here; see also Note 2).
As can be seen from Table 1, the resulting IRA index estimate is
Because in this example we know the true parameters used for data simulation (see Equations 15 and following discussion), we can work out the population IRA coefficient by substituting them into its defining Equation (5). This yields the latter IRA index as ρ1 = .900, which is very close to its reported estimate of .905 and covered by the 95% CI [.892, .916] obtained above. As an alternative approach to examining IRA, one may evaluate the IRA in this example using the popular and widely used intraclass correlation (ICC) estimation procedure in Shrout and Fleiss (1979; see also Shrout & Lane, in press), which we stress however is an observed-variable- only approach to IRA evaluation. Applying that procedure to the raw data set here directly, one obtains .845 as the ICC for rater consistency (see Shrout & Lane, in press, for the SPSS estimation input file needed thereby). This ICC estimate of IRA, namely .845, is considerably below the true agreement coefficient obtained here (ρ1 = .900) and is also notably below the left endpoint of its 95% CI [.892, .916]. This finding can be explained by the realization that the IRA index (5), as shown earlier, is a correlation between underlying latent variables, namely, the sum of the “true” ratings
Discussion and Conclusion
This article was concerned with the evaluation of IRA in behavioral, educational, social, and biomedical research. A latent variable-based IRA index was proposed, and a procedure for its point and interval estimation that uses widely circulated popular software was outlined. The proposed IRA index was shown to be derivable using different analytic approaches and, at the level of latent variables, consistent with certain IRA indexes available at the manifest level. This IRA index can be interpreted also as a “true agreement” among the raters, given that their individual ratings cannot be often considered to be very precise (error free) in the behavioral and social disciplines.
This article contributes to the literature on IRA evaluation by proposing an index developed within an alternative framework in comparison with traditional methods of IRA estimation. The framework is inherently concerned with the underlying possible sources of rater agreement in terms of one or more underlying latent dimensions. This IRA index (5) is also instrumentally related to the query of whether raters use the same thresholds leading up to their typically ordinal data. The possibility to test for this identity is not shared with many extant IRA methods and seems to be one particular benefit of the approach adopted in this article.
Although the preceding discussion in the article was couched nearly exclusively in terms of a single latent variable (factor, dimension) underlying the raters’ evaluations, it is readily extended to the case where more than one latent variables represent sources of rater agreement. To this end, one can (a) use the IRA index definition as proportion common variance in the normal variables underlying each individual rater’s evaluations (e.g., Rabe-Hesketh & Skrondal, 2008) and then (b) apply the method in Raykov and Shrout (2002) to obtain point and interval estimates of this proportion, that is, the IRA index of this article in that case.
Another extension of the present IRA estimation procedure is easily obtained for the case of clustered targets, such as teachers nested within schools, patients nested within facilities, respondents nested within cities, students nested within classes, and so forth. Such nesting may be more often the case than initially thought in the IRA literature. A particular benefit stemming from adopting the LVM approach of this article resides in the possibility to address many realistic empirical settings where the objects of (coarse) measurement by a set of judges are clustered in higher order units, such as facilities, schools, classes, cities, and so on. In such circumstances, the assumption of independence of the objects, which underlies the majority of traditional IRA estimation methods, is no longer fulfilled. The resulting violations of this assumption are readily handled with this article’s LVM-based procedure, namely by fitting Model (2) as a two-level model while accounting for the clustering effects of the targets within corresponding higher level units. (At the software level, this is readily achieved by using the corresponding two-level model fitting and parameter estimation approach as implemented in Mplus; Muthén & Muthén, 2010.) Similarly, under the assumption of data missing at random (MAR; e.g., Little & Rubin, 2002), the procedure of point and interval estimation of the IRA index (5) is straightforwardly applicable as well. With violations of MAR, inclusion of auxiliary variables “predictive” of the missing values is recommendable (e.g., Enders, 2010) , which is readily accommodated within the estimation procedure outlined in this article.
The proposed method is directly applicable also in cases where one is interested in including covariates in the analysis. Specifically, one could regress the latent variable(s) η on the covariates in the pertinent extension of Model (2) used for IRA index estimation to explain individual differences in the common continuum underlying the observed ratings. In addition, one can also examine whether individual raters’ evaluations could be affected uniquely by particular covariates, that is, over and above their effect assumed than mediated by the latent variable(s). In the affirmative case (e.g., findings of significant and large/small unique regression parameters for individual raters), such a covariate-extended model may provide a possible explanation for the reasons why particular judges may be rendering inconsistent evaluations relative to the remaining raters.
Limitations of the outlined IRA evaluation approach stem from its requirement for large samples with regard to targets. We stress that there is no restriction with respect to raters—their number can be as low as 2, so long as the Model (2) is identified (which it will be then with the additional assumption of equal factor loadings). The reason is the fact that the underlying modeling method is based on an asymptotic statistical theory (Muthén, 1984). This requirement will be particularly important with a sizable number of judges. Although no specific guidelines are at present available for determining appropriate sample size, because of a multitude of rather complicated issues involved, it may be conjectured that with more than, say, 10 judges, it could be recommendable to have perhaps more than 1,000 targets being evaluated by them. Such evaluations typically occur in the context of large-scale assessment programs where multiple raters evaluate the performance of a large number of subjects, say, in educational assessments (e.g., Congdon & McQueen, 2000; Wolfe, 1996), licensure examinations (e.g., Raymond et al., 2011), and so forth. One may also submit that with fairly large fractions of missing values, the results of the outlined procedure should be interpreted with a great deal of caution. We strongly encourage future research in the direction of developing possible guidelines for sample size in relation to number of judges, targets, and fraction of missing data.
In conclusion, this article offers a readily applicable, LVM-based approach to point and interval estimation of IRA, which extends the arsenal available to empirical social and behavioral researchers involved in the measurement of personal characteristics.
Footnotes
Appendix A
Appendix B
Appendix C
Acknowledgements
We are indebted to R. Brennan and P. Shrout for valuable discussions on interrater reliability and to T. Asparouhov for software simulation procedure details. We are grateful to B. May, J. Piert, and C. Dunbar for posing questions related to the issues discussed in this article and to two anonymous referees for valuable criticism on an earlier version of the article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
