Abstract
Existing tests of interrater agreements have high statistical power; however, they lack specificity. If the ratings of the two raters do not show agreement but are not random, the current tests, some of which are based on Cohen’s kappa, will often reject the null hypothesis, leading to the wrong conclusion that agreement is present. A new test of interrater agreement, applicable to nominal or ordinal categories, is presented. The test statistic can be expressed as a ratio (labeled QA, ranging from 0 to infinity) or as a proportion (labeled PA, ranging from 0 to 1). This test weighs information supporting agreement with information supporting disagreement. This new test’s effectiveness (power and specificity) is compared with five other tests of interrater agreement in a series of Monte Carlo simulations. The new test, although slightly less powerful than the other tests reviewed, is the only one sensitive to agreement only. We also introduce confidence intervals on the proportion of agreement.
Keywords
Introduction
Quantifying interrater agreement is useful in contexts where two raters must judge into what categories a series of observations shall be classified. If agreement between raters is perfect, all judgments will be identical, whereas if both raters use completely different criteria, agreement will occur only by chance. Agreement is often assessed using Cohen’s kappa (Cohen, 1960), for which a value greater than 0.40 is commonly considered a moderate agreement (Landis & Koch, 1977). Recently, Kraemer, Periyakoil, and Noda (2002) explained that a Cohen’s kappa for more than two categories is equal to a weighted average of individual kappas. The individual kappa
As an example, consider the following situation in which two psychiatrists examine N = 100 patients suffering from depression in order to appraise the category of depression (this example is inspired from von Eye, Schauerhuber, & Mair, 2006). The psychiatrists used k = 3 categories of severity. Raters’ categorizations are summarized in Table 1 using a
Example Data of Ratings Performed by Two Raters (Psychiatrist 1 and Psychiatrist 2) Appraising the Severity of Depression of 100 Patients (N = 100) Using Three Categories of Depression (k = 3).
As seen, both raters share the belief that most patients belong to the first category of severity. This high prevalence does not explain the large number of concordant judgments in that category, as chance would only predict approximately 70 agreement ratings for the first category. Hence, there is good agreement regarding Category 1. On the other hand, the two psychiatrists rarely agree on Categories 2 and 3. For example, Psychiatrist 2 found a total of 8 cases belonging to the third category of severity; of those, only 2 cases are also put in the third category by Psychiatrist 1, a figure below what chance would predict. Instead, the results seem to indicate that what looks like a Level 2 depression to Rater 1 is a Level 3 depression to Rater 2 and vice versa. This suggests disagreement between the two raters on these two categories.
Overall, 86% of the cases are judged identically by the two raters (the cases found in the main diagonal). Should we conclude that, overall, the two raters are in good agreement? Cohen’s kappa suggests a moderate but significant agreement (
Similarly, consider the results of Table 2. In this example, 127 cases were classified by two raters in one of 5 categories. Again, agreement is not clear. The raters seem to agree well regarding Categories 2 and 5. They also agree very well on Category 1 (nearly half of the ratings in this category are agreements). However, they agree little regarding Categories 3 and 4. Inspection of the data in Table 2, outside the diagonal, shows that the two raters have opposite interpretations regarding Categories 3 and 4: cases that are instances of Category 3 for Rater 1 are instances of Category 4 for the other. The two raters are not responding randomly, but they are not agreeing. Hence, whereas the rate of agreement is moderately good (32.3%), it does not mean that the two raters agree a little on all categories. The results instead suggest that the raters agree well on a few categories. Overall agreement is missing. Yet Cohen’s kappa is weak but still significantly different from zero (
An Example Where Agreement for Three Categories Accompanies Disagreement for Two Categories.
Note. Contrary to Table 1, there is no significant difference in prevalence between the categories.
These two examples show the need for a statistic of agreement that can distinguish situations with much agreement for a few categories from situations with some agreement for most categories.
The purpose of this article is to present two new measures of interrater agreement, which we call Q A and PA . The first is more convenient as it uses the Fisher F tables for critical values, whereas the second is possibly more intuitive, as it is a proportion between 0 and 1. In the next section, we describe these measures, which can be used on nominal or ordinal classifications. Next, we examine the reliability of these measures along with their confidence intervals. Finally, we assess the statistical power but, more important, the specificity of this approach, that is, the ability of a test based on Q A (or PA ) to make the difference between mixtures of agreeing and disagreeing category ratings from solely agreeing ratings. The approach developed is akin to analysis of variance, as it partitions variances found in each cell (not just those in the main diagonal) as supporting agreement or supporting disagreement.
In the remainder of this article, the judgments are structured in the form of a square k×k table of observed frequencies. Observed frequencies in cell {i, j} will be noted oi,j
. As per the
Pearson’s chi-square test of independence has been used on occasion to ascertain the significance of the agreement. However, this test only examines the null hypothesis (
The reader can find a brief description of five alternative tests of interrater agreements in Appendix A. These tests will be compared to the present approach in the subsequent section.
Measures of the Global Structure of the Judgment Matrix
The new approach is a nonparametric one based on the same assumptions as Cohen’s kappa (Brennan & Prediger, 1981). As will be seen, it is sensitive because it uses information outside the main diagonal as well as in the main diagonal. The crucial observation is that high agreement rates in the diagonal cells necessarily imply a shortage of cases outside the diagonal, and conversely. Hence, the whole matrix, not just the diagonal, is informative as to whether there is agreement or not.
The present approach partitions the matrix based on whether the observed cell frequencies deviates from chance (judged by whether the observed count is different from its expected value) and whether such ratings are supportive of agreement or not.
More formally, under
The first statistic, Q A, is a ratio of cells favorable to agreement against cells unfavorable to agreement, the former being those that (i) on the main diagonal have counts higher than expected and (ii) off the main diagonal have counts smaller than expected, whereas the latter are those that (iii) on the main diagonal have counts smaller than expected and (iv) off the main diagonal have counts larger than expected. Hence, Q A is computed as
where the terms in the sums (the components) will be added depending on the sign of the difference between the observed frequency and the expected frequency using
QA
values range from 0 to +∞, and under
The second statistic, PA , is the proportion of variance supporting agreement onto total variance. It can be obtained from the z scores or from QA with
The statistic PA
is akin to the eta squared (η2) statistic and ranges from 0 to 1. When the ratings are random, its central value is ½, that is, 50% of the variance in the ratings suggest agreement and the other 50% of the variance suggest disagreement. Hence, a value of ½ supports neither interpretations and indicates random ratings. As found in the literature (e.g., Forbes, Evans, Hastings, & Peacock, 2010), a ratio of the form
PA
and QA
are totally interchangeable. To detect whether the ratings deviate from random, one option is to use QA
. The null hypothesis is
Under the null hypothesis, approximately half of the cells,
where
Appendix B shows how QA
and PA
can be computed with the SPSS statistical package. It also shows how to get the p value of the test and confidence intervals for PA
for any level
Interpretation of the Statistics PA.
An Illustration With Computation of Formulas
We illustrate the computation of the QA
statistic with an example involving two raters having to examine and codify 200 observations into a system of k = 5 categories, labelled 1 to 5. The judgment matrix and the marginal sums
Cross-Classification Data From Two Simulated Raters, Indicating (Top) the Number of Ratings oi
,
j
in Each i, j Cell, (Middle) Its Corresponding Theoretical Value
In this example, the component (i),
This value, compared to a F critical value at α = .05, F(8, 8) = 3.438, suggests that the raters are in agreement, Q A(8, 8) = 8.23, p < .05.
Note that the sum of all the components of Q
A,
The second statistic, PA
, is obtained with
For this example, five statistical tests (the four from Appendix A and QA ) make the same decision, rejecting the null hypothesis even at the 0.01 level. However, the same does not happen for the previous examples of Tables 1 and 2 where Q A alone concludes that agreement is absent. Table 5 summarizes the results for the examples of Tables 1, 2, and 4.
Note. *Denotes the test assuming the indifference principle. n.s. = nonsignificant at the .05 level.
Reliability of the Statistics and Their Distributions
To examine the merit of the present approach, we explored the statistic PA using Monte Carlo simulations (the same results were obtained exploring QA ). In particular, we examined if the scores’ distributions correspond to their theoretical counterparts. To do so, we generated agreement matrices with various amount of true agreement and checked whether the theoretical 95% confidence intervals contained the results of 95% of the simulations (using the same methodology as in Harding, Tremblay, & Cousineau, 2014). We also checked 99% and 99.9% confidence intervals but the results were comparable and so we do not report the findings.
We manipulated the sample sizes (N, from 50 to 500 by increment of 50), the number of categories k (from 4 to 25), and the true probability of an agreement, ρA . Details are given in Appendix C.
A subset of the results is shown in Figure 1. The first thing to note is that the theoretical confidence intervals (shown using error bars) match very closely the limits in which 95% of the simulated PA rest (shown using gray areas). The theoretical error bars are the least accurate for small number of categories and strong effect size where they overestimate the spread of the results.

Mean PA (thick line), 95% confidence intervals (error bars), and spread where 95% of the simulated PA fell (gray area) as a function of sample size N, when ρA is varied from 0 to 0.10 (columns) and k is varied from 5 to 10 (rows). We see that PA is biased downward in the first column: with no effect, mean PA should be equal to zero for all N and all k.
Less visible but more critical, the average PA (as well as the average QA , not shown) is biased downward, underestimating the strength of the agreement. This bias is visible in the left panels where there is no agreement (ρA = 0), more so for large numbers of categories (e.g., k > 8) and for small sample sizes. We could not find a simple way to undo the bias, which would be suitable to all k. Because of the downward bias, these measures are biased toward detecting disagreement and against detecting agreement. This is why these measures should not be used for detecting disagreement (performing a left tail test of QA or PA ). The bias results in a conservative test (i.e., less powerful than it could be). However, as we will see next, this has little impact on its specificity.
Sensitivity and Specificity of the Test
In order to evaluate the relative merits of the present test of agreement, we ran three more series of Monte Carlo experiments, comparing our ratio test to four other tests of agreement. In the first series, we explored statistical power by manipulating the true rate of agreement ρA from 0 to 1. In all the simulations, we manipulated the number of categories k (from 5 to 25), the total sample sizes N (from 50 to 500), and the significance levels (α, .05 and .01). The statistical tests to be compared (described in Appendix A) are the following:
QA , the ratio of agreement test (equivalent to PA )
We included the
Apart from being powerful, a good test should also be mostly, if not exclusively, sensitive to agreeing judgments, and not to any other type of consistent categorizations. For example, if observations classified as instances of category 2 by the first rater are systematically put in Category 3 by the second rater (as seen in Table 1), this is a consistent pairing, but not an agreement, and the tests should not reject
The purpose of the last two series of Monte Carlo experiments is to evaluate the specificity of the tests. As before, we manipulated the parameters ρC, k, and N. However, in the present context, the parameter ρC represents the rate of consistent categorizations between raters. This rate is large when raters have consistent decision rules even if these rules are not in agreement. Overall, with large ρC , cases will be more frequent in cells not necessarily on the main diagonal.
In these two series of simulations, there is no overall agreement, but possibly accidental agreements for one or a few categories. Hence, the probability of rejecting
To control precisely for the presence of accidental agreements, the two series present two conditions. In the Random with possible coincident pairings condition, Category “x” from Rater 1 is coupled to some Category “y” of Rater 2, where y may be any category among 1 to k, including by chance the same as Category x (an agreement for this category). In the Random excluding coincident pairings condition, there cannot be a single agreeing pair in the main diagonal.
A detailed description of the simulations’ parameters and algorithms is given in Appendix C.
Results
Because of the very large number of conditions explored, we only report illustrative results, shown in Figures 2 to 4, all using a significance level of .05; these results are typical of what was found in the other conditions and with a significance level of .01. Figure 2 presents the results for one set of parameter (k = 5 and N = 125), one panel per condition. In Figure 3, everything is the same as in Figure 2, except that the sample size is doubled (N =250); in Figure 4, everything is the same as in Figure 2, except that the number of categories is doubled (k = 10).

(Top panel) Power curve for 5 tests of interrater agreement as a function of the true agreement rate (ρA ) when k = 5 and N = 125; (Middle and bottom panels) specificity for the same tests in the same condition as a function of the rate of consistent categorizations (ρC ).

Power curve and specificity curves when k = 5 and N = 250 in the same format as Figure 2.

Power curve and specificity curves when k = 10 and N = 125 in the same format as Figure 2.
As seen, the results follow three patterns.
Chi-Square Test
As expected, all our results disqualify the
Cohen’s (
), Fleiss’ (
), and the Sum-of-z (
) Tests
For the next three tests, Cohen’s simple formula (
QA Test
As can be seen in the top panels of Figures 2 and 3, power curves of the QA
test follow well the power curves of the
The QA test considers all circumstances in which the raters may agree and disagree. Hence, it rejects the null hypothesis only if the evidence gathered from the agreeing cells outweighs the evidence from the disagreeing cells. Consistent pairings outside the main diagonal is thus for QA a strong cue against agreement.
In sum, the new test QA was found to be just a little less powerful than the other tests when k is small (k < 8). However, these tests proved to be far less specific. One possibility is that on a small percentage of the simulations, random pairings might have occurred, triggering these tests to significance but not QA . Hence, it is not clear whether the new test is truly less powerful or is simply more selective.
General Discussion
We examined the performance of five tests for determining whether two observers, each one classifying N observations into a set of k categories, agree or not above chance level. Monte Carlo simulations were used to compare the five tests with respect to their sensitivity to true agreement—a quality that translates into statistical power—and their specificity, that is, their insensitivity to anything but true agreement. As could be expected (see for instance Cohen, 1960), the standard
The statistics described here provide a nice set of tools to quantify agreement and assess its significance. Confidence intervals are also defined, allowing for easy comparisons between studies. The only limitation of the present statistics is that they are all biased downward (and the bias is important for k≥ 10 and small N). Future work should try to find the correction to this limitation. In the meanwhile, the present statistics should not be used to detect significant disagreement.
Our results show that a significant test on kappa can indicate that agreement is strongly present in a few categories or weakly present in all categories (or a continuum between these extremes). On the other hand, a significant QA test only indicates that agreement is present in all the categories. The simulations showed that the QA test is the only test that is sensitive to agreement within all categories, not just a few, as seen by comparing the markedly different results in the Random excluding coincident pairings with the Random with possible coincident pairings simulations.
What does distinguish the κ-to-z and sum-of-z statistics, on the one hand, and QA , on the other, and wherefrom does the latter earn its high specificity? The uniqueness of QA comes from the fact that the whole matrix of agreement is used. Kappa-based and sum-of-z tests only exploit the main diagonal. Hence, they implicitly assume that information outside it is uninformative, which is obviously false, as we discussed with the first two examples.
Footnotes
Appendix A
Appendix B
Appendix C
Acknowledgements
We would like to thank Bradley Harding for his comments on an earlier version of this article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
