Abstract
Comparative judgment (CJ) is an alternative method for assessing competences based on Thurstone’s law of comparative judgment. Assessors are asked to compare pairs of students work (representations) and judge which one is better on a certain competence. These judgments are analyzed using the Bradly–Terry–Luce model resulting in logit estimates for the representations. In this context, the Scale Separation Reliability (SSR), coming from Rasch modeling, is typically used as reliability measure. But, to the knowledge of the authors, it has never been systematically investigated if the meaning of the SSR can be transferred from Rasch to CJ. As the meaning of the reliability is an important question for both assessment theory and practice, the current study looks into this. A meta-analysis is performed on 26 CJ assessments. For every assessment, split-halves are performed based on assessor. The rank orders of the whole assessment and the halves are correlated and compared with SSR values using Bland–Altman plots. The correlation between the halves of an assessment was compared with the SSR of the whole assessment showing that the SSR is a good measure for split-half reliability. Comparing the SSR of one of the halves with the correlation between the two respective halves showed that the SSR can also be interpreted as an interrater correlation. Regarding SSR as expressing a correlation with the truth, the results are mixed.
Keywords
Introduction
There is a constant need for reliable assessments whether in everyday classroom assessment or high stakes selection procedures in professional contexts. In this context comparative judgment (CJ) has been proposed as an assessment method providing reliable results (Pollitt, 2004, 2009). As the name states, the method of CJ is based on comparisons in contrast to absolute judgments. Judges are presented with pairs of students work—further called representations—and are asked to judge which one is better with regard to the competence under assessment. Based on these judgments, done by several judges, a scale value can be estimated.
Already, since the early days of CJ as an assessment method, it has been considered inefficient, requiring a large number of comparisons to obtain estimates that have an acceptable/good reliability. Or as stated by Bramley, Bell, and Pollitt (1998), The most salient difficulty from a practical point of view is the monotony of the task and the time it takes to get a sufficient number of comparisons for reliable results. (Bramley et al., 1998, p. 14)
Therefore, one of the most important methodological questions in CJ to date is, how can the efficiency (in number of comparisons) of a CJ assessment be increased without affecting the reliability of the final estimates?
But how can this question ever be answered if it is not known what the reliability measure means? Because of the similarities between the models behind CJ, item response theory (IRT), and Rasch measurement, the reliability measure used in CJ has been adopted from Rasch measurement (Bramley, 2007, 2015; Pollitt, 2012). Although this is arguable, the differences between the CJ and the Rasch measurement method are substantial enough not to assume that this measure has the same meaning in both contexts (for further details see later). Therefore, the main focus of the current study will be how the reliability measure in CJ can be interpreted.
The first section of this article will be structured as follows. First, what CJ is will be discussed. In doing so, the theoretical underpinnings leading up to the measurement model and its related reliability measure will be presented. Next, a theoretical framework on reliability and elaboration on the ways reliability can be estimated will be presented. Finally, the reasoning behind the current study will be discussed.
What Is CJ
CJ was introduced in educational assessment in 1995 by Pollitt and Murray who derived the method from Thurstone’s law of comparative judgment (Thurstone, 1927a, 1927b). The starting point for this law was the psychophysical observation that an object in the environment or representation (e.g., an essay) has a psychological impact on an observer (e.g., assessor) and that this impact or impression can change over time even as the object remains constant. Consequently, any statement (e.g., judgment on the quality) based on this impression will change accordingly (Thurstone, 1927b). This was later formulated again, in educational assessment, by Laming (2003) who stated that an absolute judgment does not exist and that every judgment is a comparison. The latter can be found in Thurstone’s derivation of the law of comparative judgment. Thurstone assumed that the psychological impact cannot be observed directly and that if the objects producing these impacts can be ordered based on a certain characteristic, then the corresponding impacts must also follow the same ordering. Therefore, the only way that one can measure the impact is by asking the observer to compare two objects and state which one is better on a certain characteristic (Thurstone, 1927b). For example, assessors are asked to compare two essays on their quality regarding the competence argumentative writing.
If an observer is thus presented with several pairs of representations of which she or he has to judge which one of the two possesses more of a specified quality, it is possible to estimate from these judgments a scaled location of the representations based on the normal function (Thurstone, 1927b). These estimates are commonly called ability estimates or ability values. Something similar can be found in paired comparison research. There, scale values are estimated using the Bradley–Terry–Luce model (BTL model; Bradley & Terry, 1952; Luce, 1959) which can be obtained from the original formulation of Thurstone’s law after some simplifying assumptions (Thurstone’s case V; Thurstone, 1927a) and by substituting the normal function by a logit function (Andrich, 1978). Earlier, Thurstone’s law was already identified as “a comparable method of analysis” (Bradley, 1953, p. 32) for paired comparison data. Andrich (2004) also pointed out that although IRT with the two Parameter Logistic model (2PL model) and the Rasch model are conceptually different, Thurstone’s law of comparative judgment is considered as the forerunner of both the paradigms (Andrich, 2004). All these analysis models are mathematically related, which becomes clear if the BTL model is formulated as follows:
with
If the 2PL model is formulated as follows:
with
Thus, the person parameter of the 2PL model or the Rasch model is replaced by a second item parameter and the discrimination parameter of the 2PL model is fixed to 1. Despite the mathematical similarity, as proven by Andrich (1978), the BTL model and the IRT and Rasch models clearly have a different parametrization. Therefore, it seems not justifiable to just copy measures of reliability from Rasch measurement or IRT, as carried out in previous research (Bramley, 2007; Heldsinger & Humphry, 2010; Pollitt, 2012), without checking whether their meaning is generalizable to the context of CJ. To the knowledge of the authors, these checks have not been done up until now.
The reliability measure in CJ is called the Scale Separation Reliability (SSR), in analogy of the naming in Rasch literature from where the measure was taken over (see Bramley, 2015 for details), and is formulated as follows: (Bramley, 2015)
With
Where
Reliability Theory
In classical test theory (CTT), reliability is defined as the variance in observed scores that is attributable to true scores (Brennan, 2011; Webb, Shavelson, & Haertel, 2006) or what is assumed as the truth (Brennan, 2011). And although IRT does not entirely conform with CTT (see Brennan, 2011, for a brief discussion and further references), this perspective can also be recognized in IRT and Rasch measurement and thus in CJ.
As Shown in Appendix A, the SSR, Equation 2, can be expressed as
where
The variance of the true scores can be estimated from the variance of the observed scores using this formula:
where MSE stands for the mean squared error (Andrich, 1982). The
In practice, reliability can also be estimated from the correlation between two variations of the same assessment or parallel forms (Bramley, 2015; Webb et al., 2006). One way to create these parallel forms, in tests with multiple items, is to split this test in multiple halves on the items and then correlate the respective pairs. The mean of these correlations is then coefficient alpha (Cronbach’s alpha; Cronbach, 1951) or the equivalent KR20 for dichotomous items (Webb et al., 2006).
In CJ, however, it is impossible to do a split-half on the representations. Doing this could result in a reduction in the overlap between the pairs, which leads to incorrect or even missing ability estimates. As in CJ, the assessor group can be seen as an integral part of the results—the judgments of all the assessors are pooled in the analysis—split-halves can be obtained by splitting the assessor group. This approach has already been taken in a few CJ studies (e.g., Jones, Inglis, Gilmore, & Hodgen, 2013; Jones, Swan, & Pollitt, 2015). However, none of these studies have made the connection to the SSR.
The Current Study
Extending the idea of Jones and colleagues (Jones et al., 2013; Jones et al., 2015), this study combines the idea of split-half correlations (on assessors) with the calculation of the SSR to check the interpretation and the validity of this reliability measure in the context of CJ assessments. This is done using an empirical approach.
The current study, investigates the value of three types of reliability in a CJ assessment context: the split-half reliability, interrater reliability, and reliability as a correlation with the truth. Based on the idea of Jones and colleagues (Jones et al., 2013; Jones et al., 2015) of triangulating the SSR with split-half correlations as a way to support the reliability of CJ assessments, the meaning of the SSR measure is checked in the current study by directly comparing it with several correlations. Namely, assessments are split in halves and estimated logit scores of the respective halves were correlated. This correlation is then compared with the SSR of the whole assessment providing information on the SSR as split-half reliability. Furthermore, as a CJ assessment can only be split in halves by judge, as argued by Bramley (2015) and demonstrated by Jones and colleagues, information can be obtained on the SSR as interrater reliability. This can be done by comparing the SSR measure of the estimates of one of the halves with the correlation between the estimates of the two respective groups. Finally, if one considers the whole assessment as the truth then correlating ability scores of the whole assessment with the scores of one of the halves can support the interpretation of SSR as a correlation with the truth when this correlation is compared with the SSR of the scores of the respective halve. This latter notion was extended in the following way. As the correlation of observed values with the truth is the main idea behind model fit, and the measure for model fit R2 is in essence the squared Pearson’s r correlation, the squared correlation between the logit scores of the whole assessment and those of one half was compared with the SSR value of the logit scores of the respective half.
It should be remarked that the error variance in CTT is different from that in IRT because the latter framework does not take item variance into account (Kim, 2012). This might have consequences for the comparability of reliability and Pearson’s r correlation measures. Nevertheless, it does not pose a problem for split-half and interrater reliabilities as it was shown that the parallel forms reliability in IRT is equivalent to that in CTT (Kim, 2012) and thus with Pearson’s r. Differences might arise when correlation with the truth, in other words squared-correlation reliability, is considered (Kim, 2012). In this study, this might lead to biased or inconclusive results.
As a correct interpretation of the reliability measure is methodologically important and practically relevant in assessments, the current study aims to question what the meaning/value is of the SSR in contexts where CJ is used. This is done using an empirical method.
Method
The Data
A meta-analysis is conducted on 15 CJ assessments, 26 assessor groups, in total. This difference in numbers is due to how assessments are defined here. One CJ assessment can consist of multiple assessor groups resulting in multiple sets of estimates. In a CJ assessment, representations (products, for example, essays) are compared regarding a specific competence. In one assessment, two competences needed to be judged resulting in two rank orders. This leads to 27 data sets being used.
Here follows a general description of the assessment characteristics to provide an idea on the range of assessments included in the analysis. For more specific details on the assessments, see Appendix B. The majority of the assessments were conducted in higher education (n = 13), followed by secondary education (n = 6) and primary education (n = 1). The remaining assessments were conducted outside the context of education (n = 7). The assessments were conducted with 51 representations on average (minimum = 6, maximum = 201) and judged by an average of 28 assessors (minimum = 4, maximum = 93). The representations were compared 27 times on average (mininimum = 13, maximum = 105), leading to an average total of 548 comparisons (minimum = 60, maximum = 2,193) per assessment. The assessments resulted in an average SSR of 0.80 (minimum = 0.62, maximum = 0.93).
Procedure and Analyses
The authors first discuss the split-half procedure. Next, they explain how the correlation coefficients were interpreted in the light of reliability. Afterward, the they go into detail on the First the split-half procedure is discussed. Next, it is explained how the correlation coefficients were interpreted in the light of reliability. Afterward, detail are provided on the opportunity provided by some of the data, with regard to interpreting and confirming some results. All analyses were conducted in R (R Core Team, 2016).
Every assessment is split in halves by assessor group in every possible way. For instance, an assessment with 10 assessors results in 126 different possible split-halves of the data. In 54% of the assessments, the authors limit the number of split-halves to 1,000 because the number of assessors is too big to be manageable when all split-halves would be obtained. When there is an odd number of assessors, one of both split-half groups contains one assessor more than the other.
For every assessment as a whole and every split-half group, logit scores are estimated and the SSR is calculated. The logits of the corresponding split-half groups are correlated, as are the logits of each split-half group and those of the corresponding whole assessment. This leads to three SSRs and three correlation coefficients per assessment per split-half or 55,662 correlations and as much SSRs in total. Per assessment and split-half group the mean of the SSRs and the mean of the correlations are calculated. It is then possible to compare each of these mean SSRs with each mean correlation coefficient as is shown in Figure 1. However, only five of the nine combinations (colored black) are interpretable. These five combinations can be clustered into three interpretations. (a) If the correlation between the split-half groups is compared with the SSR of the whole assessment (bottom left plot), this provides information on the split-half reliability. (b) In the plots at the bottom middle and right, the correlation between the split-half groups is compared with the SSR of each group separately. Therefore, the correlation can be interpreted as an interrater reliability. (c) The reliability as a correlation between what is observed and what is considered as the truth can be found in the top row the middle plot and in the second row the right plot. In these two plots, the correlation between one of the groups and the whole assessment is compared with the SSR of the respective group.

Example of SSR against correlation plot.
Plots of the mean SSR against the mean correlation are hard to interpret. The Bland–Altman plot (BA plot), or Tukey’s mean difference plot, provides more information (Bland & Altman, 1986; Kozak & Wnuk, 2014). In a BA plot, the mean of two values (measures) is plotted against the difference
It should be remarked that the correlation as interrater reliability might be an overestimation. This might also, but to a lesser extent, be the case with split-half reliability. This is inherent to the CJ method. As was noted earlier, the judges are an integral part of the results. This is even more so because the algorithm constructing the pairs takes into account all previous, judged pairs, to not send out the same pair multiple times. Due to this part-dependence of pairs, it is impossible to create complete independent halves.
This issue could in part be countered by the setup of some assessments. As can be seen in the table in Appendix B, some assessments (n = 5) were repeated by different assessor Groups (2 to 3) thus providing assessment variations as in letting different groups of assessors compare the same representations with the same algorithm. If these variations are correlated, a more correct estimate of interrater reliability can be obtained. This could then provide further support for this interpretation of the SSR.
Pearson’s r is used as a correlation measure and the squared Pearson correlation is included as a further support for reliability as model fit. As remarked earlier, the latter should be interpreted with caution as there might be a difference in value between the squared Pearsons’s r correlation and the squared-correlation reliability in IRT (Kim, 2012).
Results
In this section, only the results of the BA plots are presented. Interested readers can find the plots of the SSR’s against the correlations in Appendix C. The results are ordered according to the type of reliability they provide information for. The authors first focus on split-half reliability, then on interrater reliability, and eventually on reliability as a correlation with the truth.
To investigate whether the SSR could be interpreted as some form of split-half reliability, the SSR measure of the whole assessment is compared with the mean of the split-half correlations for that assessment. In the BA plot (Figure 2), zero (black dotted line) is clearly within the LoA (dashed lines), and the most extreme estimates of the LoA (outermost gray dotted lines) are between –.1 and .4 which is just acceptable for correlations. Thus, the SSR is a quite good estimate of the split-half correlation.

BA plots for split-half reliability: Comparison between the SSR of the whole assessment and Pearson’s r correlation between both halves of the assessment.
Comparing the mean of the correlations between two halves of an assessment and the SSR of one of these halves provides information on the SSR as interrater reliability (Figure 3a and 3b.). Zero is within the LoA boundaries and the most extreme estimates for these boundaries are –.25 and .25 for the comparison with the Group 1 SSR and between –.3 and .3 for the comparison with the Group 2 SSR (Figure 3a and 3b.). This can be considered small. Therefore, the SSR could be considered as an interrater reliability.

BA plots for interrater reliability: Comparisons (a) between the SSR of Group 1 and the correlation between the two split-half groups, (b) between the SSR of Group 2 and the correlation between the two split-half groups, and (c) between the SSR of selected assessments and the true interrater correlation.
As the split-half groups are not completely independent because of the CJ design, as stated earlier, these correlations might be an overestimate of the true interrater correlation. Therefore, within the assessments, and if possible, SSRs of separate assessor groups are compared with real interrater correlations between these groups. These results confirm the results with the correlation between the split-half groups, as zero is inside the LoA and the extreme estimate boundaries around –.15 and .3 (Figure 3c). Again, the SSR appears a good measure for interrater reliability.
It should be remarked that the assessor groups are not completely comparable, in number and assessment expertise for all assessments, which could again result in an underestimate of the interrater reliability. Therefore, it might be possible that these results are an overestimation of the agreement.
For reliability as a correlation with the truth, the SSR of each half is compared with the correlation between the whole assessment and the respective half. If the difference between the SSR of either one of the groups and the correlation between the whole assessment and the respective group is considered (Figure 4a and 4b.) the LoAs concerning either group are below zero. This shows that these SSRs are underestimates of the respective correlations. But, as argued, the correlation between observed values and the truth is better expressed by the measure of model fit (R2). Hence looking at the difference between the SSRs of either group and the squared Pearson correlation between the whole assessment and the respective group (Figure 4c and 4d), the results prove difficult to interpret. The zero lies above the LoA boundaries but still within the estimation error boundaries of the LoAs. The most extreme boundaries are still within the acceptable limits of around –.1 and around .3. It can be cautiously concluded that the SSR might be a good estimate for correlation with the truth but there is not enough data to be certain. However, one can expect these results as values of the SSR calculations might not completely correspond to the squared-correlation values, as remarked earlier. This might also provide an explanation of the inconclusive results with the squared Pearson’s r.

BA plots for reliability as correlation with truth: Comparisons (a and c) between the SSR of Group 1 and the correlation between Group 1 and the whole assessment and (b) and (d) the SSR of Group 2 and the correlation between Group 2 and the whole assessment. Plots (a) and (b) display Pearson’s correlations and plots (c) and (d) the R2.
Discussion
The SSR measure from CJ has been adopted from Rasch measurement because of the algebraic similarity of the measurement models. However, the methods of CJ and Rasch measurement are different enough not to assume that the reliability measures mean the same in both contexts. Therefore, this study set out to answer the question, how the SSR can be interpreted, more specific what the meaning is of the SSR in the context of CJ. Therefore, a meta-analysis was conducted on 27 data sets (five assessments or 26 assessor groups; see Appendix B). Using a split-half methodology, SSR values were compared with several types of correlation using BA plots and corresponding LoAs. The assessments are diverse enough and the data set large enough that some generalizing statements can be made. However, as this study set out to investigate the meaning of the SSR measure in CJ, it has to be remarked that these conclusions cannot be generalized to Rasch and IRT. Furthermore, it should be kept in mind that the analyses were conducted on the data from a set of 27 specific assessments. Therefore, it is necessary that the findings are replicated with more experimental studies.
The results strongly point in the direction that the SSR reflects the interrater reliability as the SSR of each split-half group shows congruency with Pearson’s r correlation between both groups. However, it has been remarked that this correlation might be an overestimate of the interrater reliability because the groups are not independent. Therefore, the confirmation was sought in assessments with different assessor groups. Here, the SSRs were also close to Pearson’s r. As these assessments were not set up to test interrater correlations however, the assessor groups were not constructed to be equivalent, so the correlations could be an underestimate of the potential interrater correlation. In sum, there are good and strong indications that the SSR reflects the interrater correlations but some results call for caution. These results should be further confirmed with a more experimental and controlled approach.
Regarding the most theoretical view on reliability, namely, the correlations of the observed values with the truth, the SSR of each group differs from Pearson’s r correlation. This could be due to the fact that reliability as a correlation with the truth is better reflected by the squared Pearson correlation or the measure of model fit (R2). The squared Pearson correlation values indeed appear to lie closer to the corresponding SSR values. Cautiousness is however warranted. The results present a borderline case meaning there is not enough data to provide enough certainty over the results. Also, difference in conceptualization between CTT and IRT (Kim, 2012) might contribute to the fact that these values do not completely correspond. The authors can tentatively conclude that there is some evidence and a slight confirmation that the SSR might be interpreted as a theoretical reliability, namely, a correlation with the truth.
Finally, there was also evidence that the SSR expresses split-half reliability. The SSR of the whole group appears not that different from Pearson’s r correlation between both split-half groups. Evidence is thus pointing in the direction of the SSR as a split-half reliability but further research is needed.
It can be concluded that there are strong indications that the SSR provides an interrater reliability index which can be informative when using CJ. Some results also point in the direction of the SSR as a correlation with the truth and/or a split-half correlation. However, these indications are less strong and further research is recommended. Studies conducting assessments with a higher control on the equivalency of assessor groups are important to conduct as well as assessments where the rank order is known beforehand might provide some interesting findings.
The findings of this meta-analysis, based on a substantial yet specific sample of assessments, provide a first step toward a strong theoretical basis for the interpretation of CJ results. As this study takes an empirical approach, these results need to be confirmed in more systematic studies.
These results provide initial information in the search toward adaptive algorithms to increase the efficiency of the CJ method. Even further, these results might give inspiration in the analyses of future simulation studies on these algorithms. Besides, this study reaches the assessment practice some handles to interpret the results of their CJ assessment.
Regarding the use of CJ in the assessment practice, the efficiency question is an equally important methodological question which has important practical implications. This question also cannot be answered if it is unknown how many comparisons are actually needed to reach a certain level of reliability any way. This article focused on the basic methodological and theoretical question of the meaning of the reliability, and future research is needed to question the number of comparisons actually needed.
Footnotes
Appendix A
Appendix B
Appendix C
Acknowledgements
The authors thank the two anonymous reviewers whose comments helped improve the manuscript. The majority of the data was collected within and outside the University of Antwerp and with the cooperation of the following persons: Prof. Dr. Kris Aerts, Cynthia De Bruycker, Benedicte De Winter, Ann-Kathrin Hennes, Stefan Martens, Prof. Dr. Nele Michels, Prof. Dr. Jean-Michel Rigo, Dr. Pierpaolo Settembri, Dr. Joke Spildooren, Daniëlle Van Ast, Tine van Daal, Marie-Thérèse van de Kamp, Kristel Vandermolen, Kristof Vermeiren, and Ellen Volkaert. The authors also thank these people for their efforts.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is part of a larger project (D-PAC) funded by the Flanders Innovation & Entrepreneurship and the Research Foundation (Grant number: 130043).
Supplemental Material
Supplementary material is available for this article online.
