Abstract
This inquiry is focused on three indicators of the precision of measurement—conditional on fixed values of θ, the latent variable of item response theory (IRT). The indicators that are compared are (1) The traditional, conditional standard errors,
The overall, marginal precision of measurement for test scores is estimated using reliability coefficients, standard errors of measurement (SEM), and information functions. Currently, conditional measurement precision is estimated using traditional conditional SEMs (CSEMs) =
As a prelude to describing the paradox, some notation and definitions are described, which allow for both dichotomous and polytomous items:
Let
Let Pij(θ) be the conditional probability of scoring in category j of item i; then the conditional mean and variance of item i are given by
where θ is the latent variable being measured and,
If xi is a dichotomous item, scored 0 or 1, then,
Let
The score information function of IRT is, defined as,
It is important to remember that this is the score information function for the NC score—as contrasted to the test information function (Lord, 1980, pp. 73-74).
The IRT-based conditional SEM
Note: This standard error is appropriate for the number correct score and reflects the accuracy with which θ can be predicted from any score—including the number correct score—it is not limited to θ-estimates (Lord, 1980, p. 67).
A Paradox
The two indicators of conditional score precision, the traditional CSEM =
The test characteristic curve (TCC) for MK9a is presented in Figure 1.

Test characteristic curve (TCC) for MK9a.
Figure 2 displays a graph of
Computing
When using

CSEMs and CirtSEMs for NC scores from MK9a. CSEM = conditional standard errors of measurement; NC = number-correct.
Notice that in Figure 2 the correlation between the two conditional standard errors is −0.01; they are uncorrelated. Logic would seem to imply that this medium-difficulty test should measure most precisely in the middle of the θ-distribution, and
In order to examine the nature of
It is interesting to note that Equation (5) gives additional significance to the information function (which, in isolation, is rather barren of meaning): It is approximately the number of times greater the conditional true variance is than the conditional error variance,
A reviewer of this article made the important point that Equation (5) is an approximation that should be considered as a heuristic. This approximation may be reasonable for values of θ that are not in the extremes of the θ-distribution. A TCC is a sigmoidal curve that has a concave portion (near the lower asymptote), an essentially linear portion and a convex portion (near the upper asymptote). The Taylor series estimate of the conditional true variance should be most accurate in the nearly linear portion of the curve, and least accurate in the concave and convex portions—especially if the item a-values are large (>2). If the nearly linear portion of a TCC is coincident with the area of maximum density for the θ-distribution (i.e., the central portion), then the conditional true-variance approximation should be good for all but the most unlikely values of θ and for items with extremely large a-values.
To provide some data on the approximation for conditional true variances and conditional reliabilities, a small study of the reasonableness of
Resolution of the Paradox
Given the interpretation of the information function provided in Equation (4), the paradox is resolved: the root reciprocal information is approximately equal to the traditional CSEM divided by the conditional true-score standard deviation,
Equation (6) indicates that
Following from Cronbach and Gleser’s work, the conditional signal/noise, S/N(θ), may also be expressed as
where
Discussion
Now consider three ways for indicating conditional precision of a test score:

Traditional CSEMs, IRT-based CSEMs, and reliabilities for MK9a. CSEM = conditional standard error of measurement; IRT = item response theory.
Physical measurement differs from psychological and educational measurement in that precision is indicated when the variance of measurement errors is small. In psychological-educational measurement, precision is indicated when the variance of measurement errors is small relative to the true variance. Therefore, it is arguable that conditional indicators of precision should involve comparisons of the relative magnitudes of the conditional true-and-error variances—given the nature of psychological-educational measurement. It now seems clear that the answer to the question posed in the title of this study is, “No.” Use of the conditional CSEM (or error variance) alone, can give misleading information concerning the conditional precision of a test; however, it is useful to use in conjunction with other indices of precision.
In fairness to traditional CSEMs, it should be mentioned that the concept of conditional true variance was foreign to IRT until very recently. All that researchers and test developers had for use in determining conditional precision were estimates of the traditional CSEMs—in the case of true score theory—or score information functions and IRT-based CirtSEMs—in the case of IRT. Back in the days when CTT dominated psychological and educational measurement—and all that was available were the overall (or marginal) test reliability, and the overall SEM—it was realized that, in order to describe the precision of a test, both of these indicators were necessary for determining precision. The reliability coefficient provides the proportion of score variability that is due to true differences among persons, and what proportion is due to measurement error. However, reliability alone, being a proportion, provides no indication of measurement precision in terms of the units of the test’s score scale. The marginal SEM expresses precision in terms of such units, and is used to express precision in terms of the length of the confidence intervals placed around a test’s true scores.
Looking back at Figure 3, there is something puzzling—in the lower half of the θ-distribution the traditional CSEMs are relatively large and nearly constant (indicating low and equal precision), but the conditional reliabilities are rapidly increasing in the same interval. How can this be? The answer is found in the fact that the magnitude of the CirtSEM (as in the case of the marginal SEMs) depends on both the conditional reliability coefficient and the conditional total variance of a measure—the larger the total variance of measure, the larger the CSEM, and vice-versa. Conditionally, the total variance for NC scores is equal to
Figure 4 displays the conditional true, error and total variances, and it is clear, that in the lower half of the θ-distribution, error variance is very slowly increasing from 4.5 to 5.0, but true variance is increasing rapidly, from 0.04 to 41.7 in the same interval. It is easy to show that
which is in accord with CTT’s expression for error variance,

Conditional true, error and total variances for MK9a.
Then viewing Figures 3 and 4, it can be seen that what is happening in the lower portion of the θ-distribution, is that conditional reliability,
What Does All This Mean Practically?
The results presented here indicate that if test developers want to demonstrate where a test measures well and not so well, the indices best suited for this purpose are, putatively, the conditional reliability coefficient and the CirtSEM. However, in order to fully understand the practical effects of measurement precision in applications of these indicators, it can be of considerable practical value to relate conditional reliabilities and SEMs to the NC true-score scale, as opposed to the latent θ-scale. The θ-scale values themselves are devoid of meaning, and only through their normal percentiles do they take on meaning. Associating indices of measurement precision with the NC scale can be useful as shown below. Of particular value, is that precision (e.g., conditional reliability coefficients) can be computed at various cut-scores on the NC score scale for a test.
Figure 5 is a plot of the three measures of conditional measurement precision against the NC true score scale for MK9a. The pattern of results is the same as it is for the θ-scale, but the scale is based on the NC true scores. In Figure 5, one can see that, according to the conditional reliabilities, this test varies quite a bit in precision across the score scale, with conditional reliabilities ranging between .17 and .91 for NC scores of 6 and 19, respectively. Conditional reliability drops off rapidly for scores lower than 10 and higher than 23. Scores less than 10 have conditional reliabilities between .09 and .56; scores greater than 23 have reliabilities between .23 and .67. In between the scores of 10 and 23 (77% of the NC-score distribution), the reliabilities are rather more uniform ranging from .75 to .91. The CirtSEMs are consonant with the conditional reliabilities in that they are larger below 10 and above 23, and smaller and fairly uniform in between these two scores.

Conditional standard errors of measurement (SEMs) and reliability coefficients plotted against number-correct true scores for MK9a.
Using the conventional CSEM, the picture of this test is quite different. The scores between 6 and 23, show nearly equal CSEMs (that are larger than the CirtSEMs) in this 15-point interval—which constitutes 93% of the NC score distribution. The CSEMs erroneously denote the test being most precise for scale scores between 23 and 25 (about 7% of the population). This pattern of results for conditional measurement precision for the NC-score scale is the same as for the pattern attained using the θ-scale—only the scale units differ.
It is sometimes desirable to have approximately equal measurement precision across the true-score or θ-scale scales; for example, this near-equality is one of the important criteria in the development of the ACT® Assessment Test (see ACT Inc., 2014, p. 51). How would MK9a fare in terms of approximately equal, conditional measurement precision across the NC score scale? Using the conventional CSEM, one could logically conclude that MK9a has very uniform precision (CSEMs of about 2.0) in the lower 93% of the NC-score distribution. In the upper 7% of this distribution, the SEM are lowest and range from .32 to 1.2. The conclusions from conventional CSEMs is that MK9a measures with very uniform precision across 93% of the score scale, and with greatest precision in the upper 7%.
Using the conditional reliability coefficients, or CirtSEMs, the measurement precision looks more variable across the score scale than indicated by the CSEMs. However, one could argue that MK9a has uniform measurement precision in the sense that 77% of the conditional reliabilities are between .75 and .91 in the central portion of the score distribution (10 ≤ NC ≤ 23) [or the corresponding θ-interval (−1 ≤θ≤ 1.5)]. What is labelled as uniform score precision, depends very heavily on the situation, and, of course, uniform precision is not always desirable; it may be more important to have higher precision in a particular portion of the score distribution. As an aside, the only way that broad, uniform conditional precision can be attained is through use of a computer adaptive test that draws on an item bank of widely varying item difficulties. However, medium-difficulty tests, with a-values in the vicinity of 1.0 or higher, (like the one analyzed here), should have nearly uniform, conditional reliabilities over the majority of a score distribution.
Summary and Conclusions
This inquiry involved three indicators of conditional precision of test scores: (1) the traditional CSEM,
Traditional conditional SEM =
IRT-based conditional SEM =
Score information function =
Conditional reliability =
It was argued that, by itself,
Hopefully, this research will find practical utility in the construction and analysis of test scores.
Footnotes
Appendix A
Assessed here is the reasonableness of the use of the squared slope of a test characteristic curve as an approximation of conditional true variance, namely
The average of the estimated conditional true variances, that is, the marginal true variance, is given by
The marginal true-and-error variances were found using,
These were then used to compute the marginal reliability coefficient,
Also, an estimated marginal reliability coefficient, using
In order to assess the generalizability of this support for,
as a reasonable approximation to conditional true variance—over a range of test difficulty and reliability, the following was done. The item parameters for MK9a were modified to produce tests that differed from MK9a in terms of difficulty (by manipulating b-values), reliability (by manipulating a-values), and difficulty and reliability (by manipulating a- and b-values). The following “tests” were generated by adding and subtracting constants from the MK9a a- and b-parameters:
Table A1 summarizes the effects of manipulating test difficlty and reliability on: Marginal and estimated marginal true variance; marginal error variance; mariginal reliability: estimated marginal reliability (using the approximate conditional true variance discussed earlier).
Appendix 2
Presented here is a sketch of a proof that the true variance, conditional on a fixed value of θ,
Let
since σ2(θ) is generally set equal to one in IRT. A more detailed proof requires a Taylor series expansion, but the idea and result are the same; see Nicewander (2018).
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
