Conditional Precision of Measurement for Test Scores: Are Conditional Standard Errors Sufficient?

Abstract

This inquiry is focused on three indicators of the precision of measurement—conditional on fixed values of θ, the latent variable of item response theory (IRT). The indicators that are compared are (1) The traditional, conditional standard errors, $σ (e_{X} | θ)$ = CSEM; (2) the IRT-based conditional standard errors, $σ_{irt} (e_{X} | θ) = C_{irt} SEM = \sqrt{1 / I (θ, X)}$ (where $I (θ, X)$ is the IRT score information function); and (3) a new conditional reliability coefficient, $ρ (X, X' | θ)$ . These indicators of conditional precision are shown to be functionally related to one another. The IRT-based, conditional CSEM, $C_{irt} SEM$ , and the conditional reliability, $ρ (X, X' | θ)$ , involve an estimate of the conditional true variance, $σ^{2} (t_{X} | θ)$ , which is shown to be approximately equal to the numerator of the score information function. It is argued—and illustrated with an example—that the traditional, conditional standard error, CSEM, is not sufficient for determining conditional score precision when used as the lone indicator of precision; hence, the portions of a score distribution, where scores are most-and-least precise, can be misidentified.

Keywords

item response theory conditional SEMs information conditional reliability

The overall, marginal precision of measurement for test scores is estimated using reliability coefficients, standard errors of measurement (SEM), and information functions. Currently, conditional measurement precision is estimated using traditional conditional SEMs (CSEMs) = $σ (e_{X} | θ)$ , information functions, $I (θ, X)$ and item response theory (IRT)–based conditional standard errors, $C_{irt} SEM = σ_{irt} (e_{X} | θ)$ (Lord, 1980, pp. 67-68). The present inquiry is focused on statistical methods for computing the conditional precision of test scores. It begins with the description of a paradox involving conditional precision as measured by the traditional CSEMs, and the IRT-based C_irtSEMs for number-correct scores.

As a prelude to describing the paradox, some notation and definitions are described, which allow for both dichotomous and polytomous items:

Let $X = \sum_{i = 1}^{n} x_{i}$ be the number-correct (NC) test score composed of n items, and let $j - 1$ be the score for the jth category of the ith item where j = 1, 2, . . ., m categories (m≥ 2).

Let P_ij(θ) be the conditional probability of scoring in category j of item i; then the conditional mean and variance of item i are given by

E (x_{i} | θ) = \sum_{j} s_{ij} P_{ij} (θ),

where θ is the latent variable being measured and,

CSEM = σ^{2} (x_{i} | θ) = \sum_{j} {s_{ij}^{2} P_{ij} (θ) - [E (x_{i} | θ]}^{2} .

If x_i is a dichotomous item, scored 0 or 1, then, $E (x_{i} | θ) = P_{i} (θ)$ , and

σ^{2} (x_{i} | θ) = P_{i} (θ) [1 - P_{i} (θ)] .

Let $σ (e_{X} | θ) = \sqrt{\sum_{i} σ^{2} (x_{i} | θ)}$ be the traditional CSEM.

The score information function of IRT is, defined as,

I (θ, X) = {[\sum_{i} \frac{\partial}{\partial θ} E (x_{i} | θ)]}^{2} / \sum_{i} σ^{2} (x_{i} | θ) = {[\sum_{i} \frac{\partial}{\partial θ} E (x_{i} | θ)]}^{2} / σ^{2} (e_{X} | θ) .

It is important to remember that this is the score information function for the NC score—as contrasted to the test information function (Lord, 1980, pp. 73-74).

The IRT-based conditional SEM

C_{irt} SEM = σ_{irt} (e_{X} | θ) = \sqrt{\frac{1}{I (θ, X)}} .

Note: This standard error is appropriate for the number correct score and reflects the accuracy with which θ can be predicted from any score—including the number correct score—it is not limited to θ-estimates (Lord, 1980, p. 67).

A Paradox

The two indicators of conditional score precision, the traditional CSEM = $σ (e_{X} | θ)$ , and C_irtSEM = $σ_{irt} (e_{x} | θ)$ , can produce radically different views of conditional precision. The paradox is illustrated using a mathematics test, MK9a, a retired form of the Math Knowledge test from the Armed Services Vocational Aptitude Battery. It is composed of 25, medium-difficulty, dichotomous items (test mean = 15.68; SD = 5.43; reliability = .86). Their three-parameter logistic, average IRT item parameter estimates were

\bar{a} = 1.10 (SD = 0.38); \bar{b} = - 0.25 (SD = 0.99); \bar{c} = 0.22 (SD = 0.07) .

The test characteristic curve (TCC) for MK9a is presented in Figure 1.

Figure 1.

Test characteristic curve (TCC) for MK9a.

Figure 2 displays a graph of $σ (e_{X} | θ)$ and $σ_{irt} (e_{X} | θ)$ plotted against the latent knowledge variable, θ. The paradox illustrated in Figure 2 is rather striking, with the two types of precision measures telling very different stories about conditional measurement precision in various regions of the θ-distribution:

Computing $σ (e_{X} | θ)$ , for the NC scores, X, indicates that at the low end of the θ-distribution, in the interval $(- 3 \leq θ \leq 0)$ , X has low levels of precision that are fairly constant, The standard errors are largest near the center of the distribution in the interval $(- 1 \leq θ \leq 0)$ they decrease considerably in the interval $(0 \leq θ \leq 3)$ , and the score precision is greatest at the top end of the θ-distribution.

When using $σ_{irt} (e_{X} | θ)$ to determine precision, the results are nearly the opposite of those for $σ (e_{X} | θ)$ . In the interval $(- 3 \leq θ \leq 0)$ the score, X, increases rapidly in precision, and is maximally precise in the central portion of the θ-distribution, in the interval $(- 1 \leq θ \leq + 1)$ . The score, X, decreases rapidly in precision at the top of the θ-distribution where it is lowest.

Figure 2.

CSEMs and C_irtSEMs for NC scores from MK9a. CSEM = conditional standard errors of measurement; NC = number-correct.

Notice that in Figure 2 the correlation between the two conditional standard errors is −0.01; they are uncorrelated. Logic would seem to imply that this medium-difficulty test should measure most precisely in the middle of the θ-distribution, and $σ_{irt} (e_{X} | θ)$ supports this logic, but $σ (e_{X} | θ)$ , contradicts it. An examination of the nature of $σ^{2} (e_{X} | θ)$ may provide a partial explanation of the paradox. The test, MK9a is composed of medium-difficulty, dichotomous, multiple-choice items. For this type of item, the IRT-based, conditional p values, P_i(θ), define the conditional item means, and conditional error variances, P_i(θ)(1 −P_i(θ)). Assuming local independence, the conditional, test means are $E (Σ x_{i} | θ) = Σ P_{i} (θ)$ , and, the conditional, test error variances are $σ^{2} (E_{X} | θ) = Σ σ^{2} (x_{i} | θ) = Σ P_{i} (θ) (1 - P_{i} (θ))$ . For a medium-difficulty test, it is easy to see that in the central part of the θ-distribution, the P_i(θ)s are in the vicinity of .5, and the conditional error variances are not much less than their maximum value of .25. In the upper tail of the distribution, they become smaller as the P_i(θ)’s approach 1.0; in the lower tail they are intermediate due to guessing preventing really small P_i(θ)s. Putting this all together results in an array of CSEMs following the pattern as shown in Figure 2; that is, moderately high in the lower portion of the θ-distribution, largest in the central portion and smallest in the upper region of the θ-distribution. Summarizing, CSEMs are largest in the portion of the θ-distribution where the majority of item b-values are close in value to the values of θ (in the case of MK9a, the majority (13/25) of the item b-values fall between −.47 and .43; seven are less than −.47, and four are greater than .427. What accounts for the fact that the C_irtSEMs are in substantial disagreement with the CSEMs concerning where this example test measures best? It is shown below that the cause of the discrepancy is due to the composition of the two, indicators of conditional precision of measurement, and is not due to different types of scores—NC versus θ-estimates. Both of these indicators of precision are appropriate for the NC score.

In order to examine the nature of $σ (e_{X} | θ)$ and $σ_{irt} (e_{X} | θ)$ , first, it is necessary to define the conditional true variance, $σ^{2} (t_{X} | θ)$ , for the score, X. The concept of conditional true variance has long been absent from the IRT literature; however, recently, Nicewander (2018), has shown that the numerator of the information function, $[\sum_{i} \frac{\partial}{\partial θ} E (x_{i} | θ)]^{2}$ , the squared slope of the TCC, is a Taylor series approximation of the conditional true variance for the NC score in an infinitesimally small interval around a fixed value of θ (see Appendix B for a sketch of the derivation). Therefore, the information function in Equation (3) can be rewritten as

I (X, θ) ≅ σ^{2} (t_{X} | θ) / σ^{2} (e_{X} | θ) .

It is interesting to note that Equation (5) gives additional significance to the information function (which, in isolation, is rather barren of meaning): It is approximately the number of times greater the conditional true variance is than the conditional error variance, $σ^{2} (e_{X} | θ)$ . For example, if I(X, θ) is 3, then the conditional true variance is about three times larger than the conditional error variance (which is just enough information to compute an estimate of the conditional reliability as, $3 σ^{2} (e_{X} | θ) / [3 σ^{2} (e_{X} | θ) + σ^{2} (e_{X} | θ)] = 3 / 4 = 0.75$ —more on this later).

A reviewer of this article made the important point that Equation (5) is an approximation that should be considered as a heuristic. This approximation may be reasonable for values of θ that are not in the extremes of the θ-distribution. A TCC is a sigmoidal curve that has a concave portion (near the lower asymptote), an essentially linear portion and a convex portion (near the upper asymptote). The Taylor series estimate of the conditional true variance should be most accurate in the nearly linear portion of the curve, and least accurate in the concave and convex portions—especially if the item a-values are large (>2). If the nearly linear portion of a TCC is coincident with the area of maximum density for the θ-distribution (i.e., the central portion), then the conditional true-variance approximation should be good for all but the most unlikely values of θ and for items with extremely large a-values.

To provide some data on the approximation for conditional true variances and conditional reliabilities, a small study of the reasonableness of $σ^{2} (t_{X} | θ)$ is presented in Appendix A. Before describing the study, keep in mind that this is not a situation where one could set known values of $σ^{2} (t_{X} | θ)$ and $ρ (X, X' | θ)$ , and then see how accurately they can be estimated. However, what one can do is to is to compute marginal true variances, $σ^{2} (t_{X} | θ)$ and reliability coefficients for a test, and then see how accurately these are estimated using the marginal (average), $E [σ^{2} (t_{X} | θ)] = {\bar{σ}}^{2} (t_{X} | θ)$ , if the conditional true variance as an estimate of $σ^{2} (t_{X})$ . Details are given in Appendix A. In this study, six tests that were more-and-less difficult, and more-and-less reliable than MK9a were generated by adding and subtracting constants from the MK9a a- and b-values. The principal statistics for these six “tests” formed from MK9a were (1) the marginal true variances; (2) the marginal, averages ${\bar{σ}}^{2} (t_{X} | θ)$ of the $σ^{2} (t_{X} | θ)$ ; (3) the marginal reliabilities of the tests; and finally, the approximate marginal reliabilities were computed using the ${\bar{σ}}^{2} (t_{X} | θ)$ in place of the marginal true variance, $σ^{2} (t_{X})$ in the ratio of true variance to true plus error variance. In a nutshell, the basic findings were (1) the marginal reliabilities and the approximate marginal reliabilities never differed by more that .03 (for one case) and were between .00 and .02 for the rest and (2) the approximate, marginal true variances were all larger than the actual marginal variance—one by as much as 25% and one by as little as 3%. It is also shown that the average of the conditional reliability coefficients was a poor estimate of the marginal reliability. Overall, the data in Appendix A give some information where and where not the $σ^{2} (t_{X} | θ)$ are reasonable approximations; they were most reasonable for the easier tests with lower a-values and worst for difficult tests with high a-values. In all cases, it is arguable that the conditional reliabilities were reasonable approximations—not so much for the conditional true variance themselves.

Resolution of the Paradox

Given the interpretation of the information function provided in Equation (4), the paradox is resolved: the root reciprocal information is approximately equal to the traditional CSEM divided by the conditional true-score standard deviation,

σ_{irt} (e_{X} | θ) ≅ σ (e_{X} | θ) / σ (t_{X} | θ) .

Equation (6) indicates that $σ_{irt} (e_{X} | θ)$ takes into account both conditional true variance as well as error variance; consequently, $σ_{irt} (e_{X} | θ)$ , is smaller than $σ (e_{X} | θ)$ when $σ^{2} (t_{X} | θ)$ is greater than $σ (e_{X} | θ)$ . Of course, another way to include conditional true variance in determining measurement precision is through use of a conditional reliability coefficient. Recall that the information function was shown in Equation (5) to be approximately, $σ^{2} (t_{X} | θ) / σ^{2} (e_{X} | θ)$ . Nicewander (2018) pointed out that this ratio of conditional true-to-error variances is interpretable as the conditional version of Cronbach and Gleser’s (1964) signal/noise (S/N) ratio,

S / N = ρ (X, X') / [1 - ρ (X, X')] .

Following from Cronbach and Gleser’s work, the conditional signal/noise, S/N(θ), may also be expressed as

S / N (θ) = \frac{ρ (X, X' | θ)}{1 - ρ (X, X' | θ)} ≅ I (θ, X),

where $ρ (X, X' | θ)$ is the conditional reliability coefficient for x at a fixed value of θ. Solving Equation (7) for $ρ (X, X' | θ)$ yields

ρ (X, X' | θ) ≅ \frac{I (X, θ)}{I (X, θ) + 1} .

Discussion

Now consider three ways for indicating conditional precision of a test score: $σ (e_{X} | θ)$ , $σ_{irt} (e_{X} | θ)$ , and the conditional reliability coefficient, $ρ (X, X' | θ)$ (the information function is not included because of its mathematical relationship to the conditional reliability coefficient). These three indicators of conditional measurement precision are graphed in Figure 3. Both $σ_{irt} (e_{X} | θ)$ and $ρ (X, X' | θ)$ agree that the most precise measurement is in the central portion of the θ-distribution; $σ (e_{X} | θ)$ indicates that precision is greatest in the upper portion of the θ-distribution. If it is agreed that the IRT-based C_irtSEM and conditional reliability are preferred indicators of conditional precision, then one can argue that of the two, the conditional reliability coefficient is more useful for practical testing situation than the IRT-based CSEM. The conditional reliability, being bounded by 0 and 1, has restricted values that are more familiar to most users of tests. For example, there is probably substantial agreement that coefficients in the range .90 to .95 indicate high reliability; reliabilities of .80 to .90 are moderately high; reliabilities in the range of .70 to .80 are adequate; and reliabilities less than .70 are low. So, if the above “rules of thumb” are applied when one looks at Figure 3, statements such as the following clearly describe the conditional precision in familiar terms: “The MK9a math test has moderately high conditional reliability equal to .80 or more in the central portion of the θ-distribution in the interval $(- 0.8 \leq θ \leq 1.6)$ , and adequate reliability equal to .70 or more in the interval $(- 1.2 \leq θ \leq 2.0)$ .” One can also evaluate the conditional reliability at cut scores, for example, “This test has moderately high reliability equal to .75 at a cut score of θ = −1, and high reliability equal to .90 at a cut score of θ = 1.”

Figure 3.

Traditional CSEMs, IRT-based CSEMs, and reliabilities for MK9a. CSEM = conditional standard error of measurement; IRT = item response theory.

Physical measurement differs from psychological and educational measurement in that precision is indicated when the variance of measurement errors is small. In psychological-educational measurement, precision is indicated when the variance of measurement errors is small relative to the true variance. Therefore, it is arguable that conditional indicators of precision should involve comparisons of the relative magnitudes of the conditional true-and-error variances—given the nature of psychological-educational measurement. It now seems clear that the answer to the question posed in the title of this study is, “No.” Use of the conditional CSEM (or error variance) alone, can give misleading information concerning the conditional precision of a test; however, it is useful to use in conjunction with other indices of precision.

In fairness to traditional CSEMs, it should be mentioned that the concept of conditional true variance was foreign to IRT until very recently. All that researchers and test developers had for use in determining conditional precision were estimates of the traditional CSEMs—in the case of true score theory—or score information functions and IRT-based C_irtSEMs—in the case of IRT. Back in the days when CTT dominated psychological and educational measurement—and all that was available were the overall (or marginal) test reliability, and the overall SEM—it was realized that, in order to describe the precision of a test, both of these indicators were necessary for determining precision. The reliability coefficient provides the proportion of score variability that is due to true differences among persons, and what proportion is due to measurement error. However, reliability alone, being a proportion, provides no indication of measurement precision in terms of the units of the test’s score scale. The marginal SEM expresses precision in terms of such units, and is used to express precision in terms of the length of the confidence intervals placed around a test’s true scores.

Looking back at Figure 3, there is something puzzling—in the lower half of the θ-distribution the traditional CSEMs are relatively large and nearly constant (indicating low and equal precision), but the conditional reliabilities are rapidly increasing in the same interval. How can this be? The answer is found in the fact that the magnitude of the C_irtSEM (as in the case of the marginal SEMs) depends on both the conditional reliability coefficient and the conditional total variance of a measure—the larger the total variance of measure, the larger the CSEM, and vice-versa. Conditionally, the total variance for NC scores is equal to

σ^{2} (x_{total} | θ) ≅ σ^{2} (t_{X} | θ) + σ^{2} (e_{X} | θ) .

Figure 4 displays the conditional true, error and total variances, and it is clear, that in the lower half of the θ-distribution, error variance is very slowly increasing from 4.5 to 5.0, but true variance is increasing rapidly, from 0.04 to 41.7 in the same interval. It is easy to show that

σ^{2} (e_{X} | θ) = [1 - ρ (X, X' | θ)] σ^{2} (X_{total} | θ),

which is in accord with CTT’s expression for error variance, $σ^{2} (e) = [1 - ρ (X, X')] σ^{2} (X)$ ].

Figure 4.

Conditional true, error and total variances for MK9a.

Then viewing Figures 3 and 4, it can be seen that what is happening in the lower portion of the θ-distribution, is that conditional reliability, $ρ (X, X' | θ)$ and total, conditional variance, $σ^{2} (X_{total} | θ)$ , are increasing in such a way that $σ^{2} (e_{X} | θ)$ remains nearly constant.

What Does All This Mean Practically?

The results presented here indicate that if test developers want to demonstrate where a test measures well and not so well, the indices best suited for this purpose are, putatively, the conditional reliability coefficient and the C_irtSEM. However, in order to fully understand the practical effects of measurement precision in applications of these indicators, it can be of considerable practical value to relate conditional reliabilities and SEMs to the NC true-score scale, as opposed to the latent θ-scale. The θ-scale values themselves are devoid of meaning, and only through their normal percentiles do they take on meaning. Associating indices of measurement precision with the NC scale can be useful as shown below. Of particular value, is that precision (e.g., conditional reliability coefficients) can be computed at various cut-scores on the NC score scale for a test.

Figure 5 is a plot of the three measures of conditional measurement precision against the NC true score scale for MK9a. The pattern of results is the same as it is for the θ-scale, but the scale is based on the NC true scores. In Figure 5, one can see that, according to the conditional reliabilities, this test varies quite a bit in precision across the score scale, with conditional reliabilities ranging between .17 and .91 for NC scores of 6 and 19, respectively. Conditional reliability drops off rapidly for scores lower than 10 and higher than 23. Scores less than 10 have conditional reliabilities between .09 and .56; scores greater than 23 have reliabilities between .23 and .67. In between the scores of 10 and 23 (77% of the NC-score distribution), the reliabilities are rather more uniform ranging from .75 to .91. The C_irtSEMs are consonant with the conditional reliabilities in that they are larger below 10 and above 23, and smaller and fairly uniform in between these two scores.

Figure 5.

Conditional standard errors of measurement (SEMs) and reliability coefficients plotted against number-correct true scores for MK9a.

Using the conventional CSEM, the picture of this test is quite different. The scores between 6 and 23, show nearly equal CSEMs (that are larger than the C_irtSEMs) in this 15-point interval—which constitutes 93% of the NC score distribution. The CSEMs erroneously denote the test being most precise for scale scores between 23 and 25 (about 7% of the population). This pattern of results for conditional measurement precision for the NC-score scale is the same as for the pattern attained using the θ-scale—only the scale units differ.

It is sometimes desirable to have approximately equal measurement precision across the true-score or θ-scale scales; for example, this near-equality is one of the important criteria in the development of the ACT® Assessment Test (see ACT Inc., 2014, p. 51). How would MK9a fare in terms of approximately equal, conditional measurement precision across the NC score scale? Using the conventional CSEM, one could logically conclude that MK9a has very uniform precision (CSEMs of about 2.0) in the lower 93% of the NC-score distribution. In the upper 7% of this distribution, the SEM are lowest and range from .32 to 1.2. The conclusions from conventional CSEMs is that MK9a measures with very uniform precision across 93% of the score scale, and with greatest precision in the upper 7%.

Using the conditional reliability coefficients, or C_irtSEMs, the measurement precision looks more variable across the score scale than indicated by the CSEMs. However, one could argue that MK9a has uniform measurement precision in the sense that 77% of the conditional reliabilities are between .75 and .91 in the central portion of the score distribution (10 ≤ NC ≤ 23) [or the corresponding θ-interval (−1 ≤θ≤ 1.5)]. What is labelled as uniform score precision, depends very heavily on the situation, and, of course, uniform precision is not always desirable; it may be more important to have higher precision in a particular portion of the score distribution. As an aside, the only way that broad, uniform conditional precision can be attained is through use of a computer adaptive test that draws on an item bank of widely varying item difficulties. However, medium-difficulty tests, with a-values in the vicinity of 1.0 or higher, (like the one analyzed here), should have nearly uniform, conditional reliabilities over the majority of a score distribution.

Summary and Conclusions

This inquiry involved three indicators of conditional precision of test scores: (1) the traditional CSEM, $σ^{2} (e_{X} | θ)$ ; (2) the IRT-based C_irtSEM = $\sqrt{1 / I (θ, X)}$ ; (3) the conditional reliability coefficient, $ρ (X, X' | θ) = I (X, θ) / [1 + I (X, θ)]$ ; and, tangentially, the test information function, $I (θ, X)$ . It is interesting to note that all of these measures are functions of the conditional true-and-error variances, $σ^{2} (t_{x} | θ)$ and $σ^{2} (e_{X} | θ)$ :

Traditional conditional SEM = $σ (e_{X} | θ)$ ;

IRT-based conditional SEM = $σ (e_{X} | θ) / σ (t_{X} | θ)$ ;

Score information function = $I (X, θ) = σ^{2} (t_{X} | θ) / σ^{2} (e_{X} | θ)$ ;

Conditional reliability = $ρ (X, X' | θ) = \frac{I (X, θ)}{1 + I (X, θ)} = \frac{σ^{2} (t_{X} | θ)}{σ^{2} (t_{X} | θ) + σ^{2} (e_{X} | θ)}$

It was argued that, by itself, $σ (e_{X} | θ)$ cannot adequately determine conditional score precision. The two indicators, $ρ (X, X' | θ)$ and $σ_{irt} (e_{X} | θ)$ , are functions of $σ^{2} (t_{X} | θ)$ and $σ^{2} (e_{X} | θ)$ , the conditional true-and-error variances. Both of these are acceptable indicators of conditional score precision. However, because a reliability coefficient is so universally used and understood through rules-of-thumb, it is arguable that the preferred method of characterizing the degree of conditional precision is the conditional reliability coefficient. However, the CSEM is still a required index of precision, especially when used in conjunction with conditional reliability. Finally, in order to make the above measures of conditional precision more relevant in applied situations, they can be computed for a test’s score scale. Since in IRT, true scores and θs are functionally related, an observed score scale can be constructed using the true scores corresponding to a set of θ-values —as was done in Figure 5. Using the true-score scale as a basis, rather that the latent θ-scale, allows one to directly determine the effects of measurement precision in various regions of the score scale for a test. For example, in applied testing situations, it is more helpful to be able to say, “This test has a conditional reliability of approximately .75 at a cut score of 10, and .90 for a cut score of 19,” compared with “This test has a conditional reliability of approximately .75 for a cut score of -1, and .90 for a score of 1 on the latent θ-scale.”

Hopefully, this research will find practical utility in the construction and analysis of test scores.

Footnotes

Appendix A

Assessed here is the reasonableness of the use of the squared slope of a test characteristic curve as an approximation of conditional true variance, namely

σ^{2} (t_{X} | θ) ≅ [\sum_{i} \frac{\partial}{\partial θ} E (x_{i} | θ)]^{2} = {\hat{σ}}^{2} (t_{X} | θ) .

The average of the estimated conditional true variances, that is, the marginal true variance, is given by

{\hat{σ}}_{t_{X}}^{2} = E [{\hat{σ}}^{2} (t_{X} | θ)] = \int_{- \infty}^{+ \infty} {\hat{σ}}^{2} (t_{X} | θ) φ θ d θ .

The marginal true-and-error variances were found using,

σ_{t_{X}}^{2} = \int_{- \infty}^{+ \infty} {[t_{X | θ} - E (t_{X | θ})]}^{2} φ θ d θ and σ_{e_{X}}^{2} = \int_{- \infty}^{+ \infty} σ^{2} (e_{X} | θ) φ θ d θ .

These were then used to compute the marginal reliability coefficient,

ρ (X, X') = σ_{t_{X}}^{2} / (σ_{t_{X}}^{2} + σ_{e_{X}}^{2}) .

Also, an estimated marginal reliability coefficient, using ${\hat{σ}}_{t_{X}}^{2}$ , was computed as,

ρ_{est} (X, X') = {\hat{σ}}_{t_{X}}^{2} / ({\hat{σ}}_{t_{X}}^{2} + σ_{e_{X}}^{2}) .

In order to assess the generalizability of this support for,

{\hat{σ}}^{2} (t_{X} | θ) = [\sum_{i} \frac{\partial}{\partial θ} E (x_{i} | θ)]^{2},

as a reasonable approximation to conditional true variance—over a range of test difficulty and reliability, the following was done. The item parameters for MK9a were modified to produce tests that differed from MK9a in terms of difficulty (by manipulating b-values), reliability (by manipulating a-values), and difficulty and reliability (by manipulating a- and b-values). The following “tests” were generated by adding and subtracting constants from the MK9a a- and b-parameters:

Table A1 summarizes the effects of manipulating test difficlty and reliability on: Marginal and estimated marginal true variance; marginal error variance; mariginal reliability: estimated marginal reliability (using the approximate conditional true variance discussed earlier).

Appendix 2

Presented here is a sketch of a proof that the true variance, conditional on a fixed value of θ, $σ^{2} (t_{X} | θ)$ , is approximately equal to the numerator of the score information function.

Let $t_{X | θ} = \sum_{i} E (x_{i} | θ)$ be the true score for the NC score, X, for a fixed value of θ = θ₀. In an infinitesimally small interval around θ₀, the function relating $t_{X | θ}$ to θ is approximately linear with slope, b_θ = $\sum_{i} {\frac{\partial}{\partial θ}}_{0} E (x_{i} | θ_{0})$ . Then, the variance of $t_{X | θ}$ at a fixed value of θ = θ₀ is approximately

σ^{2} (t_{X} | θ) \approx b_{θ}^{2} σ^{2} (θ) = [\sum_{i} \frac{\partial}{\partial θ} E (x_{i} | θ_{0})]^{2} σ^{2} (θ) = [\sum_{i} \frac{\partial}{\partial θ} E (x_{i} | θ_{0})]^{2},

since σ²(θ) is generally set equal to one in IRT. A more detailed proof requires a Taylor series expansion, but the idea and result are the same; see Nicewander (2018).

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

ACT Inc. (2014). ACT® technical manual. Iowa City, IA: Author.

Cronbach

L. J.

Gleser

G. C.

(1964). The signal/noise ratio in the comparison of reliability coefficients. Educational and Psychological Measurement, 24, 467-480.

Lord

F. M.

(1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.

Nicewander

W. A.

(2018). Conditional reliability coefficients for test scores. Psychological Methods, 23, 351-362.