Conditional Subscore Reporting Using Iterated Discrete Convolutions

Abstract

The literature showing that subscores fail to add value is vast; yet despite their typical redundancy and the frequent presence of substantial statistical errors, many stakeholders remain convinced of their necessity. This article describes a method for identifying and reporting unexpectedly high or low subscores by comparing each examinee’s observed subscore with a discrete probability distribution of subscores conditional on the examinee’s overall ability. The proposed approach turns out to be somewhat conservative due to the nature of subscores as finite sums of item scores associated with a subdomain. Thus, the method may be a compromise that satisfies score users by reporting subscore information as well as psychometricians by limiting misinterpretation, at most, to the rates of Type I and Type II error.

Keywords

discrete convolutions compound binomial distribution symmetric functions subscores proportional reduction of mean squared error

Encouraged by at least 25 years of federal government mandates beginning with the Improving Americas School Act of 1994, No Child Left Behind of 2001, and more recently the Every Student Succeeds Act of 2015, subscores have become an expected component of feedback for testing programs. Subscore information can be presented at the individual level with the goal of, for example, helping examinees focus their remedial activities by identifying their strengths and weaknesses or at the institutional level to help, for example, facilitate more nuanced comparisons of cohort performance. Some stakeholders may believe this additional information is needed to justify the high direct and indirect costs associated with an examination.

It is fundamentally important to distinguish between subscores from multidimensional and unidimensional tests. For tests constructed to be multidimensional, the overall score is essentially a composite of several distinct constructs, and the subscores can often be easily identified and supported through dimensionality analysis (e.g., factor analysis) and divergent validity. For instance, the SAT Math and Verbal section scores can be considered subscores. More commonly, subscores are computed from tests that were constructed to be unidimensional, where the underlying subscales are often designed post hoc by content expert judgments or sometimes defined by item types/features. The reading component score within the SAT Verbal section can be considered a subscore of this type. In this article, we focus on subscores from essentially unidimensional assessments, where the primary goal is to produce a single overall score, the dominant paradigm in large-scale testing.

Literature on the utility of subscores derived from unidimensional tests has found them to be notoriously difficult to justify from both item response theory (IRT) and classical test theory perspectives (Feinberg & Jurich, 2017; Haberman, 2008; Jiang & Raymond, 2018; Puhan et al., 2010; Rijmen et al., 2014; Sinharay, 2010; Thissen & Wainer, 2001; Verhelst, 2012). As previous research has demonstrated, subscores are typically based on a relatively small number of subdomain-specific items, and in many cases, these subdomain scores are better predicted by an individual’s overall score. Hence, subscores are often not sufficiently reliable or distinct enough from the rest of the test to justify separate reporting. This is unsurprising as Brennan (2012) observed, “If test scores fit a unidimensional model, a psychometrically compelling argument cannot be mounted for reporting any subscores since, by definition, there is only one proficiency or latent trait” (p. 14). Thus, subscores often reflect the jangle fallacy (Marsh, 1994) where two or more subscores have different names but represent the same construct.

With a growing body of literature suggesting that reporting subscores is seldom defensible, testing programs are beginning to change their score reporting policies. These changes have not always been warmly received by stakeholders. In 2014, the National Council of Bar Examiners eliminated subscore reporting on the Multistate Bar Exam (Albanese, 2014); however, stakeholder response was so negative that some subscore information was eventually restored for failing candidates (Pieper Bar Review, 2017). Thus, simply not reporting subscores, though perhaps psychometrically ideal, may be politically impractical. Instead a compromise is needed between whatever psychometric limitations a given exam’s subscores impose and the expectations and desired inferences from score users.

Perhaps the best way to create subscores that add value would be to use an evidence-centered design approach wherein the intended subscore inferences can be factored into the test design and blueprinting process (Mislevy et al., 2003), but this is likely to change the nature of the test and to produce a multidimensional instrument that may be less aligned with the intended goal of the assessment. Post hoc options include lengthening the test, combining similar content areas after the testing event, or calculating augmented subscores to boost subscore reliability. Yet, most often these strategies have been ineffective because they are applied to preexisting exams that were built to be unidimensional (Feinberg & Wainer, 2014; Sinharay et al., 2011). Moreover, the costs of these strategies may be difficult to justify given that the common uses of subscores (e.g., remedial feedback) are typically subordinate to the primary purpose of the exam (e.g., overall score, pass/fail classification).

Reporting subscores in alternative formats does not alleviate the problem. For instance, some testing programs report subscore information as profile bands where the width is a function of the subscores standard error. This common practice has been a way for psychometricians to appease stakeholders by reporting subscores while also aligning their practice with the testing standards by incorporating a measure of imprecision as highlighted by Standard 2.3 of the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014). However, research has suggested that profile bands can be confusing and difficult to interpret accurately, even when detailed explanatory text is provided (Clauser & Rick, 2016; Rick & Clauser, 2016).

Yet, despite their typical redundancy and often frequent presence of substantial statistical errors, many stakeholders remain convinced that subscores are needed. The purpose of this study is to investigate an approach to subscore reporting intended to reduce many of the most common mistakes score users make when interpreting subscores on a sufficiently unidimensional test. We assume that test data can be fitted by a unidimensional model or, more specifically, that a multidimensional model does not provide better model data fit than, for example, a one-factor model or, equivalently, a unidimensional IRT model (Lord & Novick, 1968). Moreover, here our discussion is limited to the Rasch model (Rasch, 1966; von Davier, 2016); however, our proposed method applies to more general IRT models as well (Lord & Novick, 1968).

Conditional Subscore Distributions Are Convolutions

Mathematically, the distribution of sums of random variables, in this case subscores, are called convolutions. Subscores are comprised of identifiable units—items—contributing a summand to a subscore, with probabilities for each summand that depend on respondent ability and item difficulty. This is true whether the subscores are based on dichotomous 0/1 item scores or a polytomous Likert-type rating scale ranging from 1 to 5. If all variables in the sum are binary and distributed identically, the well-known binomial distribution for 0 < p < 1 and K trials provides the probability of the sum score s as

P ((S = s)) = ((\begin{matrix} K \\ s \end{matrix})) p^{s} {(1 - p)}^{((K - s))} .

This expression describes the distribution of sums of independent Bernoulli trials with identical probability of all trials. The expected value of the sum score is well known and equals

E ((S)) = K p,

and the variance equals

V ((S)) = K p ((1 - p))

In testing, sum scores consist of (locally) independent trials (item scores) with different probabilities, as each item i is different and likely carries a different probability of success p_i _ω for each respondent ω. Assuming local independence, the (binary) item responses are conditionally independent Bernoulli trials where the expected value for respondent ω equals

E ((S)) = \sum_{i = 1}^{K} P_{i ω},

and the variance is

V ((S)) = \sum_{i = 1}^{K} P_{i ω} ((1 - P_{i ω})),

which reduces to the binomial distribution if $P_{i ω}$ = $P_{ω}$ = P for all items i = 1, …, K. For polytomous ordinal variables with x_i ∈ {0…, m_i }, the expected score is $E ((S |ω)) = \sum_{i = 1}^{K} E (X_{i} | ω)$ with $E (X_{i} |ω) = \sum_{x = 1}^{m_{i}} x P_{i ω x}$ and $P_{i ω x} = P_{i} (X = x | ω)$ , and the variance is $(S |ω) = \sum_{i = 1}^{K} V (X_{i} | ω)$ with $V (X_{i} |ω) = E (X_{i}^{2} |ω) - {[E (X_{i} |ω)]}^{2}$ .

The probability of each sum of binary item scores for any respondent $ω$ with ability $θ_{ω}$ can be computed using a recursive formula that is customarily applied in latent trait theory (Lord & Wingersky, 1984) and in particular is well known from conditional inference in the Rasch model (Andersen, 1972; Gustafsson, 1980). This approach is also utilized for the $S - X^{2}$ item fit index (Orlando & Thissen, 2003). For items with a partial credit format, von Davier and Rost (1995) provide a summation algorithm for these types of conditional sums. This approach of recursive summation of probabilities to obtain the distribution of sums of scores gets rediscovered frequently (e.g., Evans & Leemis, 2004; more recently Biscarri et al., 2018; Gonzalez et al., 2016), showing how ubiquitous this problem is in applied statistics, even though it has been a long-solved problem. The algorithm we use can be applied to binary, polytomous, and mixed format tests (Henson, 1994; Thissen et al., 1995; von Davier & Rost, 1995).

Predicting Subscore Distributions From Overall Ability Estimates

In the case of the Rasch model, for person ω with estimated ability ${\hat{θ}}_{ω}$ , we can obtain the conditional probability of a correct response (score = 1) on item i with difficulty parameter $b_{i}$ as

p_{i ω 1} = \frac{e^{(({\hat{θ}}_{ω} - b_{i}))}}{1 + e^{(({\hat{θ}}_{ω} - b_{i}))}} .

The probability of an incorrect response (score = 0) is then given by $p_{i ω 0} = 1 - p_{i ω 1}$ . Using these respondent-specific probabilities for each item in a particular subtest, we can calculate the probability of a given raw score of s on the first k items as the joint probability of scoring $s - 1$ on the first $k - 1$ items and responding correctly to the $k th$ item, $p_{k ω 1} P_{((k - 1)) ω} ((S = s - 1))$ , plus the joint probability of scoring s on the first $k - 1$ items and responding incorrectly to the $k th$ item, $p_{k ω 0} P_{((k - 1)) ω} ((S = s))$ :

P_{k ω} ((S = s | ω)) = p_{k ω 1} P_{((k - 1)) ω} ((S = s - 1)) + p_{k ω 0} P_{((k - 1)) ω} ((S = s)) .

The nested structure of this relationship allows $P_{k ω} ((S = s | ω))$ to be computed recursively for any k, s, and ω by simply calculating $P_{k^{'} ω} ((S = s^{'} | ω)) ((s^{'} = 0, 1, \dots, k^{'}))$ for $k^{'} = 2, 3, . . . k$ until $k^{'} = k$ and $s^{'} = s$ . (Note that when $s = 0$ or $s = k$ , $P_{k ω} ((S = 0 - 1 | ω)) \equiv 0$ and $P_{((k - 1)) ω} ((S = k | ω)) \equiv 0$ , respectively.) For the first level of the recursion, where $k = 1$ and $P_{1 ω} ((S = x_{1}))$ , we must define the initial probabilities so that they agree with the item level model as

P_{1 ω} ((S = x_{1})) = p_{1 ω 1}^{x_{1}} p_{1 ω 0}^{((1 - x_{1}))},

for the first item and the (trivial) partial sum that contains only this item (see von Davier & Rost, 1995, as well as Grinstead & Snell, 1997, for an extension of this recursion to polytomous response data).

The recursion can be used for calculating the raw score probabilities for any subscale q consisting of a subset of k_q ≤ k items

({x_{q ((1))}, \cdot, x_{q ((k_{q}))}}) \subseteq ({x_{1}, \cdot, x_{k}}),

with lower thresholds

P_{q ω} ((S \leq s)) = \sum_{t = 0}^{s} P_{q ω} ((S = t)),

and upper thresholds for subscore probabilities

P_{q ω} ((S \geq s)) = \sum_{t = s}^{k} P_{q ω} ((S = t)) .

These thresholds can lead to sets of categorical subscores by defining

C_{q ω} = \{{\begin{matrix} P_{q ω} ((S \leq s)) < \frac{α_{crit}}{2} = L \\ P_{q ω} ((S \geq s)) < \frac{α_{crit}}{2} = H \end{matrix} .

A categorical subscore of $C_{q ω} = L$ would be considered “lower than expected” and $C_{q ω} = H$ “higher than expected” for subscale q and respondent ω with estimated overall ability ${\hat{θ}}_{ω}$ . See Table 1 for an illustration of the recursion procedure for an individual with ${\hat{θ}}_{ω} = 1.3$ on a 5-item subscale where $P_{i ω 1} = ({.9644, .9089, .5744, .3318, .1545))$ and $α_{crit} = .05$ . In this illustration, if the test taker received a score of either 0 or 1, ( $S_{q ω}$ = 0 or 1), then they would receive $C_{q ω}$ or categorical performance indicator (CPI) of L, signifying that their performance was low compared to what was expected given their estimated overall ability. Similarly, if the test taker had received a score of 5, ( $S_{q ω}$ = 5), then they would receive a CPI of H indicating that their performance was higher than expected. All other possible scores would indicate that performance was not significantly different (NS) given their estimated overall ability with observed differences considered within the margin of error.

Table 1.

Recursion Iteration Procedure for Illustrative Example

Recursion Iteration	$S_{q ω} = s$
Recursion Iteration	0	1	2	3	4	5
$k = ({1))$	0.0356	.9644
$k = ({1 : 2))$	0.0032	.1202	.8765
$k = ({1 : 3))$	0.0014	.0530	.4421	.5035
$k = ({1 : 4))$	0.0009	.0359	.3130	.4831	.1671
$k = ({1 : 5))$	0.0008	.0305	.2702	.4569	.2159	0.0258
$P_{q ω} ((S \leq s))$	0.0008	.0313	.3014	.7583	.9742	1.0000
$P_{q ω} ((S \geq s))$	1.0000	.9992	.9687	.6986	.2417	0.0258
$CPI$	L	L	NS	NS	NS	H

Note. k = number of items, $S_{q ω} = s$ = subscore for test taker $ω$ , L = lower than expected, NS = nonsignificant, H = higher than expected, CPI = categorical performance indicator.

Example With Operational and Resampled Data

Operational Data

Data sets from two operational exam programs were obtained. Data from Test 1, a multiple-form high-stakes pass/fail licensure test, included a total of 2,278 items with 300 scored items per form and 5,832 examinees who tested in 2016 and 2017. In addition to their pass/fail classification and overall score, examinees on this test receive six subscores based on 6 subscales ranging from 17 to 72 items in length. Data from Test 2, a smaller scale single-form achievement test, included 193 items that were administered to 316 examinees in 2017. In addition to their overall score, examinees receive six subscores that—as with Test 1—were based on 6 subscales, which in this case ranged from 21 to 64 items in length. For both exams, CPI scores were calculated as either L (lower than expected), H (higher than expected), or NS for each examinee ( $ω$ ) using their estimated overall ability ${\hat{θ}}_{ω}$ , their observed subscores ( $S_{q ω}$ ), and $α_{crit} = .05$ . Using existing item difficulty estimates ( ${\hat{b}}_{i}$ ) from each test’s respective operational item bank, ${\hat{θ}}_{ω}$ was estimated through maximum likelihood in the R ltm package (Rizopoulos, 2006) associated with irtoys (Partchev, 2014).

To facilitate comparisons with previous research, the value added ratio (VAR; Haberman, 2008) was calculated for each subscore

VAR = \frac{{PRMSE}_{s}}{{PRMSE}_{x}} = \frac{r_{1}}{r_{3}^{2} r_{4}},

where r ₁, r ₃, and r ₄ represent subscore reliability, the disattenuated correlation between the subscore and total score, and total score reliability, respectively. VAR is a ratio of the proportional reduction in mean squared error between the observed subscore ( ${PRMSE}_{s}$ ) and total score ( ${PRMSE}_{x}$ ). Using the VAR methodology, subscores have value if ${PRMSE}_{s} > {PRMSE}_{x}$ (i.e., $VAR > 1$ ) or, in other words, when the observed subscores explain more variance in the true subscores than do the observed total scores. Additionally, Feinberg and Jurich (2017) recommend a threshold of $VAR \geq 1.1$ to report subscores with a meaningful effect size.

Resampled Data

Given that the true value of subscores was unknown for the operational data sets, resampled data reflecting the null condition (e.g., subscores have no value) under similar test structures as Test 1 and Test 2 were used as a comparison. If our null hypothesis is that a subscore has no value, then all items within the test will have the property of exchangeability, and subscore value can be examined using a permutation (randomization) test (e.g., Good, 2005) and ignoring the operational subtest allocation.

For the purposes of this comparison, and using the available operational data, the null condition is defined as a random set of items within the same test. To this end, a null distribution of CPI scores was generated by replicating—250 times—a process like the one just described except instead of using items from the actual subtest, the same number of items was randomly chosen. For a given subscale, the resampling results indicate the expected frequency in which the categorical scores of L, H, or NS would be reported if there was no subscore value under a similar test structure. In this way, differences between the operational and resampled data would be an indicator of subscore value captured by this CPI method.

Results From Operational and Resampled Data

Summary results for Tests 1 and 2 are presented in Table 2. For both tests, the subscore reliabilities are typical for scores reported for secondary purposes, but note that the disattenuated correlations are extremely high indicating that the subscores are not measuring anything distinct from the total score. This tendency has been observed with other operational programs as well (Sinharay, 2010). Thus, it is unsurprising that only Subscores A and F on Test 2, which have slightly lower disattenuated correlations, had $VAR > 1.1$ . Therefore, both Tests 1 and 2 present different challenges. Using the VAR methodology, none of the subscores on Test 1 were informative enough to report, and only two of the six subscores are worth reporting on Test 2. However, it may be impractical to not report any subscores on Test 1 or to only report some subscores on Test 2.

Table 2.

Value Added Ratio (VAR) and Categorical Performance Indicator (CPI) Scores for Real and Simulated Data

						CPI
					Real Data				Null Data
Subscale	Items	Subscore Reliability	Total Test Reliability	Disattenuated Subscore Total Score Correlation	VAR	%L	%NS	%H	%L	%NS	%H
Test 1
A	70	.78	.93	.98	.88	3	95	2	2	96	2
B	68	.81	.93	.98	.91	2	95	3	2	96	2
C	45	.67	.93	.96	.78	3	94	3	2	97	2
D	17	.39	.93	.89	.53	1	97	2	1	97	1
E	47	.68	.93	.96	.79	3	94	4	2	97	2
F	72	.75	.93	.98	.84	3	93	4	2	96	2
Test 2
A	64	.85	.94	0.87	1.21	10	79	11	2	96	2
B	21	.64	.94	0.94	0.76	2	96	2	1	97	1
C	28	.70	.94	1.00	0.74	2	94	3	2	97	1
D	27	.75	.94	0.98	0.83	4	93	3	2	97	1
E	27	.79	.94	0.97	0.90	3	92	6	2	97	1
F	26	.79	.94	0.88	1.10	5	89	6	2	97	1

In line with the VAR findings, the CPI results in Table 2 indicate that most subscores would be reported as nonsignificant (NS). Subscores A and F on Test 2 would be reported more often, which is also consistent with the VAR findings. Note that the null CPI results found that scores were identified as relatively high and relatively low less often than expected.

Figure 1 presents analogous results as a function of the number of subscale items (k) after undertaking a separate null resampling of Test 1. These results show that the nominal α level is recovered only for very long subtests (k $\geq$ 2,000). This relationship between the nominal threshold attenuation and subscale length is the expected result when looking at the probabilities of exceedance $P_{q ω} ((S \leq s))$ and $P_{q ω} ((S \geq s))$ above. These probabilities can often be larger than $α = .05$ or whatever α was chosen so that seemingly extreme subscores (0 or maximum scores) carry a conditional probability that is larger than the critical value (α). In such cases, due to the finite length of the subscale, it would be impossible to identify a high or low CPI score. Moreover, even when a given subscore does have probabilities below α, the de facto α will be the highest probability less than the nominal α, which makes the method inherently conservative. Further, notwithstanding the issue of subscale length, the CPI method requires an estimate of ability, ${\hat{θ}}_{ω}$ (since true ability $θ_{ω}$ is unknown is practice), and thus, the proportion of flagged subscores should not be expected to exactly match the α level.

Figure 1.

Frequency of nonsignificant Test 1 categorical performance indicator results based on varying subscale length.

Figure 2 presents the percent of examinees receiving any L or H CPI scores (e.g., any indicators of relative strength/weaknesses) for each test with examinees in the bottom, middle, and top quintile of overall performance for both the real and null (permutation-based) data. For instance, 52%, 31%, 16%, 2%, and <1% of average performers on Test 2 would receive 0, 1, 2, 3, and 4 indicators of deviance from their overall performance, respectively, compared to 81%, 17%, 2%, and <1% as expected from the null.

Figure 2.

Percent of H and L categorical performance indicators (strength/weaknesses) reported by test and overall ability groups separated by bottom/middle/top quintile.

More CPI scores would be reported on Test 2—particularly for subscores A and F—which, again, is consistent with the VAR results. Additionally, slightly more CPI scores are reported for average examinees compared to high and low performers, due to restriction of range for observed scores, for instance, it would be difficult to demonstrate a relative strength based on a subscale of finite length when an examinee is already a high performer overall.

Figure 3 illustrates how CPI reporting might compare to a traditional subscore profile by plotting sample results for two Test 2 examinees. Both versions provide similar information, yet the CPI profile is more explicit in directly communicating the appropriate inference. For an examinee with some meaningful variability, the CPI profile is less likely to be misinterpreted and more clearly indicates areas of relative strength/weaknesses. Similarly, for an examinee where differences between subscore profiles are likely due to random error, the CPI plot unambiguously presents consistent performance across subscores. The results from Table 2 and Figure 2 indicate that most examinees from both Tests 1 and 2 would receive a profile that shows subscores not deviant but consistent with overall performance. Thus, reporting the CPI profile should help promote the desired inferences by minimizing the burden on score users to correctly interpret the score information, particularly when the expected outcome from a unidimensional test will be that performance is similar across all subscales for most respondents.

Figure 3.

Example of categorical performance indicator reporting compared to a traditional subscore profile.

Supplementary Simulation Study

The evidence presented thus far illustrates the CPI methodology for a unidimensional test, where subscores are unlikely to have meaningful variability—as expected given the intended purpose. However, what would happen if the test has useful underlying subscore information? To investigate this question, using simulation recommendations by Feinberg and Rubright (2016), data were generated as follows based on the multidimensional structure of the English, Reading, Math, Science components of the ACT where the correlations among the components were reported as (ACT, 2019)

((\begin{matrix} 1.00 0.76 0.81 0.77 \\ 0.76 1.00 0.69 0.78 \\ 0.87 0.69 1.00 0.75 \\ 0.77 0.78 0.75 1.00 \end{matrix}))

True item discrimination and difficulty parameters for four subscales of similar length as the operational ACT were generated from $a \sim N ((0, 1))$ and $b \sim N ((0, 1))$ , respectively. To investigate the relationship between multidimensionality on CPI scores, true ability parameters (n = 1,000) were generated from $θ \sim N ((0, 1))$ using correlation matrix c above where the average correlation p was manipulated to range from 0.5 to 1.0. Using a similar method as Sinharay (2010), for the off-diagonal elements e of c, the average correlation was controlled by subtracting out the mean of all off-diagonal elements, m, and then adding p. Thus, for each condition, e was equal to e − m + p. This approach kept the average correlation fixed for a given p but also allowed a realistic fluctuation between the generated ability parameters.

Using the true parameters in Step 1 above, data were generated using a multidimensional item response theory (MIRT) model (Reckase, 2007):

P ((U_{i j} = 1 |θ_{j}), a_{i}, b_{i})) = \frac{1}{1 + e^{- (a_{q i} θ_{q} - b_{i})}},

where the probability of a correct response to item i for examinee j is a function of item difficulty b_i , item discrimination vector a = (a _1, a₂,…, a_qi ) and ability vector θ = (θ₁, θ_2,…,θ _q) corresponding to subscale (q). Response vectors were created by dichotomizing the probabilities against a random draw from the uniform distribution. The sum of each section of this response vector associated with subscale (q) equals an examinees’ simulated subscore (

S_{q ω}

Using the response vectors based on the true parameters from Step 2 above, a unidimensional model was fit to the data and item difficulty ( ${\hat{b}}_{i}$ ), and examinee ability ${\hat{θ}}_{ω}$ estimates were estimated through joint maximum likelihood in the R ltm package (Rizopoulos, 2006) associated with irtoys (Partchev, 2014).

Using Rasch item probabilities corresponding with ${\hat{θ}}_{ω}$ and ${\hat{b}}_{i}$ from Step 3 above, CPI scores were calculated for each examinee and subscale using $α = .05$ as the criterion.

Lastly, the frequency of each possible CPI score for each subscale was calculated across examinees to reflect how often those particular categorical subscores would be reported.

Results from the simulation are illustrated in Figure 4. As the correlation between the subscales increased, a linear increase can be observed in the frequency of CPI scores flagged as nonsignificant. Beyond correlations of .95, the CPI results begin to converge on the same null findings reported in Table 2. Thus, when there is meaningful variability, from multidimensional data, the CPI method does flag a higher proportional of scores.

Figure 4.

Percent of nonsignificant (NS) CPI results based on varying subscale intercorrelations.

Discussion

Previous research has demonstrated that seemingly distinct subscores on unidimensional tests are more likely to represent noise than unique information; yet, eliminating subscores may be an unacceptable option for stakeholders. Consequently, practitioners must identify ways to provide diagnostic feedback that is defensible.

A limitation of the existing methods of determining the value of subscores, including VAR, is that they are based on reliability and correlation coefficients that are not sensitive to individual differences. As a result, it is possible that some subscores may be useful for some examinees—even for subscales that fail to meet the criteria for adding value overall. In contrast, the CPI approach is distinct from these other methods in that subscores are evaluated individually. Using CPI can potentially capture meaningful variability, if it exists, but also presents an option for providing a consistent subscore profile when there is no signal (or when there is insufficient power to detect one), which is the expected outcome for most examinees when an exam is unidimensional. Thus, reporting CPIs may be a compromise that satisfies score users by reporting subscore information as well as psychometricians by limiting misinterpretation, at most, to the rates of Type I and Type II error.

Nevertheless, the CPI approach is not a silver bullet regarding the limitations of short subscales that restrict the range of observed scores and create situations in which it is less likely, and sometimes impossible, to achieve certain CPI scores. For short subscales, a test taker with a high overall performance is less likely to receive a CPI indicating a relative strength just as a test taker with a low overall performance is less likely to receive a CPI indicating a relative weakness. On the other hand, subscales with very few items do not provide exceedingly reliable information, so that flagging someone as a higher or lower performer based on a very short subscale also bears some risk of providing meaningless information.

Consequently, guiding the remediation efforts of low-ability examinees is one of the principal (if not the principal) goals of reporting subscores. As a result, policy guidelines may need to be implemented to qualify when a CPI score should be presented as feedback to limit the Type II errors due to subscale length. Arguably, these policies should be conditional on proficiency although in practice differential reporting might be politically infeasible.

Additionally, and similar to score reporting conventions, some interpretive language will be needed to communicate with score users on the criterion used for the CPI calculation. The illustrative example presented in this article provides performance feedback relative to oneself, though the method could easily be adapted to instead use estimated ability from a normed comparison group or another absolute criterion (e.g., passing score). Thus, a high overall performer who receives a low CPI or, more importantly, a low overall performer or who receives a high CPI on some subscale may need some additional information to know how to proceed.

In assessing the generalizability of this CPI methodology, future research should investigate other model frameworks beyond Rasch. For example, the rate of flagging higher-than-expected subscores for low overall performers might be different when using more complex IRT models that factor in guessing. Also, future research may want to compare different variations of the recursive computation for the subscore probability distribution. As described in this article, an individual’s point estimate of θ associated with their total score was used to calculate conditional probabilities. An alternative, yet more computationally intensive, approach could be to instead compute the subscore probability distribution by integrating across the posterior distribution of θ for their total score, thus treating θ as a random variable. Differences between these approaches may also arise depending on the number of test items and corresponding confidence interval around the θ point estimates.

Additionally, particularly for exam programs reporting a large number of subscores, there is a greater risk of Type I error. Future research on this methodology might consider how this risk can be best managed either through α-level adjustments or setting criteria for null comparisons. Further, the utility of the CPI approach may extend beyond identifying useful subscore information. For instance, a disproportionately higher flagging rate on a particular subscale (or specified item subset) might serve as an indicator of item preknowledge. Similarly, research in the person-fit literature has investigated whether a pattern of responses to a subset of items is inconsistent with the responses on the rest of the test (Belov, 2013; Sinharay, 2017). Thus, different facets of this proposed CPI approach should be explored.

When subscores are not worth reporting to the vast majority of test takers, the inference that an examinee performed similar across all subscales is still important to communicate. Otherwise, in the absence of any subscore feedback, score users could mistakenly infer relative strengths and weaknesses based on their own recollections and beliefs about their performance. Providing a clear consistent profile, as opposed to no information at all, could help reduce misinterpretation as well as support the best guidance possible for remediation on a unidimensional test: focus on the content domain weighting in the test blueprint and study everything.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

ACT. (2019). ACT technical manual. https://www.act.org/content/dam/act/unsecured/documents/ACT_Technical_Manual.pdf

Albanese

M. A.

(2014). The testing column: Differences in subject area subscores on the MBE and other illusions. The Bar Examiner, 83(2), 26–31.

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.

Andersen

E. B.

(1972). The numerical solution of a set of conditional estimation equations. The Journal of the Royal Statistical Society, 34(Series B), 42–54.

Belov

D. I.

(2013). Detection of test collusion via Kullback–Leibler divergence. Journal of Educational Measurement, 50, 141–163.

Biscarri

Zhao

S. D.

Brunner

R. J.

(2018). A simple and fast method for computing the Poisson binomial distribution function. http://doi.org/10.1016/j.csda.2018.01.007

Brennan

R. L.

(2012). Utility indexes for decisions about subscores (CASMA Research Report 33). Center for Advanced Studies in Measurement and Assessment.

Clauser

A. L.

Rick

(2016). Evaluating score report prototypes for a licensure examination [Paper presentation]. Washington, DC: American Educational Research Association Annual Meeting.

Evans

D. L.

Leemis

L. M.

(2004). Algorithms for computing the distributions of sums of discrete random variables. Mathematical and Computer Modelling, 40(13), 1429–1452.

10.

Feinberg

R. A.

Jurich

D. P.

(2017). Guidelines for interpreting and reporting subscores. Educational Measurement: Issues and Practice, 36(1), 5–13.

11.

Feinberg

R. A.

Rubright

J. D.

(2016). Conducting simulations in psychometrics. Educational Measurement: Issues and Practice, 35(2), 36–49.

12.

Feinberg

R. A.

Wainer

(2014). When can we improve subscores by making them shorter? The case against subscores with overlapping items. Educational Measurement: Issues and Practice, 33(3), 47–54.

13.

Gonzalez

Wiberg

von Davier

A. A.

(2016). A note on the Poisson binomial distribution in item response theory. Applied Psychological Measurement, 40(4), 302–310.

14.

Good

(2005). Permutation, parametric and bootstrap tests of hypotheses (3rd ed.). Springer.

15.

Grinstead

C. M.

Snell

J. L.

(1997). Introduction to probability (2nd rev. ed.). AMS Publications. https://math.dartmouth.edu//∼prob/prob/prob.pdf

16.

Gustafsson

J.-E.

(1980). A solution of the conditional estimation problem for long tests in the Rasch model for dichotomous items. Educational and Psychological Measurement, 40, 327–385.

17.

Haberman

S. J.

(2008). When can subscores have value? Journal of Educational and Behavioral Statistics, 33(2), 204–229.

18.

Henson

B. A.

(1994). Note: Extension of Lord–Wingersky algorithm to computing test score distributions for polytomous items. Retrieved February 1, 1994, from http://www.b-a-h.com/papers/note9401.html

19.

Jiang

Raymond

(2018). The use of multivariate generalizability theory to evaluate the quality of subscores. Applied Psychological Measurement, 42(8), 595–612.

20.

Lord

F. M.

Novick

M. R.

(1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.

21.

Lord

F. M.

Wingersky

(1984). Comparison of IRT true-score and equipercentile observed-score equatings. Applied Psychological Measurement, 8, 453–461.

22.

Marsh

H. W.

(1994). Sport motivation orientations: Beware of jingle-jangle fallacies. Journal of Sport & Exercise Psychology, 16(4), 365–380.

23.

Mislevy

R. J.

Steinberg

L. S.

Almond

R. G.

(2003). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1(1), 3–67.

24.

Orlando

Thissen

(2003). Further investigation of the performance of S-X²: An item fit index for use with dichotomous item response theory models. Applied Psychological Measurement, 27(4), 289–298.

25.

Partchev

(2014). irtoys: Simple interface to the estimation and plotting of IRT models (R package Version 0.1.7). http://CRAN.R-project.org/package=irtoys

26.

Pieper Bar Review. (2017). Bar examiners to provide (slightly) more information to candidates who fail the bar exam. http://news.pieperbar.com/bar-examiners-to-provide-slightly-more-information-to-candidates-who-fail-the-bar-exam

27.

Puhan

Sinharay

Haberman

S. J.

Larkin

(2010). The utility of augmented subscores in a licensure exam: An evaluation of methods using empirical data. Applied Measurement in Education, 23, 266–285.

28.

Rasch

(1966). An individualistic approach to item analysis. In Lazarsfeld

P. F.

Henry

N. W.

(Eds.), Readings in mathematical social science (pp. 89–107). The MIT Press.

29.

Reckase

M. D.

(2007). Multidimensional item response theory. In Rao

C. R.

Sinharay

(Eds.), Handbook of statistics (Vol. 26, pp. 607–642). Amsterdam, The Netherlands: North-Holland.

30.

Rick

Clauser

A. L.

(2016). What score report features promote accurate remediation? Insights from cognitive interviews [Paper presentation]. Annual Meeting of the National Council on Measurement in Education (NCME), Washington, DC, United States.

31.

Rijmen

Jeon

von Davier

Rabe-Hesketh

(2014). A third-order item response theory model for modeling the effects of domains and subdomains in large-scale educational assessment surveys. Journal of Educational and Behavioral Statistics, 38, 32–60.

32.

Rizopoulos

(2006). ltm: An R package for latent variable modelling and item response theory analyses. Journal of Statistical Software, 17(5), 1–25.

33.

Sinharay

(2010). How often do subscores have added value? Results from operational and simulated data. Journal of Educational Measurement, 47(2), 150–174.

34.

Sinharay

(2017). Detection of item preknowledge using likelihood ratio test and score test. Journal of Educational and Behavioral Statistics, 42, 46–68.

35.

Sinharay

Haberman

S. J.

Wainer

(2011). Do adjusted subscores lack validity? Don’t blame the messenger. Educational and Psychological Measurement, 71(5), 789–797.

36.

Thissen

Pommerich

Billeaud

Williams

V. S. L.

(1995). Item response theory for scores on tests including polytomous items with ordered responses. Applied Psychological Measurement, 19(1), 39–49.

37.

Thissen

Wainer

(2001). An overview of test scoring. In Thissen

Wainer

(Eds.), Test scoring (pp. 1–19). Lawrence Erlbaum.

38.

Verhelst

N. D.

(2012). Profile analysis: A closer look at the PISA 2000 reading data. Scandinavian Journal of Educational Research, 56(3), 315–332.

39.

von Davier

(2010). Why sum scores may not tell us all about test takers. In Wang, Leigh

(Ed.), Special issue on Quantitative research methodology. Newborn and Infant Nursing Reviews, 10(1), 27–36.

40.

von Davier

(2016). The Rasch model. In van der Linden

(Ed.), Handbook of item response theory (Chapter 3, Vol. 1, 2nd ed., pp. 31–48). CRC Press.

41.

von Davier

Rost

(1995). Polytomous mixed Rasch models. In Fischer

G. H.

Molenaar

I. W.

(Eds.), Rasch models—Foundations, recent developments and applications (pp. 371–379). Springer.