Abstract
The literature showing that subscores fail to add value is vast; yet despite their typical redundancy and the frequent presence of substantial statistical errors, many stakeholders remain convinced of their necessity. This article describes a method for identifying and reporting unexpectedly high or low subscores by comparing each examinee’s observed subscore with a discrete probability distribution of subscores conditional on the examinee’s overall ability. The proposed approach turns out to be somewhat conservative due to the nature of subscores as finite sums of item scores associated with a subdomain. Thus, the method may be a compromise that satisfies score users by reporting subscore information as well as psychometricians by limiting misinterpretation, at most, to the rates of Type I and Type II error.
Keywords
Encouraged by at least 25 years of federal government mandates beginning with the Improving Americas School Act of 1994, No Child Left Behind of 2001, and more recently the Every Student Succeeds Act of 2015, subscores have become an expected component of feedback for testing programs. Subscore information can be presented at the individual level with the goal of, for example, helping examinees focus their remedial activities by identifying their strengths and weaknesses or at the institutional level to help, for example, facilitate more nuanced comparisons of cohort performance. Some stakeholders may believe this additional information is needed to justify the high direct and indirect costs associated with an examination.
It is fundamentally important to distinguish between subscores from multidimensional and unidimensional tests. For tests constructed to be multidimensional, the overall score is essentially a composite of several distinct constructs, and the subscores can often be easily identified and supported through dimensionality analysis (e.g., factor analysis) and divergent validity. For instance, the SAT Math and Verbal section scores can be considered subscores. More commonly, subscores are computed from tests that were constructed to be unidimensional, where the underlying subscales are often designed post hoc by content expert judgments or sometimes defined by item types/features. The reading component score within the SAT Verbal section can be considered a subscore of this type. In this article, we focus on subscores from essentially unidimensional assessments, where the primary goal is to produce a single overall score, the dominant paradigm in large-scale testing.
Literature on the utility of subscores derived from unidimensional tests has found them to be notoriously difficult to justify from both item response theory (IRT) and classical test theory perspectives (Feinberg & Jurich, 2017; Haberman, 2008; Jiang & Raymond, 2018; Puhan et al., 2010; Rijmen et al., 2014; Sinharay, 2010; Thissen & Wainer, 2001; Verhelst, 2012). As previous research has demonstrated, subscores are typically based on a relatively small number of subdomain-specific items, and in many cases, these subdomain scores are better predicted by an individual’s overall score. Hence, subscores are often not sufficiently reliable or distinct enough from the rest of the test to justify separate reporting. This is unsurprising as Brennan (2012) observed, “If test scores fit a unidimensional model, a psychometrically compelling argument cannot be mounted for reporting any subscores since, by definition, there is only one proficiency or latent trait” (p. 14). Thus, subscores often reflect the jangle fallacy (Marsh, 1994) where two or more subscores have different names but represent the same construct.
With a growing body of literature suggesting that reporting subscores is seldom defensible, testing programs are beginning to change their score reporting policies. These changes have not always been warmly received by stakeholders. In 2014, the National Council of Bar Examiners eliminated subscore reporting on the Multistate Bar Exam (Albanese, 2014); however, stakeholder response was so negative that some subscore information was eventually restored for failing candidates (Pieper Bar Review, 2017). Thus, simply not reporting subscores, though perhaps psychometrically ideal, may be politically impractical. Instead a compromise is needed between whatever psychometric limitations a given exam’s subscores impose and the expectations and desired inferences from score users.
Perhaps the best way to create subscores that add value would be to use an evidence-centered design approach wherein the intended subscore inferences can be factored into the test design and blueprinting process (Mislevy et al., 2003), but this is likely to change the nature of the test and to produce a multidimensional instrument that may be less aligned with the intended goal of the assessment. Post hoc options include lengthening the test, combining similar content areas after the testing event, or calculating augmented subscores to boost subscore reliability. Yet, most often these strategies have been ineffective because they are applied to preexisting exams that were built to be unidimensional (Feinberg & Wainer, 2014; Sinharay et al., 2011). Moreover, the costs of these strategies may be difficult to justify given that the common uses of subscores (e.g., remedial feedback) are typically subordinate to the primary purpose of the exam (e.g., overall score, pass/fail classification).
Reporting subscores in alternative formats does not alleviate the problem. For instance, some testing programs report subscore information as profile bands where the width is a function of the subscores standard error. This common practice has been a way for psychometricians to appease stakeholders by reporting subscores while also aligning their practice with the testing standards by incorporating a measure of imprecision as highlighted by Standard 2.3 of the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014). However, research has suggested that profile bands can be confusing and difficult to interpret accurately, even when detailed explanatory text is provided (Clauser & Rick, 2016; Rick & Clauser, 2016).
Yet, despite their typical redundancy and often frequent presence of substantial statistical errors, many stakeholders remain convinced that subscores are needed. The purpose of this study is to investigate an approach to subscore reporting intended to reduce many of the most common mistakes score users make when interpreting subscores on a sufficiently unidimensional test. We assume that test data can be fitted by a unidimensional model or, more specifically, that a multidimensional model does not provide better model data fit than, for example, a one-factor model or, equivalently, a unidimensional IRT model (Lord & Novick, 1968). Moreover, here our discussion is limited to the Rasch model (Rasch, 1966; von Davier, 2016); however, our proposed method applies to more general IRT models as well (Lord & Novick, 1968).
Conditional Subscore Distributions Are Convolutions
Mathematically, the distribution of sums of random variables, in this case subscores, are called convolutions. Subscores are comprised of identifiable units—items—contributing a summand to a subscore, with probabilities for each summand that depend on respondent ability and item difficulty. This is true whether the subscores are based on dichotomous 0/1 item scores or a polytomous Likert-type rating scale ranging from 1 to 5. If all variables in the sum are binary and distributed identically, the well-known binomial distribution for 0 < p < 1 and K trials provides the probability of the sum score s as
This expression describes the distribution of sums of independent Bernoulli trials with identical probability of all trials. The expected value of the sum score is well known and equals
and the variance equals
In testing, sum scores consist of (locally) independent trials (item scores) with different probabilities, as each item i is different and likely carries a different probability of success pi ω for each respondent ω. Assuming local independence, the (binary) item responses are conditionally independent Bernoulli trials where the expected value for respondent ω equals
and the variance is
which reduces to the binomial distribution if
The probability of each sum of binary item scores for any respondent
Predicting Subscore Distributions From Overall Ability Estimates
In the case of the Rasch model, for person ω with estimated ability
The probability of an incorrect response (score = 0) is then given by
The nested structure of this relationship allows
for the first item and the (trivial) partial sum that contains only this item (see von Davier & Rost, 1995, as well as Grinstead & Snell, 1997, for an extension of this recursion to polytomous response data).
The recursion can be used for calculating the raw score probabilities for any subscale q consisting of a subset of kq ≤ k items
with lower thresholds
and upper thresholds for subscore probabilities
These thresholds can lead to sets of categorical subscores by defining
A categorical subscore of
Recursion Iteration Procedure for Illustrative Example
Note. k = number of items,
Example With Operational and Resampled Data
Operational Data
Data sets from two operational exam programs were obtained. Data from Test 1, a multiple-form high-stakes pass/fail licensure test, included a total of 2,278 items with 300 scored items per form and 5,832 examinees who tested in 2016 and 2017. In addition to their pass/fail classification and overall score, examinees on this test receive six subscores based on 6 subscales ranging from 17 to 72 items in length. Data from Test 2, a smaller scale single-form achievement test, included 193 items that were administered to 316 examinees in 2017. In addition to their overall score, examinees receive six subscores that—as with Test 1—were based on 6 subscales, which in this case ranged from 21 to 64 items in length. For both exams, CPI scores were calculated as either L (lower than expected), H (higher than expected), or NS for each examinee (
To facilitate comparisons with previous research, the value added ratio (VAR; Haberman, 2008) was calculated for each subscore
where r
1, r
3, and r
4 represent subscore reliability, the disattenuated correlation between the subscore and total score, and total score reliability, respectively. VAR is a ratio of the proportional reduction in mean squared error between the observed subscore (
Resampled Data
Given that the true value of subscores was unknown for the operational data sets, resampled data reflecting the null condition (e.g., subscores have no value) under similar test structures as Test 1 and Test 2 were used as a comparison. If our null hypothesis is that a subscore has no value, then all items within the test will have the property of exchangeability, and subscore value can be examined using a permutation (randomization) test (e.g., Good, 2005) and ignoring the operational subtest allocation.
For the purposes of this comparison, and using the available operational data, the null condition is defined as a random set of items within the same test. To this end, a null distribution of CPI scores was generated by replicating—250 times—a process like the one just described except instead of using items from the actual subtest, the same number of items was randomly chosen. For a given subscale, the resampling results indicate the expected frequency in which the categorical scores of L, H, or NS would be reported if there was no subscore value under a similar test structure. In this way, differences between the operational and resampled data would be an indicator of subscore value captured by this CPI method.
Results From Operational and Resampled Data
Summary results for Tests 1 and 2 are presented in Table 2. For both tests, the subscore reliabilities are typical for scores reported for secondary purposes, but note that the disattenuated correlations are extremely high indicating that the subscores are not measuring anything distinct from the total score. This tendency has been observed with other operational programs as well (Sinharay, 2010). Thus, it is unsurprising that only Subscores A and F on Test 2, which have slightly lower disattenuated correlations, had
Value Added Ratio (VAR) and Categorical Performance Indicator (CPI) Scores for Real and Simulated Data
In line with the VAR findings, the CPI results in Table 2 indicate that most subscores would be reported as nonsignificant (NS). Subscores A and F on Test 2 would be reported more often, which is also consistent with the VAR findings. Note that the null CPI results found that scores were identified as relatively high and relatively low less often than expected.
Figure 1 presents analogous results as a function of the number of subscale items (k) after undertaking a separate null resampling of Test 1. These results show that the nominal α level is recovered only for very long subtests (k

Frequency of nonsignificant Test 1 categorical performance indicator results based on varying subscale length.
Figure 2 presents the percent of examinees receiving any L or H CPI scores (e.g., any indicators of relative strength/weaknesses) for each test with examinees in the bottom, middle, and top quintile of overall performance for both the real and null (permutation-based) data. For instance, 52%, 31%, 16%, 2%, and <1% of average performers on Test 2 would receive 0, 1, 2, 3, and 4 indicators of deviance from their overall performance, respectively, compared to 81%, 17%, 2%, and <1% as expected from the null.

Percent of H and L categorical performance indicators (strength/weaknesses) reported by test and overall ability groups separated by bottom/middle/top quintile.
More CPI scores would be reported on Test 2—particularly for subscores A and F—which, again, is consistent with the VAR results. Additionally, slightly more CPI scores are reported for average examinees compared to high and low performers, due to restriction of range for observed scores, for instance, it would be difficult to demonstrate a relative strength based on a subscale of finite length when an examinee is already a high performer overall.
Figure 3 illustrates how CPI reporting might compare to a traditional subscore profile by plotting sample results for two Test 2 examinees. Both versions provide similar information, yet the CPI profile is more explicit in directly communicating the appropriate inference. For an examinee with some meaningful variability, the CPI profile is less likely to be misinterpreted and more clearly indicates areas of relative strength/weaknesses. Similarly, for an examinee where differences between subscore profiles are likely due to random error, the CPI plot unambiguously presents consistent performance across subscores. The results from Table 2 and Figure 2 indicate that most examinees from both Tests 1 and 2 would receive a profile that shows subscores not deviant but consistent with overall performance. Thus, reporting the CPI profile should help promote the desired inferences by minimizing the burden on score users to correctly interpret the score information, particularly when the expected outcome from a unidimensional test will be that performance is similar across all subscales for most respondents.

Example of categorical performance indicator reporting compared to a traditional subscore profile.
Supplementary Simulation Study
The evidence presented thus far illustrates the CPI methodology for a unidimensional test, where subscores are unlikely to have meaningful variability—as expected given the intended purpose. However, what would happen if the test has useful underlying subscore information? To investigate this question, using simulation recommendations by Feinberg and Rubright (2016), data were generated as follows based on the multidimensional structure of the English, Reading, Math, Science components of the ACT where the correlations among the components were reported as (ACT, 2019)
True item discrimination and difficulty parameters for four subscales of similar length as the operational ACT were generated from
Using the true parameters in Step 1 above, data were generated using a multidimensional item response theory (MIRT) model (Reckase, 2007):
Using the response vectors based on the true parameters from Step 2 above, a unidimensional model was fit to the data and item difficulty (
Using Rasch item probabilities corresponding with
Lastly, the frequency of each possible CPI score for each subscale was calculated across examinees to reflect how often those particular categorical subscores would be reported.
Results from the simulation are illustrated in Figure 4. As the correlation between the subscales increased, a linear increase can be observed in the frequency of CPI scores flagged as nonsignificant. Beyond correlations of .95, the CPI results begin to converge on the same null findings reported in Table 2. Thus, when there is meaningful variability, from multidimensional data, the CPI method does flag a higher proportional of scores.

Percent of nonsignificant (NS) CPI results based on varying subscale intercorrelations.
Discussion
Previous research has demonstrated that seemingly distinct subscores on unidimensional tests are more likely to represent noise than unique information; yet, eliminating subscores may be an unacceptable option for stakeholders. Consequently, practitioners must identify ways to provide diagnostic feedback that is defensible.
A limitation of the existing methods of determining the value of subscores, including VAR, is that they are based on reliability and correlation coefficients that are not sensitive to individual differences. As a result, it is possible that some subscores may be useful for some examinees—even for subscales that fail to meet the criteria for adding value overall. In contrast, the CPI approach is distinct from these other methods in that subscores are evaluated individually. Using CPI can potentially capture meaningful variability, if it exists, but also presents an option for providing a consistent subscore profile when there is no signal (or when there is insufficient power to detect one), which is the expected outcome for most examinees when an exam is unidimensional. Thus, reporting CPIs may be a compromise that satisfies score users by reporting subscore information as well as psychometricians by limiting misinterpretation, at most, to the rates of Type I and Type II error.
Nevertheless, the CPI approach is not a silver bullet regarding the limitations of short subscales that restrict the range of observed scores and create situations in which it is less likely, and sometimes impossible, to achieve certain CPI scores. For short subscales, a test taker with a high overall performance is less likely to receive a CPI indicating a relative strength just as a test taker with a low overall performance is less likely to receive a CPI indicating a relative weakness. On the other hand, subscales with very few items do not provide exceedingly reliable information, so that flagging someone as a higher or lower performer based on a very short subscale also bears some risk of providing meaningless information.
Consequently, guiding the remediation efforts of low-ability examinees is one of the principal (if not the principal) goals of reporting subscores. As a result, policy guidelines may need to be implemented to qualify when a CPI score should be presented as feedback to limit the Type II errors due to subscale length. Arguably, these policies should be conditional on proficiency although in practice differential reporting might be politically infeasible.
Additionally, and similar to score reporting conventions, some interpretive language will be needed to communicate with score users on the criterion used for the CPI calculation. The illustrative example presented in this article provides performance feedback relative to oneself, though the method could easily be adapted to instead use estimated ability from a normed comparison group or another absolute criterion (e.g., passing score). Thus, a high overall performer who receives a low CPI or, more importantly, a low overall performer or who receives a high CPI on some subscale may need some additional information to know how to proceed.
In assessing the generalizability of this CPI methodology, future research should investigate other model frameworks beyond Rasch. For example, the rate of flagging higher-than-expected subscores for low overall performers might be different when using more complex IRT models that factor in guessing. Also, future research may want to compare different variations of the recursive computation for the subscore probability distribution. As described in this article, an individual’s point estimate of θ associated with their total score was used to calculate conditional probabilities. An alternative, yet more computationally intensive, approach could be to instead compute the subscore probability distribution by integrating across the posterior distribution of θ for their total score, thus treating θ as a random variable. Differences between these approaches may also arise depending on the number of test items and corresponding confidence interval around the θ point estimates.
Additionally, particularly for exam programs reporting a large number of subscores, there is a greater risk of Type I error. Future research on this methodology might consider how this risk can be best managed either through α-level adjustments or setting criteria for null comparisons. Further, the utility of the CPI approach may extend beyond identifying useful subscore information. For instance, a disproportionately higher flagging rate on a particular subscale (or specified item subset) might serve as an indicator of item preknowledge. Similarly, research in the person-fit literature has investigated whether a pattern of responses to a subset of items is inconsistent with the responses on the rest of the test (Belov, 2013; Sinharay, 2017). Thus, different facets of this proposed CPI approach should be explored.
When subscores are not worth reporting to the vast majority of test takers, the inference that an examinee performed similar across all subscales is still important to communicate. Otherwise, in the absence of any subscore feedback, score users could mistakenly infer relative strengths and weaknesses based on their own recollections and beliefs about their performance. Providing a clear consistent profile, as opposed to no information at all, could help reduce misinterpretation as well as support the best guidance possible for remediation on a unidimensional test: focus on the content domain weighting in the test blueprint and study everything.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
