Abstract
An approach to scoring tests with binary items, referred to as D-scoring method, was previously developed as a classical analog to basic models in item response theory (IRT) for binary items. As some tests include polytomous items, this study offers an approach to D-scoring of such items and parallels the results with those obtained under the graded response model (GRM) for ordered polytomous items in the framework of IRT. The proposed design of using D-scoring with “virtual” binary items generated from polytomous items provides (a) ability scores that are consistent with their GRM counterparts and (b) item category response functions analogous to those obtained under the GRM. This approach provides a unified framework for D-scoring and psychometric analysis of tests with binary and/or polytomous items that can be efficient in different scenarios of educational and psychological assessment.
There are ongoing efforts in the research on classical test theory and item response theory (IRT) to achieve simplicity and efficiency in test scoring and interpretations of test scores under a specific context and purpose of measurement (e.g., DeMars, 2008; Dimitrov, 2003, 2016, 2017; Fan, 1998; Hambleton & Jones, 1993; Lin, 2008; Oswald, Shaw, & Farmer, 2015). In line with this trend, an approach to scoring and equating tests with binary items, referred to as D-scoring, was developed as a classical analog to basic IRT models for binary items (Dimitrov, 2016, 2017). The D-scoring method is currently implemented for automated use with large-scale assessments at the National Center for Assessment in Saudi Arabia (e.g., Atanasov & Dimitrov, 2015). As some tests include polytomous items with ordered categories (e.g., to measure levels of proficiency in language testing or teacher certification tests), the purpose of this study is to propose a D-scoring analog to the graded response model (GRM) in IRT (Samejima, 1969, 1996). This will provide a unified framework for efficient D-scoring of tests that consist of binary and/or polytomous items. Presented next is a theoretical framework of basic concepts related to the GRM and D-scoring method, followed by the proposed design for D-scoring of polytomous items with a simulation study for illustration, and discussion of the results and related issues.
Theoretical Framework
Graded Response Model
The GRM works for polytomous items with ordered categories (x = 0, 1, . . ., m). The analytic form of the GRM is expressed as
where
The analytic function
where x = 1, 2, . . ., m− 1. The probabilities at the two extreme categories are computed as follows: (a) at x = 0,
D-Scoring Model for Binary Items
Under the D-scoring of unidimensional tests with binary items, the D-score of a person is based on the person’s response vector weighted by the expected difficulties of the items for the population of test takers (Dimitrov, 2016, 2017). If
For a test with n binary items, Dimitrov (2017) defined the D-score of person s as a linear combination of the person’s binary scores,
The D-scores range from 0 to 1
Computation of D-scores for Four Response Vectors on Five Binary Items.
Note. The D-scores are computed with the use of Equation (3) (s = 1, 2, 3, 4; i = 1, 2, 3, 4, 5).
D-Model of Item Response Function
The probability for correct response on item i by person s, given the Ds score of that person on the D-scale, is estimated as a predicted item score,
where Ds is the independent variable, obtained via Equation (3), whereas
Let

Item characteristic curves (ICCs) on the D-scale for three items, selected from 20 simulated items, with 2PLR parameters as follows: item 3 (
D-Scoring Design for Polytomous Items
Proposed here is a design for using D-scoring under Equation (3) with data on ordered categories of polytomous test items. In IRT, such data are typically analyzed using the GRM. To illustrate the idea, assume that a test consists of n polytomous items with four ordered categories (0, 1, 2, 3) indicating, say, levels of proficiency. To dichotomize the category scores for the purpose of using Equation (3), while preserving the hierarchical nature of these categories, each polytomous item i generates three “virtual” binary items,
Binary Scores of Three “Virtual” Items Generated by Possible Category Scores (0, 1, 2, 3) of One Polytomous Item.
Note. A test of n polytomous items, with four ordered categories each, generates a test of 3n binary items analyzed under the D-scoring method.
It is important to note that the hierarchical dependency among virtual items generated by a polytomous item under the proposed scoring design is not a problem for the computation of D-scores because Equation (3) does not assume statistical local independence. As a research-based support on this argument, a previous simulation study showed that the D-scoring is fairly robust to violations of IRT assumptions, including local independence (Luo & Dimitrov, 2018). In contrast, the statistical local independence is a key assumption in IRT estimation procedures that use the likelihood function of a response vector on binary items (e.g., the widely used maximum likelihood estimation in IRT; e.g., see Hambleton, Swaminathan, & Rogers, 1991, pp. 33-35). Thus, the proposed design of using virtual binary items generated by polytomous items with ordered response categories is appropriate under the D-scoring method but not under maximum likelihood estimations in IRT.
It is also important to note that under the GRM, the discrimination parameter of a polytomous item,
Illustration With Simulated Data
Data were simulated (in R) under the GRM for a test of 15 polytomous items with four ordered categories per item (0, 1, 2, 3), with the generating item parameters given in Table 3 and ability scores of 1,000 examinees randomly selected from the distribution θ ~ N(0,1). As shown in Table 2, with each polytomous item generating three “virtual” binary items, 45 such items were obtained and analyzed in the framework of D-scoring. The item parameters of these 45 virtual items, generated by the 15 polytomous items, are provided in Table 4. It was expected that the resulting D-scores would highly correlate with the θ scores obtained via the GRM on the 15 polytomous items. It was also expected that the category response functions (CCRF and SCRF) obtained under the GRM and the D-scoring would be similar in type of information they provide, but not directly comparable as they are represented on different scales—namely, the IRT logit scale for the GRM and the D-scale (from 0 to 1) under the D-scoring model. Reported next are the results for one simulated data set, but the results from all replications were practically the same.
Graded Response Model (GRM) Item Parameters for Generating Simulated Data on 15 Polytomous Items With Four Ordered Categories Each.
Note. The threshold
Item Parameters of “Virtual” Binary Items Generated by Polytomous Items With Simulated Data.
Note.
The resulting D-scores varied from 0.002 to 0.954 (Mean = 0.316 and SD = 0.180) on the D-scale (from 0 to 1). As expected, the D-scores highly correlated with the GRM ability scores (
The CCRFs under the GRM were obtained via Equation (1), whereas their counterparts under the D-scoring model were obtained via Equation (4) (i.e., they represent the IRFs of the virtual binary items generated by the respective polytomous item). For illustration, the CCRFs under the GRM and D-scoring model are depicted for one polytomous item—namely, Item 3. The GRM parameters of this item are its discrimination,

Cumulative category response function (CCRF) for categories 1, 2, and 3 of Item 3 with simulated data (obtained via Equation 1 under the graded response model [GRM]).
The D-model parameters under Equation (4) for the virtual binary items

Cumulative category response function (CCRF) for categories 1, 2, and 3 of Item 3 with simulated data (obtained via Equation 4 on the D-scale).
In Figures 4 and 5, the SCRFs for the polytomous Item 3 were obtained via Equation (2), with the probabilities

Score category response function (SCRF) for categories 0, 1, 2, and 3 of Item 3 with simulated data under the graded response model (GRM; obtained with the

Score category response function (SCRF) for categories 0, 1, 2, and 3 of Item 3 with simulated data under the D-scoring model (obtained with the
Discussion
The D-scoring method for tests with binary items was developed as a classical analog of basic IRT models for binary items (Dimitrov, 2016, 2017). This effort was motivated by practical needs for simplicity, efficiency, and transparency in automated test scoring and equating in the framework of large-scale assessments at the National Center for Assessment in Saudi Arabia. As some tests include polytomous items with ordered categories, the purpose of this article was to propose an approach to using the D-scoring method with polytomous items as an analog to the GRM in IRT (Samejima, 1969, 1996).
Under the proposed approach, each polytomous item with ordered categories generates “virtual” binary items. If an examinee scored in a given category of the polytomous item, the scoring design assumes that he or she mastered this category and its preceding categories, so the virtual items corresponding to these categories are assigned a score of 1 (see Table 2). This design raises the issue of local dependence for the set of virtual items generated by a polytomous item. However, as noted earlier, the computation of D-scores via Equation (3) is robust to violation of the assumption of statistical local independence (e.g., Luo & Dimitrov, 2018). In contrast, this assumption plays a key role in the IRT estimation of ability under the widely used method of maximum likelihood estimation (e.g., Hambleton et al., 1991). Therefore, the proposed scoring design of generating virtual binary items that correspond to ordered response categories of a polytomous item is suitable under the D-scoring method but not under maximum likelihood methods of ability estimation in IRT.
The results from the simulation study in this article indicate that the approach to D-scoring of polytomous items provides dependable estimation of the examinees’ ability on such items, with very high correlation (0.970) between the GRM ability scores on the IRT logit scale and the D-scores as ability estimates on the D-scale (from 0 to 1). The value of this high correlation is enhanced by the fact that the IRT logit scale and the D-scale are both (close to) interval scales. Specifically, previous studies on this matter showed that the D-scale performs slightly better than the IRT logit scale in terms of intervalness by criteria of the additive conjoint measurement, with the difference tending to decrease with the increase of the test length (Dimitrov, 2016; Domingue & Dimitrov, 2015). Thus, there is a high consistency in the estimation of the underlying ability under the GRM and the D-scoring model for polytomous items.
The category response functions (CCRF and SCRF) obtained via the D-scoring model with virtual items are similar to their GRM counterparts in type of psychometric information that they provide. For example, under both the GRM and D-scoring model, the CCRF for a response category shows the probability of scoring in that category or higher (e.g., see Figures 2 and 3). Also in both cases, the intersection of the SCRFs of two adjacent response categories shows the scale point where the examinees have equal chances of scoring in either of these two categories (e.g., see Figures 4 and 5). However, the category response functions obtained via the GRM and D-scoring model are not directly comparable as they are based on different probabilistic models and different scales. For example, an important difference is that the GRM item discrimination parameter does not vary across the response categories of the item, whereas under the D-scoring model, each virtual item has its own slope thus providing information about the discrimination level of each response category.
In conclusion, the main contribution of this article consists of the design for generating virtual binary items from polytomous items and the use of D-scoring for such virtual items that provides (a) ability scores that are consistent with their GRM counterparts and (b) item category response functions analogous to those obtained under the GRM. An advantage of the CCRFs obtained under the D-scoring model over their GRM counterparts is that they differentiate the discrimination power of the item response categories, whereas the GRM-based discrimination parameter does not vary across the item response categories. A consequential contribution of the proposed approach to using the D-scoring model for polytomous test items is that it provides a unified framework for scoring and psychometric analysis of tests with binary and/or polytomous items with ordered response categories. Although this approach is illustrated in the context of its implementation at the National Center for Assessment in Saudi Arabia, it can be efficiently used in the assessment practices of other institutions for educational and psychological assessment.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
