Abstract
This article describes an approach to test scoring, referred to as delta scoring (D-scoring), for tests with dichotomously scored items. The D-scoring uses information from item response theory (IRT) calibration to facilitate computations and interpretations in the context of large-scale assessments. The D-score is computed from the examinee’s response vector, which is weighted by the expected difficulties (not “easiness”) of the test items. The expected difficulty of each item is obtained as an analytic function of its IRT parameters. The D-scores are independent of the sample of test-takers as they are based on expected item difficulties. It is shown that the D-scale performs a good bit better than the IRT logit scale by criteria of scale intervalness. To equate D-scales, it is sufficient to rescale the item parameters, thus avoiding tedious and error-prone procedures of mapping test characteristic curves under the method of IRT true score equating, which is often used in the practice of large-scale testing. The proposed D-scaling proved promising under its current piloting with large-scale assessments and the hope is that it can efficiently complement IRT procedures in the practice of large-scale testing in the field of education and psychology.
There are ongoing efforts in the theory and practice of measurement on comparing and bridging concepts and procedures from the classical test theory (CTT) and item response theory (IRT) (e.g., Bechger, Maris, Verstralen, & Beguin, 2003; DeMars, 2008; Dimitrov, 2003; Fan, 1998; Hambleton & Jones, 1993; Kohli, Koran, & Henn, 2015; Lin, 2008; MacDonald & Paunonen, 2002; Oswald, Shaw, & Farmer, 2015; Raykov & Marcoulides, 2016). Numerous CTT–IRT studies focus on the practical usefulness of combining CTT and IRT procedures of test scoring and item analysis to achieve simplicity in computations and interpretations, taking into account the specific context and purpose of measurement. The literature on CTT–IRT suggests that the trait-level estimation of individuals using the CTT often highly correlates with its more complex IRT counterpart (e.g., Embretson & Reise, 2000; Fan, 1998; Thorndike, 1982).
Without providing an extensive review of CTT-IRT integrations, we refer to a brief example in personality assessments with the use of the Navy Computer Adaptive Personality Scales (NCAPS; Houston, Borman, Farmer, & Bearden, 2006). Under the NCAPS, the examinee must choose between two stems that reflect different levels of a given trait, where stem levels were estimated by averaging subject matter expert ratings. In a study on NCAPS scoring, Oswald, Shaw, and Farmer (2015) compared an IRT-based scoring to much simpler alternative scoring methods. For example, under an alternative dichotomous scoring method, a test taker is given 1 point for endorsing the higher level stem in a pair and 0 points for the lower level stem; then the points across the number of attempted items are averaged. The score under this method is the proportion of the time a test taker endorsed the stem in the item that had the higher subject matter expert level. The authors concluded that IRT-driven test scoring is certainly no worse than simpler methods but may not always be decisively better . . . when computerized tests are unavailable, then it is possible that simple CTT-driven approaches to item selection and item scoring may do no worse, which is heartening as a matter of convenience. (Oswald, Shaw, & Farmer, 2015, p. 152)
In line with psychometric efforts of using IRT information on test data to simplify test scoring and interpretations, this paper provides an approach to test scoring and equating which can be suitable for large-scale assessments using tests with dichotomously scored items; (as this is the main goal of the study, it should be kept in mind for better understanding of the purpose of methods and procedures presented in this paper). In the context of such assessments, item and person parameters are usually estimated with the use of IRT. Also, multiple forms of a given test are often equated to a base form of the test using, say, IRT true score equating under the nonequivalent groups with anchor test (NEAT) design (e.g., Angoff, 1971; Dorans, Moses, & Eignor, 2010; von Davier, Holland, & Thayer, 2004; Kolen & Brennan, 2014). Under this approach, the first step in the equating a new test form, A, to the scale of an old (base) form, B, is to rescale the item parameters of Form A onto the ability scale of Form B through linear transformations by using item characteristic curve methods (Haebara, 1980; Stocking & Lord, 1983) or the mean/mean and mean/sigma methods (Loyd & Hoover, 1980; Marco, 1977). The second step is to map the test characteristic curve (TCC) of Form A onto the TCC of Form B (e.g., Hambleton, Swaminathan, & Rogers, 1991; Kolen & Brennan, 2014; Lord, 1980). The outcome is that true scores on Form A are mapped on the true-score scale of Form B thus equating them. In practice, the equated true scores are usually treated as equated raw scores on the test; (e.g., see Kolen & Brennan, 2014, p. 197).
The IRT-based approach to test scoring and equating has advantages over CTT-based methods (e.g., van der Linden, 2013), but its practical implementation relates to conceptual and technical issues that deserve attention. For example, under the IRT true score equating described here above, the test performance of a person is reported and interpreted on the base of his or her raw score (the IRT score, θ, plays an intermediate role in the equating process). With this, the ability information encoded in the person’s response vector is “lost” because different response vectors generate the same raw score. Furthermore, the procedures for equating multiple test forms are very complex and run into technical problems with the mapping of multiple TCCs. A particular source of complexity and estimation error in mapping TCCs is the Newton–Raphson method which involves tedious iterations and the choice of poor initial values leads to erroneous solutions (e.g., see Kolen & Brennan, 2014, p. 194).
In an attempt to deal with these issues, the present article provides an approach to scoring and equating of tests with binary items which uses their IRT calibration to obtain test scores that depend on the person’s response vector, but not on the sample of examinees who took the test. Under the proposed approach, referred to here as delta-scoring (or D-scoring), the D-score of a person is derived from the person’s response vector weighted by the expected difficulty (delta, hence the name “delta-scoring”) of the items for the population of test takers. The equating of D-scores from multiple test forms on the D-scale of a base form, under the NEAT design, is greatly simplified as it avoids mapping of multiple TTCs (thus, the complexity and errors associated with the use of Newton–Raphson iterations are totally eliminated).
The procedures of D-scoring and equating, presented next, are currently under pilot applications with large-scale assessments at the National Center for Assessment (NCA) in Saudi Arabia. The motivation behind this effort came from the NCA call for developing an automated system of computerized scoring and equating. The currently existing system at the NCA provides the item scores (1/0) of the examinees, but all additional procedures of scoring and equating are conducted outside the system with the use of computer programs for IRT calibration under the three-parameter logistic (3PL) model and IRT true score equating of multiple test forms under the NEAT design. The integration of such procedures into an automated system for scoring and equating, including item bank feeding, runs into technical difficulties that relate to complex, tedious, and error-prone procedures of mapping multiple TTCs and other computations in a sequential test scoring and equating. Another task in the context of NCA testing is that, given the IRT item parameters of a test assembled from an item bank, the test score of an examinee should be known directly from his or her response vector; that is, the test score should reflect not only how many items, but which specific items, were answered correctly by that examinee. The effort is to address these issues with using the proposed D-scoring, which is described next and illustrated with real data from large-scale assessments at the NCA (of course, applications of the proposed is method are not limited to the context of NCA testing).
Theoretical Framework and Method
The idea behind the method of D-scoring and equating of tests with binary items is that (a) the D-score is based on the person’s response vector weighted by the expected difficulties of the items for the target population of test-takers, (b) the expected difficulties of the items are obtained as an analytic function of their IRT parameters, and (c) to equate the D-scores of two test forms, it is sufficient to rescale the item parameters of the new form to the scale of the base form. Thus, given the IRT estimates of item parameters (e.g., from an item bank), one can obtain the D-score for any response vector (pattern of 1/0 item scores) in two steps: (a) the expected item difficulties are obtained as a function of their IRT parameters, using an analytic formula and (b) the D-score is the sum of the 1/0 scores in the person’s response vector weighted by the expected difficulties of the items in that response vector. As the expected item difficulties are sample independent, the person’s D-score, which is based on the expected item difficulties, is also independent of the sample of test takers. Furthermore, the equating of D-scores under the IRT-based NEAT design eliminates the complex and error-prone procedure of mapping TTCs. Details on the D-scoring and equating method are provided next.
Expected Item Score
For the purposes of D-scoring, the expected item score,
where
where m1 = 0.278393, m2 = 0.230389, m3 = 0.000972, and m4 = 0.078108. When X < 0, one can use that
In case of IRT calibration under the 1PL, Equation (1) is used with ai = 1, whereas under the 3PL, the expected item score is given by
D-Scoring
As can be noticed,
where
The D-scale can be treated as a continuous numeric scale, with the scores on a test of n items ranging from 0 to
Consider a test of five binary items with expected item difficulties δ1, δ2, δ3, δ4, and δ5. For a person with the response vector 1 1 0 0 1, we have D =
Standard Error of D-Scores
The derivation of the standard error of
The resulting formula is
where
Item Reliability Under D-Scoring
Let Xi denotes the observed score on item i and
This finding can be useful, say, in the selection of items that maximize the internal consistency reliability as the respective procedure uses the item reliability (e.g., Allen & Yen, 1979, p. 126).
Equating of D-Scales
The equating of D-scores on a new test form A onto the scale of an old (base) form B can be performed in two steps. First, the IRT item parameters of form A are rescaled onto the scale of form B through linear transformations by using methods such as the item characteristic curve methods (Haebara, 1980; Stocking & Lord, 1983), mean/mean method, and mean/sigma method (Loyd & Hoover, 1980; Marco, 1977). Second, by representing the IRT item parameters of two test forms on a common scale, the D-scores on these two forms, obtained through the use of Equations (1) to (3), are also on a common scale because they are direct functions of the IRT item parameters. The D-score equating approach can be particularly efficient when multiple new test forms (say, A1, A2, . . ., A m ) need to be equated to a base form, B. Specifically, after rescaling the IRT item parameters of the new forms onto the ability scale of form B through a sequence of scale transformations over a “chain” of test forms A1→ A2→⋯→ A m → B, the item parameters of all test forms are on a common scale, so the D-scores obtained as a function of these item parameters for each test form (via Equations 1-3) are also on a common scale. For details and formulas for such a chain rescaling, the reader may refer to Li, Jiang, and von Davier (2012).
Intervalness of the D-Scale
A key question about the delta scale (D-scale) is whether it is an interval scale and how D-scores compare with IRT ability scores (θs, “thetas”) in this regard. It is known that an interval scale exists when the axioms of additive conjoint measurement (ACM) hold within a given dataset (Luce & Tukey, 1964; see also, Karabatsos, 2001). Referring to a scale, the term intervalness is used in the literature to indicate the degree to which the scale data are consistent with the axioms of ACM (e.g., Domingue, 2014). From this perspective, the task here is to compare the D-scale and θ-scale on intervalness. As D-scores and their standard errors, SE(D), are obtained in the framework of IRT, where the θ-scale is supposed to be (close to) interval, the intervalness of the D-scale is compared to that of the θ-scale using a method proposed by Domingue (2014). As shown with the real-data example in the next section, the D-scale behaves better than the θ-scale on criteria of intervalness, with the difference in this regard decreasing with the increase of the number of test items.
Scaling of D-Scores
For practical reports and interpretations of test scores at the NCA, the D-scores are transformed, at the current piloting stage, into scale scores that range from 0 to 100, to be in line with the widely adopted scaling from 0 to 100 with educational assessments in Saudi Arabia. Specifically, D-scores are transformed into scale scores, SD, using a linear transformation that results in a proportional “stretch” the D-scale from 0 to 100, namely:
Illustration With Real Data
As noted at the beginning, automated procedures of D-scoring and equating are under pilot applications with large-scale assessments at the NCA in Saudi Arabia. Most of these assessments are based on (a) aptitude and achievement tests administered to high school graduates, as a part of their application to Saudi universities and (b) multiple tests for teacher certification in Saudi Arabia. All tests are standardized and consist of dichotomously scored multiple-choice items, with an ongoing development of test forms and their equating using the IRT true score equating under the NEAT design. Because of the high complexity and efforts of scoring and equating in this context, the use of automated procedures of D-scoring and equating is deemed as very efficient, especially with the availability of an item bank which contain IRT item parameters (under the 3PL model). This allows for direct computations of expected item difficulties, δi, D-scores for response vectors of examinees, and (relatively fast and simple) D-score equating of multiple new test forms to the scale of a target test form. The procedures of D-scoring and equating are under piloting with real data on multiple forms for different tests at the NCA. A software, developed for this purpose, is named System for Automated Scoring and Equating (SATSE; Atanasov & Dimitrov, 2015). 2
Because of space consideration, provided here are only some results and clarifications related to D-scoring and equating with real data from two tests at the NCA (a) the General Aptitude Test–Verbal Part (GAT-V), which is administered to high school graduates and (b) The General Teacher Test (GTT), which is used for certification of teacher candidates in Saudi Arabia. First, two test forms of GAT-V are used to illustrate D-scoring and equating. Second, comparison of D scores and IRT ability scores, θ, in terms of their intervalness, is provided with the use of data from GAT-V and GTT. Although GAT-V and GTT data were found to be unidimensional in previous studies on their validity and psychometric features testing for dimensionality and estimation of reliability were performed with the data used here. Specifically, the unidimensionality of the sample data on the two GAT-V forms and the GTT was supported by a tenable data fit of a one-factor model tested in the framework confirmatory factor analysis (CFA) with the use of the computer program Mplus (Muthén & Muthén, 2010). The results are summarized in Table 1.
Testing for Unidimensionality of Data From Two GAT-V Test Forms (Base and New) and GTT.
Note. GAT-V = General Aptitude Test–Verbal; GTT = General Teacher Test; CFI = comparative fit index, TLI = Tucker–Lewis index, WRMR = weighted root mean square residual, RMSEA = root mean square error of approximation; CI = confidence interval; LL = lower limit; UL = upper limit. A tenable data fit is in place with CFI > .90, TLI > .90, WRMR is close to 1, and RMSEA < .05.
In Mplus, WRMR is used with categorical variables, which is the case with the study data (with continuous variables, standardized root mean square residual [SRMR] is used).
GAT-V base form.
GAT-V new form.
The reliability of the sample data was estimated under the latent variable modeling (LVM) approach using Mplus (e.g., Raykov, 2007; Raykov, Dimitrov, & Asparouhov, 2010). The resulting reliability estimates, provided in Table 2 with their 95% confidence intervals, range from .848 to .883, which is adequate for the purpose of this illustration. The Cronbach’s coefficient alpha for internal consistency reliability is also provided in Table 2. As can be seen, the alphas are smaller than their LVM-based counterparts. A plausible explanation is that the Cronbach’s alpha requires essentially tau-equivalent measures (i.e., all observed measures have equal loadings to the latent factor that they represent; e.g., Raykov & Marcoulides, 2016). However, this assumption is difficult to meet with congeneric binary measures, whereas it is not required with the LVM approach to reliability estimation.
Estimates of Score Reliability for Two Test Forms of GAT-V and GTT Under Two Approaches to Estimation (α and LVM).
Note. GAT-V = General Aptitude Test–Verbal; GTT = General Teacher Test; LVM = latent variable modeling; CI = confidence interval. Cronbach’s α assumes that the measures are essentially tau-equivalent, whereas the LVM approach does not require this assumption.
GAT-V base form.
GAT-V new form.
Computation of D-Scores
The computation of D-scores is illustrated with data from a base test form of GAT-V, which has 20 dichotomously scored multiple-choice items that measure the examinees’ ability in reading comprehension and sentence completion. The data consist of the binary scores (1/0) of 9,937 high school graduates on the 20 items of this GAT-V test form. The distribution of total test scores (number correct responses) was close to normal, ranging from 1 to 18 (M = 8.43, SD = 2.96). The IRT estimates of the item parameters under the 3PL model are provided in Table 3 (a = discrimination, b = difficulty, and c = pseudo-guessing) (e.g., see Hambleton, Swaminathan, & Rogers, 1991). The IRT calibration was performed under maximum likelihood estimation with EM algorithm using the computer program Xcalibre 4.2 (Guyer & Thompson, 2013). The expected item difficulty, δi, is also given in Table 3; (i = 1, 2, . . ., 20). Recall that δi is the proportion of the target population of examinees who provided an incorrect response on the item; that is, δi shows how difficult is the item for the entire target population (δi = 1 –πi, where πi, the expected “easiness” of the item, is computed as a function of the item parameters via Equation 1).
Item Parameters and D-Scores for 3 Examinees on 20 Test Items of GAT-V Base Form, B.
Note. GAT-V = General Aptitude Test–Verbal; 3PL = three-parameter logistic. δ i = expected item difficulty (the population proportion of incorrect item responses). Column Dis is the product of columns Xis and δi; that is, Dis = δiXis; (i = 1, . . ., 20; s = 1, 2, 3). The D-score of person s is the sum of the entries in column Dis; that is, Ds = D1s+⋯+D20s. The maximum possible D-score (for all item responses correct) is Dmax = δ1+⋯+δ20 = 7.327. Given in boldface are the numbers of seven items in the base form, B, which are used as common items with the new test form, A (see Table 4).
In Table 3, the columns labeled Xi1, Xi2, and Xi3 contain the response vectors of three examinees, with the first two having the same total test score (X1 = X2 = 5), but different response vectors, whereas the third person has all items correct (X3 = 20). The response vectors Xi1, Xi2, and Xi3 are multiplied by the item difficulty vector, δi, and the resulting products are stored in the columns labeled Di1, Di2, and Di3, respectively. Then, by the virtue of Equation (3), the sum of the entries in column Dis renders the Ds score of person s (s = 1, 2, 3), namely D1 = 1.717, D2 = 1.907, and D3 = 7.327. Note that the score of the third person equals the maximum possible D score on the test, which is the sum of all δi values (Dmax = 7.327), because that person has answered correctly all 20 items. On the other hand, although the first two persons have the same total score (X1 = X2 = 5), they have different D-scores because of having different response vectors; that is, they have answered items with different difficulties for the target population. The D-scores of the other examinees in the sample (N = 9,937) are obtained in the same way. The distribution of D-scores was close to normal, ranging from 0 to 7.327 (M = 3.802; SD = 1.272). The correlation of the D-scores with the total test score (X = number correct responses) was very high (0.962). However, the X-scores can take only 21 different values (from 0 to 20), whereas the D scores can take values generated from thousands different response vectors on 20 binary items.
The conditional standard error for each Ds score, SE(Ds), was computed via Equation (5); (the true scores,

Standard errors of D-scores on GAT-V (form B) data. GAT-V = General Aptitude Test–Verbal.
It is also worth noting that the D scores provide higher differentiation of examinees with low or high abilities compared with IRT ability scores reported with the use of computer programs for IRT calibration. For example, although the theoretical values of IRT ability vary from −∞ to +∞, they are always reported in a practically reasonable interval, say, from −7 to 7 on the logit scale. Thus, the examinees assigned to an extreme category (say, −7 or 7) in IRT calibrations are much better differentiated on the D-scale. For example, under the IRT scoring, via Xcalibre 4.2, with the data on GAT-V form B (N = 9,937) (Figure 2), it was found that 204 examinees were assigned to the lowest score category (θ = −7) on the logit scale, whereas 172 of them were assigned different scores on the D-scale, ranging from 0 to 3.713 (M = 1.468, SD = 0.654).

Standard errors of item response theory ability scores (thetas) on GAT-V (form B) data. GAT-V = General Aptitude Test–Verbal.
Equating of D-Scales
In this example, the D scores on a new test form of GAT-V (Form A) are equated to the D-scale of the base form of GAT-V (Form B) described in the previous section. The data on test form A consist of binary scores of 9,781 high school graduates on 20 items, seven of which are common items with the items of form B (as described earlier, test form B was administered to 9,937 high school graduates). The samples of examinees who took forms A and B, respectively, are treated as “nonequivalent groups” coming from two different populations of test takers on Forms A and B. The items of Forms A and B are calibrated under the 3PL model in IRT using Xcalibre 4.2. Items 1, 2, 3, 4, 11, 12, and 13 in Form B are common (anchor) items that correspond to Items 1, 2, 3, 4, 18, 19, and 20, respectively, in Form A. The correlation between the scores on the set of common items and the total test score is .883 and .887 for Forms A and B, respectively. The reliability estimates for the scores on the two test forms are also very similar, .883 and .848 for Forms A and B, respectively (see Table 2). These results are in support of the appropriateness of equating test Forms A and B (e.g., Kolen & Brennan, 2014).
The D-scale equating is performed in three major steps. First, the item parameters of the new Form A (a, b, c) are transformed onto the scale of the base Form B, thus obtaining rescaled item parameters A (a*, b*, c*). Second, the expected difficulties for the items of Form A are also “rescaled” by computing them as a function of the rescaled item parameters (a*, b*, c*), as described earlier. Thus, if
Estimates of Item Parameters and Expected Item Difficulties for Test Form A, Before and After Their Rescaling Onto the Scale of Base Form B.
Note. Given in boldface are the numbers of seven items used as common items with the base test form B; (Items 1, 2, 3, 4, 18, 19, and 20 in form A are used as Items 1, 2, 3, 4, 11, 12, and 13, respectively, in the base test form B; see Table 3).
The examination of Table 4 shows that the rescaled values of expected item difficulties,
For illustration, consider the response vector 1100110101000000000 of a person with 6 correct responses on the 20 items of Form A. The Ds score of that person on Form A, prior to its equating, is obtained by using Equation (3) with the given response vector and the expected item difficulties
D-Scale Intervalness
A key question about the delta scale (D-scale) is whether it is an interval scale and how D-scores compare with IRT ability scores (thetas) in terms of intervalness. This question was addressed with a previous study at the NCA by comparing the D-scale with the IRT theta scale from the perspective of additive conjoint measurement using an approach referred to as ConjointChecks (Domingue, 2014). The details in methodology and findings, provided with a technical report on that study (Domingue & Dimitrov, 2015), are not presented here for space consideration, but some main points and results are replicated and illustrated with data in this example. Specifically, used are the data with the base form of GAT-V, described in the previous section, and data on the GTT. The GTT data consist of the binary scores (1/0) of 45,749 teacher candidates on 79 multiple-choice items.
In the case of item response data, the ACM axioms are concerned with orderings amongst the probabilities for individuals at different abilities responding to items with different difficulties. The ConjointChecks approach (Domingue, 2014) implements an algorithm for checking the axioms of ACM. The question is whether the observed proportion of correct item responses for a given set of respondents assumed to be at some common ability is consistent with the posterior distribution of the probabilities for correct responses generated by the algorithm. If not, the ConjointChecks algorithm is said to have detected a “violation.” The violation percentages (Vp) are checked in 3 × 3 submatrices of the full data matrix. These matrices are either formed via a random selection of items and groups of individuals or via the collection of adjacent items and groups of individuals.
The ConjointChecks approach is readily applied to discrete ability estimates, but in case of continuous data, such as D-scores and IRT thetas, a discretization of the continuous number line is achieved by a division of the line referred to as “banding.” Some bandings are more “stringent” than others in the sense that they are more likely to place a person in the wrong band given the error associated with the person’s score (here, D or theta). One can expect that a more stringent banding would produce fewer axiom violations (for details, see Domingue, 2014). In this example, the violation percentages (Vp) produced by common bandings of the D and theta scores are examined across levels of stringencies. 3
Violation percentages based on the natural banding of the sum scores were computed for a set of 5,000 respondents. For the continuous abilities, the mean score (either theta or D) for individuals at a given sum score was considered. The banding was then defined by the midpoints between all consecutive means. The Vp were examined from two types of checks. The first check looks at all adjacent 3-matrices from the full data matrix while the second one considers 5,000 randomly chosen 3-matrices. Along with the unweighted Vp, weighted Vp were also considered, where violations at a given portion of the scale are weighted based on the number of individuals at that part of the scale. The results are summarized in Table 5. The sum scores generate the smallest percentages of violations (for all Vp), but this can be expected given that this banding is based on the sum scores. The D scores look very similar to the sum scores, especially on the weighted metrics. The theta scores produce more violations, notably more so for the randomly chosen 5,000 3-matrices. Also, the weighted metric performs better than its unweighted counterpart in terms of smaller percentage of violations.
Violation Percentages (Vp) for Natural Banding of Sum Scores.
Note. Vp-Adj = all adjacent 3-matrices from the full data matrix; Vp-5k = 5,000 randomly chosen 3-matrices; Uw = unweighted; W = weighted; NCR = sum score (number correct responses).
Larger values indicate more stringent bandings (Domingue, 2014).
As there is no obvious banding available for the continuous theta and D scores, the effect of banding stringency on Vp was investigated for a number of potential bandings. Bandings are characterized by the number of cutpoints and the starting point of the first band in the banding; (the cutpoints are evenly spaced). The number between 10 and 190 were varied with increments of 20. For the GTT with 79 items, there are 80 bands in the sum score banding which is roughly the middle of the range used here. The first cutpoint were either at the 0.005, 0.01, or 0.015 quantile of the score distribution and then intervals were evenly spaced across the scale of the abilities, with the last cutpoint being at either the 0.985, 0.99, or 0.995 quantile, respectively. Unlike the case where the sum score banding was used as the basis for bandings for theta and D, now the banding is defined within the scale so that the Vp, based on a choice of banding, are optimal for each scale.
The results for the 20-item GAT-V and 79-item GTT are depicted in Figure 3. Used are only weighted Vp because they perform better (smaller percentage of violations) compared with unweighted Vp (see Table 5). The theta scores were obtained with IRT calibration under the 3PL using (a) maximum likelihood estimation (MLE) and (b) expected a priori (EAP) estimation (as a side note, the correlation between the D scores and IRT theta scores were 0.928 for the 79-item GTT data and 0.896 for the 20-item GAT-V data). As can be seen in Figure 3, the percentage of violation decreases with the increase of stringency in the banding, which supports the intuitively expected tradeoff between Vp and stringency (see Domingue, 2014). The D-scores consistently produce lower Vp compared with the IRT ability scores (thetas), regardless of the approach to theta estimation (MLE or EAP), with the difference tending to decrease with the increase of the test length. Thus, the D-scores produce fewer violations of the ordering axioms of ACM than do the IRT theta scores. In other words, the D-scale performs a good bit better than the IRT theta scale in terms of intervalness from the perspective of ACM, under the ConjointChecks approach to checking the ACM axioms (Domingue, 2014).

Comparison of weighted Vp and stringency for GAT-V (20 items) and GTT (79 items).
Conclusion
Under the delta-scoring (D-scoring), the score of an examinee on a test of n binary items is the sum of expected difficulties (
The results related to psychometric features of the D-scale, reported with the illustrative example, were replicated with numerous sets of real data from large-scale assessments at the NCA (not provided here for space consideration). In summary (a) the D-scores highly correlate (in the neighborhood of .90) with the IRT ability scores, θ; (b) the precision of the D-scores is higher for low- and high-ability examinees, which is just the opposite of IRT case, where the precision of θ estimates decreases for low- and high-ability examinees; (c) the D-scale performs a good bit better than the IRT theta scale in terms of intervalness, by criteria of the additive conjoint measurement, with the difference tending to decrease with the increase of the test length; and (d) the D-scores differentiate better between examinees who, under IRT estimation, are assigned to the extreme categories (say, −7 and 7) of a practically reasonable interval on the logit scale. These properties of the D-scale are particularly useful in testing that aims at differentiating among low test performers (e.g., to identify students “at risk”) or high test performers, say, in the context of medical education testing, admission of students to universities, teacher certification, and so forth.
As noted earlier, the development of D-scoring was motivated by a call at the NCA in Saudi Arabia for the development of procedures for automated test scoring and equating that are methodologically sound and technically feasible. An important aspect of this call was the request to use IRT item bank information for (a) test assembling; (b) direct scoring of tests based on the item parameters available in the bank, and response vectors of examinees; (c) sequential equating of multiple test forms; and (d) feeding the item bank with new trial items. The call was addressed with the development and piloting of D-scoring and equating at the NCA, with the procedures being implemented into a computerized system for automated test scoring and equating (SATSE; Atanasov & Dimitrov, 2015). Along with IRT estimates of the item parameters under the 3PL model, the item bank at the NCA is now upgraded to include the expected item difficulty, δi, as a direct function of these item parameters. As the item parameters in the item bank are on the same scale, the δi values are also on a common scale; that is, δi for an item shows how difficult is that item (as a “hurdle”) for the population of test takers on the scale of a designated base form of the test. When trial items are used with a new form of a test, their IRT parameters and expected difficulty, δi, for the population of test takers for the new test form are rescaled to the common scale for the target population of test takers for the base test form.
In conclusion, the proposed method of D-scoring and equating proved promising under its current piloting with large-scale assessments in Saudi Arabia and the hope is that this method can efficiently complement IRT procedures in the practice of large-scale testing in the field of education and psychology.
Footnotes
Appendix A
Appendix B
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
