Abstract
The part of responses that is absent in the nonequivalent groups with anchor test (NEAT) design can be managed to a planned missing scenario. In the context of small sample sizes, we present a machine learning (ML)-based imputation technique called chaining random forests (CRF) to perform equating tasks within the NEAT design. Specifically, seven CRF-based imputation equating methods are proposed based on different data augmentation methods. The equating performance of the proposed methods is examined through a simulation study. Five factors are considered: (a) test length (20, 30, 40, 50), (b) sample size per test form (50 versus 100), (c) ratio of common/anchor items (0.2 versus 0.3), and (d) equivalent versus nonequivalent groups taking the two forms (no mean difference versus a mean difference of 0.5), and (e) three different types of anchors (random, easy, and hard), resulting in 96 conditions. In addition, five traditional equating methods, (1) Tucker method; (2) Levine observed score method; (3) equipercentile equating method; (4) circle-arc method; and (5) concurrent calibration based on Rasch model, were also considered, plus seven CRF-based imputation equating methods for a total of 12 methods in this study. The findings suggest that benefiting from the advantages of ML techniques, CRF-based methods that incorporate the equating result of the Tucker method, such as IMP_total_Tucker, IMP_pair_Tucker, and IMP_Tucker_cirlce methods, can yield more robust and trustable estimates for the “missingness” in an equating task and therefore result in more accurate equated scores than other counterparts in short-length tests with small samples.
Introduction
In educational settings, producing interchangeable scores on different test forms (i.e., equating) is essential to make the assessment fair and comparable when examining unidentical items/questions (Kolen & Brennan, 2004). A majority of researchers and practitioners perform equating through the nonequivalent groups with an anchor test (NEAT) design, which adjusts items’ properties to estimate what an examinee would have performed if this examinee was administered items that were, in fact, never administered (Maris et al., 2010). To illustrate without losing generalizability, a typical NEAT design with two forms made of three batches of items is given here: one batch for the base/reference form only, the second batch for the target form only, and the third batch shared between the forms (i.e., the anchor set). Traditionally, statistical techniques for equating are about transformations of both modeling parameters and item responses, including the ones based on equipercentile equating, linear equating methods, item response theory (IRT) observed score and true score equating, van der Linden local equating, Levine nonlinear method, Kernel equating (KE), and others (see Kolen & Brennan, 2004 for details). Furthermore, post-stratification (PSE), Levine observed score linear, and chained equating (CE) methods are typically used in KE when a NEAT design is present (von Davier et al., 2004). In addition to treating equating as the transformation, it can also be handled as a missing data problem.
Following a popular definition of missing data in the statistics literature, the part of responses absent in a NEAT design can be regarded as missing at random, known as missing at random (MAR) mechanism (Little & Rubin, 2002). Accordingly, values underlying the missing areas depend on the design part, of which the responses are observable. On the other hand, Sinharay and Holland (2010) claimed that since the missingness in the NEAT design is deliberately planned, and therefore theoretically, it is likely to be missing completely at random (MCAR) instead of from a theoretical perspective. We believe this assumption applies in most cases, except possibly affected by testing time. Methodologically speaking, techniques for handling missing data problems can be applied to both MAR and MCAR settings. Previous studies have treated the NEAT design as an incomplete-data issue (Liou & Cheng, 1995; Liou et al., 2001), involving imputation methods designated for MAR problem, including a kernel estimator, a log-linear model-based estimator, and an iterative moment estimator.
Conventionally, imputation techniques can be either model-based or model-free. Readers interested in comprehensive imputation approaches can see Little and Rubin (2019) and Enders (2010) for details. To illustrate the model-based one related to equating tasks, Holland and Thayer (2000) proposed an algorithm based on “expectation-maximization” (EM), where the aforementioned log-linear model was deployed to produce values for the missing part (i.e., the equating target), and Moses and colleagues (2011) found that the approach was fairly reliable in many NEAT conditions. On the other hand, the model free–based imputations (i.e., K-nearest neighbors, fuzzy K-means (i.e., an extension of K-means that does not simply predict targets to a definitive class but provide class-probability estimates like a mixture model; Bezdek, 1981; Equihua, 1990), singular value decomposition, principal component analysis, and others) seem to be less favored in this kind of study of multiple imputations by chained equations (MICE). These model-free approaches are now labeled machine learning (ML)-based imputation techniques in the contemporary world (Lakshminarayan et al., 1996; Lin & Tsai, 2020), emphasizing predictive accuracy rather than interpretability.
Similar to other ML-based approaches, the advantages such as needing minimal assumptions about the data-generating systems, being compatible with complex variable patterns, subsuming various input formats, as well as producing more trustable predictions make ML-based imputation techniques a popular choice in both research and practice (Athey, 2018; Ij, 2018), especially in the conditions where simple linear associations between the missing and the observed data do not exist (Hong et al., 2020). These properties make ML-based imputation techniques promising for equating tasks. As listed in Figure 1, visual comparison bridges the essence of equating tasks and imputing inquiries. Group X takes test form #1, while group Y takes test form #2, and common items designed to be the same in both tests are called anchor items. True responses from Group X on test form #2 and Group Y on test form #1 are missing except for anchor items. Imputation techniques used to deal with missing data can be used to obtain equating scores for individuals on unanswered tests.

The Bridge Between Equating Task and Imputation.
The primary inquiry of equating is about yielding more accurate estimates for hypothetical scores obtained from the base form that an examinee never actually takes. This inquiry matches the ML advantages mentioned above well and, therefore, inspires the possibility of applying the techniques to situations where traditional equating approaches commonly used in testing organizations fail to deliver reliable results. Unsurprisingly, small sample equating is one of the situations; it has received more attention in the literature nowadays. The literature review found that linear equating methods have been suggested for use with small samples (Kolen & Brennan, 2004; Skaggs, 2005). In addition, several new methods for small-sample equating have been proposed, including circle-arc equating (Livingston & Kim, 2009), synthetic equating (Kim et al., 2008), nominal weights mean equating (Babcock et al., 2012), and so on.
Recent studies addressing small sample equating have primarily focused on evaluating the performance of existing methods. For instance, circle-arc equating and nominal weights mean equating yielded less-biased estimates in small samples compared to applications within standard settings for each administration (Dwyer, 2016). A recent study by Babcock and Hodge (2020) showed that Rasch-based approaches could produce acceptable results in the context of small sample exams, especially for non-Bayesian ones.
In this study, we propose using an ML-based imputation technique called chaining random forests (CRF) to perform equating tasks within a NEAT design, given a scenario of small sample sizes, defined as a low volume of both examinees and items. The equating performance of the proposed methods was also compared with other equating methods likely to be used in small-sample situations through a simulation study. We hypothesized that by benefiting from ML techniques’ advantages, CRF would yield more robust estimates for the “missingness” in an equating task and therefore result in more reliable equated scores than other counterparts.
Method
Initially introduced by Stekhoven and Bühlmann (2012), CRF is an iterative imputation technique devising Breiman’s random forest algorithm (Breiman, 2001). As CRF’s name suggests, the major components are random forests, trained on the observed values to predict the missing ones. The advantage of this method is that it considers complex interactions and non-linear relations among variables. Studies have shown that CRF and MICE can be equivalent in many situations, where the former is not only a better fit for mixed-type data (Penone et al., 2014; Yadav & Roychoudhury, 2018) but also consumes less computational power when a similar task is present (Wong et al., 2021). Substantial evidence has published to support this study, for instance, Shah and colleagues (2014) compared random forest to multiple imputation by chained equations (MICE) and showed that random forest parameters were less biased.
Consider that in a data set where an arbitrary variable
To eventually deploy CRF in equating tasks, we defined six ways of augmentations for the imputations and used the original data set to impute all missing values as a baseline method (IMP); the corresponding implementations are listed in Figure 2. Specifically, the first augmentation incorporated each student’s total score on anchor items as a new column into the data (IMP_total). The second one augments the data by adding the sum scores of each item pair nested within the anchor test (IMP_pair). The latter four data augmentation methods were constructed by exploiting benefits from well-known equating methods (e.g., the Tucker method and the circle-arc method); these augmentation methods can be divided into two steps: (1) the equating procedure (the Tucker method or the circle-arc method) is first implemented to calculate the equating scores of the target group (group Y) on the reference test (test form #1), and (2) the equating scores and the total scores of the reference group (group X) on the reference test (test form #1) are combined to form a new variable to augment the original data set. The method using both the total anchor test score and the equating results of the Tucker method is named IMP_toatl_Tucker method, while the one using both the sum scores of each item pair in the anchor test and the equating results of the Tucker method is called IMP_pair_Tucker method. To compare the outcomes using different equating methods, the method called IMP_toatl_circle (i.e., using both the total anchor test score and information from the circle-arc method) was used for comparison. Finally, total scores of the anchor test, sum scores of each item pair in the anchor test, and the equating results from both the Tucker and the circle-arc methods were added to the data simultaneously to form IMP_Tucker_circle method to investigate if the equating performance could be further improved by using more information.

Six Methods of Augmentation for Imputations.
Simulation Study
When comparing different equating methods in simulation studies of this kind, common factors include the sample size (Arai & Mayekawa, 2011; Hanson & Béguin, 2002; Kang & Petersen, 2012; Kim & Cohen, 1998; Sinharay & Holland, 2007), the number or proportion of anchor items (Arai & Mayekawa, 2011; Hanson & Béguin, 2002; Kang & Petersen, 2012; Kim & Cohen, 1998; Sinharay & Holland, 2007; T. Wang et al., 2008), the ability distribution of both the target group and reference group (Hanson & Béguin, 2002; Kang & Petersen, 2012; Kim & Cohen, 1998; Sinharay & Holland, 2007; T. Wang et al., 2008), the difficulty distribution of anchor items (Hanson & Béguin, 2002; Kang & Petersen, 2012; Sinharay & Holland, 2007), and the test length (Sinharay & Holland, 2007; T. Wang et al., 2008).
Summarizing the aforementioned designs to accommodate the small sample context (e.g., a classroom setting; Perry & Dickens, 1987; Stewart & Gibson, 2010), this study considered the following factors:
Test length. Four levels of test length were considered: 20, 30, 40, and 50.
Sample size. The sample sizes for X and Y were equally set with two levels: 50 and 100.
Proportion of anchor items: 0.2 and 0.3.
Two (latent) ability distributions for the target group Y: N(0,1) and N(0.5,1).
Difficulty distribution of anchor items: random, easy, and hard. When the type of anchor items was random, the anchor items were randomly selected from test form #1. Otherwise, the difficulty distribution of anchor items was biased from test form #1. When the type of anchor items was easy, the anchor items were randomly selected from half of the items with lower difficulty values from test form #1. Conversely, when the type of anchor items was difficult, the anchor items were randomly selected from half of the items with higher difficulty values in test form #1.
There were 96 conditions in total (i.e., 4 × 2 × 2 × 2 × 3). The three-parameter logistic (3PL) IRT model (Birnbaum, 1968) was adopted for data generation in multiple conditions via the NEAT design, assuming that group X took test form #1 and group Y took test form #2. While the specific procedures can be found in Online Appendix, the simulation and analyses involved the following steps:
Step 1: The discrimination parameters, difficulty parameters, and guessing parameters of both tests (test form #2 did not include anchor items at this step) were randomly generated from N(0.8, 0.2), N(0, 1), and Unif(0, 0.25).
Step 2: Ability values for group X were randomly generated from the standard normal distribution N(0,1). The ability values for group Y were generated according to its factor levels. The full data set was generated using the IRT model based on the item parameters and ability values.
Step 3: Sort the items in test form #1 according to their difficulty values. A predefined number (according to the simulation condition) of anchor items was randomly selected from test form #1 in alignment with their difficulty levels. The responses of group X on test form #2 and group Y on test form #1 were treated as true observed data and set as missing data, as shown in Figure 1.
Step 4: Given group X’s responses on test form #1 and group Y’s responses on test form #2, different equating methods were used to compute equivalent scores converted from test form #2 to test form #1.
Step 5: Steps 2 to 4 were repeated 100 times. Two measures were used according to the literature (i.e., Wolkowitz & Wright, 2019; Zeng, 1993)—the average absolute bias (BIAS) and root mean square difference (RMSD):
where
Step 6: Steps 1 to 5 were repeated for each simulation condition, where the indexes were recorded for further comparisons.
The descriptive statistics of the mean and standard deviation of the difficulty for the anchor items under different anchor types are shown in Table 1.
Descriptive Statistics for the Difficulty Parameter of the Anchor Items.
To compare the proposed method with equating methods commonly used in large-scale testing organizations, especially those performed better with small samples, linear equating methods (Tucker method and Levine observed score method), equipercentile equating method, circle-arc method and concurrent calibration (CC) (Hanson & Béguin, 2002; Hu et al., 2008) based on Rasch model with true score equating (Kolen & Brennan, 2004) were selected to serve as references. R software was used (R Core Team, 2022), while the packages “equate” (Albano, 2016), “equateIRT” (Battauz, 2015), “SNSequate” (González, 2014), and “missRanger” (Mayer & Mayer, 2022) were implemented to execute the reference methods and CRF imputation, respectively. All the package settings were left default. The R script for data generation, imputation/equating, and result gathering was documented in the Appendix.
Result
Aggregated results are presented across all conditions in Table 2. The smallest RMSD and BIAS values among the equating methods are boldfaced. The three imputation methods using only raw data or its integrated information (IMP, IMP_total and IMP_pair method) did not perform as well as the traditional equating methods: those that used data augmentations (IMP_total and IMP_pair) yielded smaller RMSD values than the one used the original data set only (i.e., IMP). Using the sum of item pairs to augment the data (IMP_pair) was slightly better than using the total scores of anchor items (IMP_total). Meanwhile, adding information from other equating methods significantly improved the performance of the imputation methods, with a significant decrease in both averaged RMSD and BIAS. IMP_total_Tucker and IMP_ total_circle outperformed the other methods as they had the lowest RMSD and BIAS values. IMP_pair_Tucker method, which also uses Tucker method information, was inferior to IMP_total_Tucker method. In addition, the equating accuracy could not be further enhanced when multiple sources were used together (i.e., the total scores of anchor items, the sum of anchor item pairs, and the equating results of both Tucker and circle-arc methods) to augment the data (IMP_ total_circle), and its RMSD and BIAS were close to those of IMP_total_Tucker method.
Average Equating Errors From Different Equating Methods.
Note. The smallest values among the equating methods are boldfaced. RMSD = root mean square difference.
The Tucker method performed best among the traditional equating methods because it had the lowest RMSD and BIAS values, followed by the circle-arc method. The equipercentile equating and the Rasch-based methods both performed poorly because they produced the largest RMSD value, and the Levine method generated the largest BIAS among the traditional methods.
The results for the average equating errors of the 12 equating methods across different test conditions are presented in Tables 3 to 10 and discussed in the following paragraphs.
The Overall RMSD for Different Equating Methods (Number of Items = 20).
Note. The smallest values among the equating methods are boldfaced. RMSD = root mean square difference.
The Overall RMSD for Different Equating Methods (Number of Items = 30).
Note. The smallest values among the equating methods are boldfaced. RMSD = root mean square difference.
The Overall RMSD for Different Equating Methods (Number of Items = 40).
Note. The smallest values among the equating methods are boldfaced. RMSD = root mean square difference.
The Overall RMSD for Different Equating Methods (Number of Items = 50).
Note. The smallest values among the equating methods are boldfaced.
The Averaged BIAS for Different Equating Methods (Number of Items = 20).
Note. The smallest values among the equating methods are boldfaced.
The Averaged BIAS for Different Equating Methods (Number of Items = 30).
Note. The smallest values among the equating methods are boldfaced.
The Averaged BIAS for Different Equating Methods (Number of Items = 40).
Note. The smallest values among the equating methods are boldfaced.
The Averaged BIAS for Different Equating Methods (Number of Items = 50).
Note. The smallest values among the equating methods are boldfaced.
Test Length and Sample Size
A comparison of Tables 3 to 6 and Tables 7 to 10 indicates that all the equating methods followed a similar pattern, where they tended to produce larger RMSD and BIAS values as the number of items on a test form increased. When the test length was 20 and 30, imputation methods incorporating results from other methods (i.e., IMP_total_Tucker, IMP_pair_Tucker, IMP_total_circle and IMP_Tucker_cirlce) outperformed any other single (non-incorporated) equating methods. Among them, IMP_Tucker_cirlce method, which uses more information, was more accurate than its counterparts in most cases. When the test length was 40, the Tucker method started to show a slight advantage on a small set of conditions, and when the test length reached 50, the advantage became more apparent; however, imputation methods, such as IMP_total_Tucker and IMP_total_circle methods, still produced the highest equating accuracy in some cases. In addition, the advantage of IMP_Tucker_cirlce method over IMP_total_Tucker and IMP_total_circle methods faded apart.
The equating accuracy for traditional equating methods tends to improve as the sample size increases. When the sample size was 100, they produced smaller RMSD and BIAS values in most cases than when the sample size was 50. On the contrary, for CRF-based imputation methods, the effect of sample sizes was inconsistent: they yielded larger RMSD or BIAS values when the sample size became larger in several conditions, except for IMP_pair_Tucker method, of which the equating accuracy improved as sample size increased. Although the performance of IMP_pair_Tucker method was not as good as IMP_total_Tucker method on average, the RMSD and BIAS values of IMP_pair_Tucker method were lower than those of IMP_total_Tucker method in cases of 100 examinees, especially when the test length was 20: the RMSD and BIAS values of IMP_total_Tucker method were always lower than those of IMP_pair_Tucker method and even the smallest among all equating methods in most cases.
Ratio of Anchor Items
As the ratio of anchor items increased, the equating accuracy of all methods improved as their RMSD and BIAS values dropped without exception. In particular, the RMSD and BIAS values of the three methods using only raw data or their integrated information (IMP, IMP_total, and IMP_pair method) declined the most as the ratio of anchor items increased.
Ability Distribution
The equating accuracy of all equating methods was very similar regardless of the ability distribution of the second group. That said, the difference in the mean ability of the two groups does not affect the performance of the reference methods in terms of equating accuracy.
Anchor Type
The gaps in RMSD and BIAS values among the three anchor types for all equating methods were not substantial. Their RMSD and BIAS values for imputation methods were slightly larger when hard anchor items were used than the random or the easy sets.
Conclusion and Discussion
Based on different data augmentation methods, seven CRF-based imputation methods were proposed to perform equating in a NEAT design. The performance of seven imputation methods and several traditional equating methods was investigated under varying sample sizes, test lengths, the ratios of anchor items, the difference in examinee ability, and the types of anchors. The findings suggest that imputation methods incorporated with the wisdom of other methods (e.g., the Tucker method or circle-arc method) yield the highest equating accuracy when the test length is short, but when the test length reaches 50, the Tucker method shows a slight advantage. Increasing the sample size does not always reduce equating errors for the proposed methods; this finding makes the largest difference between the imputation methods and the reference ones. Furthermore, the lower the proportion of anchor items, the worse the performance of all equating methods, while the type of anchor items and group ability differences had little impact on the equating results.
The imputation methods possess high flexibility in subsuming good results from various augmenting strategies and equating methods. Some specific methods from the former (i.e., using different data augmentation methods) can be unstable, especially for the one using response data of the anchor test only, leading to low equating accuracy. On the other hand, the latter kind that combines information from other equating methods’ results can significantly improve the performance, even better than the original ones that were selected for augmentations. Particularly, given the Tucker method is selected to incorporate into the proposed methods, the equating accuracy obtained by using the total scores of the anchor test (IMP_total_Tucker) is better than using the sum scores of anchor item pairs (IMP_pair_Tucker) in most cases. IMP_pair_Tucker method performs better only when the test length is the shortest (20) and the sample size is relatively large (100). In addition, IMP_Tucker_cirlce method which uses the most information is more advantageous in short tests.
Therefore, we suggest that when the test length is not more than 40, imputation methods with more information to augment the data set, such as aggregated scores from the test itself and information from other equating methods (e.g., IMP_Tucker_cirlce), are recommended. Moreover, when the test length is extremely short, say less than 20, and the sample size is relatively large, IMP_pair_Tucker method is also applicable.
In this study, we not only set a small sample situation but also limited the test length to a short range (50 items and below) to be suitable for equating the analysis of short tests. Short tests are very popular in educational measurement, such as quizzes, unit tests, and subtests in comprehensive tests. There have been some equating studies on short tests that contain 40 or fewer items (Dimitrov, 2018; Lim & Lee, 2020), but few studies have focused on small samples equating with short tests, although such scenarios are not rare in educational practice. While comparing the performance of various equating methods under this condition, this study proposes several imputation methods that are particularly suitable for this case. The imputation methods can also be directly extended to polytomous scoring situations for equating mental health questionnaires using Likert-type scales, which are usually short in length.
Although the current research successfully used the ML-based imputation technique to perform equating tasks within the NEAT design in the small sample scenario, several limitations should be considered in future studies. (a) As the proposed methods were developed and applied for dichotomous items only, future research can extend the application to polytomous or mixed-format cases. (b) The proposed methods were based on the CRF-based method; other augmentation strategies, such as adding group Y’s total scores on test form#2 into the data, can be considered in the equating research studies. (c) Unidimensionality and local independency are the main assumptions of IRT models. In some psychological or educational tests, unidimensionality may not be fully satisfied, or local dependency usually exists in practice. Multidimensional equating methods or testlet-based equating methods can be considered to treat multidimensional measures or address local dependence between items in future research. (d) Although the evaluation criteria used in most equating research studies were based on the recovery of the true values, the evaluation criteria of equating errors have always been a difficulty in equating research studies. Determining or finding consistent evaluation criteria or “gold standard” in equating research studies deserves further investigation.
Supplemental Material
sj-pdf-1-epm-10.1177_00131644221120899 – Supplemental material for The NEAT Equating Via Chaining Random Forests in the Context of Small Sample Sizes: A Machine-Learning Method
Supplemental material, sj-pdf-1-epm-10.1177_00131644221120899 for The NEAT Equating Via Chaining Random Forests in the Context of Small Sample Sizes: A Machine-Learning Method by Zhehan Jiang, Yuting Han, Lingling Xu, Dexin Shi, Ren Liu, Jinying Ouyang and Fen Cai in Educational and Psychological Measurement
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Z.J. was supported by the National Natural Science Foundation of China for Young Scholars (grant no. 72104006) and Peking University Health Science Center (grant no. BMU2021YJ010).
Supplemental Material
Supplementary material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
