Abstract
An equating procedure for a testing program with evolving distribution of examinee profiles is developed. No anchor is available because the original scoring scheme was based on expert judgment of the item difficulties. Pairs of examinees from two administrations are formed by matching on coarsened propensity scores derived from a set of background variables. These two subsets of scores are then equated, treating the associated sets of test performances as equivalent. The method is applied to the scores in 2 years of a testing program for admission to tertiary education.
Keywords
1. Introduction
Equating is applied in large-scale testing programs in which the awarded scores are meant to be treated with no regard for the test form, the administration (year) in which the test was taken, the location of the testing center, nor any other aspect of the test or of the examinee’s background. Scores adjusted by equating refer to an underlying construct that transcends administrations. In established testing programs, founded many years ago, the distributions of the examinees’ profiles and abilities are unlikely to change a great deal from one year to the next. If the assumption of identical annual distributions were tenable, some simple methods of equating could be applied. Without relying on this assumption, an anchor test, administered to both groups (cohorts) of examinees, provides an effective means of adjusting the scores in the later administration to the earlier one. The current practice and much of the theory supporting it are discussed by Kolen and Brennan (2004) and von Davier, Holland, and Thayer (2004). Holland and Rubin (1982) is an important reference to earlier developments.
There is less research and literature on equating when an anchor is not available and the two cohorts of examinees have different distributions of abilities (latent scores). Such settings are usually avoided by design. We describe a procedure for equating when there is no anchor and the cohort distributions are likely to differ. It is applied to a test for admission to institutions of tertiary education (universities and colleges) administered in 2 consecutive years in a country. The donor of the relevant data set has requested to identify neither the testing program nor the (recent) years when the test was administered.
The testing program was approved by the national body charged with maintenance of standards in education. Institutions of tertiary education were required to state well in advance of the admissions process how they would treat the test results. In most cases, the decisions were devolved to academic departments or even individual courses. At one extreme, the test would not be required and the result, if submitted with the application, would be ignored. At the other extreme, the test would be mandatory or not having taken the test would amount to a substantial handicap. A person could take a test only once a year and only twice within a 5-year period. After the second attempt, he or she is barred from reporting the result of the first attempt. Without searching through past applications, which are paper forms, a score user can see only one score of an applicant.
The decisions about how the test results are treated can be altered every year, but the changes have to be announced well in advance to enable students to adjust their conduct regarding preparation and taking the test. Changes in the first few years of the test were quite common, mostly in the direction of requiring the test, often informed by the experiences of other institutions. A marketing campaign may also have influenced these changes. Test-takers have to pay for the test, and the fee is substantial for the poorest students. However, students in tertiary education who come from households with low income can apply for a grant to cover their everyday expenses. Some employers in the country have indicated that they would regard favorably applicants for early career appointments if they submitted their test scores. As a result, the test is taken by some who do not intend to enter tertiary education.
We refer to the 2 years we study as Years 1 and 2. We seek an adjustment of the scores in Year 2 that would make them equivalent to the scores in Year 1. In Year 1, a total of 8,400 tests were taken, 1,500 of them (17.8%) for the second time. In Year 2, a total of 11,000 tests were taken, 2,100 of them (19.0%) for the second time. For every test-taker, a set of background variables is recorded and their details are given in Table 1. The table shows that in the second year there were relatively more older examinees (aged 21 years or over, 25% vs. 21%), more women (44% vs. 42%), more examinees who qualify for financial support (27% vs. 21%), and more minorities (22% vs. 17%). The grade point average (GPA) is a single-figure summary of the student’s assessments in the first 3 years of the (4-year) secondary school. The highest achievement is 1 and the lowest is 4. The average GPA is lower (better) in Year 2 (2.45 vs. 2.58). The four types of secondary school (variable Type) are generally regarded as ordinal, from the academically superior (SS) to the inferior (D). The fraction of examinees from category D dropped in Year 2 substantially, as did the fraction of those with the poorest GPA score (19% vs. 13%). Most of the between-year differences in the percentages are not trivial and are statistically highly significant. Assuming independence, the standard error of the difference of two percentages is given by:
where p 1 and p 2 are the proportions of a category in respective Years 1 and 2. The upper bound is attained when p 1 = p 2 = 0.5. For example, the standard error of the between-year difference for Type 4 (the estimate is 15.2 − 12.0 = 3.2%) is 0.49%. Arguably, the odds ratio is a more suitable scale for comparing two related proportions or percentages. Thus, 15.2% and 12.0% correspond to the odds ratio 15.2/12.0 × 88.0/84.8 = 1.31.
Background Variables of the Test-Takers
Note. Each distribution is described by percentages, with the categories in parentheses. Y1 = Year 1; Y2 = Year 2. Y/N = Yes or No.
The test has several sections, some of them comprising multiple-choice items and others essays and other types of items with open-ended response. The latter are scored by appointed graders at a central location, with suitable arrangements for anonymity and quality control. The scores are integers in the range 0 to 100, although scores outside the range 11 to 90 are exceptional. For example, there were only five and two scores in excess of 90 points in respective Years 1 and 2. The histograms of the scores are presented in Figure 1. The means (and standard deviations) of the scores are 42.8 (13.0) and 42.9 (13.3) in respective Years 1 and 2. These summaries differ very little but, given the substantial numbers of examinees, the differences in the shapes of the distributions cannot be attributed entirely to chance. The smoothed density scaled to fit the frequencies is drawn by a solid line for each year. The smoothed density for the other year, scaled to make the two densities comparable, is drawn by thinner dashes. The densities differ most in the vicinity of their principal modes and at around 25 points. The differences in the examinee profiles are another reason why the two cohorts of examinees should not be regarded as equivalent. The administrator of the test stipulates that scores should be integers, even if they are adjusted by equating. We concur with the practice in other testing programs that the adjusted scores should be rounded less coarsely, for example, to one decimal place.

Histograms of the test scores in Years 1 and 2. The vertical dashes mark the mean and one standard deviation below and above the mean. The smoothed scaled density for the year is drawn by a solid line. The smoothed density for the other year, scaled to make it comparable, is drawn by dashes.
In the next section, we describe our approach to equating. It is based on a selection of pairs of examination papers, one from each year in a pair, that are matched on the background variables via a propensity score. The two groups of papers in these pairs are then treated as if they were equivalent. The approach is motivated by methods for missing data and related to the potential outcomes framework (Holland, 1986; Rubin, 1974, 2006), also known as the Rubin’s causal model. The relevance of the missing data principle to test equating is discussed in the context of tests with anchors by Sinharay and Holland (2008). We argue that it is equally relevant when no anchor is available. Section 3 discusses inverse proportional weighting (IPW) as an alternative to the matched-pairs analysis. Section 4 deals with estimation of the sampling variation. The concluding section summarizes the strengths and weaknesses of the method and outlines some further alternatives.
We use the term “equating,” as opposed to “linking,” because the tests used in the two administrations we study are related to the same construct, are assembled by essentially the same process, and the same instructions (rules) are applied in their scoring. There is a substantial overlap of the staff used for test construction and scoring in the 2 years. The main difference is that a few new graders were appointed to deal with the increased workload. We are concerned with observed score equating. For its validity, it is essential that the two test forms have identical distributions of the measurement error. Dorans and Holland (2000) state as one of their guidelines that tests with different reliabilities should not be equated. The assumption of equal reliability is realistic in our case, because the two forms are based on identical specifications and use identical scoring schemes—they can be regarded as replicates.
Established approaches to equating make references to underlying populations. We depart from this convention by targeting the method to the specific papers in Year 2. Adjustment is intended only for these papers. We do not insist that the equating be reversible, because it is meant for no other purpose.
Measurement error in the background variables is also an important issue in general (see McCaffrey, Lockwood, & Setodji, 2013). In our case, measurement error is absent, with the possible exception of the variable GPA. We believe that its impact on the validity and quality of the procedures we propose is negligible. In any case, an error-prone covariate could be used without any reference to its latent (error-free) version. The structure and properties of the error, or misclassification, would have to be identical in the two administrations.
2. Equating on Matched Subsamples
The design of our equating problem is related to the design of two nonequivalent groups with an anchor test (called NEAT by von Davier, Holland, & Thayer, 2004). Instead of the anchor, we have a set of background variables. von Davier et al. propose chain equating (CE) and post-stratification equating (PSE) for NEAT. In CE, the scores in each year are related to the anchor, or the background variables in our case, obtaining functions G 1 and G 2 for respective Years 1 and 2. With an anchor, the equating formula is the composite function:
Both functions G 1 and G 2 are smoothed (made differentiable and increasing), for example, by fitting log-linear models or applying kernel smoothing to the scores, and the composition G 21 may also be smoothed. Holland and Thayer (1989) apply polynomial log-linear regression for this purpose and refer to it as continuization. This method cannot be adapted to our setting because the functions G 1 and G 2 are multivariate, and G 2 cannot be inverted.
In PSE, the conditional distributions of the scores in both years, given the anchor score, are estimated, smoothed, and the adjusted marginal distributions are obtained from them. Equating is based on these distributions. In our case, there are too many, 2 × 2 × 4 × 4 × 2 × 2 × 2 = 512, such conditional distributions, one for each combination of the seven background variables. Several of them are not defined for one year or the other, because there are no examinees with the particular configuration of categories of the background variables. Moreover, many of the estimated distributions are based on too few scores and are therefore very unstable; some details are given in Section 3. Therefore, PSE is not suitable either. A reduction of the set of background variables to a single (categorical) variable with a manageable number of categories is not acceptable because too much information would be discarded. The propensity score is a reduction of the background variables to a single dimension, but it serves a different purpose, for which it is well suited (Rosenbaum & Rubin, 1983).
Smoothing (continuization) is an important element of CE and PSE. In the established approach, it is applied to the scores in both years (e.g., to functions G 1 and G 2) and sometimes also to the composition G 21. We propose a method in which smoothing is applied only once, to the empirical version of G 21.
To equate the scores for the 2 years, we adjust the scores for Year 2. Our method of equating is best motivated as an application of the potential outcomes framework. In this framework, there are two sets of units, A and B, and they are subjected to respective treatments TA and TB. The assignment of the units to the treatments is not under any experimental control; for example, the unit (a person) or its representative may choose the treatment to suit a particular agenda related to the hypothesized outcome. A set of background variables is recorded for all units. A variable is said to be background if its value for a unit is not affected by the treatment that was selected by or assigned to that or any other unit.
Our method comprises two steps. In the first, a set of matched pairs, with a unit from either treatment group in every pair, is selected. The pairing is based on estimated propensity scores. Its purpose is to form two groups of equal size that have nearly identical distributions of the background variables, so that they have the appearance of having arisen by randomization. In the second step, equating is applied to the two groups; the groups are treated as equivalent and any method for such groups can be applied. Owing to the first step, the analysis is simple, efficient, and entails no model-related caveats. Admittedly, modeling is applied in the first step, but it does not involve the test scores. It would not be used in a hypothetical allocation of units to treatments either.
In the second step, the background variables are not used at all. The need for them is eliminated by having balanced their distributions within the treatments in the first step. In a design with random allocation of units to treatments, the background variables would be redundant for a comparison of the treatments. Even though all the background variables are categorical, exact matching, in which each pair has the same set of values of these variables, is not feasible because they have too many (512) possible configurations of the values and a lot of them appear in the data very few times.
In test equating, the unit is not an examinee, but is his or her knowledge, abilities, and mental disposition during test-taking. We refer to such a unit as a (examination) paper, to allow for differences between the performances of an examinee on two occasions (years). The treatment is the year, 1 or 2, when the test was taken. The “year” represents the composition (cognitive content) of the single-form test administered in the year. We refer to the sets of units that were assigned to the treatments as cohorts.
It is essential to equate papers, not examinees, because a test-taker may be prepared very differently in one year than in another. The difference may be not only due to maturity and additional education but also due to an intent to perform better than last time, often in direct response to a perceived failure caused by a poor or disappointing score at the first attempt. The repeaters are a highly selective subsample, and their pairs of scores are likely to have different psychometric properties from the hypothetical pairs of scores of all the test-takers. They cannot be used for any inference about test reliability.
The variables listed in Table 1 are transparently background; that is, their values were assigned well before the issue of the treatment (taking the test) arose. Although a unit was assigned to a particular treatment, it could conceivably have been subjected to the other treatment. That is, the cognitive and mental state associated with a paper can be present at test-taking in either year.
The outcome variable Y is the test score, recorded for every unit in the two cohorts. Although it is commonly regarded as a single variable, or as two variables defined for different sets of units, it is useful to treat it as a mixture of two variables defined for the union of the two cohorts;
where Y (1) is the outcome following Treatment 1, Y (2) the outcome following Treatment 2, and I(1) is the indicator of Treatment 1; I(1) = 1 if Treatment 1 was assigned and I(1) = 0 otherwise. The unit-level effect of Treatment 2 over Treatment 1 is defined as the difference in the outcomes after the two treatments:
It is a variable defined in the union of the two cohorts. We do not want to assume that its values are equal to a constant or have any particular pattern, as might be assumed in a (linear) regression model, possibly with interactions of the treatment with some background variables. The average treatment effect for Cohort 2 is defined as the mean of the values of δ in the cohort. In our context, nonzero (unit-level) treatment effects are an undesirable feature of the two tests, and the purpose of equating is to remove them by adjusting the scores in Year 2. If the value of the treatment effect δ
i
for unit i were known, we would adjust for it straightforwardly as
The fundamental difficulty is that the value of δ is never observed; the value of Y (1) is unavailable for every unit in Cohort 2, as is the value of Y (2) for every unit in Cohort 1. Therefore, we estimate its value for unit i, δ i , by a value common to a homogeneous group of units in Cohort 2. The obvious choice for this group is the set of all the papers that were awarded a given score. If the units were assigned to the treatments at random, as could be arranged in a hypothetical controlled experiment, the average treatment effect would be estimated without bias by the difference of the within-treatment means. In an observational study, such as test-taking with voluntary enrollment in forms (years), this estimator is in general biased.
When the distributions of the background variables for the two groups differ, the solution proposed originally by Rubin (1974) is implemented, in our context, in two steps. First, we match every unit in Cohort 2, or as many units as possible, with a unit in Cohort 1; then we estimate the average treatment effect, or another quantity of interest, from the outcomes for the matched pairs. Rubin (2008) relates such matching to implementing an experimental design. This is done by selecting a subset from each cohort. These two subcohorts have the same size and are in appearance as close as possible to a data set from a randomized experiment. Of course, larger size of the subcohorts is preferred. The importance of matching in observational studies is elaborated by Rosenbaum (2002).
Many observations are discarded in the process of forming matched pairs. Rosenbaum (2002) and Rubin (2006, 2008) justify this apparent waste by the priority to reduce or eliminate bias and by appealing to the simplicity of the subsequent second step. Also, the discarded observations are least relevant to the estimation of the treatment effect. IPW is an alternative to matched-pairs analysis. With IPW, applied in Section 3, Year-1 papers in a matching group are assigned weights for which their total of weights is equal to the number of Year-2 papers in the group. Equating is then applied using these weights.
When several background variables are available, arranging the matches is difficult. Rosenbaum and Rubin (1983) showed that matching on propensity scores yields sets of pairs of essentially the same quality as could be achieved by the (multivariate) matching on the background. In their terminology, the propensity score is a balancing score—the treatment assignment depends on the background only through the propensity score. The propensity of a unit i is defined as the conditional probability of a unit being assigned to Treatment 2, given the same background as this unit i:
where
Causal analysis is motivated by the question How different would be the outcome of a unit that received Treatment 2 if it received Treatment 1?
The corresponding question in equating is as follows: How different would be the score of a paper in Cohort 2 if it were in Cohort 1?
We emphasize that the unit (paper) is not a test-taker, but the complex of his or her knowledge, abilities, and mental disposition while taking the test. The above-mentioned question can be reset in relation to a given test-taker in Cohort 2 as asking what his or her score would be if the same test form were administered in Year 2 as it was in reality in Year 1, but a different test form were used in Year 1. The test form would have to be different in Year 1 because its inspection or collecting information about its specific items may be part of the test-taker’s preparation.
2.1. Propensity Score Analysis
We apply propensity score matching to find subgroups of papers from Years 1 and 2 that are matched as closely as can be arranged. For the propensity model, we use logistic regression with the background variables listed in Table 1 and some of their interactions, which we discuss in Section 2.3. The propensity model is selected so as to obtain a close match of the frequencies of each category of the background variables across the cohorts, so that the matched pairs would closely resemble a data set that might have been obtained by hypothetical random allocation (randomization) of papers to cohorts. There are no formal criteria for comparing models with respect to this goal and candidate models are searched by trial and error. The goodness-of-fit test and similar diagnostics are not relevant for this search. All the recorded background variables should be included in the model, especially when not many are available, and their interactions should be added if it results in a closer match. For continuous variables, their transformations (e.g., powers) should be considered. Propensity scoring is equally well suited to continuous and categorical covariates, or their mix, because so is logistic regression.
At the design or planning stage, the background variables should be selected so that the assignment of papers to cohorts be ignorable. However, ignorability cannot be verified or tested. We have to rely on the variables that were recorded. Seamless collection of the background information is paramount, and an extensive background questionnaire would be unacceptable.
The issue of forming propensity groups can be related to the search for optimal stratification of a set of units. With too few strata, some strata are likely to be too heterogeneous. With too many strata, too many degrees of freedom are used up, but also the units in a small stratum may be much less variable than a superpopulation of units in the stratum would be, or if the data were much more extensive. Arguably, there is a stronger emphasis on bias reduction in matching and, in practice, no matching process is perfect. Rubin (2006) reports that as few as six propensity groups suffice in small or moderate-size studies. We select the method of grouping and the number of groups so that each group contains nearly the same variety of backgrounds as could be expected in a much larger data set. As an example, suppose an attribute is present in a propensity group in 5% of its units. If the propensity group has 20 units, then there is a substantial probability, equal to 0.9520 = 0.36, that no unit in the group has that attribute. For a group with 100 units, the probability is sufficiently small, equal to 0.95100 = 0.006. This suggests that a propensity group should have at least 100 units from each cohort. We give details for the stratification to 25 propensity groups, but the results of the subsequent second step (equating) are very similar to their counterparts for 10, 20, 50, 75, and 100 groups (see Longford, Nicodemo, Núñez, & Núñez, 2011, for another application of propensity analysis accompanied by sensitivity analysis).
Caliper matching is an alternative to stratification of the propensity scores. We chose stratification, because papers in Year 2 would have a widely varying numbers of Year-1 papers within their calipers. Caliper matching is more effective when at least a few background variables are continuous and the propensity scores have a smooth density (see Dehejia & Wahba, 2002, for a detailed evaluation of methods of matching).
The fitted propensities are classified into 25 groups by the cut points set to the percentiles 4, 8, … , 96 of the fitted propensities. The cut points can be set in other ways, for example, to the percentiles of the fitted propensities for Cohort 2. Splitting the range of the fitted propensities, 0.38 to 0.76, or their logits, to 25 intervals of equal length is not useful because the propensities are not distributed uniformly.
The papers with the same propensity score are always assigned to a single group. With the model we selected, we obtained the propensity groups summarized in Table 2 by the numbers of papers from the two cohorts. The limits that define the cohorts are given in the rows labeled “From” and “To.” The smallest group contains 272 units (Group 9), the next smallest 501 (Group 6), and the two largest groups have 1,064 (Group 5) and 1,362 units (Group 8). The average group size is 777.3 and the median size is 757. The sizes of the groups are unequal because some values of the fitted propensities occur many times. There are only a finite number of distinct fitted propensities because all the covariates in the logistic regression are categorical.
The Propensity Groups and Numbers of Exam Papers From the Two Administrations (Cohorts)
Note. The ranges of the propensity scores are given by their lower and upper limits in respective rows labeled “From” and “To.”
In a propensity group with k 1 papers from Cohort 1 and k 2 from Cohort 2, we form k 1 pairs if k 1 < k 2 and k 2 pairs otherwise. In the first case, we discard a random sample of k 2 − k 1 papers from Cohort 2, and in the second case, we discard k 1 − k 2 randomly selected papers from Cohort 1. The pairing of the remaining 2k 1 or 2k 2 papers, to one from each cohort, is immaterial, because the treatment effect is estimated by the average (a linear function) of the contrasts of the test scores within the matched pairs. We obtain M = 415 + 372 + … + 212 = 8,147 pairs of papers. Cohort 1 is in a majority only in propensity groups 1, 2, and 3, by a total of 130 + 52 + 54 = 236 units. In the next section, we describe a method of equating based on these M pairs of scores. In Section 3, we describe a closely related method that uses all the 8,383 + 11,050 = 19,433 scores.
Propensity score matching is usually applied to estimate a mean treatment effect, a single target, and balance of the background variables within the treatment groups is the principal diagnostic. We apply matching to estimate a large set of targets, one for each realized score in Year 2. This suggests that the balance should be checked within the strata defined by the Year-2 scores, and some of these are very small. However, smoothing of the estimates makes such a strict diagnostic check less relevant.
2.2. Equating Matched Groups
By a trivial method of equating, we would adjust the Year-2 scores by the difference of the within-cohort mean scores. The difference of the mean scores for all the papers (without matching) is 42.88 − 42.77 = 0.11, but the difference in the matched set is 43.85 − 42.63 = 1.22. Since the numbers of papers are substantial, we can equate the two cohorts in a more refined way. We now treat the selected subsets of papers from the two cohorts as equivalent. Kolen and Brennan (2004) and von Davier et al. (2004) describe several methods for equating scores from distinct test forms with equivalent groups of examinees. The methods apply smoothing to the estimated distribution functions of the scores within the test forms and then compose them by the identity in Equation 1. We describe a method similar to percentile equating in which smoothing is applied only once. Other methods can be applied, but some are not suitable because they do not provide the level of detail required.
We sort the test scores of the selected papers in the ascending order within the cohorts, to form the vectors
We illustrate the method by a small example. Suppose Forms A and B have six papers each, with sorted scores 2, 4, 5, 5, 6, and 8 for Form A and 1, 3, 5, 7, 7, and 10 for Form B. The equating of Form B to Form A that matches the distribution of scores in Form A exactly is given by:
that is, score 1 in Form B is adjusted to 2, … , score 10 is adjusted to 8. The subscripts in Equation 2 refer to the first or second of a pair of papers with identical scores in a test form. The subscripts are assigned arbitrarily. Owing to the one-to-one matching, the adjusted scores for Form B have the same distribution as the scores for Form A. However, the adjustment is iniquitous, as two papers with identical scores (7 each) are adjusted differently and two papers with different scores (5 and 7) are assessed identically (by score 5) after adjustment. Also, the adjustment given by Equation 2 is a mapping of papers that corresponds to a step function; the steps are integers. Both these problems are resolved by smoothing. Its trivial application adjusts both 71 and 72 to 6, the average of the original adjustments. A less trivial smoothing procedure takes into account also the pairs with Form-B scores further away from 7. The normal kernel smoothing (Simonoff, 1996; Wand & Jones, 1994), a form of nonparametric regression, yields a transformation of Form-B scores that is an increasing function. Details are given in the Appendix.
The method respects the multiplicities that occur in the scores. For example, score 39 occurs in
The sorted pairs of scores

Smoothing of the sorted scores for the matched sets of papers.
The right-hand panel reproduces the plot after rotating the axes, by changing the vertical axis to X (2) − X (1), which corresponds to the adjustment of the scores in Year 2 (see Figures 3.5–3.7 in Kolen & Brennan, 2004, for applications of this graphical device). Without smoothing, we obtain a function that zigzags from one value of X (2) to the next for the frequently occurring scores (thin line). The three smoothed equating lines are now clearly discerned. The black dots at the bottom of the plot represent the frequencies of the (11,050) scores in Year 2, and the gray dots at the top the frequencies of the (8,147) Year-2 scores in the matched pairs.

Differences of the score adjustments obtained using matched pairs and inverse proportional weighting, both with normal kernel smoothing.
The three curves are formed by smoothing the association of
There is no objective criterion for setting σ, and we have to rely on experts’ (or analyst’s) opinion as to how sudden changes in the equating formula (the adjustment) are realistic. For small σ, the adjustments for neighboring scores differ appreciably. The experts judged the changes in adjustments around scores 40 and 65 as too sudden and the smoothed curves as having too sharp twists and turns for σ < 1.5. The smoothing with σ = 2.0 was accepted by the experts after inspecting the results for σ = 1.0, 1.2, … , 2.4. A single value of σ might be preferred, but any criterion that would determine it may be problematic in a different context. The choice of σ is not as crucial as it might at first appear, because after rounding the adjustments based on σ = 1.0 and σ = 2.0 differ for only a small fraction of the papers. Details are given in Table 3, where the adjustments for the Year-2 scores 1–99 are listed, together with the numbers of papers affected (ms ). Thus, the two ways of equating differ by two points for scores 86, 87, and 90, but there are no papers with either of these scores in Cohort 2. Equating with σ = 2.0 would award one more point than equating with σ = 1.0 for Year-2 scores 1–3, 7, 9, 12–16, 21, 26, and 27, affecting 538 papers (4.9%), 169 of them with Score 27. Equating with σ = 2.0 would award one fewer point than with σ = 1.0 for Year-2 scores 57–64, 72–79, 81, 85, 88, 89, and 93, affecting 1,261 papers (11.4%). Most frequent of these scores is 57, awarded to 216 papers.
Adjustment by Equating Based on Normal Kernels With Standard Deviations σ = 1.0 and 2.0
Note. ms is the number of papers with the given score or range of scores in Cohort 2.
Arguably, the table presents the comparison of the two ways of equating in an unfavorable light because the values fitted by kernel smoothing differ by less than 0.5 for several scores for which they differ by a full point after rounding. The fits for the scores 57 to 64 (1,117 papers) differ by between 0.23 and 0.40. If half points were allowed, the same adjustment would be applied by the kernels for scores 57 to 59, 62, and 63, which account for 747 papers, more than 40% of the papers on which the rounded fits by the two kernels disagree. Thus, the choice of the standard deviation for the kernel is not crucial and its impact would be reduced by awarding fractional scores.
Discarding observations cannot be judged as a good practice in general. However, observations that upset the balance may erode the quality of the analysis (see, e.g., Tarpey, Ogden, Petkova, & Christensen, 2014). Discarding purposefully selected observations in equating promotes bias reduction that modeling could achieve only in some stylized settings.
2.3. Diagnostics
An important property of a good smoothing procedure is that the moments of the distributions of the original and smoothed scores are closely matched. Table 4 displays the means, variances, and scaled central moments of the adjusted Year-2 scores with the four levels of smoothing and with and without rounding. The moments are defined as:
h = 3, 4, … , where μ is the expectation and τ the standard deviation of the scores. The moments, as well as μ and τ2, are estimated for the adjusted scores of the papers involved in the matched pairs. Holland and Thayer (1989) proposed these statistics as a general diagnostic for the appropriateness of smoothing. The first column (σ = 0) gives the moments when no smoothing is applied; in that case, the moments match perfectly by construction. The following columns list the moments after smoothing with and without rounding. The table shows that the agreement of the moments is quite close. It deteriorates as the standard deviation σ is increased, but discernibly only for higher moments (not displayed). For a given value of σ, the agreement of the moments is better without rounding. The results with rounding to one decimal place (not displayed) are much closer to the results with no rounding. This confirms that the adjusted scores should not be rounded to integers.
The Means and Scaled Central Moments of the Adjusted Year-2 Scores for the Matched Cohort-2 Scores
In the logistic regression from which the fitted propensities are derived, we included the variables listed in Table 1 and the interactions of Age and Sex, Age and 2nd (taking the test for the second time), Sex and 2nd, 2nd and Mnr (being from an ethnic minority). The choice was made by trial and error, aiming to obtain a close agreement of the distributions of all the background variables in the matched pairs. The model fit is listed in Table 5. Note that some coefficients in the fit are nominally not significant. We retain them in the model because with large sample size (nearly 20,000 papers) elimination of bias has a greater priority than variance reduction. In any case, the model fit is secondary to our principal objective, to obtain a close agreement of the distributions of the background variables in the matched pairs.
The Propensity Model Fit (Logistic Regression)
Note. The first category listed in Table 1 is the reference for each variable. All entries are multiplied by 1,000. GPA = grade point average; SE = standard error; Mnr = ethnic minority; Qsp = qualified for financial support.
Since all the covariates are categorical, it suffices to compare the proportions of the categories in the matched pairs. Table 6 lists the differences of the proportions, multiplied by 1,000 to remove many leading zeros. We refer to it as the balance table. The table confirms that the proportions of the categories in the matched subsets of papers are very close. In fact, we selected the model in Table 5 after inspecting the balance table for this and several other models. By adding an interaction to this model, the balance of the two subsets worsens slightly; by removing any one of the four interactions, the balance becomes appreciably worse. An informal criterion for what amounts to sufficient balance can be derived from the standard deviation of the balance in the related randomized design. This is equal to
The Balance Table
Note. The contrasts of the proportions of the categories of background variables in all the papers and in the matched subgroups. All entries are multiplied by 1,000. GPA = grade point average; Mnr = ethnic minority; Qsp = qualified for financial support.
The balance can be studied for the matched pairs within the propensity groups, but the result is a set of 25 balance tables. Compared to the entire set of matched pairs, greater imbalance can be expected in them because they are based on much smaller samples. Thus, the balances for the variable Age within the propensity groups are in the range from −0.072 to 0.087; for the subsets included in the matched pairs, they are in the range from −0.055 to 0.068. The overall balances listed in Table 5, 0.042 and 0.001, are the respective weighted averages of these sets of 25 balances. Note that we require balance only for the entire set of matched pairs. The propensity groups are merely a device for their construction.
We do not study the association of the scores with the background variables. The background variables are used only for the selection of matched pairs. The explanatory power of the background variables is not a relevant indicator of success of equating. For completeness, the ordinary regression fit of a model selected by the established criteria for model selection is listed in Table 7.
Ordinary Regression Fits to the Test Scores in the 2 Years
Note. SE = standard error; Mnr = ethnic minority; Qsp = qualified for financial support; Res. var. = residual variance.
3. Inverse Proportional Weighting
An apparent deficiency of the matched-pairs analysis is that a sizable fraction of the data is discarded. In our case, the 19,433 test scores from the 2 years are reduced to 2M = 16,294 scores (83.8%). By IPW (McCaffrey, Lockwood, & Setodji, 2013; Robins, Hernán, & Brumback, 2000), all papers are used but the scores are associated with (unequal) weights. All papers in Year 2 are assigned unit weight, and each paper in Year 1 and propensity group k is assigned the weight wk
for which the totals of weights in each propensity group coincide for the 2 years. For example, in propensity group 1, there are 545 and 415 papers in respective Years 1 and 2 (see Table 2), so w
1 = 545/415 = 1.313. Then, the total of the weights within the Cohort 1 is equal to n
2 = 11,050. We apply the second step (equating) with these weights by the following method. We sort the scores within each cohort in the ascending order and associate the sorted scores of Cohort 1 with the cumulative total of the weights. Let
Presenting the results for the two cohorts by a diagram with the same layout as in Figure 2 is ineffective because the differences between the adjustments based on matched pairs and IPW would be difficult to discern. Figure 3 presents the pairwise differences of the adjustments for (post-)smoothing by the normal kernel with the same standard deviations σ as in Figure 2. The differences are in the range (−0.2, 0.2) for all four values of σ, except for a narrow range around the score of 90. Replications of the matching process yield different adjustments. The replicate adjustment curves have a variety of shapes, but do not deviate from zero by more than 0.25, except for σ = 0.5 and in narrow ranges in the proximity of zero and 100. In brief, the differences between the two methods, matched pairs and IPW, are small.
A perfect match of the two cohorts could be achieved by stratification on the entire set of configurations of the background variables. These variables have 25 × 42 = 512 unique configurations. In the data, 80 of these configurations are absent and further 104 are present in only one cohort, 20 only in Cohort 1 and 84 only in Cohort 2. These configurations account for 66 and 283 papers from respective Cohorts 1 and 2. Further 190 configurations contain fewer than 10 papers from one or both years. They account for 904 + 1,587 = 2,491 papers. An IPW procedure that uses these configurations is likely to be very unstable and inefficient. On the other hand, too much information would be lost by discarding the papers in these configurations. Lunceford and Davidian (2004) conducted extensive simulations of IPW using 5 and 10 propensity groups. They found that estimators based on IPW are biased but the bias is small. In general, IPW yields more efficient estimators, but the gains are far smaller than what the number of observations discarded by matched-pairs analysis might suggest.
4. Sampling Variation
In the attempt to match the scores for the two cohorts, we are concerned solely with the adjustment of the 11,050 scores in Year 2. Unlike in standard applications of logistic (or another) regression, there is no uncertainty about the fit in our case. We make inferences about a specific set of realized papers, not any hypothetical superpopulation of papers. In replications of matching on coarsened propensities, we would obtain the same fit and the same set of propensities. The only source of variation in the propensity analysis is due to the selection of matched pairs. We can easily replicate this selection and thus estimate the sampling variation of the score adjustment. Apart from the replicate adjustments, we are also interested in the replicate versions of the balance table in Table 6. The replicate balances of the matched subsets are summarized by their means and standard deviations in Table 8. Although there is some imbalance on average, for categories 2 and 3 of GPA in particular, we regard it as very small. Note that when GPA is treated as an ordinal variable, the mean contrast is −0.1 × 10−3.
The Means and Standard Deviations of the Replicate Balance Tables
Note. Based on 100 replications. All entries are multiplied by 1,000. Mnr = ethnic minority; Qsp = qualified for financial support; GPA = grade point average; SD = standard deviation.
Figure 4 presents 100 replicate sets of adjustments, without rounding and with rounding (as applied in the operation) in its two panels. A small amount of random noise is added to the plotted values in the horizontal direction to prevent a lot of overprinting. The standard errors of the adjustments are estimated by the standard deviations of the replicate adjustments. They are indicated at the right-hand margins of both panels by the thickness of the overlapping gray dots. The numbers of Year-2 papers with each score are similarly indicated at the left-hand margins. The adjustment is coarse for the highest scores even without rounding because only a handful of papers make a nontrivial contribution to the estimates of their adjustments.

Replicate adjustments of Year-2 scores for equating to Year-1 scores, with no rounding (left) and rounding (right). The standard deviation of the replicates is indicated at the right-hand margin, and the numbers of papers at the left-hand margin of each panel. Based on smoothing with σ = 2.0.
Without rounding, the adjustment is quite precise in the region where scores are most frequent; for scores 33 to 58, which account for 7,402 papers in Cohort 2 (67.0%), the standard error is smaller than 0.10. It is smaller than 0.10 also for scores 27 to 30 and 61 to 63, which account for further 1,474 papers (13.3%). The standard error is greater than 0.5 for scores 2 to 4, and 83 to 99, but only 34 papers have scores in these ranges.
When the adjustment is rounded, its standard error vanishes for scores 28 to 45, as well as 24 and 60 to 62 (5,607 papers, 50.7%). Among the more frequent scores, the standard errors are relatively large, in the range 0.4 to 0.5, for scores 47 to 50 which account for 1,201 papers (10.9%). Without rounding, the standard errors are around 0.07, but the adjustments tend to be close to −1.5 and the rounding disperses them to −1 or −2 with nearly equal probabilities. Although the standard errors of the adjustment are extremely large for scores above 85 (greater than 1.0), they affect only 10 papers.
Our discussion of the precision of equating ignores the fact that certain values of the score are important borderlines adopted by the institutions that deal with students’ applications. However, at present, there is no consensus about such scores, except for some rounded values, such as 40, indicating average ability, and 60, assumed to be sufficient even for the most competitive university courses, given other credentials.
Some parameters in our equating procedure are set more or less arbitrarily. For example, the number of propensity (matching) groups was set to 25. In most applications, in which far fewer units are involved, fewer groups are formed. We repeated the procedure for 10, 20, 50, 75, and 100 groups and obtained very similar results. There are numerous alternatives to the normal kernel for smoothing, but no advantage in the quality of the smoothing of one of them over another can be identified. The computational intensity involved is a negligible factor, even with extensive exploration and many replications. We prefer the normal kernel because the normal distribution is familiar to many and the standard deviation of the kernel is easy to relate to the extent of smoothing.
5. Conclusion
The method of equating applied in this article consists of two steps. In the first, we find a set of matched pairs of papers, with one paper from each cohort, and in the second, we apply a method of equating designed for equivalent groups of equal size. Although the first step appears like data reduction, it can also be interpreted as data imputation—the matches found for the papers in Year 2 are in fact plausible values of the Year-1 potential outcomes (scores) of these papers. The matches are arranged by propensity score analysis, but any other method of constructing a set of matched pairs can be applied. Iacus, King, and Porro (2011) describe an alternative. In both methods, matched pairs are formed solely based on the background variables, without any involvement of the outcome variables.
In the second step, we apply an adaptation of percentile equating with kernel post-smoothing. For a given set of matched pairs, this analysis does not involve the background variables. Alternatives can be used for both steps; important is their interplay by which the first step generates a problem that is much easier to deal with than the original problem. This theme is common to the generic methods for dealing with missing data, multiple imputation (Rubin, 2002) and the EM algorithm (Dempster, Laird, & Rubin, 1977). The advantage of these approaches in our setting is that a method can be constructed in (two) stages: first for a simpler problem, in which the cohorts are equivalent, and then it is adapted to nonequivalent groups in a principled way.
The potential outcomes framework can be applied also when an anchor test is available. Simply, we would regard the score from the anchor test as a background variable and form pairs matched on its values. In fact, important conditions for a suitable anchor test are that it is well defined with both cohorts, and its potential versions with the two cohorts are identical. That is, a paper with a given anchor score in Year 1 (or 2) would achieve the same anchor score if it were taken in Year 2 (or 1). The anchor score could be used with (genuine) background variables. The superior status of the anchor score can be acknowledged by refined matching, that is, matching on both this score and the propensity group (Rosenbaum, Ross, & Silber, 2007).
Numerous arguments for both practical and theoretical advantages of matching over regression adjustment are given by Rubin (2008). They include freedom from distributional assumptions, from assumptions of any particular (parametric) form of the dependence of the outcomes on the covariates, no concern about influential observations, and a separation of the “science” (what we want to know—the departure of the two sets of scores from equivalence) from what we do to learn about it (recruit test-takers in the 2 years and select suitable subsets of their papers).
We have dealt only with equating of two administrations (years). For more administrations, there are the same options as for other methods. The first year may be designated as the reference and all subsequent years equated to it. As an alternative, each year after the first is equated to the previous year, and these equating formulae are composed. In hybrids of these two approaches, the reference year is changed from time to time to avoid long chains of equating.
Our approach has one distinct weakness, that is, it requires a sufficiently rich set of background variables that would warrant the assumption that the assignment mechanism (of papers to the cohorts) is unconfounded. That is, the score has to be conditionally independent of the treatment (cohort), given the examinee’s background. This is an untestable assumption, and all we can do is collect as rich a set of background variables as is feasible. Any regression approach entails a similar weakness, although the covariates play a different role in it. Since every matching is imperfect, it entails some bias but this is of a smaller order of magnitude than its simple alternatives and involves fewer and weaker caveats than approaches based on regression adjustment (Rubin, 2008).
The data analyzed in this article is proprietary and cannot be distributed, but the code, compiled in R, can be obtained from the author on request.
Footnotes
Appendix
Acknowledgment
Permission to publish this article and disclose the information in it was granted to the author by the administrator of the test and an associated authority. They wish to remain unindentified. Their constructive comments on earlier drafts of this article are acknowledged.
Author’s Note
Opinions expressed in this article are to be attributed solely to the author; they do not reflect the policies or practices of the authority or the testing program. Referees contributed to improvements on the original submission by constructive criticism and insightful suggestions.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
