Abstract
This study illustrates the use of score equity assessment (SEA) for evaluating the fairness of reported test scores from assessments intended for test takers from diverse cultural, linguistic, and educational backgrounds, using a workplace English proficiency test. Subgroups were defined by test-taker background characteristics that research has shown to be associated with performance on language tests. The characteristics studied included gender, age, educational background, language exposure, and previous experience with the assessment. Overall, the empirical results indicated that the statistical and psychometric methods used in producing test scores were not strongly influenced by the subgroups of test takers from which the scores were derived. This result provides evidence in support of the comparability and meaning of test scores across the various test-taker groups studied. This example may encourage language testing programs to incorporate SEA analyses to provide evidence to inform the validity and fairness of reported scores for all groups of test takers.
Language Proficiency assessments are typically designed for a target population defined by one or more overarching characteristics, but there is often considerable diversity within the target population that may encompass test takers from diverse demographic, sociocultural, and educational backgrounds. The assumption is that the statistical and psychometric methods used in producing test scores are not influenced by the population of test takers from which the scores are derived. That is, the test is assumed to be fair and the scores earned by different subgroups of test takers have the same meaning.
A number of statistical procedures exist to assess the fairness of scores earned by all test takers. These include: (a) differential item functioning (DIF) analysis to assess measurement bias at the item level; (b) differential prediction analysis to assess prediction invariance at the test level using an external criterion; and (c) multiple-group confirmatory factor analysis (MG-CFA) or structural equation modeling (SEM) to assess whether the test measures the construct it purports to measure in the same way across the different groups of intended test takers (Liu & Dorans, 2016). In addition, it is important that supplemental review is undertaken to interpret statistical results within the larger educational and sociological context. These procedures may be used at different points in the test life cycle to provide evidence in support of the validity and fairness of reported scores for all groups of test takers.
As a complement to these existing procedures, the score equity assessment (SEA) approach for empirically assessing fairness at the test level was introduced by Dorans (2004). In large-scale testing, as a first step to ensuring score comparability across alternate forms of a test, the forms are developed according to predefined content and statistical specifications. This step is followed by producing a linkage between the scores on the alternate forms to adjust for unintended differences in test form difficulty (Dorans & Holland, 2000; Kolen & Brennan, 2004). As noted by Holland (2007), “The term linking refers to the general class of transformations between the scores from one test and those of another. Linking methods can then be divided into three basic categories call predicting, scale aligning, and equating” (p. 5). The linking procedure employed in this study falls under the category of equating, the purpose of which is, as Holland further noted, to allow scores from alternate forms of the same test to be used interchangeably. These linking functions are usually derived using the total test-taker group. As a part of this process, it is important to evaluate the extent to which the linking functions are invariant across subgroups of test takers, as a lack of invariance indicates that there may be an interaction among score level, test difficulty, and group membership, thus raising test fairness concerns (Dorans, 2004; Dorans & Liu, 2009).
In the SEA approach, the linking is produced for each subgroup and compared to that derived from the total group as a quality control check as to whether the test assembly and linking practices produce test scores that are sufficiently interchangeable (Dorans, 2004; Dorans & Liu, 2009). Since its introduction by Dorans in 2004, the SEA approach has been widely used to assess the degree to which test scores have the same meaning and can be interpreted in the same way across subpopulations (Dorans & Liu, 2009; Kim & Kolen, 2010; Liu & Dorans, 2013; Yi, Harris, & Gao, 2008). The approach can also be used to evaluate the feasibility of maintaining the current report scale linking whenever changes are introduced to the test (Liu & Dorans, 2013).
In the field of English language testing, test fairness for various assessments has been extensively researched using two of the approaches listed above: DIF and MG-CFA/SEM (Ercikan & Oliveri, 2013; Ferne & Rupp, 2007; In’nami & Koizumi, 2012; Kim, 2001; Stricker & Rock, 2008; Yoo & Manna, 2017; Young, Morgan, Rybinski, Steinberg, & Wang, 2013; Zumbo, 2007). In contrast, there are few studies using the differential prediction analysis and SEA approaches. One recent study by Huo and Kim (2016) employed the SEA approach to evaluate population invariance using both simulation and empirical data for a worldwide English language proficiency testing program. SEA analysis was conducted for single and double linking to determine whether population invariance held under both linkage conditions. The study’s results showed that double linking moderated the differences of the linking functions computed from subgroups based on geographic region. However, beyond this study, additional SEA research is needed to examine population invariance across other diverse subgroups.
The test-taker background characteristics included in population invariance studies tend to be limited to those that are more readily collected during the test administration such as gender, age, ethnicity, or geographic region (Holland, 2003). However, in the field of language testing, a number of additional test-taker background characteristics have been shown to be related to language ability and performance on language tests (Bachman, 1990, 2000; Gradman & Hanania, 1991; Gu, 2014; Hill & Liu, 2012; Kunnan, 1998; Manna & Yoo, 2015; Shin, 2005; Yoo & Manna, 2017), and thus should be included in invariance studies. These additional characteristics include cognitive ability, native language, cultural background, and language learning background, including educational level, time spent studying English, academic major of currently enrolled test taker, or academic major of highest degree. In addition, background characteristics can have a differential impact on the modalities of listening and reading (Bae & Bachman, 1998; Kunnan, 1995; Wilson, 2000). For example, Wilson (2000) found that performance on the Listening section of the TOEIC test varied more based on English use/exposure than educational level. In contrast, there was a stronger association between educational level and Reading scores, such that test takers with less than university education levels tended to score lower, on average, than those with higher education levels. Another salient factor may be repeater status of the test taker. It is possible that repeat test takers represent a unique group within the testing population as a result of greater familiarity with test content and testing conditions (the practice effect). If the analysis sample includes a sufficiently large number of repeat test takers, there may be an impact on the linking relationship. It is, therefore, important to evaluate the influence of repeat test takers on the linking functions used to produce reported test scores (Kim & Kolen, 2010).
The purpose of this study is to extend the research on SEA in the context of language testing. Specifically the study addresses the following research questions:
Using the SEA approach, what test-taker background characteristics, if any, may impact the invariance of linking functions and thus jeopardize the comparability and meaning of scores across subgroups?
Using the SEA results, is there evidence of differential impact on the linking functions across listening and reading test sections?
The test-taker background characteristics included in the study are gender, age, education level, years spent in studying English, academic major, and repeater status of the test taker.
Method
Measurement instrument
The assessment instrument used in this study was the TOEIC® test, developed by Educational Testing Service for assessing English language skills in the workplace. This test is taken by a very diverse group of test takers (in 2013 there were seven million tests taken) and is used by nearly 14,000 companies, government agencies, and English language programs in 150 countries. There are two sets of TOEIC tests that are administered separately: the TOEIC Listening and Reading test, which is administered on paper, and the TOEIC Speaking and Writing test, which is administered online. Test takers can elect to take one or both of these tests.
This study uses only the TOEIC Listening and Reading test to maintain consistency within the study sample as pertains to delivery mode and administration timeframe. The Listening and Reading sections of the test each contain 100 multiple-choice items, which are dichotomously scored. For the Listening section, test takers listen to a variety of questions and short conversations recorded in English and then answer questions based on what they have heard. The Listening items are paced by an audio device and take 45 minutes to complete. For the Reading section, test takers respond at their own pace and the session lasts for 75 minutes. The TOEIC Listening and Reading test reports scale scores that range from 5 to 495 in increments of 5 for each section and a total scale score that is the sum of the scale scores on the two sections.
Prior to testing, TOEIC test takers filled out a self-reported questionnaire designed to collect the characteristics of the test taker’s background, such as education, work experience, English language study and exposure, and prior test-taking experience. The questionnaire responses were used to define the subgroups used in this study.
Data
The intended population of interest for the study is Japanese test takers. Specifically, the data used in this study were from three TOEIC test forms operationally administered in 2013 and 2014 in Japan. Operational linking of the 2014 form was accomplished with items shared (common) with the two 2013 forms. The form administered in 2014 was taken by 32,633 test takers, and the two test forms (2013A and 2013B) administered in 2013 were taken by 36,090 and 27,545 test takers, respectively. Given the study purpose to determine the test-taker background characteristics that may impact the invariance of linking functions, only those test takers with no missing data on the six test-taker characteristics studied were included in the sample used for analysis. Thus, the final analysis sample consists of 19,162 test takers from the 2014 test form, and 19,160 and 19,231 for the two 2013 test forms. Table 1 provides the summary statistics for the analysis sample and all Japanese test takers for the 2013 and 2014 test forms used in this study. A comparison of these summary statistics suggests that, on average, the analysis sample is similar in performance to all Japanese test takers for the corresponding test forms.
Scale score summary statistics of study sample and all Japanese test takers.
As noted in the introduction, the background characteristics of interest in this study are gender, age, education level, years spent in studying English, academic major, and repeater status of the test taker. With the exception of gender, questionnaire response options for these background characteristics were collapsed to ensure sufficient sample size and to facilitate the interpretation of results (see Appendix A for the original and collapsed response options to the background questions). It should be noted that for the variables age and time spent studying English, this study used a similar categorization as used in Yoo and Manna (2017). Specifically, age was categorized in years as ‘younger than 22,’ ‘at least 22 but younger than 42,’ and ‘42 or older’ as these ranges are reasonable approximations of the age range of test takers who have not completed a higher education degree compared to those who have completed (and are newly entering the workforce or have already entered the workforce) and those who are likely to have been in the workforce for long time. In terms of time spent studying English, the categorization in years of ‘fewer than 6,’ ‘at least 6 but fewer than 10,’ and ‘10 or more’ was chosen to broadly represent different developmental stages in acquisition of English as a foreign language and the corresponding proficiency. A summary of the recategorized test-taker responses are provided in Table 2 for the three test forms. As can be seen, all subgroups had more than 900 test takers and the percentages of test takers in each subgroup were similar across the three test forms. The largest differences observed were for the age category ‘younger than 22’ where the test form administered in 2014 had fewer test takers who were ‘younger than 22’ and more that were ‘42 or older’ compared to the test takers for forms administered in 2013.
Test-taker background information.
Linking procedure
In this study, a nonequivalent groups with common items design was used to link the test scores at the subgroup level as well as at the total group level (all subgroups combined). For the subgroup linking, only the test takers from the targeted subgroup on the new test form and reference forms were used in the linking process, whereas for the total group linking, all subgroups were included. Specifically, a set of common items were used to establish the linking relationship between the new test form (from the 2014 administration) and the two reference forms (from the 2013 administrations) using item response theory (IRT). For the Listening and Reading sections of the test, items were calibrated separately for the total group and each targeted subgroup using the two-parameter logistic model (Lord, 1952), as is currently used for the TOEIC test. For each test section, the 2014 items were then placed onto the reporting scale using the test characteristic curve procedure (Stocking & Lord, 1983). It should be noted that two links were produced, 2014 to 2013A and 2014 to 2013B. The scale scores assigned to the test takers were derived from equal weighting of these two linking functions. In this study, for ease of calculation, scale scores of 0 to 100 in increments of 1 were used, not the 5 to 495 scale that is used operationally for the TOEIC test.
In order to determine the invariance of the linking functions and thus the comparability of test scores, the linking functions derived using each targeted subgroup were compared to the linking functions derived using all test takers and the evaluation criteria described in the next section. In sum, a total of 72 linking functions (18 subgroups × 2 common item linking sets × 2 test sections, Listening and Reading) were conducted.
Evaluation criteria
The following evaluation criteria were used for evaluating population invariance.
Difference plots of score conversions
This involves a comparison of the score conversion derived from subgroup linking against that derived from the total group linking by plotting the conversion differences (subgroup minus total) at each score level. As noted by Dorans and Liu (2009, p. 10), this is the “most direct means of assessing population invariance.”
Statistical indices
To further facilitate the evaluation of the subgroup and total group linking functions, the following three statistics were also used: root mean squared difference (RMSD; Dorans & Holland, 2000, p. 288), root expected mean square difference (REMSD; Dorans & Holland, 2000, p. 288), and root expected square difference (RESD(g); Dorans & Liu, 2009, p. 12; Yang, 2004, p. 41). The RMSD describes the differences between the subgroup and total group linking functions at each score level, whereas the REMSD provides an overall summary of these differences across all score levels. In contrast, the RESD statistic summarizes the differences across the score levels separately for each subgroup. In addition to these indices, the scale score differences between average scores obtained from the total group’s linking function and average scores obtained from a targeted subgroup linking function, mean diff(g), was used as an evaluation criteria. The computation of these statistics as described in Liu and Dorans (2013, p. 17) are provided below.
Root mean squared difference (RMSD)
At each scale score point x, the RMSD is defined as
where
Root expected mean square difference (REMSD)
The REMSD is defined as
where
Root expected square difference (RESD)
At each subgroup,
where
Differences in averages
The difference in average scores based on the total group conversion
Difference that matters
To further facilitate the evaluation of the relative magnitude of a difference in score conversions, a difference of half a scaled score unit, specifically .5, was considered as a difference that matters (DTM; Dorans & Feigenbaum, 1994; Liu & Dorans, 2013) in terms of practical consequences as pertains to the reported scale score after rounding. Specifically, if the differences in the RMSD, REMSD, RESD, and mean diff are less than the DTM, then the dependence of the linking function on the subgroup is small enough to be ignored as this difference would not result in a change in the reported score.
Results
In this section, results are presented in the following manner: First, the summary statistics for the test forms are presented. The prerequisites for subgroup invariance are the same construct and equal reliability requirements (Liu & Dorans, 2013). The similar reliabilities and the relationships between total items and common items from each test form provide evidence in support of these prerequisites. This is followed by the evaluation of subgroup versus total group linking functions using scale score difference plots and evaluation statistics (RMSD, REMSD, RESD, and mean difference).
Summary statistics
Table 3 provides reliability coefficients for the new form and two reference forms, the correlations between performance on the total test and the common items, and the effect size of the difference in performance on the common items for the new and reference forms. The reliability coefficient for the three test forms were similar and ranged in value from .92 to .93 for Listening and from .93 to .95 for Reading. In addition, the correlation between performance on the total test and the set of common items used for linking ranged in value from .86 to .89 for Listening and from .89 to .91 for Reading. An examination of the performance of test takers on the common items shared between test forms indicates that the test takers in 2014 were slightly more able than those who took the test forms in 2013 (the effect size of the difference ranged from .10 to .17).
Descriptive statistics of new form and reference forms 1 and 2.
Note: CI Set 1 and CI Set 2 refer to common items from Reference Forms 1 and 2, respectively.
Evaluation of subgroup linkings
Do the difference plots identify background characteristics that impact the invariance of the linking functions for Listening and Reading? The difference in scale scores derived from linking based on the total group and scale scores derived from each of the six subgroups (gender, age, education level, years spent in studying English, academic major, repeater status) is shown in Figures 1 and 2 for Listening and Reading, respectively. Bolded horizontal lines at scale score difference values of .5 indicate DTM boundaries. In general, it can be observed throughout most of the score scale there were no large differences in scores based on the total group linking in comparison to those from the subgroup linkings. Absolute differences that were larger than the DTM tended to be only marginally so, with most of them being less than 1 score point in magnitude. These differences across the subgroups are explained in more detail below.

Scale score differences across subgroups, Listening section.

Scale score differences across subgroups, Reading section.
Figures 1a and 2a show that there were no differences larger than the DTM for scores based on male-only and female-only groups in both Listening and Reading. With respect to age, differences larger than the DTM were observed in Listening in the middle of the score scale (scores 19–67) for the ‘42 or older’ group (Figure 1b) and in Reading, the ‘younger than 22’ group at the upper end of the score range (scores 75–97) (Figure 2b). Specifically, in comparison to the total group linking for Listening, the ‘42 or older’-only linking would yield higher mean Listening scores, whereas for Reading the ‘younger than 22’-only linking group yield slightly lower mean Reading scores.
In reviewing the difference plots for highest educational level (Figures 1c and 2c) and academic major (Figures 1d and 2d), very few differences larger than the DTM were observed. One notable exception was for the ‘graduate’ category in highest education level subgroup on the Listening section. In this subgroup, the linking that included only the ‘graduate’ test takers tended to yield higher Listening scores compared to the total group linking, in particular for scores in middle of the score scale (scores of 14–47).
The results from years spent studying English subgroup exhibited the largest degree of group dependence, in particular for Listening (Figure 1e) and to a lesser extent for Reading (Figure 2e). The scores derived from linking that included only test takers who spend ‘fewer than 6 years’ studying English tended to be higher than those derived from the total group linking at the lower to middle range of the Listening scale (scores of 2–54). Marginal differences greater than the DTM were also observed for the ‘more than 10 years’-only linking for scores between 15 and 35. In contrast, for Reading, scores higher than the DTM were obtained from the total group linking compared to the ‘more than or equal to 6 years but fewer than 10 years’-only linking for scores in 77–96 range. Marginal differences greater than the DTM were also observed for the ‘fewer than 6 years’-only linking at the upper end of Reading scale (scores of 84–94).
The results for number of times the test was taken before subgroup are shown in Figure 1f for Listening and Figure 2f for Reading. The Listening scores derived from the linking that included only nonrepeat test takers were slightly lower compared to those from the total group at the lower and upper ends of the score scale (scores of 15–28 and 66–87). In contrast, the Reading score differences were greater than the DTM at the lower end of scale (scores of 1–23) and slightly lower at the upper end of the scale (scores of 74–91). However, although these differences were greater than the DTM, they were less than 1 score point and occurred in areas of the scale where there were few test takers.
Do the RMSD, REMSD, RESD, and mean difference statistics identify background characteristics that impact the invariance of the linking functions for Listening and Reading? Table 4 summarizes the differences in scale scores obtained from linkings that included test takers from the targeted subgroups and those that included all test takers in terms of the REMSD and RESD indices. Also provided are mean differences in the scores. As observed in Table 4, the REMSD, RESD, and mean differences between scores from the subgroup and total group linkings were below than the DTM for Listening. One exception is in the age subgroup in which the score differences for those in the ‘42 or older’ category (RESD = .61 and mean difference = .57) marginally exceeded the .5 DTM criteria. The mean difference in the scores was .57 indicating that the scores obtained from the linking that included only the ‘42 or older’ test takers were lower than those obtained from the linking that included all test takers. For Reading, the results show that all values for REMSD, RESD, and mean differences between scores from the targeted subgroup and total group linkings were all below the DTM.
Statistical indices for listening and reading sections.
As an indication of score similarity across the score range, Figures 3 and 4 show the conditional RMSD (dotted curve line) of the linking functions for Listening and Reading, respectively. For reference, the REMSD (dashed horizontal line) and the DTM (solid horizontal line) are also shown. The RMSD curves from six subgroups fell below the DTM line, an indication that scores from the subgroup linking were similar to the scores from the total group linking over the entire score range. One exception is the RMSD for the years spent studying English category, in which the RMSD curves slightly exceed the DTM for both Listening (score of 15–35) and Reading (scores of 83–94) sections.

Root mean score difference (RMSD) and root expected mean square difference (REMSD) across subgroups, Listening section.

Root mean score difference (RMSD) and root expected mean square difference (REMSD) across subgroups, Reading section.
Discussion
For security reasons, many testing programs use multiple test forms within and across test administrations. Usually these forms are constructed so they are parallel in content and statistical characteristics. Moreover, the forms are linked using appropriate statistical methodology to adjust for unintended differences in test difficulty and thereby achieve comparability of test scores regardless of the test form taken. From a test fairness perspective, it is also important that the test scores reflect comparable measurement across the various subgroups to whom the test was administered. Using the SEA approach (Dorans, 2004), the present study attempted to investigate the consistency of measurement reflected in test scores across the diverse group of Japanese test takers to whom the workplace English language assessment, TOEIC, was administered.
Overall, the results of the SEA analyses for the TOEIC Listening and Reading section indicate no significant violation to population invariance in reported scores for the subgroups studied: gender, age, educational level, academic major, time spent studying English, and the number of times the test was taken before. In other words, test takers who belong to different subgroups and who have the same score on one test have the same expected test score on the linked test (Huggins & Penfield, 2012). However, as noted by Dorans and Liu (2009, p. 14) “no acceptable [linking] function can ever be completely subpopulation invariant, even in the best of circumstances” and using the DTM criteria, marginal differences were observed for the Listening section between scores derived from the total group linking and those derived from the subgroup linking for the ‘42 or older’ category in age subgroup, the ‘graduate’ category in highest education level subgroup, and those who spent less than 6 years studying English, and to a lesser degree on the Reading section.
Differences (between total group linking scores and subgroup linking scores) that were slightly larger than the DTM were also observed at various points along the score scale for several of the subgroups. A potential factor contributing to the deviations from invariance that were observed may be limited sample size and the resultant misfit in the function used to link the tests (Dorans & Liu, 2009; Liu & Dorans, 2013). Also, the deviations from invariance observed tended to be at the lower or upper end of the score scale where there are fewer test takers. For example, on Listening, for years spent studying English subgroups, the RMSD differences that exceeded the DTM were observed in the score range of 15–35 for the ‘fewer than 6’ group and ‘10 or more’ group, both of which had higher scale scores compared to the total group (see Figure 3e). However, in this score range, there were only 149 test takers from ‘fewer than 6’ group (less than 5% of 3459) and 32 test takers from ‘10 or more’ group (less than 1% of 9663).
Other factors contributing to lack of subgroup invariance in the linking results may be differences in group sample sizes across test forms. For example, for Listening, the sample size for the ‘42 or older’ category in age subgroup was larger on the new test form compared to the reference forms, and the RESD and mean diff marginally exceeded the DTM criteria. Despite the marginal differences that were observed, the overall results of this study were similar across the Listening and Reading sections, suggesting that population invariance held across the sections examined in the current study.
There are practical implications for this study results. First, although SEA has been applied to the several operational programs for research purposes (Dorans, Liu, & Hammond, 2008; Kim & Kolen, 2010; Yang & Gao, 2008; Yi, Harris, & Gao, 2008), the literature suggests that it is not employed as a routine quality control procedure to determine whether the test assembly and linking practices produce sufficiently equitable test scores across subgroups (Dorans & Liu, 2009). This is of particular importance in the context of English language assessments, as the testing population tends to be widely heterogeneous and its composition is likely to change over time given the increasing use of English as a medium of instruction and as a means of communication among non-native speakers of English globally.
This study illustrates empirically the use of SEA in the context of a workplace English language assessment. It is to be hoped that this illustration will encourage English language assessment programs to conduct SEA analyses more frequently, thus providing evidence to support the validity and fairness of reported scores for all groups of test takers. Note, however, that owing to the potential for variation across individual administrations, Dorans and Liu (2009, pp. 37–38) recommend that these analyses should not be conducted prior to reporting test scores “with the intent of ascertaining whether to use subgroup specific [linkings],” but rather across “multiple test administrations and forms” so as to focus on “systemic problems rather than idiosyncratic results.” If the linking differences from these planned analyses are large enough to have practical impact on reported test scores, then further investigation of test assembly, test administration, and statistical analysis processes are warranted (Liu & Dorans, 2013). Depending on the results of these investigations, modifications to test assembly, sampling and/or linking processes may be needed.
Second, this study speaks to the usefulness of collecting various test-taker background characteristics beyond gender, age, and ethnicity such as test-taker educational level, varying degree of language exposure, and the test familiarity by repeated test-taking experience, as an aid to score interpretation. For example, in a related study on population invariance using a factor analytic approach, Yoo and Manna (2017) used the same test and test-taker background characteristics, but with a more expansive sample that included Korean test takers as well. Specifically, their study evaluated whether the prediction of observed test scores from unobservable latent variables were invariant across important subgroups. Although the present study is limited to Japanese test takers, the results are consistent with those of Yoo and Manna’s study, which found that the prediction of observed test scores from unobservable latent variables was invariant across the subgroups studied. Moreover, the latent constructs were measured with the same precision across the different subgroups.
These two types of studies along with the DIF analyses that are carried out on a routine operational basis for the TOEIC program illustrate how three of the four approaches as described by Liu and Dorans (2016; DIF, differential prediction analysis, MG-CFA, and SEA) can be used to provide a cohesive argument in support of the fairness of an assessment, in general, and in particular, the TOEIC Reading and Listening test. Of course, studies on differential prediction are needed, although these are relatively more difficult to design and execute owing to data collection efforts associated with the use of an external criterion.
With respect to study limitations, Liu and Dorans (2013, p. 20) noted that SEA analysis “like any fairness procedure, is complicated because examinees can be a member of many groups” and “there are many ways of partitioning a total population into different subpopulations.” In addition, small sample sizes may limit subgroup categorization and subsequent inclusion in population invariance analyses (Dorans & Liu, 2009). Thus, care should be taken when categorizing the total population into subgroups for SEA analyses, with adequacy and consistency of sample sizes within and across test administrations as well as the potential for test-taker multigroup membership taken into consideration. This highlights a limitation of the current study: Although subgroup sample size was generally not an issue, the results of this study are limited to Japanese test takers and their background characteristics in the context of the TOEIC test. There are other test-taker characteristics that may related to the construct measured by the test and thus should be included in future studies. These include, for example, ‘time spent attending a school, college, or university in which content classes were taught in English’ or ‘having lived in a country where English is the main language.’ Moreover, this study needs to be replicated using test takers from other countries.
An additional limitation is that the test-taker background characteristics were voluntarily self-reported, and as noted earlier test takers with incomplete information were excluded from the analysis sample. Thus, there is the potential risk of inaccurate, inconsistent, or missing responses negatively impacting the study results. Thus, posthoc simulation research is needed to clarify the impact of such responses on the invariance of the linking process, and as applicable, to evaluate different methods of adjusting for missing responses. It is also important that future studies evaluate the population invariance in the linkings across multiple test forms and administrations, as well as sample sizes, so that potentially atyical results from a single linking do not skew interpretation.
Given that this study was limited to the Listening and Reading sections, additional research is needed on the Speaking and Writing sections to provide a comprehensive analysis of invariance for the TOEIC test. In addition, future research may also be warranted to investigate other sources of score variance that may be a threat to test fairness such as native language, type and extent of exposure to the target language, and intended test use.
In summary, as noted by Camilli (2006), “Concerns about fairness arise from the intended and unintended consequences of testing … the central theme of test fairness concerns the match between a test’s measurement properties and the purposes and goals for which the test is used” (p. 251). As illustrated in this study, SEA may be used to complement existing invariance procedures to inform the validity and fairness of reported scores for all groups of test takers.
Footnotes
Appendix
Original and collapsed response options to the background questions.
| Background questions | Original response options | Collapsed response options |
|---|---|---|
| Gender | Male | N/A |
| Female | ||
| Age (in years) | Calculated as test date minus birthdate provided | Younger than 22 |
| At least 22 but younger than 42 | ||
| 42 or older | ||
| Currently enrolled or the highest education level | Undergraduate college or university (for bachelor’s degree) | Undergraduate |
| Graduate or professional school (for master’s or doctoral degree) | Graduate | |
| Elementary school (primary school) | Other | |
| General secondary school (junior High school) | ||
| Secondary school for university entrance (high school) | ||
| Vocational/technical high school | ||
| Vocational/technical school after high school | ||
| Community/junior college (for associate degree) | ||
| Language institution | ||
| Currently enrolled or the academic major of highest degree | Liberal arts | Liberal arts |
| Social studies/law | Social studies / law / business | |
| Business | ||
| Sciences | Sciences / health / engineering / architecture | |
| Health | ||
| Engineering/architecture | ||
| Other / none | Other / none | |
| How much time (in years) have you spent studying English? | Less than or equal to 4 years | Fewer than 6 |
| More than 4 years but less than or equal to 6 years | ||
| More than 6 years but less than or equal to 10 years | At least 6 but fewer than 10 | |
| More than 10 years | 10 or more | |
| How many times have you taken the TOEIC test before? | Never | Never |
| Once | Once or more | |
| Twice | ||
| Three times or more |
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
