Abstract
There has been an increased interest in psychometric tests that enable the comparison of test scores across different language versions. From a psychometric point of view, this endeavor requires empirical evidence on the full score equivalence of these measures. However, this aim is often rather difficult to achieve. This is particularly true for the assessment of verbal abilities. In the present article, we will outline how automatic item generation can be used to overcome some of the problems researchers face in constructing psychometric tests that are valid in multilingual assessment settings. The feasibility of this approach is illustrated in the context of the construction and empirical evaluation of a German and English word fluency test. The results of the various studies reported in this article indicate that automatic item generation enables the generation of a sufficient amount of word fluency items exhibiting a high psychometric quality in both languages. Furthermore, the item pool constructed for both languages can be linked to each other using a common set of anchor items that are identical to each other with regard to their conceptual, linguistic, and psychometric characteristics, thereby facilitating cross-lingual comparisons of word fluency performance.
In response to the globalization of economies, there has been an increased interest in psychometric tests that enable valid test score comparisons across different languages and/or cultures (cf., Byrne et al., 2009; Casillas & Robbins, 2005; Manly, 2008; Myors, Schuler, & Frintrup, 2008). To accomplish this aim, researchers and test developers need to ensure that test scores have an identical meaning across languages and/or cultures. Therefore, observed score differences between and within languages need to be entirely attributable to individual differences in the latent trait measured. In the psychometric literature, this prerequisite is referred to as full score equivalence and constitutes the highest degree of equivalence within the hierarchy (cf., van de Vijver, 2002; van de Vijver & Poortinga, 1997, 2005; van de Vijver & Tanzer, 2004).
The full score equivalence of psychometric tests can be compromised by different sources of bias. Generally, bias is a general term for all kinds of nuisance factors that affect the validity of cross-lingual and/or cross-cultural comparisons. In this article, we will focus on sources of bias that have shown to affect the central components of construct validity: (a) dimensionality, (b) construct representation, and (c) nomothetic span. Dimensionality is concerned with empirical evidence that respondents’ test-taking behavior can be explained by test characteristics (e.g., difficulty of the items) and respondents’ standing on the latent trait to be measured (Borsboom, Mellenbergh, & van Heerden, 2004). Construct representation deals with outlining the cognitive processes and solution strategies used by respondents to solve particular cognitive tasks (Embretson, 1983; Kane, 2001; Messick, 1995). One way to evaluate the construct representation of a cognitive ability test is to analyze the contribution of item stimulus features assumed to affect processing on the difficulty of the item parameter estimates. This approach is commonly referred to as cognitive process modeling (Embretson, 1983, 2005). Nomothetic span builds on the two aforementioned facets of construct validity by examining the correlational patterns between construct-related and construct-unrelated measures to further test hypotheses deduced from their theoretical model.
Two different sources of bias are known to threat the full score equivalence of psychometric tests: (a) method bias and (b) item bias (van de Vijver & Leung, 1997; van de Vijver & Poortinga, 2005; van de Vijver & Tanzer, 2004). Method bias subsumes different nuisance factors such as differential familiarity with the (a) stimulus material, (b) response procedure, and (c) general mode of test administration. These nuisance factors may either affect selected items of a psychometric test (cf., van de Vijver, 2002, 2008) or exert a uniform effect on all test items (e.g., Poortinga & van de Vijver, 1987). By contrast, item bias is concerned with item-specific problems that arise from a lack of conceptual and linguistic similarity of individual test items across languages. In the psychometric literature, item bias is more commonly referred to as differential item functioning (cf., Holland & Wainer, 1993; Sireci, Yang, Harter, & Ehrlich, 2006). Both kinds of bias call the validity of cross-lingual and/or cross-cultural comparisons of test scores into question because the observed score differences between languages reflect the joint influence of individual differences in the latent trait and the effect of specific source of bias.
Dealing With Bias and Ensuring Full Score Equivalence
Different remedies have been proposed to deal with biases in cross-lingual assessments: (a) Method bias can be circumvented by ensuring that respondents of different mother-tongue are equally familiar with the item format and general administration mode. This sometimes requires familiarity-driven and language-driven adaptations of the test items (cf., Malda et al., 2008). (b) Item bias can be reduced by ensuring a sufficient conceptual and linguistic similarity of test items across languages. Research indicated that this aim is often hard to accomplish in the cross-lingual assessment of verbal abilities (cf., Allalouf, Hambleton, & Sireci, 1999; Allalouf, Rapp, & Stoller, 2009; Elosua & López-Jaúregui, 2007; Sireci & Allalouf, 2003). To ensure a sufficient conceptual and linguistic similarity of test items, a precise operational definition of the latent trait that outlines item stimulus features linked to the cognitive processes is required. Within the classic judgmental approach, the similarity is evaluated by a team of translators, psychometricians, and content matter experts. Although this approach has shown to be effective, it is rather time and cost consuming (e.g. Burke, 2009; Sireci et al., 2006). Furthermore, the classic judgmental approach relies on the expertise of the multidisciplinary team. By contrast, current approaches to automatic item generation (AIG: Arendasy & Sommer, 2011; Embretson, 2005; Irvine & Kyllonen, 2002) provide a more formal and precise framework to accomplish the same aim.
An AIG Approach to Test Adaptation
Within this framework, the test adaptation process starts with a definition of the latent trait. The test author then chooses an item format, which has been judged to be equally familiar across the different target languages in which the test is intended to be administered. Until now, the AIG approach differs little from classic judgmental designs. In a next step, the test author outlines the cognitive processes that characterize the latent trait. This process is referred to as the construction of a cognitive model (Embretson, 2005). The focus should be on general cognitive processes shared among different languages to ensure a maximum of linguistic and cultural decentering (Tanzer, 2005). The cognitive model is afterward condensed into a more specific cognitive item model, which outlines item stimulus features of the chosen task that are hypothesized to affect respondents’ processing of the items (e.g. Arendasy, 2004; Arendasy & Sommer, 2005, 2007, 2011; Arendasy, Hergovich, & Sommer, 2008; Embretson, 1983, 2005). These item stimulus features are commonly referred to as radicals (Irvine, 2002). In addition, the test author should also outline item stimulus features assumed to be irrelevant for the processing of the items. These item stimulus features are called incidentals (Irvine, 2002) and can be regarded as interchangeable. Taken together, the main aim of outlining radicals and incidentals is to provide a more formal and unambiguous definition of the latent trait. In addition, the radicals can be used to maximize the construct-related variance in the item parameters (Arendasy, 2004; Arendasy & Sommer, 2005, 2007, 2011; Arendasy et al., 2008).
Arendasy and Sommer (2005, 2007, 2011) recently argued that the specification of radicals alone is not sufficient to ensure full score equivalence and construct representation equivalence. One also needs to define a set of item features that must be omitted in the item construction process to minimize interfering variance arising from non-construct-related cognitive processes. These item features have been referred to as functional constraints (Greeno, Moore, & Smith, 1993).
Thus, within the AIG approach, each test item can be characterized by a unique set of radicals, incidentals, and functional constraints. In sum, the AIG approach provides a formal and precise framework to define the conceptual and linguistic similarity of the test items and outline potential sources of method and item bias.
Statistical Methods to Evaluate Full Score and Construct Representation Equivalence
In cases in which the full score equivalence of unidimensional latent traits has to be examined, the 1PL Rasch model can be used (Rasch, 1980). Usually, the fit of this model within and across languages and/or cultures is evaluated in a series of model fit statistics that check for the presence of differential item functioning. One way to assess differential item functioning is the Likelihood Ratio Test (LRT: Andersen, 1973), which proved to be particularly sensitive in a recent simulation study (Suárez-Falcon & Glas, 2003). In essence, the LRT relates the likelihood of the data for the item parameters estimated in the total sample to the likelihoods of the data for the item parameters estimated in different groups (e.g. German- vs. English-speaking respondents). Unfortunately the resulting goodness-of-fit statistic is only accurate given a sufficient sample size (cf., Glas & Verhelst, 1995; Muñiz, Hambleton, & Xing, 2001; Ponocny, 2003). In case of small sample sizes, the T10 statistic can be used as a nonparametric equivalent of the LRT that does not depend on the sample size but on the number of simulations carried out with a Monte Carlo shuffling algorithm that is used to calculate this test statistic (cf., Ponocny, 2003, p. 445). In both cases, nonsignificant p values indicate an absence of differential item functioning and therefore argue for the full score equivalence of the psychometric test.
These model fit statistics are first applied within each language and/or culture. If the goodness-of-fit statistics fail to reach the level of statistical significance within each language and/or culture, structural equivalence can be assumed. Subsequently, the data sets are merged to evaluate the fit of the 1PL Rasch model across the target languages using the portioning criterion “language and/or culture.” A nonsignificant result regarding this particular model fit statistic indicates that full score equivalence across language and/or culture can be assumed.
Next, evidence on the construct representation of the cognitive measure can be obtained by evaluating the fit of the Linear Logistic Test Model (LLTM; De Boeck & Wilson, 2004; Fischer, 1995). This model constitutes a more restricted version of the 1PL-Rasch Model, which assumes that differences in the item difficulty parameter estimates can be explained by a weighted sum (q lk ) of the basic parameter estimates (η k ) of the radicals (Irvine, 2002) implemented into the item generator and a normalization constant (q0):
The fit of the LLTM also is evaluated by means of an LRT comparing the likelihood of the data according to the 1PL Rasch model and the LLTM. If the Likelihood Ratio statistic fails to reach the level of significance, the basic parameter estimates (ηk), which model the radical structure, explain the item difficulty parameter estimates sufficiently well. Furthermore, the statistical identity of the basic parameter estimates across different languages or cultures can be evaluated by means of an LRT. The psychometric model thus enables a formal statistical test of the construct representation equivalence across gender or language (for similar arguments, Karmer & Smith, 2001).
Constructing a Bilingual Word Fluency Test
The next section outlines the procedural framework used in the simultaneous construction of a bilingual word fluency test.
Definition of the Latent Trait
According to current models of human intelligence in German- and English-speaking countries, word fluency can be defined as the speed and ease with which words are generated or retrieved from long-term memory, irrespective of their meaning (e.g., Carroll, 1993; Jäger, 1984). Since there is a general agreement across both languages in the scope of the definition of the latent trait, issues of construct bias across these two languages can be ruled out.
Choosing the Item Format
In general, word fluency can be measured by any task, which requires the production of words that fit one or more structural orthographic restrictions. We chose anagram problems to measure word fluency since the item format is equally familiar in German- and English-speaking countries. The respondents are administered a jumbled sequence of letters arranged in a row (cf., Figure 1, jumbled letter sequence below the horizontal line). Their task is rearranging the letters to form a noun by clicking them in the correct sequence.

Example of an Anagram Problem Generated
Respondents’ answers are scored as correct in cases in which the noun has been identified by the respondents. Otherwise, the answer entry is scored as incorrect.
Specification of the Cognitive Item Model
Our cognitive item model was based on the Adaptive Control of Thought–Rationale model (ACT-R: Anderson, 2005). The cognitive item model hypothesizes that respondents first encode the letter string and attempt to retrieve a word from the mental lexicon (Harley, 2008). Prior research already indicated that the general structure of the mental lexicon is identical across languages (De Bleser et al., 2003). In general, the mental lexicon is a long-term memory store, which holds the orthographic information on a word in schemata with position-referenced slots for the individual letters. Each word stored in the mental lexicon has a certain base-level activation that corresponds to its word frequency (van Rijn & Anderson, 2003). Upon a retrieval request, the activation of all word schemata is updated and the most active schema is selected for retrieval (van Rijn & Anderson, 2003). In consequence, the ease with which the target word can be retrieved depends on the activation level of the target word and the activation level of all other orthographically similar words (van Rijn & Anderson, 2003). Since the initial letter string of the anagram does not match any of the words stored in the mental lexicon, the letter string has to be rearranged. The complexity of the rearrangement process depends on the degree of fit between the letter string and the target word (cf., Schweizer, 2007; Stankov, 2000). We hypothesized that respondents guide their search through the problem space defining possible rearrangements of the letters by resorting to letter combinations that co-occur frequently in the respective language (Anderson, 1998). Once the letter string has been rearranged mentally, the respondents attempt to retrieve the target word from the mental lexicon. If the word has not been identified by now, the above mentioned information-processing circuit continues until the respondent is either able to retrieve the target word or decides that the problem at hand cannot be solved.
Derivation of the Functional Constraints, Radicals, and Incidentals
Based on this cognitive model, a set of five radicals, three functional constraints, and one incidental was specified to automatically generate German and English anagram problems using the bilingual automatic item generator VfGen (Arendasy, 2006).
The radical “number of letters” is based on a structural feature of the mental lexicon, which contains position-referenced slots for the individual letters of a word. We hypothesized that anagram problems become harder to solve when a larger number of letters needs to be taken into consideration (van Rijn & Anderson, 2003). Similarly, we hypothesized that “word frequency class” affects respondents’ solution process since retrieval of the target word depends in part on its baseline activation (Baayen, Piepenbrock, & Gulikers, 1995; Hager & Hasselhorn, 1994; van Rijn & Anderson, 2003). Based on the hypothesis that respondents guide their search through the use of frequent letter combinations (e.g., Anderson, 1998), we hypothesized that the radical “cues based on frequent letter combinations” should also influence respondents’ solution probability (Baayen et al., 1995; Hager & Hasselhorn, 1994; van Rijn & Anderson, 2003). This radical has been assumed to facilitate the processing of the anagram problems. By contrast, we hypothesized that the radical “number of similar words” should increase the difficulty of anagram problems. This hypothesis is based on the assumption that the ease of retrieval also depends on the activation of orthographically similar words (van Rijn & Anderson, 2003). The radical “degree of fit between the letter sequence presented and the target word” is linked to the rearrangement process itself. This radical affects the number of mental steps (cf., Schweizer, 2007; Stankov, 2000) required to solve a given anagram. Therefore, we hypothesized that it influences respondents’ solution probability in solving anagram problems.
The first functional constraint specifies that the sequence of the letter presented should not be identical to the letter sequence of the target word. Two additional functional constraints deal with the unambiguity of the solution. Since the ease of retrieval depends on the activation level of the target word and orthographically similar words (van Rijn & Anderson, 2003), we hypothesized that letter sequences that can be arranged into two or more words exhibit an increased likelihood of differential item functioning in case the target word and the alternative solution do not exhibit the same baseline activation.
Because word meaning has been defined to be beyond the scope of word fluency (e.g., Carroll, 1993; Jäger, 1984), we decided to treat the meaning of the target words as incidental.
Study 1
This study was conducted to evaluate the likelihood of differential item functioning caused by violations of the functional constraints “unambiguity: relatively uncommon nouns” and “unambiguity: relatively common nouns.” Additionally, we tested whether full score equivalence can be obtained when items honor all functional constraints and are matched according to their radical structure during the item generation phase.
Procedure
To test these hypotheses, we constructed a set of k = 10 German and English anagram problems each. The items of this set were matched to each other according to their radicals. In addition, we generated a language-specific set of k = 10 items that violated exactly one of the two functional constraints “unambiguity: relatively uncommon nouns” or “unambiguity: relatively common nouns.” This was done by forcing the item generator to create letter strings that can be rearranged into exactly two different target words. For scoring purposes, respondents obtained one point if either of these words was identified. These two different kinds of item sets were administered together. The order of item presentation was randomized and identical for all respondents within each of the four conditions. Respondents were randomly assigned to one of the two experimental conditions (violation of the functional constraint “unambiguity: relatively uncommon nouns” or “unambiguity: relatively common nouns”) within each language version. They were allowed to correct their answer until they pressed the “next” button to move onto the next item of the test. There was no time limit to complete the tasks.
Hypotheses
Since the retrieval of the target word depends on its baseline activation and the baseline activation of orthographically similar words, a relaxation of functional constraints should lead to an increased likelihood of differential item functioning, which is more pronounced in the case of the functional constraint “unambiguity: relatively common nouns.” In addition, we hypothesized that after controlling for items causing differential item functioning, the ones matched according to their radical structure should exhibit a good fit of 1PL Rasch model within and across the two language versions.
Samples
The entire Austrian sample consisted of 93 (48.7%) men and 99 (51.3%) female respondents aged between 17 to 73 years (M = 31.11, SD = 11.58). The respondents were from diverse educational levels (ISCED Level 1: 8.8%, ISCED Level 2: 12.9%, ISCED Level 3: 48.2%, ISCED Level 4: 19.7% and ISCED Level 5: 10.4%). Approximately, half of the respondents were randomly assigned to each of the two experimental “constraint violation” conditions.
The two corresponding English samples consisted of 91 (48.1%) male and 98 (51.9%) female respondents aged between 16 to 71 years (M = 30.16, SD = 11.62). The respondents were from diverse educational levels (ISCED Level 1: 6.4%, ISCED Level 2: 16.4%, ISCED Level 3: 41.8%, ISCED Level 4: 20.6% and ISCED Level 5: 14.8%). Again, approximately half of the respondents were randomly assigned to each of the two experimental “constraint violation conditions.”
Results
Differential item functioning within each language version caused by violations of the functional constraints was examined by means of T10 statistics. These analyses were carried out separately for the German and English samples. For each goodness-of-fit test, we simulated a total of 6,000 matrices. The results of these analyses are given in Table 1.
Effect of Intentional Violations of the Functional Constraints on the Psychometric Quality of the Automatically Generated Items (1PL-Rasch Model)
The results obtained for the functional constraint “unambiguity: relatively common nouns” were in line with our hypotheses in both language versions. However, the results obtained for the functional constraint “unambiguity: relatively uncommon noun” only partially confirmed our hypothesis. While we observed significant model fit statistics in the English, a narrow insignificant model fit statistic was obtained in the German version. A closer examination of the constraint violating items across the two languages revealed that the differences in the word frequencies of the more common target word and the less common alternative solutions was more pronounced in the German version, which might have contributed to our finding.
Next, we evaluated whether the k = 10 anagrams that honored all functional constraints measure a unidimensional latent trait within and across both languages. The results are presented in Table 2.
Goodness of Fit Statistics for the 1PL Rasch Model Within and Across Both Languages for the k = 10 Items of Study 1 That Honor All Functional Constraints
As can be seen in Table 2, the 1PL Rasch model fits the data within and across language. Most importantly, the Andersen Likelihood Ratio test for the partitioning criterion “language version” failed to reach the level of statistical significance. This implies that matching anagram problems according to their radical levels in the actual item generation phase is effective in automatically generating a set of anagram problems that exhibits full score equivalence across the two languages examined.
Study 2
The second study was conducted to validate the results obtained in the first study and to evaluate the construct representation within and across both languages.
Procedure
To ensure a more representative distribution of the radical levels, we generated k = 90 new German anagram problems and k = 60 new English anagram problems. Based on the finding that constructing conceptually and linguistically identical verbal test items in multiple languages is close to impossible (cf., Allalouf et al., 1999; Allalouf et al., 2009; Elosua & López-Jaúregui, 2007; Sireci & Allalouf, 2003), we did not attempt to match these additional items according to their radicals across the two languages. Instead, we decided to link the resulting item pools through a common set of k = 10 anagram problems that already exhibited full score equivalence in the previous study. The new item pool within was split into anagram problem sets containing a total of k = 15 newly generated anagram problems and the k = 10 anchor items obtained in Study 1. The anchor items were administered at the 2nd, 4th, 6th, 8th, 10th, 12th, 14th, 16th, 18th, and 20th positions of the item sets. This resulted into a total of six German and four English item sets. Each respondent worked 1 of the 10 item sets. The general administration conditions were identical to Study 1.
Samples
The German sample consisted of 721 (43.5%) males and 936 (56.5%) females aged between 12 to 89 years (M = 31.11, SD = 12.83). The respondents were from diverse educational levels (ISCED Level 0: 3.2%, ISCED Level 1: 7.7%, ISCED Level 2: 13.3%, ISCED Level 3: 52.0 %, ISCED Level 4: 13.5%, ISCED Level 5: 7.7%, and ISCED Level 6: 2.6%).
The English sample, on the other hand, comprised 391 (49.0%) males and 407 (51.0%) females aged between 15 to 79 years (M = 31.48, SD = 13.07). All educational groups are represented (ISCED Level 1: 3.0%, ISCED Level 2: 19.5%, ISCED Level 3: 47.6%, ISCED Level 4: 20.3%, and ISCED Level 5: 9.5%).
Results
Examining full score and construct representation equivalence was done in a series of model fit tests. We first evaluated whether the item parameters of the interlanguage anchor items (Allalouf et al., 2009) remained statistically identical to Study 1. To do so, we combined the present sample with the one obtained in the first study and calculated an LRT. The results indicated that this is the case for the German, χ2(9) = 13.98, p = .123, and English, χ2(9) = 13.53, p = .140, sample. Furthermore, this was also the case when both samples had been combined, χ2(9) = 16.74, p = .053. This indicated that full score equivalence can be assumed for the k = 10 anchor items.
Next, we evaluated the fit of the 1PL Rasch model within and across both language versions. The results of these analyses are presented in Table 3.
Goodness of Fit Statistics for the 1PL Rasch Model Within and Across Both Languages for the Newly Generated Anagram Problems of Study 2
The results indicated that the 1PL Rasch model fits the data well within and across both languages as indicated by the nonsignificant likelihood ratio statistics. In addition, the LLTM fitted the data no worse than the 1PL Rasch model at α = .01 in the German, χ2(94) = 120.52, p = .034, and English, χ2(64) = 87.32, p = .028, anagram problems. The empirically estimated item parameters and those predicted on the basis of the radicals were highly correlated in both languages (R =.93). The results further indicated that the German basic parameter estimates enable a good prediction of the English item difficulty parameters (R = .93). This was also true when using the English basic parameter estimates in order to predict the German item difficulty parameters (R = .92).
Next, we evaluated the fit of the LLTM to the combined data set. At α = .01, the model fitted the data no worse than the 1PL Rasch model, χ2(154) = 196.99, p = .011. Furthermore, the basic parameter estimates of the radicals were statistically identical across the two language versions, χ2(153) = 181.26, p = .059, and all of them contributed significantly to the prediction of the item difficulty parameters. In line with the cognitive item model, the radicals “number of letters” (η = .185, p < .001; odds change = −1.203), “word frequency class” (η = .189, p < .001; odds change = −1.208) and “number of similar words” (η = .287, p < .001; odds change = −1.332), which determine ease at which the target word can be retrieved, significantly increase the item difficulty parameters of the anagram problems. This finding supports the claim that item features linked to the ease at which the target word can be retrieved are associated with individual differences in anagram problem solving. In contrast, the radical “cues based on frequent letter combinations” (η = −.715, p < .001; odds change = 2.044) was found to decrease the difficulty of anagram problems by narrowing down the problem space when rearranging the letter string (Anderson, 1998). Furthermore, the basic parameter estimate of the radical “degree of fit between the letter sequence presented and the target word” (η = − .199, p < .001; odds change = 3.317) also turned out to be statistically significant. Since this radical (Irvine, 2002) is closely linked to the goal module, the results indicate that facets of executive functioning contribute to individual differences in anagram problem solving.
Taken together, these results indicated that full score equivalence can be assumed and there is empirical evidence that respondents indeed use identical cognitive processes to work the anagram problems in both languages. Additionally, both item pools exhibited satisfactory characteristics. The item difficulty parameters of the German anagram pool range from -2.61 to 2.72 (M = −0.06, SD = 1.06), while the item parameters of the English item pool ranged from -2.61 to 2.10 (M = −.170, SD = 1.06). Both item pools were normally distributed (German: Z = .873, p = .432; English: Z = .797, p = .549). The German and English item pool did not differ from each other in terms of difficulty (Levene-test: F = .015, p = .903; t[169] = −.697, p = .487). Furthermore, a simulation study (N = 5,000) that assumed a normal distribution of the ability parameters (M = .00, SD = 1.00; range = −3.00 to 3.00) was conducted for the German and English item pools to evaluate the measurement precision that can be reached in a computerized adaptive test administering a fixed number of k = 25 anagram problems. The results of the German and English item pool yielded a mean measurement precision of α = .81, which corresponds to a standard error of estimation of .44.
Study 3
Although the results obtained in the second study are quite favorable, the LLTM failed to fit the data at α = .05. The basic parameters of the radicals are thus unable to exactly reproduce the item difficulties in both language versions, although the empirically estimated and predicted item parameters were highly correlated. This is a common finding in empirical applications of the LLTM (cf., De Boeck & Wilson, 2004; Fischer, 1995; van de Vijver, 2002). In these cases, Fischer (1995) recommended to cross-validate the results using an independent set of items and respondents. This was done in the third study.
Procedure
We generated a total of k = 20 German and k = 20 English anagram problems using VfGen (Arendasy, 2006). The newly generated items were not identical to already existing items in terms of their radical structure (Irvine, 2002); they were administered together with the k = 10 anchor items. Similar to the previous study, the anchor items were administered at the 2nd, 4th, 6th, 8th, 10th, 12th, 14th, 16th, 18th, and 20th positions. The remaining administration procedure was identical to Study 2.
Sample
The German sample consisted of 90 (43.5%) males and 117 (56.5%) females aged between 15 to 65 years (M = 30.37, SD = 10.92). The respondents were from diverse educational levels (ISCED Level 0: 1.4%, ISCED Level 1: 9.7%, ISCED Level 2: 15.5%, ISCED Level 3: 55.6 %, ISCED Level 4: 8.2%, ISCED Level 5: 7.7%, and ISCED Level 6: 1.9%).
The English sample, on the other hand, comprised 84 (43.9%) males and 107 (56.1%) females aged between 16 to 72 years (M = 36.02, SD = 12.32). All educational groups are represented (ISCED Level 1: 10.5%, ISCED Level 2: 17.5%, ISCED Level 3: 40.4%, ISCED Level 4: 14.0%, ISCED Level 5: 13.5%, and ISCED Level 6: 4.1%).
Results
The fit of the 1PL Rasch model was evaluated within and across both languages. Additionally, we combined the data obtained in Studies 2 and 3 to evaluate the full score equivalence of the joint item pool. The results are presented in Table 4.
Goodness of Fit Statistics for the 1PL Rasch Model Within and Across Both Languages for the Newly Generated Anagram Problems of Study 3
The 1PL Rasch model fitted the data well within and across language versions. Furthermore, the empirical estimated item parameters of the k = 20 new anagram problems in both languages and those predicted on the basis of the basic parameter estimates of the radicals were highly correlated (German: R = .92; English: R = .91). This was also the case when the basic parameter estimates of one language were used to predict the item difficulty parameter in the other language (German → English: .92; English → German: .91). Additionally, there were no significant differences in the basic parameter estimates across both language versions, χ2(193) = 225.79, p = .053, and across both studies, χ2(193) = 225.20, p = .056, for the combined data set comprising a total of k = 200 anagram problems. Taken together, the results argue for the replicability of the findings obtained in Study 2.
Discussion
Several researchers have argued that constructing items that are conceptually and linguistically similar across languages is one of the core issues in cross-lingual assessment. This endeavor has proven to be particularly difficult in the field of verbal abilities assessment due to cross-lingual differences in syntax, semantics, and morphology that could potentially lead to differential item functioning (cf., Allalouf et al., 1999; Allalouf et al., 2009; Elosua & López-Jaúregui, 2007; Sireci & Allalouf, 2003). In order to solve this problem, we have suggested to utilize the min-max approach of AIG (Arendasy & Sommer, 2011) to provide a formal definition of conceptual and linguistic similarity in terms of a specific set of radicals, incidentals, and functional constraints.
In order to test the feasibility of this approach, we simultaneously constructed a German- and English-speaking word fluency test. The selection of anagram problems as a common item format and the derivation of the cognitive item model were based on an extensive literature review with an emphasis on studies conducted with English- and German-speaking respondents. In the first study, two important results have been obtained: (a) In general, violations of functional constraints seem to be associated with an increased likelihood of differential item functioning, and (b) defining conceptual and linguistic similarity in terms of their radical structure turned out to be sufficient to yield an item set that exhibits full score equivalence across the two languages. The latter finding is particularly important since half of the items also varied with regard to their surface structure (incidentals; Irvine, 2002). The second study confirmed this latter finding by indicating that the item parameter estimates of the interlanguage anchor items remain statistically identical across different administration condition. Furthermore, we were able to demonstrate that an extended pool of German and English anagram problems that was linked to each other by means of the interlanguage anchor item set exhibits full score equivalence as well.
Using the LLTM (De Boeck & Wilson, 2004; Fischer, 1995), we have been able to show that respondents indeed use the postulated cognitive processes to work our anagram problems. Furthermore, we have demonstrated that the basic parameter estimates obtained with the LLTM were statistically identical across the two languages. This indicated that the solution of the German and English anagram problems requires essentially the same cognitive processes and knowledge structures.
As with any empirical study, there are some limitations that need to be taken into consideration. Although current concepts of construct validity regard evidence on the construct representation as fundamental (cf., Embretson, 1983; Kane, 2001; Messick, 1995), this kind of evidence is insufficient by itself to support stronger claims regarding the construct validity of a test. Therefore, future studies should also evaluate the cross-lingual equivalence of the correlation patterns between our anagram problems and construct-related and unrelated cognitive ability tests. Although there is empirical evidence on this issue for the German version of this test (cf., Arendasy et al., 2008), data on the English version are still absent. Furthermore, the evidence on the construct representation of the anagram problems could be strengthened by conducting studies that analyze either eye-movement patterns or patterns of brain activation. Hypotheses for both kinds of studies can be directly deduced from our cognitive item model based on its grounding in the ACT-R framework (for examples, see Anderson, 2005). Last but not least, it has to be noted that the two languages we examined are rather similar to each other. It would be interesting to evaluate the feasibility of this approach when applied to less familiar languages that differ more fundamentally in their morphological characteristics. While it might be possible to extend our framework by incorporating other morphological characteristics (e.g., apostrophes) that can guide respondents’ search for common letter combinations, there seems to be a natural constraint to this approach when sign languages such as Chinese are taken into consideration.
Despite these limitations, the results obtained seem to indicate that the automatic min-max approach can be a viable approach to defining conceptual and linguistic similarity in order to obtain full score and construct representation equivalence. In addition, this approach could also help to reduce development and test adaptation costs in the long run. However, these benefits are restricted to well-specified latent traits.
Footnotes
The authors declared that they had no conflicts of interests with respect to their authorship or the publication of this article.
The authors declared that they received no financial support for their research and/or authorship of this article.
