Abstract
This article demonstrates how the metaheuristic item selection algorithm ant colony optimization (ACO) can be used to develop short scales for cross-cultural surveys. Traditional item selection approaches typically select items based on expert-guided assessment of item-level information in the full scale, such as factor loadings or item correlations with relevant outcomes. ACO is an optimization procedure that instead selects items based on the properties of the resulting short models, such as model fit and reliability. Using a sample of 5,567 respondents from five countries, we selected a 15-item short form of the Big Five Inventory–2 with the goal of optimizing model fit and measurement invariance in exploratory structural equation modeling, as well as reliability, construct coverage, and criterion-related validity of the scale. We compared the psychometric properties of the new short scale with the Big Five Inventory–2 extra-short form developed with a traditional approach. Whereas both short scales maintained the construct coverage and criterion-related validity of the full scale, the ACO short scale achieved better model fit and measurement invariance across countries than the Big Five Inventory–2 extra-short form. As such, ACO can be a useful tool to identify items for cross-cultural comparisons of personality.
Keywords
Short scales allow the assessment of psychological constructs such as the Big Five personality traits typically in 5 minutes or less (e.g., Donnellan et al., 2006; Gosling et al., 2003; Rammstedt & John, 2007). This high time-efficiency enables the inclusion of short scales in cross-cultural large-scale assessments such as the Programme for the International Assessment of Adult Competencies (PIAAC), the World Values Survey, or the International Social Survey Programme (ISSP). Cross-cultural large-scale assessments open up valuable research opportunities (Allik & McCrae, 2004), for example, the investigation of the association between personality traits and life outcomes within and across cultures (e.g., Danner et al., 2019) or of sociodemographic and cultural differences in the distribution and structure of personality (e.g., Schmitt et al., 2007). However, the quality of this research depends on the quality of the respective short scales. Previous research has demonstrated that even very abbreviated scales, such as the 10-item short version of the Big Five Inventory (BFI-10) or the 15-item extra-short form of the Big Five Inventory–2 (BFI-2-XS), can reliably capture interindividual differences of the respective full scales within countries (Rammstedt & John, 2007; Soto & John, 2017a). However, because of the way they are traditionally developed, the comparability of these short scales across countries may be lacking.
How Are Cross-Cultural Short Scales Traditionally Developed?
Short scales are typically a subset of items of the original full scale. In a first step, a subset of items is selected out of the original scale based on item-level criteria, such as factor loadings, correlations with criterion variables, or conceptual considerations (e.g., Stanton et al., 2002). Second, the psychometric properties of the resulting short scale—for example, its reliability, factor loadings, correlation with the full scale, or correlation with criterion variables—are evaluated. Third, the short scale—which is typically developed and validated in English-speaking samples—is translated into other languages (e.g., Harkness et al., 2004).
This approach is beset with several issues. First, multiple criteria such as factor loadings, correlation with criterion variables, and conceptual considerations have to be optimized simultaneously. This may not always be possible. Items with the highest factor loadings may not be items that show the highest correlation with criterion variables or capture the conceptual core of a construct (Smith et al., 2000). For example, the items for the BFI-10 (Rammstedt & John, 2007) were primarily selected based on factor loadings. Accordingly, the items show a clear factorial structure (e.g., Rammstedt et al., 2013), but they do not cover all relevant facets of the Big Five domains. The Extraversion items cover only the Sociability facet but not the facets Assertiveness and Energy, and the Conscientiousness items cover only the Productiveness facet but not the facets Organization and Responsibility. This inadequate facet coverage reduces the validity of traditionally developed short scales and of research based on them.
The second issue is that the relevant model-level psychometric criteria, such as model fit, reliability, and degree of measurement invariance of a scale, can be evaluated only after the items of the short scale have been selected. During the development process, it is uncertain whether the items selected based on factor loadings will also provide the best model fit and reliability for the resulting scale. This is even more critical for cross-cultural surveys, where considerable effort is made to translate the short scales according to state-of-the-art approaches (e.g., Harkness et al., 2004) without knowing whether these scales will be measurement invariant across languages or cultures. This can be rather frustrating. For example, Vazsonyi et al. (2015) investigated the measurement invariance of a 28-item short form of the BFI across six countries and found only configural (but not metric or scalar) measurement invariance. Likewise, Danner and Rammstedt (2015) investigated the measurement invariance of two Big Five short scales in the World Bank’s STEP Skills Measurement Household Survey 2012 and the ISSP 2005. In the STEP survey, the Big Five were assessed in 12 countries with the 15-item BFI-S (Gerlitz & Schupp, 2005). In the ISSP, the Big Five were assessed in 19 countries with the 10-item BFI-10 (Rammstedt & John, 2007). Both short scales were developed based on the BFI (John et al., 1991); items were selected based on factor loadings and expert ratings, and then translated from English to other languages. In both data sets, not even configural measurement invariance could be supported. One study investigating the measurement invariance of the 15-item BFI-2-XS (Rammstedt et al., 2020) suggests that the English and the German versions are only approximately invariant. Further evidence on the cross-cultural invariance of the BFI-2-XS (Soto & John, 2017a) is missing so far.
In sum, traditional approaches to developing short scales are beset with several issues. Multiple criteria have to be optimized simultaneously, and relevant criteria, such as model fit and the measurement invariance of the short scale, can be evaluated only after the items have been selected. In the following, we will present an approach that addresses these issues. We will describe how the heuristic algorithm ant colony optimization (ACO; e.g., Olaru et al., 2019) can be used to select items simultaneously based on conceptual considerations as well as psychometric criteria of the resulting model, such as model fit, measurement invariance, correlations with the original scale, and correlations with criterion variables.
The Ant Colony Approach
To address the aforementioned issues, the best item subset should be identified directly on the resulting short models based on a composite of the desired scale-level criteria. However, this comes at a cost of dramatically increased computational effort. For example, when selecting one out of four items per BFI-2 facet (15 facets in total), the potential number of models (i.e., combinations) is 415 = 1,073,741,824. In this case, estimating each of the models with several measurement invariance levels to identify the best item subset is not computationally feasible.
To address this problem, we used ACO (Dorigo & Stützle, 2010), a metaheuristic optimization procedure that is capable of solving complex combinatorial problems in an efficient way. Despite the limited capabilities of single ants, the colony as a whole is capable of finding and following efficient routes to the nest to the food source by communicating through pheromones (Deneubourg et al., 1983, 1990). During their search for food, ants leave pheromone trails along their route, which attract more ants to the route. The higher the strength of the pheromone trail, the more ants are attracted to the route. At the beginning of the search, ants will randomly search for the food source and return on the same route to the nest. Ants that find a shorter route to the food source will travel more frequently between the nest and food source and thus accumulate more pheromones on their route. Pheromone levels on longer routes increase more slowly or evaporate over time. The higher pheromone levels will attract more ants, which in turn increase the levels even further, until (nearly) all ants follow the shortest route. This natural adaptation technique described by Deneubourg et al. (1983, 1990) was first implemented as an optimization algorithm by Doringo et al. (1991) to solve the traveling salesman problem (i.e., finding the shortest route across several cities). Since then, several ant-based optimization algorithms have been developed to solve a wide variety of combinatorial problems (for an overview, see Dorigo et al., 2010), with ACO being one of the most flexible and successful adaptations. Many psychological assessment problems, such as the selection of items for a short scale, also represent a combinatorial optimization problem. More specifically, a set of objects (e.g., items of the long form) have to be selected in such a way that a criterion is optimized (e.g., model fit and reliability of the short form). In psychological research (or psychometrics), ACO has so far been used to identify the underlying empirical model in a structural equation modeling context (Marcoulides & Drezner, 2003) and to derive item weights for ethnicity-fair cognitive ability measures (Allred, 2019). Most notably, several studies have demonstrated that ACO is a purposeful item selection tool for short-scale construction (Janssen et al., 2015; Leite et al., 2008; Olaru et al., 2015; Olaru et al., 2018; Schroeders et al., 2016; Schultze & Eid, 2018).
How can this heuristic be used to develop short scales? At the start of the search, the ACO algorithm randomly selects several short scales and evaluates them on a set of user-defined criteria (e.g., model fit and measurement invariance). Items belonging to the best solution found will then receive higher (virtual) pheromone values. These pheromones in turn increase the likelihood of the items being selected in subsequent iterations. This process of selection, evaluation, and the increase of pheromone levels is repeated across several iterations until the desired criteria cannot be further optimized or pheromone levels reach a certain threshold after which the selected solutions become too similar (for more details, see Olaru et al., 2019; Schultze, 2017; Schultze & Eid, 2018). The main advantage of ACO is that it evaluates and optimizes scale-level psychometric criteria directly on the final model, instead of using item-level information from the initial model. The probabilistic approach based on pheromones also reduces the number of models that have to be estimated to a fraction of the possible models.
The Present Study
So far, several studies have shown how ACO can be used to compile short forms with good model fit and reliability in a confirmatory factor analyses (CFAs) context (e.g., Janssen et al., 2015; Leite et al., 2008; Olaru et al., 2015) or measurement invariance across age, gender, and countries in multigroup CFAs (e.g., Olaru et al., 2018; Schroeders et al., 2016; Schultze & Eid, 2018). However, these studies often focused on a small set of optimization criteria or did not compare the derived solution with a short scale developed with traditional item selection approaches (e.g., Stanton et al., 2002). In this study, we will demonstrate how ACO can optimize all criteria used in traditional approaches (e.g., conceptual considerations, correlations with the original scale, correlations with criterion variables), as well as more complex criteria that require a combinatorial approach (e.g., model fit and measurement invariance). We will also show how ACO can be applied in exploratory structural equation modeling (ESEM; Asparouhov & Muthén, 2009). Using data of the 60-item BFI-2 (Soto & John, 2017b), which was administered to 5,567 respondents in five countries (The United States, Germany, France, Spain, and Poland), we will compare the psychometric properties of a short form of the BFI-2 developed using a traditional approach (Soto & John, 2017a) with the psychometric properties of a short form of the BFI-2 developed with ACO. All analyses scripts used in this study are available in an Open Science Framework (OSF) repository (https://osf.io/bpgnv/).
Method
We report how we determined out sample size, all data exclusions, all manipulations, and all measures in the study.
Sample
We used data from the PIAAC English Pilot Study on Non-Cognitive Skills (Organisation for Economic Co-operation and Development [OECD], 2018a) and the PIAAC International Pilot Study on Non-Cognitive Skills (OECD, 2018b). The merged data set contained 7,380 participants from the United States, France, Germany, Spain, and Poland 1 who completed the full BFI-2. We excluded participants who failed at least one of eight quality checks included in the data set (e.g., agreement with the item “I fly to the International Space Station”; low response times; no correct answers on an ability test; same responses to at least four pairs of positively; and negatively keyed items of the same factor). This reduced the sample size to N = 5,766. We also excluded participants whose native language did not correspond to that of the country of residence. This resulted in a remaining total sample size of N = 5,567. Sample characteristics are presented in Table 1.
Sample Characteristics.
Cross-Validation
As an optimization procedure, ACO tries to find the optimal solution given the predefined model and sample. As a consequence, it is possible that the selected items represent an optimal solution only for the underlying sample. This problem of lacking robustness or generalizability is commonly known as overfitting. To address this issue, we randomly split the total sample into two equally large subsamples (n = 2,783) with the same country distribution as the full sample. The first sample half (the training sample) was used to identify a solution with ACO. The second sample half (the validation sample) was then used to evaluate the final solution. Evaluating the final solution in a sample that was independent from the sample used to identify this solution enabled us to ensure that the ACO model does not represent a sample-specific overfitted solution.
Measures
Big Five Inventory–2
The BFI-2 (Soto & John, 2017b) is a 60-item measure of the Big Five domains Extraversion, Agreeableness, Conscientiousness, Negative Emotionality (Neuroticism), and Open-Mindedness (Openness). The BFI-2 additionally captures personality traits at the facet level, with three facet traits per Big Five factor (e.g., Extraversion: Sociability, Assertiveness, Energy). Each facet is measured with four items; two items are positively keyed (e.g., “I am outgoing, sociable”) and two items are negatively keyed (e.g., “I tend to be reserved”). Respondents are asked to indicate their level of agreement with the 60 items on a 5-point Likert-type scale ranging from 1 (strongly disagree) to 5 (strongly agree).
Big Five Inventory–2–XS
The BFI-2-XS is a 15-item short form of the BFI-2 developed by Soto and John (2017a) with the goal of maintaining the construct and predictive validity of the full scale. Items were selected based on several criteria, for example: (a) each item’s standardized loading on its facet factor in a bifactor CFA model of the facet scale; (b) each item’s correlation with its total facet scale in a U.S. sample; (c) the inclusion of both true-keyed and false-keyed items within each Big Five domain; and (d) the authors’ conceptual judgment of the extent to which each item’s content represented the overall meaning of its facet scale. One item was selected for each facet, thereby yielding a total of 15 items.
Criterion Variables
In addition to the BFI-2, a number of criterion variables were also assessed.
Education
The highest level of education completed was assessed on a 4-point scale comprising (a) primary school, (b) high school or equivalent, (c) some college or vocational school (2 years), and (d) tertiary education.
Income
Relative yearly income was assessed on a 6-point scale with the labels (a) less than 10% of the national income level, (b) 10% to less than 25%, (c) 25% to less than 50%, (d) 50% to less than 75%, (e) 75% to less than 90%, and (f) 90% or more.
Life satisfaction
Life satisfaction was assessed with three items, asking (a) how satisfied respondents were with their lives, (b) to what extent they felt that the things they did in their lives were worthwhile, and (c) how happy they had felt the day before the assessment. All items were answered on a 10-point scale ranging from not at all to completely. We computed the mean value across the three items as an indicator of life satisfaction. The internal consistency of the scale was α = .86 (ranging from α = .82, in the French sample to α = .90, in the Spanish sample).
Health
Subjective health was measured by asking respondents how they would describe their health on a 5-point scale ranging from excellent to poor. We reverse-coded this variable so that higher values indicate higher perceived health.
Statistical Analyses
Model Specification
The measurement model for the short scales was specified using ESEM (Asparouhov & Muthén, 2009) with target rotation. ESEM is a less restrictive type of CFA that allows for cross-loadings across factors, which are commonly found in models of personality self-report scales (e.g., Ashton et al., 2009). To address potential scale usage effects (and differences therein across countries), we also included an acquiescence factor (e.g., Aichholzer, 2014; Billiet & McClendon, 2000; Maydeu-Olivares & Coffman, 2006) loading on all positively keyed items with 1, and on all negatively keyed (and recoded) items with −1. In contrast to bifactor models (e.g., Morin et al., 2016), in which the factor loadings of the general factor are unconstrained, the acquiescence factor only requires the estimation of one additional model parameter (i.e., the factor variance). Because the loadings on recoded negatively keyed items are fixed to −1, this factor captures interindividual differences in the tendency to agree with statements regardless of their content or construct assessed. The model is illustrated in Figure 1. For each model, we estimated three levels of measurement invariance: configural invariance (same model structure across countries); metric invariance (equal factor loadings across countries); and scalar invariance (equal factor loadings and equal intercepts across countries). Models were estimated with maximum likelihood estimation using Mplus 8. ACO (e.g., item selection, model evaluation, pheromone levels) was implemented in R 3.4.4. Mplus outputs were read using the R package MplusAutomation (Hallquist & Wiley, 2018).

Big Five Inventory–2 short model.
Model Evaluation
We evaluated overall model fit with a combination of the comparative fit index (CFI), the root mean square error of approximation (RMSEA), and the standardized root mean square residual (SRMR) based on common standards (acceptable/good fit: CFI ≥.90/.95; RMSEA ≤.08/.06; SRMR ≤.08/.06; Bentler, 1990; Hu & Bentler, 1999). Measurement invariance was tested by comparing a model without parameter constraints (configural measurement invariance), a model with equal factor loadings across countries (metric measurement invariance), and a model with equal factor loadings and item intercepts across countries (scalar measurement invariance). Metric invariance is required for a comparison of factor variances and correlations across countries. Scalar invariance is necessary for an unbiased comparison of the factor means.
Optimization Criteria for the Ant Colony Approach
The overall optimization function for selecting items for the newly developed short scale, BFI-2-ACO, contained three criteria:
Model fit and measurement invariance
To improve measurement invariance and absolute model fit, we optimized the CFI and RMSEA of the scalar measurement invariance model. Note that it is also possible to additionally optimize (i.e., reduce) the CFI and RMSEA difference between measurement invariance levels using ACO (e.g., Olaru et al., 2018). However, optimizing the absolute fit of the most restrictive model is equivalent to optimizing both the absolute fit of the model and the difference from a less restrictive model. Estimating several measurement invariance levels in each (virtual) ant is also computationally much more demanding than testing only the most restrictive model.
Construct coverage
To maintain construct coverage, we optimized the correlation between the BFI-2-ACO and the full BFI-2. In line with the development of the BFI-2-XS, we ensured that one item from each facet was selected for a total of three items per Big Five factor, and that at least one item was negatively keyed (to also account for scale usage effects). We did not include high factor loadings or reliability as optimization criteria, as applying these criteria to such extra-short scales generally results in a highly homogeneous item selection with narrowed construct coverage.
The correlation with outcomes
To maintain the criterion-relevant validity of the scale, we also optimized the correlations with relevant outcomes (i.e., education, income, life satisfaction, and subjective health). More specifically, we computed the difference between the correlations of the selected ACO short scale and the full BFI-2 scale with outcomes within each country. Because Negative Emotionality correlated negatively with all outcomes—thus making a negative difference value desirable—we reversed the correlations for this factor, so that a positive difference indicates stronger negative correlations. The optimization goal was to achieve a positive average difference value (i.e., at least 0).
All criteria were logit-transformed to ensure that they were weighted equally (transformed values ranged from 0 to 1) and optimized most strongly across critical values (i.e., CFI ≥ .90; RMSEA ≤ .06; difference in the correlation with outcomes ≥.00; correlation with the long scale ≥.90; for more details on the logit transformation, see Olaru et al., 2019). The overall optimization criterion was the mean value across all criteria.
Ant Colony Optimization Parameters
We estimated 60 models per iteration. After each iteration, pheromone values for the items of the best solution found in the iteration were increased by the optimization criteria value (ranging from 0 to 1). The search was aborted if the currently best solution could not be improved over the course of 40 additional iterations. If a better solution was found, the iteration counter was rosetted. As ACO is a probabilistic procedure that may yield a different solution with each run, we started the item selection 10 times with different random number generator seeds, and used the overall best solution out of the 10 runs (based on the selection sample). The average computation time was around five hours per ACO run on a laptop with an i7-7700HQ processor with 2.80GHz. To avoid the estimation of the same model several times, which may occur as the search converges toward a subset of items and pheromone values on these items become much larger (for an illustration, see Olaru et al., 2019), we skipped the estimation of previously selected models. Instead, the optimization criterion information was retrieved from a list containing the results of all previously estimated models, reducing the number of model estimations by around one third (with the current settings). An alternative approach of reducing the length of the ACO search without compromising the quality of the final solution has been implemented by Schultze (2017; see also Schultze and Eid, 2018) based on the max-min ant system (Stützle & Hoos, 2000). With this procedure, the search is stopped when pheromone values reach a certain threshold, after which the selected models would become redundant.
Results
Item Selection Based on the Ant Colony Approach
The items were selected based on the training sample. The model fit for the scalar invariant model in the training sample was CFI = .939, RMSEA = .046, and SRMR = .047. The items selected by ACO are presented in Table 2. For each factor, one negatively keyed item was selected, with the exception of Negative Emotionality, for which two negatively keyed items were selected. As can be seen from Table 2, the BFI-2-ACO scale shares six items with the BFI-2-XS. The overlap is only slightly larger than the expected random overlap of approximately four to five items (given the constraint to balance items). All results presented in the following are based on the independent validation sample.
English-Language Items of the Big Five Short Scales BFI-2-XS and BFI-2-ACO.
Note. BFI-2-XS = Big Five Inventory–2 extra-short form; ACO = ant colony optimization; EX = Extraversion; AG = Agreeableness; CO = Conscientiousness; NE = Negative Emotionality; OP = Open-Mindedness. The original BFI-2 item number is given in parentheses. BFI-2 items copyright 2015 by Oliver P. John and Christopher J. Soto. Reprinted with permission.
Model Fit and Measurement Invariance
In the validation sample, the BFI-2-ACO model with scalar measurement invariance also yielded an acceptable to good model fit (CFI = .927; RMSEA = .050; SRMR = .047). Figure 2 shows the model fit of the BFI-2-ACO and BFI-2-XS across all measurement invariance levels. In particular, the BFI-2-ACO model with scalar measurement invariance yielded a better fit than the BFI-2-XS model (CFI = .837; RMSEA = .072; SRMR = .062). Most notably, the BFI-2-XS did not achieve acceptable model fit under scalar measurement invariance constraints, whereas the BFI-2-ACO did.

Model fit across measurement invariance levels.
However, not even the BFI-2-ACO achieved metric measurement invariance based on the ΔCFI≤.01 criterion (Cheung & Rensvold, 2002) or on overlapping RMSEA confidence intervals (MacCallum et al., 1996). But measurement invariance violations were much smaller for the BFI-2-ACO (ΔCFIs = .023 and .031) than for the BFI-2-XS (ΔCFIs = .039 and .104). Thus, the BFI-2-ACO arguably represents the most measurement invariant short form for the countries covered by the present study, as it provides the best model fit under the strongest equality constraints. When comparing the Big Five factors across countries, the BFI-2-ACO should therefore yield less biased comparisons than the BFI-2-XS. When using the BFI-2-ACO on a subset of the selected countries, or relaxing some constraints (e.g., partial measurement invariance; Byrne et al., 1989), these measurement invariance issues should also improve.
Factor Loadings and Reliability
Table 3 shows the standardized factor loadings and reliability estimates (i.e., factor saturation) of the configural invariant measurement model. We report the averaged loadings and ranges across all countries. The factor loadings and reliability estimates for each country can be found on the project page of the present study in the OSF repository (see OSF Tables 1 and 2; https://osf.io/bpgnv/). As can be seen from Table 3 in the present article, loadings were of moderate size, which was to be expected for a model with items measuring a common higher order factor but different facets. Because of the low number and heterogeneity of the items, factor saturation was generally low for both the BFI-2-XS and the BFI-2-ACO (with the exception of Negative Emotionality). In all countries, BFI-2-ACO main loadings were significant—exceeding .30 (except for “Prefers to have others take charge” in the U.S. sample; λ = .22)—and higher than the largest cross-loading on the corresponding item. In contrast, the BFI-2-XS Conscientiousness item “I am someone who is reliable, can always be counted on” yielded a nonsignificant main loading in the French and Spanish sample (France: λ = .12; Spain: λ = .02). In these two countries, the item loaded on Agreeableness instead (France: λ = .44; Spain: λ = .50), indicating that it has a different meaning for participants in these countries.
BFI-2-XS and BFI-2-ACO Factor Loadings, Factor Saturation, and Manifest Correlations With the Full BFI-2.
Note. BFI-2-XS = Big Five Inventory–2 extra-short form; ACO = ant colony optimization; EX = Extraversion; AG = Agreeableness; CO = Conscientiousness; NE = Negative Emotionality; OP = Open-Mindedness; AQ = acquiescence factor; ω = McDonald’s omega. The average and range [in brackets] of the correlations with the BFI-2, standardized factor loadings, and factor saturation across countries are presented here (for all values, see OSF Tables 1 and 2 on the project page of the present study in the OSF repository, https://osf.io/bpgnv/?view_only=21543964489f41e18d410057e1850db0). Factor loadings were estimated on the configural measurement invariant model. Main loadings are printed in bold.
Correlation With the Full BFI-2 and Criterion Variables
Correlations between the BFI-2-ACO scales and the corresponding full BFI-2 scales were all above .80, and equivalent to the correlations between the BFI-2-XS and the full measure (see Figure 3). Despite the fact that the BFI-2-ACO and the BFI-2-XS did not correlate perfectly with the full BFI-2, the correlations with the relevant outcomes were identical between these three scales (see Figure 3; for correlations by country for all three scales, see OSF Table 3, https://osf.io/bpgnv/). Relevant interpersonal variance was thus arguably retained, despite the shortening of the scale. This applies to both the BFI-2-ACO and BFI-2-XS, for which items were also selected based on their correlations with a wide range of similar outcomes (Soto & John, 2017a). Differences between the magnitude of the correlations were much larger across countries than across the different scales (see error bars in Figure 3). Whereas the direction of the correlations was generally equivalent across countries, the absolute size of the relationships was often highest in Germany and lowest in Poland. In general, higher values of Open-Mindedness and Emotional Stability (i.e., the opposite pole of Negative Emotionality) were related to a higher level of educational attainment. Higher Extraversion, Conscientiousness, and Emotional Stability were associated with higher income. Life satisfaction and subjective health were most strongly related to Extraversion and Negative Emotionality. These correlational patterns were similar across countries; only the Polish sample stood out by showing the lowest absolute correlations throughout. Most notably, the Polish sample yielded only one relevant correlation (absolute r >.10) between the Big Five factors and education or income—namely r = −.18 between Negative Emotionality and income. Another noteworthy deviation from the norm was the relatively high correlation between Open-Mindedness and education in Germany (r = .23 to .26 across scales) compared with the other four countries (average r = .06 to .09, across scales).

Average manifest correlations with relevant outcomes across countries.
Discussion
The aim of the present study was to demonstrate how the heuristic item selection procedure ACO can be used to develop short scales. To develop a cross-culturally applicable 15-item short form of the BFI-2, we applied ACO in a sample of 5,567 respondents from five countries. We compared the psychometric properties of the newly developed scale, BFI-2-ACO, with those of the BFI-2-XS, a traditionally developed abbreviated form of the BFI-2 (Soto & John, 2017a). The results show that the novel item-selection approach can optimize a wide range of criteria simultaneously and achieve all the goals of the traditional approach (e.g., reliability, construct coverage, correlation with criterion variables) while also improving on additional psychometric properties (e.g., model fit and measurement invariance), which can only be adequately addressed using combinatorial item selection procedures. Whereas ACO has previously been applied to derive short models in a CFA (e.g., Janssen et al., 2015; Leite et al., 2008; Olaru et al., 2015) or multigroup CFA context (e.g., Olaru et al., 2018; Schroeders et al., 2016; Schultze & Eid, 2018), the present study shows the feasibility of applying this item selection method to more complex models (i.e., the full Big Five model with an acquiescence factor) and an alternative modeling framework (i.e., ESEM). This is also the first study to show that all criteria used in traditional short-scale development can easily be incorporated into ACO. This applies also to expert conceptual judgments of the items, which could be included if quantified in some form (e.g., relevance of the indicator as a measure of the underlying trait ranging from 0 = nonrelevant to 10 = highly relevant).
Advantages of ACO Over Traditional Approaches
Heuristic item selection algorithms—such as ACO—are particularly useful because the psychometric criteria (e.g., model fit, reliability) that are improved or maintained through item selection can only be computed based on item combinations. Item-level information, such as factor loadings, are merely proxies for the scale-level criteria, which will change as soon as the first item is removed from the full scale. These issues are even more problematic when several criteria are simultaneously evaluated and items are not clearly supported by all criteria. By estimating all relevant criteria directly on the final short model, such issues can be overcome. Because the evaluation of every possible item combination would have been computationally too demanding, we used ACO as a heuristic item selection procedure to reduce the number of models that had to be estimated (see Olaru et al., 2019, for more details).
These heuristic item selection procedures are particularly useful for complex psychometric criteria, such as model fit and measurement invariance, for which it is often difficult to identify the best items based on item-level information in the full model. For instance, we found that the item measuring the BFI-2-XS Conscientiousness facet Responsibility—“Is reliable, can always be counted on”—had a more interpersonal connotation (i.e., Agreeableness) for participants in France and Spain than for those in the other countries covered by this study. As short scales such as the BFI-2-XS are often used in international panel studies, it is important to evaluate the cross-cultural applicability of the items when developing these scales.
Whereas we were able to achieve better model fit and measurement invariance than the BFI-2-XS, the reliability and criterion-related validity (including the correlation with the long BFI-2) were equivalent across the two short forms. These criteria were optimized in the development of the BFI-2-XS, by examining item-total correlations and main loadings (in an exploratory and CAFs) to select the most central item of each facet. By doing so, the BFI-2-XS retained most of the relevant BFI-2 variance, and the ACO solution was not able to further improve on this.
Recommendations for Using ACO
The R and Mplus scripts used in this study are available on the study’s project page in the OSF repository (https://osf.io/bpgnv/; for a CFA-based version of ACO running completely in R, see https://osf.io/yx4km/ or the stuart package [Schultze, 2019]). In the following, we will give some recommendations on using ACO (for an in-depth tutorial, see Olaru et al., 2019). When using heuristic item selection procedures such as ACO, it is important to first select a meaningful set of optimization criteria. To define a purposeful optimization function, it is necessary to first evaluate the full model and identify potential shortcomings that need to be addressed for the specific research question (e.g., a lack of model fit and measurement invariance for cross-cultural comparisons). Note that some criteria—such as reliability—will decrease as a result of the reduced item numbers. Consider also that some optimization criteria will reinforce each other (e.g., correlation with the full measure and reliability; CFI and reliability; Moshagen & Auerswald, 2017). Generally, the more optimization criteria are included, the more difficult it may be to find a solution that satisfies all criteria. This depends also on the number of items available for selection, as a larger initial item pool provides greater opportunity for improvement. A viable strategy might thus be to first run ACO only with the most necessary optimization goals (e.g., model fit and measurement invariance for cross-cultural research) and to rerun it with more criteria only if the first runs yielded adequate solutions.
The number of (virtual) ants and maximum iterations should be adjusted to the number of possible short models (e.g., using the choose(n, k) function in R; but consider also potential selection constraints such as item balance or factor allocation). However, values between 30 and 70 are generally adequate for most applications. ACO should be run several times, as it might result in different short scales in each run due to its probabilistic approach. A good way to check whether a sufficient number of ants and iterations have been set is to compare the quality of the selected models across the different runs. If the criteria are not met, or if differences between runs are large, the number of ants and/or maximum iterations should be increased to ensure that the best possible item sets are identified. We advise against using ACO to select items on small samples (N < 500), as the resulting models may not be adequate on other samples. Cross-validation (i.e., selecting items on one half of the sample and subsequently evaluating the best item set found on the other half of the sample) is necessary to evaluate the robustness of the selected model.
Limitations of the Present Study
Although the present analyses were based on a well-established Big Five inventory and a heterogeneous sample of over 5,000 respondents from five countries, a number of limitations should be mentioned. First, the sample was limited to five rather culturally homogeneous countries. Even though we used samples from Germanic, Romance, and Slavic language groups, the applicability of the derived short scale outside the context of Western industrialized countries, or to other Western industrialized countries not covered by the present study, is open to question. Second, although participants were quota sampled by gender, age, and education, the samples were not randomly drawn, and were thus not representative of their populations. Third, even though the BFI-2 is a well-accepted personality inventory, it is comparatively short and—given the additional restrictions—provides only a small pool of items for selection. Fourth, the criterion variables in the present study were self-reported, and thus the correlation between the personality variables and the criterion variables may have been overestimated.
Conclusion
In a nutshell, the present study demonstrates that the heuristic item selection approach ACO improves on traditional approaches to developing short scales. Especially in cross-cultural settings, where properties such as the model fit or the measurement invariance of a scale are important, ACO can be a useful tool to identify items that allow for an unbiased comparison. Heuristic item selection procedures such as ACO are very flexible tools that can be applied to optimize a wide range of criteria (e.g., model fit, measurement invariance, scale composition) directly on the final short model within a wide variety of modeling contexts (e.g., CFA, ESEM, item response theory).
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
