Abstract
Following the suggestions of National Institute of Education and American Psychological Association Standards, this article addresses the issue of differential validity and differential prediction in the vocational interest domain as one major concern of test fairness. In order to investigate potential prediction bias in vocational interest measures, we first compare gender-specific validity coefficients for the prediction of person–environment fit and satisfaction to test for differential validity, and second, we examine gender differences in the slopes and intercepts of the regression model predicting person–environment fit to test for differential prediction. Results show evidence of differential validity and some indications of differential prediction in a standard Holland Interest Inventory. However, removing items showing large gender-specific differential item functioning, that is, controlling for measurement bias, slightly reduced prediction bias. Practical implications are discussed and further research objectives are suggested.
Keywords
Introduction
Considerable mean differences are consistently found in vocational interests of women and men (Lippa, 1998; Su, Rounds, & Armstrong, 2009). On average, whereas women prefer working with people, men are drawn to thing-oriented activities. Speaking within the Realistic, Investigative, Artistic, Social, Enterprising, and Conventional (RIASEC) framework (J. L. Holland, 1997), men show stronger interest in the realistic and investigative domain, while women express stronger interest in the artistic, social, and conventional domain. In their meta-analysis, Su, Rounds, and Armstrong (2009) reported an effect size of d = .93 for the people–things dimension—one of the largest sex difference found in the psychological domain (Hyde, 2005). These large gender-related mean differences within the vocational interest domain might be interpreted as an indicator of unfairness in interest inventories. It is debated why women and men differ in their vocational interests: Social theories emphasize the influence of gender socialization and gender roles (Ruble, Martin, & Berenbaum, 2006), whereas biological approaches point to the importance of genes as well as pre- and postnatal hormones and their influence on the brain structure and neuronal development (Berenbaum, Baxter, Seidenberg, & Hermann, 1997; Goy & McEwen, 2007). Current research on gender differences in diverse cultures (Lippa, 2010) and on the impact of prenatal and postnatal hormones on gender-specific interest pattern (Hell & Pässler, 2011) questions the assumption that the vocational interests of women and men are fundamentally equal and that empirically reported mean differences merely rest upon measurement errors or solely reflect differences in socialization history.
In their meta-analysis, Su et al. (2009) showed that item development (i.e., eliminating items with large gender-specific mean differences) moderated the size of sex differences in realistic, investigative, and enterprising interests. In comparison, the reduction in gender differences was minimal for the artistic, social, and conventional domains. Thus, indicating that efforts to reduce gender differences in vocational interest inventories have been primarily successful with scales traditionally favoring males. In line with the opportunity approach of validation (Prediger & Cole, 1975), interest inventories applying this item development technique aim to encourage women to explore nontraditional educational and occupational choices, such as careers in engineering and science. Nevertheless, the question remains unsettled whether eliminating items with large gender differences might result in changes of the construct the interest inventory is intended to measure, and thereby reducing its predictive validity (Gottfredson & Holland, 1978). Su et al. (2009) emphasize that the attempt to remove gender differences from interest inventories results in scales that do not mirror the original RIASEC dimensions as proposed by Holland (1959, 1997), but represent a narrower range of interests. Russell (2007) examined the agreement between four interest inventories and showed a mere 50% cross-classifying hit rate of RIASEC codes between the Self-Directed Search (SDS; J. L. Holland, Fritzsche, & Powell, 1994) and the Revised Unisex Edition of the American College Testing (UNIACT-R; American College Testing [ACT] Program, 1995). Thus, depending on the interest inventory applied and its construction strategy (i.e., whether items with gender-specific mean differences are eliminated), a person would likely receive different occupational suggestions. That indicates that item development strategies might indeed influence instrument validity.
However, both the Guidelines for the Assessment for Sex Bias and Sex Fairness in Career Interest Inventories (National Institute of Education [NIE], 1975) and the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education [AERA, APA, & NCME], 1999), strongly emphasize the suspension of gender differences in instrument validity and prediction as one crucial aspect of test fairness. As the Standards define “no bias exists if the regression relating the test and the criterion are indistinguishable for the groups in question” (AERA, APA, & NCME, 1999, p. 79). In recent decades, the Cleary (1968) approach of testing for differences in regression lines has been the central approach to evaluate test bias in prediction.
Moreover, evaluating the Cleary approach, Meade and Tonidandel (2010) demand that when testing for bias, both internal and external approaches must be considered. Therefore, first measurement biases must be ruled out by testing for differential item and test functioning. Second, by testing for differential validity and differential prediction, the absence of bias in prediction must be demonstrated. While differential validity determines whether the correlation between a predictor and a criterion is equal for women and men, differential prediction refers to differences in the standard errors of estimates or regression lines (i.e., differences in the slopes and/or intercepts) between females and males (Linn, 1978). Importantly, though differential validity and differential prediction are related, they can occur independently of each other. Thus, a test may predict a criterion with the same accuracy for females and males, but the use of a single regression line for females and males would lead to overprediction or underprediction for one group. Thus, as Linn (1978) argues, differential prediction is the more crucial aspect of instrument validity because it has a more direct influence on considerations of fairness.
Internal Approach of Test Fairness: Differential Item Functioning (DIF) in the Interest Domain
Item response theory (IRT) has provided new methods to address the issue of fairness in interest inventories. Within the framework of IRT, it is possible to determine whether women and men respond differently to the same item, given that they possess the same underlying trait level (P. W. Holland & Wainer, 1993; Osterlind & Everson, 2009). Multidimensional IRT has been developed as a theoretical framework to examine how DIF relates to item and test validity. Thus, the target trait, or primary dimension, is distinguished from other factors influencing test performances, that is, nuisance factors and secondary dimensions. DIF occurs because a factor apart from the construct targeted by the test is affecting the endorsement of one group (e.g., women) but not of the other group (e.g., men). Consequently, gender-specific DIF in interest items indicates that women and men do not show the same probability of endorsing an item (e.g., “teach children”) despite having the same underlying interest level (e.g., social interest).
Testing for DIF is relatively rare in the domain of vocational interests, even though it is a standard procedure in ability testing. Current research (Aros, Henly, & Curtis, 1998; Einarsdóttir & Rounds, 2009) on gender differences in interest tries to establish whether these gender-related differences are attributable to real and valid differences in the underlying trait or whether they additionally depend upon certain properties of the instrument being administered. Aros, Henly, and Curtis (1998) found gender-specific DIF on most of the Strong Interest Inventory (SII; Harmon, Hansen, Borgen, & Hammer, 1994) occupational title items, whereas Einarsdóttir and Rounds (2009) detected gender-specific DIF in two thirds of the SII items, especially on the realistic scale. Both Aros et al. (1998) and Einarsdóttir and Rounds (2009) argued that responses on the SII were influenced by an additional sex-type dimension leading to DIF. When biased items were removed, gender mean differences favoring men were reduced in the realistic and investigative scales, whereas gender differences favoring women remained on the artistic and social scales. Comparing different test construction strategies, Wetzel, Hell, and Pässler (2012) concluded that IRT-based DIF analyses are useful for test constructing purposes.
External Approach of Test Fairness: Differential Validity and Differential Prediction in the Interest Domain
According to Holland (1997), individuals tend to gravitate to environments that are consistent with their interest profile. For example, people working in sciences (e.g., biology, physics, and chemistry) are expected to have predominantly investigative interests. Therefore, instrument validity is evaluated by comparing the predominant interest type (high-point code) with a criterion such as educational or occupational choice and calculating the percentage of agreement. Various studies analyzed the concurrent or predictive validity of interest inventories by examining hit rate differences for women and men. Most studies found comparable mean hit rates for females and males (ACT, 2009; J. I. Hansen & Swanson, 1983; J. I. Hansen & Tan, 1992; J. C. Hansen & Dik, 2005; Harrington, 2006). Other researchers report slightly better classification results for women (Gottfredson & Holland, 1975). Furthermore, although mean hit rates for women and men were shown to be comparable, larger differences are found when comparing gender-specific hit rates on occupational group level (ACT, 2009). Another procedure to establish the level of agreement between a person and the environment is Holland code comparisons based on congruence indices. The C-Index (Brown & Gore, 1994) has been frequently recommended as congruence index because it not only incorporates Holland’s circular order assumption but is also sensitive to the order of interest types in Holland codes assigned for persons and occupations (Dik, Hu, & Hansen, 2007). High level of person–environment congruence will increase the likelihood of continuity in educational and occupational decisions as well as the satisfaction and success in this type of educational or occupational environment (Holland, 1997). Applying this procedure, Rottinghaus, Coon, Gaffey, and Zytowski (2007) found no indication for gender differences in the level of congruence. Remarkably, whereas there is extensive literature investigating gender-related differential validity and differential prediction in the cognitive ability domain as well as studies on differential validity in the interest domain (ACT, 2009; J. I. Hansen & Swanson, 1983; J. I. Hansen & Tan, 1992; J. C. Hansen & Dik, 2005; Harrington, 2006; Rottinghaus, Coon, Gaffey, & Zytowski, 2007), the issue of differential prediction has so far been left unexamined. Since important educational and occupational decisions are based upon results of interest tests too studying this form of predictive bias in vocational interest inventories seems particularly important.
Aim of This Study and Hypotheses
Research on prediction bias in interest inventories has merely relied on comparing hit rates for females and males or evaluating gender differences in congruence indices. Nevertheless, both NIE and APA Standards clearly emphasize the suspension of gender differences by testing for differential prediction. Several meta-analyses have shown relationships between person–environment fit and occupational satisfaction (Assouline & Meir, 1987; Tranberg, Slane, & Ekeberg, 1993; Tsabari, Tziner, & Meir, 2005). Following the Standards, we first compared validity coefficients for the prediction of person–environment fit and satisfaction for females and males to test for predictive validity, and second, we examined gender differences in the slopes and intercepts in a regression model predicting person–environment fit to test for differential prediction.
Furthermore, by following Meade and Tonidandel’s (2010) suggestions, we combined an internal and external approach for establishing test fairness. As discussed above, research (Aros et al., 1998; Einarsdóttir & Rounds, 2009) shows that by eliminating interest items showing large DIF, gender differences can be reduced. Both Aros et al. (1998) and Einarsdóttir and Rounds (2009) ascribed the observed gender-related DIF to a sex-type dimension influencing female and male responses. Nevertheless, neither Aros et al. (1998) nor Einarsdóttir and Rounds (2009) compared instrument validity of the SII with its DIF-optimized version. Evidence from UNIACT research suggests that accuracy of classifying persons into their occupational preference group was comparable for one version of the instrument where gender differences were eliminated on the item level and a traditional version of the same instrument (Prediger & Lamb, 1979). Thus, in a second step, we compared the validity of a traditional constructed interest inventory with its DIF-optimized version and analyzed both instruments for differential validity and differential prediction as well as compared findings.
Hypotheses
According to Holland (1997), individuals tend to gravitate toward environments that are consistent with their interest profile. Furthermore, the level of person–environment congruence will increase the likelihood of subjective person–environment fit as well as satisfaction with choice. Thus, to demonstrate instrument fairness neither C-Index level (Hypothesis 1) nor correlation between C-Index and subjective person–environment fit (Hypothesis 2) nor satisfaction (Hypothesis 3) should differ between females and males. Moreover, when predicting person–environment fit by an individual’s interest score the slopes and intercepts of the regression model should not differ between females and males (Hypothesis 4).
Furthermore, we were interested in the question whether eliminating interest items showing large DIF gender differences and thereby reducing gender differences on the scale-level influences instrument validity. Therefore, we additionally examined a DIF-optimized version of an interest inventory for differential validity and differential prediction and compared these findings to those of the standard version.
Method
Participants
The sample comprises 797 students (62.8% women and 37.2% men). Due to unrealistic response patterns (total processing time of the questionnaire less than 20 min), a reduced sample size of N = 736 was included in further analyses. Participants were either enrolled in vocational training (50.8%) or university programs (49.2%). In order to guarantee sufficient heterogeneity of job-related interests, we investigated different fields of vocational training and fields of study. That is, the assigned three-letter Holland codes vary. Thus, we investigated three groups of trainees during their apprenticeship (vocational training): digital media designers (15.1%), hotel managers (18.7%), and design draftspeople (17.0%). Students were recruited from university, studying different majors, such as chemistry (15.5%), economics (17.5%), and linguistics (16.2%). Table 1 shows characteristics of the sample separated by fields of studies.
Characteristics of the Sample Separated by Field of Studies.
Note. GIST = General Interest Structure Test.
The GIST three-letter code is taken from the GIST Manual. N = 736.
We tried to balance the number of female and male trainees and students in the sample. However, depending on the field of study, the distribution of gender in the population is inhomogeneous, favoring females for subjects with a stronger focus on social interaction and males for more technical subjects. Participant’s age ranged from 16 to 48 years (M = 21.42, SD = 2.89). Data were collected online during a 6-month period.
Instruments/Measures
Revised General Interest Structure Test (GIST-R)
Participant’s occupational interests were assessed using the GIST-R (“Allgemeiner Interessen-Struktur-Test;” Bergmann & Eder, 2005). The GIST-R is a widely used German interest inventory based on Holland’s (1959, 1997) RIASEC model. Participants were asked to rate their individual level of interest in the represented activity on a 5-point Likert-type scale, ranging from 1 (I am not interested in this at all; I do not enjoy doing this at all) to 5 (I am very interested in this; I enjoy doing this very much). Each participant’s score on the 60 items (10 per dimension) was aggregated to the six RIASEC dimensions. Reliabilities (Cronbach’s α) for the GIST-R scales range between .82 and .87; the 1-month retest reliability ranges from r = .85 to r = .92 (Bergmann & Eder, 2005). The GIST-R is the best validated interest inventory in the German-speaking countries (i.e., Germany, Austria, and Switzerland). Scale score correlations between the GIST and matching scales from an adaptation of Holland’s SDS instrument (Jörin, Stoll, Bergmann, & Eder, 2004) range from r = .60 to r = .75. Furthermore, evidence for the structural validity of the GIST is provided (Nagy, Trautwein, & Lüdtke, 2010).
Person–Environment Fit Score/Satisfaction
Students indicated their perceived person–environment fit with their chosen training course or field of study on a scale ranging from 1 (not at all) to 6 (perfectly) and furthermore rated their satisfaction with their chosen training course or field of study on a scale ranging from 1 (very dissatisfied) to 6 (very satisfied). Albeit having several psychometric shortcomings, assessing general job satisfaction by a single-item measure is well established in the literature. In their meta-analysis, Wanous, Reichers, and Hudy (1997) attest this approach satisfying reliability and construct validity. Correlation between perceived person–environment fit and satisfaction was .42.
Cognitive Ability
We assessed participants’ cognitive ability to control for possible moderator effects. Cognitive ability was chosen as a moderator since individuals evaluation of their perceived person–environment fit as well as satisfaction with occupational choice might rely on both the perceived fit between an individual’s interest and the occupational requirements as well as on the perceived fit between an individual’s cognitive abilities and the occupational demands. For assessing cognitive ability, a short version of an ability test designed for vocational counseling purposes (see Hell, Pässler, & Schuler, 2009 for details) measuring verbal, numerical, and spatial abilities was administered focusing on the dimensions: verbal classifications (Cronbach’s α = .73; 7 items), verbal analogies (.62; 6), numerical sequences (.67; 4), rule of three (.57; 3), quantitative comparisons (.64; 4), surface development (.54; 4), and mental rotations (.66; 3). Then, we aggregated these dimensions to an overall cognitive ability score for each participant. Instrument development was conducted according to the Berlin Model of Intelligence Structure (Jäger, 1984; for details, see Pässler & Hell, 2012). Positive evidence for instrument validity for predicting content-specific high school and college grades as well as college major choice (Pässler & Hell, 2012).
Procedures
DIF-Optimized Version
Analyses of DIF were carried out to investigate whether GIST items assess the same underlying constructs for both women and men. Two approaches were pursued: nonparametric and parametric methods. First, a nonparametric method for testing polytomous items for DIF based on calculating the Liu–Agresti cumulative common log-odds ratio (L-A LOR; Liu & Agresti, 1996) was applied. This method is implemented in Differential item functioning analysis system 5.0 (Penfield, 2009) and is based on contingency tables. Derived from the Mantel–Haenszel common odds ratio used for dichotomous items, the L-A LOR is seen as its generalization for polytomous items (Penfield & Algina, 2006). In order to evaluate the DIF-size, we followed the classification system by Educational Testing Service (ETS; Zieky, 1993). Three categories were found: A for items with negligible DIF (L-A LOR < .43), B for items with slight to moderate DIF (.43 ≤ L-A LOR < .64), and C for items with moderate to large DIF (L-A LOR ≥ .64). Second, parametric DIF analyses were conducted with ConQuest (Wu, Adams, Wilson, & Haldane, 2007). Afterward, the ConQuest DIF estimates were transformed to allow comparisons to L-A LOR values. Both methods are precisely described by Wetzel and Hell (2013). Although ETS recommends eliminating both items showing slight to moderate (B) and large (C) DIF, we decided to eliminate only C items else an insufficient number of items would have been left in the Realistic domain. This procedure led to a reduced number of items in all but one RIASEC dimension in the DIF-optimized inventory (for details, see Wetzel & Hell, 2013): Realistic (5 items), Investigative (7), Artistic (8), Social (10), enterprising (6), and Conventional (8).
Congruence
Students were asked to indicate their current training course or field of study. Each educational choice was assigned a three-letter RIASEC code, according to the register of occupational codes in the GIST manual by Bergmann and Eder (2005). Congruence was then calculated between a participant’s three-letter code and the RIASEC codes of the current training course or field of study using the C-Index (Brown & Gore, 1994). The C-Index ranks from 0 to 18 with higher scores indicating higher congruence between a participant’s vocational interests and educational choice. Separate C-Indices were calculated for the standard and the DIF-optimized GIST scales.
Differential Prediction
Differential prediction is generally assessed within a moderated multiple regression framework (MMR). We used MMR as an inferential procedure to compare two different least squares regression equations (Aiken & West, 1991). MMR establishes whether a moderating variable (such as gender) influences the predictor–criterion relationship (Aguinis, 2004). The MMR model includes the first-order effects for predicting Y (quantitative dependent variable) from X (predictor), Z (a second binary predictor), and the product term X × Z which carries information regarding the moderating effect of Z.
If the interaction term is significant, the predictive relationship with the criterion (i.e., the slope) differs across subgroups defined by the biasing variable (such as gender). Enhancing the interpretation and plotting of simple slopes, all continuous first-order effects including control variables were mean centered. Categorial variables were centered using dummy coding. Interactive terms were computed using the mean-centered first-order terms. Unstandardized regression coefficients are reported since standardized solutions are afflicted with difficulties when a product term is involved in the model (Jaccard & Turrisi, 2003).
Furthermore, when applying moderated multiple regressions, concerns of their statistical power are often raised (Aguinis, Culpepper, & Pierce, 2010; Aguinis & Stone-Romero, 1997). Following the suggestions of Arguinis et al., we referred to a conservative significance level of p < .10 for the interpretation of the interaction effect. By adopting this conservative criterion, we decided that mistakenly concluding no bias when the data suggest bias is present (Type II error) would be more severe than concluding bias when the data suggest otherwise (Type I error).
For all MMRs, we entered the mean-centered interest variable (regarding the dominant letter specifically for each training course or field of study) and gender in Step 1 of the regression equation. The product term between interest and gender was additionally placed in Step 2. As Aguinis (2004) advises, the first-order effect was entered first and the product term second. Person–environment fit was used as dependent variable in all MMR models.
Results
DIF Optimization
The DIF-optimized version leads to a reduction of mean differences between women and men regarding the dimensions R, I, and A, but not for S and E (see Table 2). No significant gender differences were found for dimension C. For further details regarding the comparison between the DIF-optimized and standard tests, see Wetzel and Hell (2013).
Reduction of Gender-Specific Mean Differences in the DIF-Optimized Version Regarding the RIASEC Dimensions.
Note. DIF = differential item functioning; GIST = General Interest Structure Test; M = mean; RIASEC = Realistic, Investigative, Artistic, Social, Enterprising, and Conventional; SD = standard deviation.
Diff d* represents the differences between the effect sizes (absolute values) in the standard test and the DIF-optimized test version.
Differential Validity
C-Indices as congruence indicators rest upon Holland’s assumption of a circumplex structure of vocational interests. In order to examine the extent to which the hypothesized order relation are held, randomization tests were implemented using the RANDALL program (Tracey, 1997). Hubert and Arabie (1987) proposed the correspondence index (CI) to assess model fit. The CI can range from −1.0 (all order predictions violated) to 1.0 (perfect model fit). In their meta-analytical study, Tracey and Rounds (1993) established CI values above .65 as a benchmark for good model fit. In our study, CI values ranged from .53 to .86 indicating sufficient model fit (see Table 3). The fit to the male data produced slightly lower CI values (CIs of .61 and .53) as was true for the female data (CIs of .86 and .75), as indicated in previous research (Darcy & Tracey, 2007).
Correspondence Indices for Total, Female, and Male Sample for the GIST-R and the DIF-GIST.
Note. CI = correspondence index; DIF = differential item functioning; GIST-R = Revised General Interest Structure Test.
CI (ratio of predictions met—predictions violated over total number of predictions).
C-Indices were calculated and then tested for significant gender differences to investigate Hypothesis 1. Results indicated significant differences between females (M = 11.38, SD = 3.32) and males (M = 11.94, SD = 3.99) in the standard version (GIST) with men showing a higher level of congruence, t(733) = 2.04, p = .04. Even though the magnitude of mean differences was rather small (d = 0.15), no significant differences were found examining the DIF-optimized version (DIF-GIST).
Furthermore, analyses were repeated separately for each training course and field of study (see Table 4). Significant gender differences were identified in two training courses (digital media design [MD] and engineering drawing [ED]) and two fields of study (C and I) for the GIST with males showing higher levels of congruence than females. For the DIF-GIST, significant gender differences were found for two training courses (MD and ED) and one field of study (I). Thus, the DIF-optimized version led to a modest reduction of effect sizes.
Differential Effects Regarding the C-Index for the Standard Item-Set and the DIF-Optimized Item-Set.
Note. DIF = differential item functioning; GIST-R = Revised General Interest Structure Test; M = mean; SD = standard deviation.
Correlation coefficients between C-Index and the criteria satisfaction and subjective person–environment fit for both versions of the inventory are shown in Table 5. Correlations between C-Index and criteria are slightly higher for females than for males. Furthermore, correlations between C-Index and criteria for the standard version appear slightly lower than those for the DIF-optimized version of the instrument. Nevertheless, when testing the differences between the two correlations using Fisher r-to-z transformation neither comparison reached significance (z = .77–1.21; p = .11 to .22).
Differential Validity of the C-Index Based on the Standard and on the DIF-Optimized Item Set.
Note. DIF = differential item functioning; GIST = General Interest Structure Test.
Women are shown beneath the diagonal. Men are shown above the diagonal.
*Correlation is significant at the .05 level (two-tailed).
**Correlation is significant at the .01 level (two-tailed).
Regression Models
Control Variables
We found a significant effect of cognitive ability for design draftspeople (b = .04, p = .002, ΔR 2 = .073; see Table 6).
Results From the MMRs Separately for Each Field of Study Regarding the GIST Standard Version and DIF-Optimized Version.
Note. DIF = differential item functioning; GIST = General Interest Structure Test; MMR = multiple regression framework.
All quantitative predictors are mean centered.
MMRs of the GIST
Step 2 of the MMRs showed a significant first-order effect of interest for digital media designers (b = .05, p = .013), design draftspeople (b = .05, p < .001) as well as for chemistry (b = .10, p < .001), economics (b = .04, p = .001), and linguistic students (b = .03, p = .014). In addition, cognitive ability was indicated as a significant predictor for design draftspeople (b = .03, p < .025). Overall, no significant effect of gender was found.
Step 3 shows the results after the interaction term was entered to the equation. Only for chemists, a marginal moderating effect (interaction term: b = .08, p = .053) was found in the standard test version. The interaction term accounts for an additional 2.7%, F(1, 108) = 3.84, p < .001, of the variance in person–environment fit. Thus, the relationship between interest and person–environment fit increase is stronger for females than for males. This result suggests the presence of differential prediction.
MMRs of the DIF-GIST
In the DIF-optimized version, significant first-order effects of interest are shown for design draftspeople (b = .10, p < .001) as well as for chemistry (b = .14, p < .001), economics (b = .06, p = .003), and linguistic students (b = .04, p = .005). In sum, the relationship between interest and the criterion was weaker than those found for the GIST.
Compared to the significant interaction in the subgroup of chemists in the GIST, the moderating effect decreases in the DIF-optimized version (b = .08, p = .093). This results in a reduction in ΔR 2. Now, 1.9% of the variance in person–environment fit is explained by the interaction term. Overall, no moderating effects of gender occurred in all other DIF-optimized versions (all p’s > .05).
Discussion
Summary
Both NIE and APA Standards (NIE, 1975) strongly emphasize the suspension of gender differences in instrument validity and prediction. Nevertheless, we addressed the issue of differential validity and differential prediction jointly with a DIF analysis for the first time together in the vocational interest domain.
In order to demonstrate the absence of differential validity, the correlation between an individuals’ test result and the criterion must be equal for females and males. Thus, congruence indices indicating person–environment fit should be equal for females and males. Nevertheless, we established significant differences in the C-Indices for females and males when analyzing the total sample. Analyses on the group level found differential validity for media design and draftspeople trainees, as well as chemistry and linguistic students. Furthermore, we found a slightly higher correlation between C-Index and both criteria (subjective person–environment fit and satisfaction) for females than for males. Thus, evidence for differential validity was found. The overall mean correlation between congruence and satisfaction is comparable to meta-analytical findings (Tranberg et al., 1993; Tsabari et al., 2005).
Current research (Aros et al., 1998; Einarsdóttir & Rounds, 2009) on DIF suggests that women and men respond differently to certain interest items in traditional interest inventories albeit possessing the same underlying trait level. Eliminating those items showing large DIF led to a reduction of gender-specific mean differences in certain interest domains (namely, realistic and investigative interests). Furthermore, as Meade and Tonidandel (2010) highlight, removing items showing gender-specific DIF reduces measurement bias thereby eliminating one source of prediction bias. When analyzing the DIF-optimized version of the instrument, we found no significant gender differences in the C-Index for the total sample. Nevertheless, when assessing differential validity on the group-level gender differences in C-Indices remained significant for media design and draftspeople trainees as well as linguistic students. Interestingly, when comparing gender-specific correlation coefficients for C-Index and both criteria (subjective person–environment fit and satisfaction), we established slightly higher correlations for the DIF-optimized version than the standard version of the interest inventory. Thus, reducing measurement bias (i.e., removing those items showing large DIF) eliminated differential validity in the total sample and reduced differential validity on the group level.
We second focused on another source of prediction bias so far uninvestigated in the domain of vocational interest namely differential prediction. Differential prediction refers to differences in the regression lines (i.e., the slopes and intercepts of the regression model) for females and males. When examining the prediction of person–environment fit by an individuals’ interest score, we found a (marginal) moderating gender effect for the group of chemists. A significant interaction term indicated that the predictive relationship between interest and person–environment fit differed for females and males. When subsequently analyzing the DIF-optimized test version, no significant prediction bias was found. Thus, while there was some evidence for differential prediction in the standard version, prediction bias was eliminated in the DIF-optimized version.
Overall, we found evidence of differential validity and some indication for differential prediction in a standard Holland Interest Inventory. However, evidence for differential prediction was found only in one of the six groups. Future studies should examine whether this result can be replicated with other samples and other instruments. Nevertheless, we found evidence that removing those items showing large gender-specific DIF seems to be one strategy to diminish or even eliminate those prediction biases. Since testing for DIF is a rather straightforward procedure, it should be considered a standard in the interest domain.
Limitations
However, we are unable to determine whether our criteria—satisfaction and subjective person–environment fit—are themselves biased especially since they rely on self-reported measures. It could well be possible that those criteria are confounded with a variety of other constructs such as overall satisfaction with life and evaluation through significant others that are prone to stereotyping.
Furthermore, we are aware of relatively small sample sizes of females and males within each training course or field of study. Moreover, distributions between females and males within some groups are rather disproportionate. This inhomogeneous distribution of gender might result in a lack of power of the MMRs (Aguinis, 2004). Furthermore, although different fields of vocational training and fields of study were investigated, the social type seems to be slightly underrepresented in our sample. Thus, future research should first rely on more equally distributed and second in terms of dominant interest letter on more heterogeneous samples.
Implications
With respect to test construction principles, we recommend that interest inventories should by default be evaluated for gender-specific DIF using IRT methodologies since they directly assess the relationship between observed item responses and latent traits that are measured. Testing for DIF enables test developers to determine whether the test behaves differently for women and men and should be integrated into test analyses as a standard procedure. Second, as highlighted by Meade and Tonidandel (2010), both measurement bias and prediction bias must be inspected when examining test bias. Even though eliminating those items showing large DIF reduced prediction bias in the instrument, we also established differential validity in the DIF-optimized version on the group level. Thus, it remains unclear how the instrument can be further modified to guarantee test fairness. Further research concerning differential validity in interest inventories as well as its causes and possible moderators is needed.
Moreover, practitioners applying interest inventories should be aware that interest items, especially in the realistic domain, are interpreted differently by female and male test takers partly explaining gender-specific mean differences on the RIASEC dimensions. In our study, removing those items showing large DIF led not only to a reduction in gender-specific mean differences in the realistic, investigative, and artistic but also to a fairer prediction of the criteria person–environment fit and satisfaction. However, these results need to be replicated with an enlarged and more heterogeneous sample as well as with other Holland-based interest instruments.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Our research was supported by a grant by the German Federal Ministry of Education and Research (grant agreement number: 01FP0930) and the European Social Fund of the European Union.
