Abstract
Three test construction strategies are described and illustrated in the development of the Verb Interest Test (VIT), an inventory that assesses vocational interests using verbs. Verbs might be a promising alternative to the descriptions of occupational activities used in most vocational interest inventories because they are context-independent, timesaving, and applicable across educational levels. Three test construction strategies are implemented and compared. The first construction method follows the rules of classical test theory (CTT), the second is within the framework of CTT as well but also takes gender differences in mean scores into account, and the third strategy is guided by item response theory (IRT) and controls for differential item functioning for men and women. The three VIT versions resulting from the different construction methods are compared regarding their construct and criterion validity. For practical use and career counseling, test development following the IRT approach seems most useful since it allows maximal occupational exploration and precise trait estimation.
Keywords
Introduction
Vocational interests are consistently found to differ between men and women. Su, Rounds, and Armstrong (2009) showed that women prefer working with people and have stronger Social, Artistic, and Conventional interests compared to men, while men prefer working with things and have stronger Realistic and Investigative interests than women. There are two general approaches to dealing with the gender differences found in interest inventory scores. The first, called socialization approach to validation (e.g., endorsed by Gottfredson & Holland, 1978), asserts that interest inventories only reflect real differences in underlying constructs that are due to the differing socialization experiences of men and women. According to this view, minimizing score differences “artificially” would impede the inventory’s validity. The opposing approach, referred to as opportunity approach to validation (e.g., proposed by Prediger & Cole, 1975), argues that the purpose of applying interest inventories is to enable test-takers to explore career options. This goal can be achieved by constructing sex-balanced interest inventory scales or using same-sex norms to report interest inventory results (Betz, 1993). In the former case, gender differences are eliminated at the item level, whereas the latter case ensures that the comparison sample is similar to the test-taker with respect to socialization experiences, thus emphasizing interests that have developed in spite of sex-role restricted socialization experiences. Both encourage maximal exploration of career options, including the exploration of non-stereotypic occupations.
Depending on which approach is chosen, different strategies for test construction are applied. Vocational interest inventories that are constructed within the framework of the socialization approach to validation (e.g., Self-Directed Search; Holland, Fritzsche, & Powell, 1994; Strong Interest Inventory; Donnay, Morris, Schaubhut, & Thompson, 2005) try to optimize construct validity using classic criteria such as factor analysis and analyses of item discrimination and scale homogeneity. Vocational interest inventories constructed according to the opportunity approach (e.g., UNIACT Interest Inventory; American College Testing Program, 1995) attempt to control gender differences at the item level. In the case of the UNIACT, this is achieved during item selection by eliminating items that display an absolute difference of more than 15% in the proportion of like responses for males and females (American College Testing Program, 2009). UNIACT items assess basic interests and minimize the influence of sex-roles stereotypes, thereby allowing the use of combined-sex norms to report results (Prediger & Swaney, 1995). Empirical evidence for the UNIACT’s validity in the prediction of academic and occupational outcomes as well as its validity for use in career counseling are presented in the Interest Inventory Technical Manual (American College Testing Program, 2009). Despite their differences, both construction strategies use classical test theory to guide test construction. A third strategy for test construction, which would allow construct validity as well as gender fairness to be optimized, is to apply item response theory (IRT).
CTT and IRT Approaches to Item Selection
Classical test theory (CTT) and IRT provide two different approaches to selecting items during test development. Both CTT and IRT aim at describing an individual’s position on an underlying latent trait. However, CTT and IRT differ in how they relate an individual’s responses to his or her trait values. While CTT focuses on the sum score, IRT focuses on the responses to individual items (Walker, Böhnke, Cerny, & Strasser, 2010). In CTT, the same model, which assumes that a person’s observed score is made up of the sum of true score and measurement errors, is always applied. Item selection within the CTT framework focuses on item discrimination with the goal of increasing a reliability coefficient such as Cronbach’s α. The reliability coefficient’s value reflects the strength of the relationship between the observed score and the true score and therefore how well the scale measures the latent construct (Reeve & Mâsse, 2004). Individual items are usually assessed by their discrimination value, which is computed by correlating the item score with the test score.
In IRT, a multitude of different models exists for relating item characteristics to a person’s latent trait value. The most common are the one-, two-, and three-parametric logistic models for dichotomous data and the graded response model (Samejima, 1969, 1996) and the partial credit model (PCM; Masters, 1982) for polytomous data. IRT allows the estimation of an individual measurement error and models test behavior at the item level instead of at the test score level (Embretson & Reise, 2000). Depending on the model chosen, items are selected, for example, with regard to their response category structure, their category probability curves, the ordering of their threshold parameters, and their functioning for different groups of respondents.
Development of the Verb Interest Test (VIT)
In the following, CTT and IRT test construction strategies will be illustrated in the development of the VIT, an interest inventory that assesses vocational interests using verbs. Verbs characterize human behavior and are therefore suitable to indicate vocational activities. Widely used vocational interest inventories such as the Strong Interest Inventory (Donnay et al., 2005) or the UNIACT Interest Inventory (American College Testing Program, 1995) employ descriptions of vocational activities to assess a person’s interests. The test-taker indicates how much he or she would like or dislike carrying out the activity described. These descriptions of activities only apply to a very specific activity in a specific occupation. Contrariwise, verbs are independent of a specific context but nevertheless concrete in implying an activity, for example, with “to count,” the activity is clear, yet the context can differ widely (counting how many people are in front of me in line, counting the number of eggs in my fridge, etc.). Thus, they can be applied across occupations and also across educational levels. Another advantage of using verbs as items is that they are timesaving compared to the complete descriptions of activities with subject, verb, and object commonly used in vocational interest inventories. Thus, more items can be responded to in less time without putting a strain on the test-taker. Furthermore, interest items that consist of verbs might be less prone to gender-stereotypic responding, thus permitting advice to persons seeking career counseling that is based less on traditional gender-roles and maximizes occupational exploration. For example, “to glue” in itself is gender-neutral, adding a context is what might elicit gender-stereotypic responding (gluing photos in an album, gluing metal to wood while making a birdhouse, etc.).
Holland’s (1959, 1997) model of vocational interests was used as the theoretical framework for the VIT. Holland grouped vocational interests into six dimensions, Realistic, Investigative, Artistic, Social, Enterprising, and Conventional, which are referred to by their initials as RIASEC. There is ample empirical support for the RIASEC model (e.g., Armstrong, Hubert, & Rounds, 2003; Day & Rounds, 1998) and its applicability to different cultures (e.g., Nagy, Trautwein, & Lüdtke, 2010).
We investigated whether using different construction strategies would yield distinct item pools for the VIT. Thus, one version of the VIT was constructed strictly according to CTT. The second version of the VIT was constructed according to CTT as well but furthermore attempted to minimize gender differences in mean scores. The third version of the VIT was constructed within the IRT framework, also taking into account differential item functioning for men and women. The resulting item pools will be examined pertaining to their resemblance or difference as well as to their validity. In sum, the goals of this study are (a) to implement the three construction strategies, (b) to compare the three VIT versions with regard to their construct and criterion validity, and (c) to draw conclusions concerning which test construction strategy works best in developing a gender fair interest inventory.
Method
Lexical Approach and Pretest
The initial item pool was compiled using a lexical approach. First, we identified commonly used verbs that are associated with activities in a dictionary of German verbs (Wendt & Thurmair, 1999). The resulting list containing 2227 verbs was given to two experts in the field of vocational interests who had extensive experience with Holland’s RIASEC model. The experts assigned a single Holland dimension (Realistic, Investigative, Artistic, Social, Enterprising, Conventional; Holland, 1997) to each verb and marked ambiguous items using detailed descriptions of the dimensions. After removing ambiguous items (items that could be ascribed to more than one dimension), results showed that only 30% of all verbs could be clearly characterized as belonging to a certain Holland dimension. A total of 222 verbs were characterized identically by the two experts. As these verbs were unevenly distributed over the 6 interest dimensions, both experts were next assigned to identify those 30 verbs best characterizing each interest dimension. Items with differing evaluations were discussed and were included in the pretest if a consensus could be reached. In total, 143 verbs were included in the pretest.
This initial item set was administered to 21 persons (university students and faculty members; 15 women and 6 men; M age = 27 years). Participants were asked to indicate how much they would like to engage in each activity on a 5-point scale between the poles I am not interested in this at all; I do not enjoy doing this at all and I am very interested in this; I enjoy doing this very much. Furthermore, they were asked to specify whether the item was unambiguous and were allowed to write comments to each item. A total of 31 verbs were marked as ambiguous; for example, the German word “ausdrücken” could be interpreted either as an artistic activity (“to express oneself”) or as a mechanical activity (“to squeeze something”). Item means, standard deviations, and response distributions were inspected. Items with floor effects (M < 2) or ceiling effects (M > 4) as well as ambiguous items were withdrawn from the item pool. After these analyses, 113 verbs constituted the new interest inventory.
Instruments
The research version of the VIT consisting of 113 items was included in the German online test “was-studiere-ich.de” (Hell, Pässler, & Schuler, 2010) that assesses vocational interests and cognitive abilities. The self-assessment tool “was-studiere-ich.de” was designed to be an orientation help for young people exploring their career options. After completion participants receive a detailed ability and interest profile. The number of items and Cronbach’s α values for each RIASEC dimension are 18 (.91) for Realistic, 23 (.93) for Investigative, 11 (.84) for Artistic, 20 (.93) for Social, 23 (.94) for Enterprising, and 18 (.91) for Conventional.
For the analysis of the construct validity of the VIT, data from another sample that had completed the VIT research version as well as the General Interest Structure Test (“Allgemeiner Interessen-Struktur-Test;” AIST-R; Bergmann & Eder, 2005) was analyzed. The AIST-R (Bergmann & Eder, 2005) is a German interest inventory based on Holland’s RIASEC model (Holland, 1959, 1997). It consists of 60 items (10 per dimension) that describe occupational activities. Participants rate how much they are interested in the depicted activity on a 5-point Likert-type scale with the poles I am not interested in this at all; I do not enjoy doing this at all; and I am very interested in this; I enjoy doing this very much. Cronbach’s α values for the six scales were .86 for Realistic, .89 for Investigative, .86 for Artistic, .83 for Social, .89 for Enterprising, and .81 for Conventional. The AIST-R’s validity is supported by its convergent and discriminant relationship with the German translation of the Self-Directed Search (Holland et al., 2004) called the Explorix (Jörin, Stoll, Bergmann, & Eder, 2003) as well as its relationship with different personality traits (Bergmann & Eder, 2005).
Additionally, participants were asked to report their satisfaction with the major they chose (All in all, how satisfied are you with your major?) as well as the fit between their interests and their major (How well does your major match your interests?). Both items were rated on a 6-point Likert-type scale with the poles very unsatisfied and very satisfied for satisfaction and not at all and perfect for match between major and interests.
Samples
A total of 1186 (849 women and 337 men) participants filled out the VIT. VIT participants were persons who visited the website “was-studiere-ich.de” to complete the ability and interest assessment and chose to additionally take the VIT. There was no extra incentive given for filling out the VIT. Their age was between 13 and 61 years with an average of 21 (SD = 6.51). Of the 1186 persons, 606 (51%) were high school students, 57 (5%) were enrolled in vocational training, 211 (18%) were college students, and 312 (26%) chose the response option “other” to indicate their current occupation.
The sample that completed VIT, AIST-R, and the measures on satisfaction with major as well as match between interests and major (“criterion sample” as opposed to the “online sample” described above) consisted of 158 German university students. Sixty two (39%) were men and 96 (61%) were women. The participants were between 20 and 40 (M = 24.44, SD = 2.63) years old and had been studying for 3–15 terms (M = 6.74, SD = 2.22). Eight different majors were represented, the most frequent being business sciences (68, 43.0%), architecture (42, 26.6%), biology (25, 15.8%), and educational science (22, 13.9%).
In the following, first, pre-analyses conducted with the complete item pool will be described. Second, the successive reduction of this item pool to 8 items per scale using the three different construction methods will be depicted. Lastly, several approaches to validating the three VIT versions will be presented.
Pre-analyses
Item reduction
The complete data set containing 113 items was pre-analyzed. First, a principal component analysis (PCA) with varimax-rotation was computed. PCA was preferred to a factor-analytic approach such as principle axis factoring since the purpose of PCA is to reduce the number of items—which was also the primary goal here—while the purpose of factor analysis is to “understand the latent factors or constructs that account for the shared variance among items” (Worthington & Whittaker, 2006, p. 818). The resulting factor loadings were examined. Items were eliminated if they had their highest factor loading on a RIASEC dimension different from the one defined a priori by the experts. Furthermore, items were eliminated if their factor loading was less than .40. This first PCA reduced the item pool to 101 items.
Next, a second PCA was performed on the remaining 101 items. Here, items were eliminated if they had substantial loadings on a second factor (>.30), except if this loading was on a neighboring RIASEC factor (e.g., R and I), since RIASEC neighbors are assumed to represent more similar interests than factors that are further apart (Holland, 1997). This resulted in a reduced item pool of 90 items.
Lastly, we eliminated a few more items from the item pool due to overlap in item content. For example, if several verbs had the same word stem (like “to think” and “to rethink”), only one—the one with the highest factor loading—was retained. This was the case for six verbs. Thus, the final item pool to be used for the development of the different VIT versions consisted of 84 items. Of these 84 items, the Realistic scale comprised 13 items, the Investigative scale 14 items, the Artistic scale 9 items, the Social scale 18 items, the Enterprising scale 18 items, and the Conventional scale 12 items.
Assessment of unidimensionality
The scales of the final item pool were tested for unidimensionality. Both CTT and IRT assume unidimensionality, though in CTT this assumption is rarely tested. PCA were computed separately for the items in each scale. Parallel analysis (as described in O’Connor, 2000) was used for determining the number of components.
The Three VIT Test Construction Strategies
Three versions of the VIT were constructed: one version according to CTT rules (VIT-Classic), another according to CTT as well but additionally selecting items that were balanced between men and women (VIT-Balanced), and a third version that was based on IRT (VIT-IRT) and for which differential item functioning (DIF) between men and women was also considered. Pre-analyses as well as analyses for the CTT versions were conducted with SPSS 18 (PASW Statistics) while WINSTEPS (Linacre, 2009a) and WINMIRA (von Davier, 2001) were used for the IRT version. All RIASEC scales were reduced to 8 items each since 8 items per scale would not make the inventory too long yet 8 items were deemed enough to cover the construct and yield scales of good reliability. An exception was made for the Artistic scale because for Artistic there were only 9 items left after pre-analyses. Thus, it was decided to remove 4 Artistic items during the construction of every VIT version, resulting in a 5-item Artistic scale. In the following, the development of each VIT version will be described.
VIT-Classic
The CTT approach was used in the development of the VIT-Classic. As is customary in CTT, the primary indices used for item selection were item difficulty and item discrimination (e.g., Ellis & Mead, 2002). After items with extreme means, item discriminations <.40, skewed distributions, and low factor loadings had been removed, there were still more than 8 items left for some scales. If this was the case, inter-item correlations were considered. For items that correlated higher than .70, only the one with the higher discrimination was retained. If a scale still contained too many items after this, as the last step, items were sorted by their discrimination value and items with the highest corrected item-total correlations were retained.
VIT-Balanced
For the construction of the VIT-Balanced, items were selected as described above for the VIT-Classic. However, instead of selecting the remaining items according to their discrimination value, male and female mean values were considered. Specifically, items with the smallest mean differences between men and women were retained. Inter-item correlations were taken into account here, too. For items with a correlation higher than .70, only the one with the smaller gender mean difference was retained.
VIT-IRT
The PCM (Masters, 1982) was chosen for item analysis in the IRT framework since the response format was polytomous. Though the PCM does not strictly assume it, with a 5-point Likert-type scale like the one employed here, only monotonically advancing response categories are reasonable. Thus, items whose threshold parameters were not ordered were removed. Furthermore, item selection was conducted by inspecting the category structure. According to Linacre (2009b), there should be at least 10 observations per category, the average measures should advance clearly, and outfit mean squares should be near 1 (and less than 2; Linacre, 1999). Next, the category probability curves were taken into consideration. Items that did not have a distinct peak for every category in the category probability curves were deleted. Additionally, differential item functioning (DIF) analyses were conducted for men and women. Here, the classification system used by Educational Testing Service (ETS) was applied, though logits were used as the measurement unit instead of Delta units. According to this classification, a DIF contrast of less than 0.43 logits is negligible, a DIF contrast between 0.43 and 0.64 logits is slight to moderate, and a DIF contrast above 0.64 is moderate to large (Linacre, 2009b). Therefore, items that displayed a DIF contrast above 0.43 were removed from the item pool. If there were still more than 8 items left in a scale (or more than 5 in the case of the Artistic scale) after these criteria were applied, threshold parameters were reexamined and items for which the thresholds had the smallest distance were removed.
Validation of the Three VIT Versions
Construct validity
To examine convergent and discriminant validity, AIST-R and VIT scores were correlated. Correlations between the same RIASEC factors measured by different instruments were expected to be high while correlations between distinct RIASEC factors measured by different instruments were expected to be low.
Criterion validity
For each VIT version, we computed Holland-codes for every participant using the first, second, and third highest scores on the RIASEC scales for each VIT version. For VIT-Classic and VIT-Balanced, these were based on the sum scores. For VIT-IRT, Holland-codes were derived from the Weighted Likelihood Estimates (WLE; Warm, 1989) for person parameters computed by WINMIRA (von Davier, 2001). With VIT-Classic and VIT-Balanced, the raw sum scores for the criterion sample were z-standardized using the mean and standard deviation of the online sample. Regarding VIT-IRT, the WLE person parameters were transformed into z scores using the mean and standard deviation of the WLE person parameters obtained in the online sample. (The expressions “Holland-code” and “three-letter-code” are used interchangeably throughout this article.) The AIST-R manual contains a register with three-letter-codes for a variety of occupations. This was used to obtain the Holland-codes for the four majors mainly represented in the criterion sample. The Holland-codes are E-C-I for business sciences, I-A-R for architecture, I-A-S for biology, and S-A-E for educational science.
Person–environment fit
The C-Index (Brown & Gore, 1994) was computed as a measure of the congruence between the test-takers’ three-letter code and their major’s (environmental) three-letter-code. Higher C-Index scores (range 0–18) indicate greater congruence. For each VIT version, a separate C-Index was calculated using the version-specific Holland-codes as well as the occupational Holland-codes from the AIST-R manual for the majors. The mean C-Indexes were then compared between the three VIT versions, with higher scores supporting the criterion validity of the instrument with which they were computed.
Furthermore, the relationship between VIT scores and two criteria, satisfaction with major and match between major and interests, was investigated. The relationship between the different VIT versions and satisfaction with major was examined by correlating the satisfaction score with the RIASEC dimensions that comprised the Holland-code of a specific major for participants in that specific major for each of the VIT versions. For instance, architecture majors who have high scores on the three dimensions that make up architecture’s three-letter-code (I-A-R) should be more satisfied than participants whose interests are mainly in the S, E, and C domains. Lastly, the participants’ self-perceived match between their interests and their major was analyzed by correlating the participants’ rating of the interests-major match with their scores on the different VIT versions separately for the four different majors. As above with satisfaction, it was expected that correlations would be highest for interest dimensions that corresponded to the major’s Holland-code.
Results
Assessment of Unidimensionality
Parallel analysis was used to derive the number of components from the eigenvalues extracted in PCAs. For Investigative, Enterprising, and Conventional, parallel analyses resulted in one component, indicating that for these three scales the assumption of unidimensionality holds. For Realistic, Artistic, and Social two components were derived for each scale. Strictly speaking, this means that unidimensionality is violated in these scales. However, drawing upon Stout’s (1987, 1990) concept of essential unidimensionality, major latent dimensions should be distinguished from minor latent dimensions. Essential unidimensionality holds when exactly one major dimension exists. For Realistic and Social, the difference between the first eigenvalue and the second eigenvalue was large (ΔR = 5.11; ΔS = 6.94), indicating that these two scales can be considered essentially unidimensional. For Artistic, the difference ΔA was only 2.69, which cannot be considered supportive of essential unidimensionality. However, Embretson and Reise (2000) state that minor violations of unidimensionality do not affect parameter estimation.
In the following, each VIT version will be described. Table 1 contains the item pools of each version as well as the relevant statistics. For VIT-Classic, means and item discriminations are depicted, for VIT-Balanced, mean differences between men and women are depicted, and for VIT-IRT, item parameters are depicted. For items that are part of more than one VIT version, all the relevant values are listed while for items that are only included in one VIT version only the values that are relevant to that version are shown in Table 1.
Statistics for the Three VIT Versions
Note: IRT = item response theory; VIT = Verb Interest Test. For each scale, the items that are part of at least on VIT version are depicted. For items in the VIT-Classic, the mean and corrected item–total correlation are shown. For items in the VIT-Balanced, the mean difference between males and females is reported. For items in the VIT-IRT, the item location (interpretable as the item difficulty) is depicted.
VIT-Classic
During item selection, only one item (“to translate,” part of Artistic) was deleted due to its discrimination value being lower than .40 (.36). Several items were removed because of their extreme means: “to wallpaper” (M = 1.77), “to mow” (M = 1.60), “to reflect” (M = 4.23), and “to listen” (M = 3.90). “To decompose” was eliminated on account of its high factor loadings on Realistic (.57) as well as Investigative (.44). “To rhyme” and “to write poetry” correlated highly (r = .80), thus it was decided to retain only “to write poetry” since it had a higher item discrimination compared to “to rhyme” (.63 vs. .62).
All corrected item–total correlations were greater than .40 (the lowest being .48 for “to plant”), which indicates that all the selected items were able to discriminate between persons with high and low trait values (see Table 1). Means were highest for the Investigative scale (on average 3.60). As can be seen in Table 2 , reliabilities for the six RIASEC scales in the VIT-Classic were consistently high (between .83 for Artistic and .91 for both Social and Enterprising).
Reliabilities of the RIASEC Scales for the Three VIT Versions
Note: IRT = item response theory; VIT = Verb Interest Test. For VIT-Classic and VIT-Balanced, Cronbach’s α values are reported and for VIT-IRT Andrich’s reliabilities are reported.
VIT-Balanced
Table 1 contains the VIT-Balanced item pool with the mean differences between women and men. A positive mean difference indicates that men endorsed an item more strongly than women did. On the reverse, a negative mean difference shows that women on average chose a higher response category for an item than men. Items from the original VIT item pool were removed (after taking CTT statistics into account) according to the size of their mean difference between men and women. Large mean differences were, for example, observed for “to empathize” (ΔM = −0.57) and “to attend to someone” (ΔM = 0.56). Therefore, these items were eliminated.
For the items remaining in the VIT-Balanced item pool, mean differences were largest for two items on the Artistic scale, namely “to draw” (ΔM = −0.62) and “to create” (ΔM = −0.50). Across all items in the VIT-Balanced, mean gender differences were largest for Realistic (M = 0.23) and Artistic (M = 0.28) and medium for Social (M = 0.16) and Conventional (M = 0.08). The smallest mean differences were found for items on the Investigative (M = 0.05) and Enterprising scale (M = 0.06). Scale reliabilities decreased compared to VIT-Classic but were still around .85 for all scales (see Table 2).
VIT-IRT
Table 1 shows the item parameters for the PCM. The item location is the average of the four threshold parameters and can be interpreted as the item difficulty. Four items showed disordered threshold parameters: “to mow” (R), “to reflect” (I), “to sing” (A), and “to dance” (A) and were thus removed.
“To vet” was deleted because the average measure of Category 5 was smaller than the average measure of Category 4. The item “to dig over” was removed because its outfit mean square for Category 4 exceeded two. No further items had to be removed due to the assessment of the category structure. Several items were eliminated because their category probability curves did not show a distinct peak for every category, for example, “to recognize” and “to cash.” Thus, these two items contained a response category that did not cover a unique section of the trait continuum and were removed from the item pool. As depicted in Table 2, Andrich’s reliabilities (which can be interpreted analogous to Cronbach’s α) for the six scales in the VIT-IRT were between .66 (Artistic) and .85 (Social and Enterprising).
As summarized in Table 3 , 11 items showed slight to moderate DIF. The Enterprising and Conventional scales did not contain any items that showed more than negligible differential functioning (DIF contrast < .43). Five of the items with a DIF contrast above .43 favored men (Realistic and Investigative items) while six items favored women (Artistic and Social items). In the case of a polytomous rating scale as applied here, an item favoring men can be interpreted as one in which women need a higher trait measure to endorse the same response category as men with a lower trait level. In other words, it is easier for men to endorse a higher response category compared to women of the same trait level. All differentially functioning items were removed from the VIT-IRT item pool. Since DIF can be a cause for multidimensionality (Edelen & Reeve, 2007), dimensionality was reassessed after removal of DIF items. This time, parallel analyses resulted in one component for each RIASEC dimension, indicating that the multidimensionality found during pre-analyses in Artistic, Realistic, and Social was indeed due to differential item functioning for men and women.
Items Showing Differential Item Functioning for Gender
Note: DIF = differential item functioning. A positive DIF contrast means that an item favors men (is easier to endorse for men), a negative DIF contrast signals that an item favors women. Only DIF contrasts above .43 are depicted.
In comparing the three VIT versions, note that there is a substantial amount of item overlap between them (see Table 1). For Realistic, five items (“to carve,” “to drill,” “to glue,” “to plant,” and “to repair”) belong to the Realistic scale of every VIT version. For Investigative and Artistic two items overlap, for Social and Enterprising, one item overlaps, and for Conventional there is an overlap of four items. Reliabilities are highest in the VIT-Classic compared to the VIT-Balanced and the VIT-IRT, which is not surprising since item selection for the VIT-Classic focused on item discrimination values. Despite the focus on item discrimination in the VIT-Classic’s item selection process, construct coverage of Holland’s RIASEC types appears to have been achieved in all three VIT-versions without noticeable differences between the versions. The same RIASEC dimensions correlated strongly between VIT-versions (from r = .75 for Artistic for VIT-Classic with VIT-IRT to .95 for Realistic for VIT-Balanced with VIT-IRT) while correlations between different RIASEC dimensions were generally low (mostly between .1 and .2).
Validation of the Three VIT Versions
Construct validity
To examine the construct validity of the three VIT versions, correlations between the AIST-R and each VIT version were computed. These correlations are displayed in Table 4 . Correlations between the same dimensions were highest for all three versions for Enterprising (Mr = .70) and Social (Mr = .70) and mean correlations were in the range of .65–.69 for Realistic, Investigative, Artistic, and Conventional, thus supporting convergent validity. Correlations between different dimensions were generally low and supportive of discriminant validity with the exception of the correlation between Realistic in the VIT and Investigative in the AIST-R (r = .43 for VIT-Classic and VIT-Balanced, r = .42 for VIT-IRT, p < .01) as well as the correlation between Investigative in the VIT and Artistic in the AIST-R (r = .39 for VIT-Classic, r = .38 for VIT-Balanced, r = .32 for VIT-IRT, p < .01). Construct validity differed between men and women across the six dimensions. For Realistic and Enterprising, correlations between VIT and AIST-R were higher for men compared to women for all three versions whereas for Investigative and Social, correlations were consistently higher for women compared to men (for Investigative and Social all p < .05). Concerning Artistic, construct validity was higher for women than for men in VIT-Classic and VIT-Balanced while it was higher for men than for women in VIT-IRT. With regard to Conventional, construct validity was higher for women in VIT-Classic and higher for men in VIT-Balanced and VIT-IRT.
Correlations Between the AIST-R and the Three VIT Versions
Note: AIST = Allgemeiner Interessen-Struktur-Test; IRT = item response theory; VIT = Verb Interest Test. R = Realistic, I = Investigative, A = Artistic, S = Social, E = Enterprising, C = Conventional. Correlations supporting convergent validity are in boldface.
* p < .05.
** p < .01.
Criterion validity
To assess the criterion validity of the different VIT versions, correlations for congruence between interests and major, satisfaction with major, and match between interests and major were computed using individual Holland-Codes based on participants’ scores in the three VIT versions.
Person-environment fit
Brown and Gore’s (1994) C-Index was computed for the different VIT versions. C-Index scores ranged from 1 to 18 for all VIT versions. Means, standard deviations, and intercorrelations as well as correlations with self-rated satisfaction with major are reported in Table 5 . Means and standard deviations were similar for the three VIT versions and not particularly high. None of the C-Indices correlated significantly with satisfaction.
Means, Standard Deviations, and Intercorrelations for C-Indices and Satisfaction With Major
Note: IRT = item response theory; VIT = Verb Interest Test. C-Index can reach values between 0 and 18. Satisfaction was rated on a Likert-type scale ranging from 1 to 6.
** p < .01.
Participants’ satisfaction with their major was correlated with their VIT scores in the dimensions that comprise their major’s Holland-code. Table 6 shows these correlations separately for women and men for architecture and business sciences. For biology and educational science, correlations were computed with the whole sample due to small sample size (N = 24 for biology and N = 22 for educational science). Concerning business sciences, correlations were mainly low and not significant with the exception of the correlation between Conventional in the VIT-Balanced and satisfaction for male participants (r = −.42, p = .04). As pertains to architecture majors, the correlation between VIT score and self-reported satisfaction was largest for Realistic in the male subsample for the VIT-Classic and for Investigative in the female subsample across all VIT versions (see Table 6). Regarding biology majors (I-A-S), the highest correlations were obtained for Investigative consistently across VIT versions. For educational science (S-A-E), the highest correlations between VIT scores and satisfaction occurred on the Artistic scale (r = .51, p = .02) in the VIT-Classic.
Correlations Between Satisfaction and VIT Scores for Four Different Majors
Note: IRT = item response theory; VIT = Verb Interest Test. R = Realistic, I = Investigative, A = Artistic, S = Social, E = Enterprising, C = Conventional. Nm = sample size men, Nf = sample size women. For biology and educational science, correlations are reported for the whole subsample.
* p < .05.
** p < .01.
Furthermore, the relationship between the participants’ self-rated match between their own interests and their major was examined to assess the criterion validity of the three different VIT versions. Table 7 depicts these correlations separately for the three VIT versions and the four different majors. For business sciences majors only the VIT-Classic and VIT-IRT showed a significant correlation between self-perceived match and VIT score on the Enterprising scale exclusively for the female subsample (r = .35, p = .02 for VIT-Classic and r = .33, p = .03 for VIT-IRT). For architecture majors, the only significant correlation between self-perceived match and VIT score was obtained for Investigative for the male subsample in the VIT-Balanced. Concerning biology (I-A-S), only the VIT-Classic yielded the expected high correlation between Investigative and self-perceived fit (r = .57, p < .01) for men and women together (see Table 7).
Correlations Between Match Between Major and Interests and VIT Scores for Four Different Majors
Note: IRT = item response theory; VIT = Verb Interest Test. R = Realistic, I = Investigative, A = Artistic, S = Social, E = Enterprising, C = Conventional. Nm = sample size men, Nf = sample size women. For biology and educational science, correlations are reported for the whole subsample.
*p < .05.
** p < .01.
Discussion
Despite using different methods for item selection, item pools of the three VIT versions show some overlap, most notably for Realistic, where five items are part of all versions, and Conventional (4 items). For VIT-Classic and VIT-Balanced, this is expected since both were constructed within the CTT framework. The substantial overlap in these two dimensions with the VIT-IRT may be due to the nature of the verbs in these scales. Verbs describing Realistic and Conventional interests (e.g., “to glue” or “to file”) are more specific and concrete in indicating an activity compared to verbs in the other scales, which appear to be more ambiguous (e.g., “to educate” or “to influence”).
Results concerning criterion validity were mixed and the three VIT versions did not differ systematically in the correlations between VIT score and satisfaction and match between interests and major for the four majors. The clearest result supporting criterion validity was the consistently high correlation between satisfaction with major and VIT score for Investigative in the biology subsample and the female architecture subsample (for both Investigative is the first letter in their Holland-codes).
Construct validity was supported for all three VIT versions by correlations between the VIT versions and the AIST-R (Bergmann & Eder, 2005); though there were differences in construct validity for men and women. On the dimensions Realistic and Enterprising, correlations between VIT and AIST-R were higher for men compared to women whereas the dimensions Investigative and Social showed the opposite pattern. Furthermore, all three versions showed unidimensionality in parallel analyses undertaken after item selection. Therefore, it can be assumed that VIT-Classic, VIT-Balanced, and VIT-IRT only assess one underlying construct in each scale.
In the framework of the opportunity approach to validation, the construction strategies implemented in VIT-Balanced and VIT-IRT seem most useful since gender differences were eliminated insomuch as the VIT-IRT item pool contains no DIF items (according to the ETS classification system) and VIT-Balanced contains those items with the smallest mean differences between men and women. Thus, assuming that eliminating gender differences at the item level facilitates occupational exploration, both VIT-IRT and VIT-Balanced permit test-takers to explore career options that are based less on traditional gender-roles than interest inventories containing gender differences.
Limitations
The criterion sample was rather small (N = 158), especially when split into four majors. The low correlations found here could indicate variance restriction due to selection effects. A further valuable test of an instrument’s quality, as recommended by the Standards of Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999), is to investigate whether differential prediction occurs for men and women. This could not be undertaken due to the small sample size.
Concerning the usefulness of verbs for the assessment of vocational interests, the advantage of being context-independent appears problematic with some verbs in the sense that they are too unspecific. That is, with some verbs, interpretations of the activity indicated may differ. For instance, “to observe” could be interpreted as meaning the observation of things, for example, during an experiment. Other test-takers might interpret it as observing people performing drama, in which case it would fit better on the Artistic scale than the Investigative scale where it is situated now. Thus, further research on using verbs in vocational interest inventories should address the issue of clarity of interpretation and allocation to a specific interest dimension.
Conclusion
Although the VIT-versions constructed in this study did not differ substantially, developing an inventory strictly following CTT rules as implemented here in the VIT-Classic cannot be recommended since potentially existing gender bias might remain at the item level. Gender differences at the item level can be addressed by test construction in the CTT framework as seen here in the VIT-Balanced. However, this strategy might lead to the elimination of items that are important in terms of construct coverage while still ignoring possible gender bias at the item level. Gender bias at the item level can be investigated using an IRT approach to test development that includes analyses on differential item functioning. On items that function differentially for gender, men and women show different probabilities of endorsing an item after they have been matched on the underlying construct (Holland & Wainer, 1993). Allowing DIF items to enter an interest inventory can counteract the goal of maximizing occupational exploration since items with gender-stereotypic connotations might differentially favor men or women, leading to higher endorsement rates in one group, despite both having similarly high interests in the domain being assessed. Furthermore, the IRT approach yields more precise and item-independent trait estimates (including an individual instead of group-based estimate of measurement error) compared to the sum scores used in CTT. In sum, of the three construction strategies, test development according to IRT can be most recommended both from a counselor’s viewpoint (maximizing exploration) and from a psychometrician’s viewpoint (precise trait estimation).
Footnotes
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
The author(s) disclosed receipt of the following financial support for the research and/or authorship of this article: This research was supported by the German Federal Ministry of Education and Research and the European Social Fund of the European Union (grant agreement number: 01FP0930).
