Evaluating the Mathematics Interest Inventory Using Item Response Theory

Abstract

As the United States has begun to lag behind other developed countries in performance on mathematics and science, researchers have sought to explain this with theories of teaching, knowledge, and motivation. We expand this examination by further analyzing a measure of interest that has been linked to student performance in mathematics and intention to enroll in advanced mathematics classes. Using Item Response Theory (IRT), the purpose of this study was to evaluate the psychometric properties of the Mathematics Interest Inventory (MII). Six hundred sixty-six students (47.6% Hispanic, 48.2% Male) formed the sample. Results revealed minimal differential item functioning (DIF) between genders. Substantial DIF emerged between Hispanic and White students. Implications of the differential functioning and the measurement of interest in mathematics are discussed.

Keywords

mathematics interest Item Response Theory differential item functioning gender difference ethnic difference Hispanic

Over the last few decades, students’ performance in mathematics and science in the United States has begun to lag behind their international peers (Fleischman, Hopstock, Pelczar, & Shelley, 2010; Provasnik et al., 2012; Shorrocks-Taylor & Jenkins, 2002). While many may attribute this to poor teaching or a lack of knowledge and focus on pedagogical reform (e.g., Seymour, 2002; Smith, Sheppard, Johnson, & Johnson, 2005), others hold a motivational perspective. In his seminal work on social cognitive theory, Bandura (1986) stated, “People do not enact everything they learn. They may acquire and retain the capabilities to execute modeled activities adeptly but rarely or never perform them” (p. 68). The motivation to act on one’s knowledge is what distinguishes knowing and performing (Bandura, 1986). To facilitate student performance in areas of science, technology, engineering, and mathematics (STEM), programs have been designed to explicitly focus on STEM-related careers, and educational foci have sought to instill and capture students’ interest in STEM as early as elementary school (DeJarnette, 2012; Russell, Hancock, & McCullough, 2007). According to Lent, Brown, and Hackett (1994), interest plays a major role in connecting self-beliefs to performance and occupational decisions (see also Brown & Lent, 2006), including the decisions to enter a STEM profession (Lent et al., 2001; Lent, Lopez, Lopez, & Sheu, 2008). To see students’ self-beliefs and interests lead to successful endeavors in coursework and occupations in STEM fields, researchers need to be confident in the measurement devices and implications of the results that educators and policy makers base their decisions on.

Most researchers tend to rely on self-report instruments as a way to measure personal factors (e.g., self-efficacy, interest, affect). However, self-report instruments need to be used with caution as they tend to be biased. We need to be constantly aware that the implications we construct based on the results of potentially biased and faulty measures may be incorrect. Apart from the obvious issue of social desirability (Spector, 1994), respondents need to be adequately and accurately aware of their knowledge (i.e., metacognition) and abilities (Schunk, Pintrich, & Meece, 2008). The socialization of individuals may also have an influence on an individual’s self-beliefs (Bandura, 1986). For example, past research on gender and STEM fields have shown that there is a difference in the levels of interest between males and females in mathematics (Evans, Schweingruber, & Stevenson, 2002; Köller, Baumert, & Schnabel, 2001; Marsh, Trautwein, Lüdtke, Köller, & Baumert, 2005; Renninger, 1992). A prominent question would be, is this a function of true interest, social manipulation, or of bias in the measurements? In addition, the socialization argument extends to ethnically and linguistically diverse individuals. Because of differences in language use or level of communication, the wording within instruments may threaten its validity (e.g., Church, 2001; Ramirez, Teresi, Holmes, Gurland, & Lantigua, 2006). As important conclusions are drawn from the results of these measurement devices, it is important that researchers thoroughly test and modify their instruments to not only accommodate the diversity of respondents but also place restrictions on use so that researchers cannot make inaccurate conclusions about subgroups within a population.

There are multiple reasons why researchers may decide to develop and utilize new instruments, including a lack of measurement devices for a certain construct, inappropriate construct conceptualization in previous measurements, or an inaccurate measurement scheme. In response to the call for better measurements of mathematics interest, Stevens and Olivárez (2005) developed the Mathematics Interest Inventory (MII). The MII was developed with situational and individual interest in mind, focusing explicitly on emotion, knowledge, and value. In the construction of the MII, items were worded to facilitate young respondents. That is, the items were created to reflect the language and behavior relevant to fourth graders. The validation of the instrument occurred in three phases, utilizing three distinct samples. The first phase consisted of an exploratory factor analysis of the original 62 items on the first sample (n = 170), which resulted in a three-factor (i.e., Positive Valence [PV], Negative Valence [NV], Time) solution. The second phase consisted of fitting the second sample of 282 students to a confirmatory factor analysis based on the results of the first phase and resulted in an adequately fitting model after a few items were removed due to poor loadings. The final phase utilized the third sample of 272 students to evaluate the application of the model to older children. The confirmatory factor analyses (CFA) model retained adequate fit even with an older student sample. Significant, positive correlations with students’ current mathematics grades of the students across the phases suggested good concurrent validity. Subsequent studies examining the MII and achievement have demonstrated consistently high correlations with other motivational factors (e.g., self-efficacy, self-determination) and achievement (Stevens, Olivárez, & Hamman, 2006). In addition, these studies have also demonstrated the predictive ability of mathematics interest (via MII) as a full moderator of the relationship between mathematics self-efficacy and the intentions to enroll in more mathematics courses during high school (Stevens, Wang, Olivárez, & Hamman, 2007). Across all of these studies, the MII has shown to have good internal reliabilities (Cronbach’s α ≥ .70; Kline, 2000), with one exception that the Time subscale fell below this cut-off in one sample of the instrument validation study (α = .57; Stevens & Olivárez, 2005). Overall, the instrument appears to be a psychometrically sound measure of mathematics interest; however, with a focus on gender and ethnic differences in STEM fields, more information is needed.

Using the MII, Stevens and colleagues (Stevens et al., 2006; Stevens et al., 2007) found significant mean score differences between Hispanic and White students, yet a non-significant mean score difference between boys and girls. Although it might be of interest to know that mean score differences between boys and girls were not significant or that Hispanic students had significantly higher levels of mathematics interest, it still fails to explain where the discrepancies occurred or if these occurred as a function of true mathematics interest or some other barrier introduced via instrument. Further analysis is needed to determine if the non-significant difference between boys and girls is truly a function of similar mathematics interest or of compounded error in a composite score. In addition, further analysis is needed to determine whether the significant mean differences between Hispanic and White students is reflective of their mathematics interest or if some underlying bias influenced different responses.

The purpose of the current study was to evaluate the item-level characteristics of the MII using Item Response Theory (IRT; Lord & Novick, 1968). IRT has emerged to address the well-documented shortcomings of classical test theory (CTT), offering three primary advantages: (a) the estimate of an examinee’s latent trait is independent of the particular sample of items that are administered, (b) the item characteristics are independent of the particular sample of respondents, and (c) a statistic indicating the precision with which each respondent’s latent trait as estimated is provided, and this statistic is free to vary among respondents (Hambleton & Swaminathan, 1985). While a theory is understood as a scientifically acceptable set of principles advanced to explain and predict a phenomenon, IRT is in fact a system of mathematical models that defines one way of establishing the correspondence between latent constructs and their indicators (de Ayala, 2009). In the current study, we adopted Samejima’s (1969) graded response model among the available IRT models because it is appropriate for handling the situation where item responses are contained in two or more ordered categories.

The fairness of the MII is of particular interest to us given that certain subgroups (i.e., girls, Hispanic students) earned statistically equal or higher scores on the scale than their counterparts (Stevens et al., 2006; Stevens et al., 2007), despite the findings that women and ethnic minorities showed lower intention to take advanced mathematics courses or enter STEM careers (e.g., Beasley & Fischer, 2012; Deacon, 2011; Shapiro & Williams, 2012). An instrument is not fair when two groups of an equal amount of a latent trait earn different scores on the same item (Gall, Gall, & Borg, 2006); such measurement bias is also termed as differential item functioning (DIF; Holland & Wainer, 1993). Taken into the current context, DIF analyses are warranted because any group difference across gender or ethnicity captured by the MII may be a function of measurement bias, rather than true differences in the latent trait.

The following research questions were to be addressed:

Research Question 1: From the IRT perspective, does the MII measure children’s positive affect, negative affect, and time investment with respect to mathematics with sufficient precision and utility?

Research Question 2: Do items of each MII subscale demonstrate significant DIF across gender and ethnicities?

These research questions entailed a comprehensive review of the item-level IRT parameters (i.e., item discrimination and thresholds) and measurement bias (i.e., DIF). Answers to these questions could guide the future use of the MII by providing recommendations on the selection of a set of items that are most informative and least biased in the subscale level.

Method

Participants

The data for the current study were drawn from Stevens et al.’s (2006) study on the mathematics achievement gap between Hispanic and White students. A total of 666 participants were recruited from students attending schools in west and south Texas. The overall percentage of missing data was limited to 0.4% of all data cells, with percentages ranging from 0% (n = 0) to 1.0% (n = 7) by individual item. Table 1 summarizes the demographic information of the participants by grade level.

Table 1.

Demographic Information of Participants (N = 666).

	Hispanic		White		African American		Asian American		Native American
Grade level	Female	Male	Female	Male	Female	Male	Female	Male	Female	Male
4th	46	47	14	17	1	0	1	1	1	1
5th	1	0	0	0	0	2	0	0	0	0
7th	0	0	0	1	0	0	0	0	0	0
8th	108	87	82	77	14	11	2	0	3	3
9th	1	3	24	16	0	0	0	0	0	0
10th	6	10	24	25	2	2	0	0	1	0
11th	2	2	3	8	0	0	0	0	1	0
12th	2	0	0	1	0	0	0	0	0	0

Note. Participants’ ages ranged from 8 to 18 years with a mean of 13.31 years old (SD = 2.01). For the demographic information, missing data are limited to 0.5% on gender (n = 3), 1.5% on ethnicity (n = 10), and none on grade level.

Measures

The traditional approach to gauging interest by simply asking the question “are you interested?” has been criticized for overestimating people’s awareness of their own level and degree of interest. Particularly, a person hardly uses the term interest in its more general shared meaning beyond its common-language meaning in everyday life (Valsiner, 1992). The MII was developed to address this issue by assessing interest through simple behavioral indicators (Stevens & Olivárez, 2005). The MII consists of 27 items gauging three factors of mathematics interest: PV, NV, and Time. PV is assessed by 10 items and reflects the degree to which students report a positive attraction toward mathematics (e.g., “I like to answer questions in math class.” “I feel good when it comes to working on math.”). NV is assessed by 10 items and related to negative experiences associated with mathematics (e.g., “I am bored when working on math.” “I get mad easily when working on math.”). Time is assessed by seven items and reflects the amount of time and effort respondents commit to mathematics (e.g., “I work more math problems than what I have to.” “I spent many hours working on math.”). Respondents are asked to rate how well each item described them on a response scale (1 = not at all like me, 4 = very much like me). The average internal consistency reliabilities (Cronbach’s α) were reported to be .93 for PV, .80 for NV, and .69 for Time in the instrument validation study (Stevens & Olivárez, 2005). Although it seems counterintuitive that negative affect is indicative of interest, NV was shown to have significant, positive correlations with PV in this validation study. This finding actually has its theoretical basis that the presence of interest is not always accompanied with positive feelings, but sometimes negative experiences as well (Hidi & Anderson, 1992; Iran-Nejad, 1987; Renninger & Hidi, 2002). The total score of the MII is therefore computed by adding up item scores, thus ranging from 27 to 108. The MII has been administered in the English version only.

Analyses

A major assumption of IRT analyses is unidimensionality, which refers to the notion that a set of items measures a single latent construct (Lord & Novick, 1968). To test this assumption, CFAs were performed on the three-factor model in Mplus v. 6.0 (Muthén & Muthén, 2010) using weighted least square mean-and-variance adjusted (WLSMV) χ² test statistic estimation. Chi-square (χ²) tests of model fit were referred to along with approximate fit indices (AFIs) for determining acceptable model fit. Due to the well-known criticism that the χ² test is severely sensitive to sample size (Schmitt, 2011), the following criteria of AFIs were consulted on significant results of the χ² test: root mean square error of approximation (RMSEA) ≤ .06, comparative fit index (CFI) ≥ .95, and weighted root mean square residual (WRMR) ≤ 1.0 (Bentler, 2007; Hu & Bentler, 1999; Yu, 2002).

Samejima’s (1969) graded response model (GRM) was used for the IRT estimates. In the GRM, option characteristic curves (OCCs) are estimated for each response option in an item. The mathematical formula with an item of K = 4 ordered responses options (k = 0, 1, 2, 3) may be written as,

P_{k, i} (θ) = {\begin{matrix} 1 - \frac{e^{a_{i} (θ - b_{1, i})}}{1 + e^{a_{i} (θ - b_{1, i})}}, k = 0 \\ \frac{e^{a_{i} (θ - b_{1, i})}}{1 + e^{a_{i} (θ - b_{1, i})}} - \frac{e^{a_{i} (θ - b_{2, i})}}{1 + e^{a_{i} (θ - b_{2, i})}}, k = 1 \\ \frac{e^{a_{i} (θ - b_{2, i})}}{1 + e^{a_{i} (θ - b_{2, i})}} - \frac{e^{a_{i} (θ - b_{3, i})}}{1 + e^{a_{i} (θ - b_{3, i})}}, k = 2 \\ \frac{e^{a_{i} (θ - b_{3, i})}}{1 + e^{a_{i} (θ - b_{3, i})}}, k = 3 \end{matrix},

where P_k,i is the probability of a person’s endorsement of response option k for item i, θ is the latent trait, a_i is the discrimination parameter for item i, the mathematical constant e is the base of the natural logarithm, and b_k,i is the threshold parameter (de Ayala, 2009). The discrimination parameter (a) indicates the degree of slope at each point of inflection. According to Baker (1985, 2001), item discrimination is considered to be very low for a < 0.34, low for 0.35 ≤ a ≤ 0.64, moderate for 0.65 ≤ a ≤ 1.34, high for 1.35 ≤ a ≤ 1.69, and very high for a > 1.70. In the present study, the threshold parameters b_1,i and b_3,i represent the values of θ where the probability is .5 for endorsing the lowest (k = 0) and highest options (k = 3), respectively, while the modes of the other OCCs (k = 1, 2) are specified as (b_1,i + b_2,i)/2 and (b_2,i + b_3,i)/2, respectively. The intersection of any two OCCs otherwise indicates an equal probability of endorsing option k or higher versus endorsing lower order options (de Ayala, 2009). Figure 1 presents a graphical illustration of the OCCs for the MII Item 1 of PV as an example. The probability is .5 for an individual to endorse Option 0 (i.e., not at all like me) when his positive affect for mathematics is 1.43 SDs below population mean, and for Option 3 (i.e., very much like me) when 1.18 SDs above mean. For OCCs of Options 1 (i.e., not much like me) and 2 (i.e., sort of like me), the θ values are (−1.43 + −0.33)/2 = −0.88, where the mode of OCC for Option 1 is present and (−0.33 + 1.18)/2 = 0.43 for Option 2.

Figure 1.

Graded response model option characteristic curves (OCCs) for Item 1 of the Mathematics Interest Inventory with four response options (a₁ = 1.84, b_1,1 = −1.43, b_2,1 = −0.33, b_3,1 = 1.18).

A minimum sample size of 500 was recommended by Reise and Yu (1990) to achieve adequate calibration with the GRM, based on their findings of parameter recovery with 25 five-option items. We estimated item parameters by fitting three separate unidimensional GRMs to each of MII subscale in IRTPRO v. 2.1 (Cai, Thissen, & du Toit, 2011). In addition, IRTPRO provides test information function values to represent the total information available for estimation. The test information function is defined for a set of test items (e.g., a subscale) at each point on the continuum of the latent trait, and the amount of information is influenced by the quality and number of test items. The contribution of each item to the total information is addictive and depends on how highly each item correlates with other items in the set (Hambleton & Swaminathan, 1985).

Several approaches, either IRT-based or non-IRT-based, are available for DIF detection (de Ayala, 2009). A major distinction between the non-IRT-based Mantel–Haenszel (M-H) method and IRT-based approaches (e.g., likelihood ratio [TSW-ΔG2]; Thissen, Steinberg, & Wainer [1988, 1993]) is that M-H is nonparametric and based on observed rather than latent conditioning scores. According to Potenza and Dorans (1995), a problem associated with parametric approaches is that DIF detected may be an artifact of model misspecification, while nonparametric methods are relatively free of model misspecification and collinearity problems (see also Bolt, 2002), but require sufficient data. We chose M-H over IRT-based approaches given our relatively large sample size. Further, the M-H procedure also yields effect size indicators which are particularly helpful in interpreting the direction and magnitude of DIF.

The analysis of DIF was performed using DIFAS v. 5.0 (Penfield, 2012). The DIF analysis is normally conducted by comparing a focal group (usually the minority) against a reference group (usually the majority). The M-H chi-square (M-H χ²) procedure was used for detecting DIF in polytomous items (Penfield, 2012) with α = .05. The M-H χ² statistic is distributed with one degree of freedom, with a higher value of a particular item indicating a higher probability for this item to display DIF (Mantel, 1963). To reduce Type 1 error, we used the Benjamini and Hochberg false discovery rate (BH-FDR; Benjamini & Hochberg, 1995) to adjust each item’s p value of M-H χ². According to Kim and Oshima (2013), the BH-FDR is the most balanced adjustment in lowering the Type 1 error rate as compared with Bonferroni correction and Holm’s procedure. In terms of effect size of differences, DIFAS v. 5.0 provides L-A Cumulative Common LOR (Liu & Agresti, 1996) and the Cox’s Noncentrality Parameter Estimator (Cox’s B; Camilli & Congdon, 1999) statistics for estimating magnitude and direction of DIF. For both L-A LOR and Cox’s B, positive values indicate DIF in favor of the reference group, while negative values indicate DIF in favor of the focal group (Penfield, 2012). For detecting non-uniform DIF, DIFAS v. 5.0 provides the conditional differences or the difference between the difficulty indices found for each of the groups. The DIF is uniform if the difference values maintain one direction (i.e., positive or negative) across all the difficulty intervals (Padilla, Hidalgo, Benítez, & Gómez-Benito, 2012).

Finally, the impact of DIF at the subscale level was examined by Differential Test Functioning (DTF) analyses. DTF represents the aggregated DIF across the items of a test or subscale and is essentially the variance estimator of item-level DIF effects. Thus, it can provide information concerning the overall impact of DIF effects. With respect to polytomous items, the variance of the generalized DIF effect is denoted by Nu-square (ν²), and considered to be small for ν² < 0.07, medium for 0.07 ≤ ν² ≤ 0.14, and large for ν² > 0.14 (Penfield & Algina, 2006).

Results

To detect the group differences (i.e., boys vs. girls, White vs. Hispanic) as reflected by measures of the MII, a 2 × 2 analysis of variance (ANOVA) was conducted. The ANOVA indicated a significant main effect for ethnicity, F(1, 603) = 3.94, p < .05, partial η² = .006, but no significant main effect for gender, F(1, 603) = 0.15, p = .70, partial η² < .001, or interaction between ethnicity and gender, F(1, 603) = 1.98, p = .16, partial η² = .003. As assessed by the MII, Hispanic students demonstrated significantly higher mathematics interest (M = 64.88, SD = 15.69) than White students (M = 62.21, SD = 18.03) across gender, while no significant difference was found between boys (M = 63.33, SD = 16.56) and girls (M = 63.85, SD = 17.23) across ethnic groups.

Unidimensionality

The reliabilities of the three MII subscales were assessed for internal consistency as both Cronbach’s alpha (α) and Raykov’s rho (ρ). The results appeared identical across α and ρ, and indicated good internal consistencies: PV = .93, NV = .87, and Time = .73. In addition, internal consistencies were also shown to be similar within each pair of comparison groups for subsequent DIF analyses: PV = .93 (female), .92 (male), .92 (Hispanic), and .94 (White); NV = .88 (female), .86 (male), .86 (Hispanic), and .88 (White); Time = .74 (female), .71 (male), .72 (Hispanic), and .74 (White).

The three-factor CFA model did not appear to fit the data well, χ²(321) = 2428.13, p < .0001, CFI = 0.90, RMSEA = 0.10, WRMR = 2.19. To further examine the factor structure, we consulted the modification indices (MIs; Sörbom, 1989) and identified one item as having a substantial cross loading, producing Δχ² = 779.8. Removing this item would have improved the model to fit the data marginally well, χ²(296) = 1619.24, p < .0001, CFI = 0.93, RMSEA = 0.08, WRMR = 1.79. The decision was made to retain this item for the IRT analysis because (a) the MII has been cross-validated to multiple samples in its validation study and results consistently supported the three-factor simple structure (Stevens & Olivárez, 2005); (b) Hu and Bentler’s (1999) combinational rules are considered too stringent and more appropriate when evaluating statistical significance rather than goodness of model fit (Marsh, Hau, & Wen, 2004). Factor loadings (λ) of the CFA are summarized in Table 2. In terms of inter-factor correlations, PV was shown to have a positive correlation of .70 (p < .001) with NV and a positive correlation of .84 (p < .001) with Time, while NV positively correlated with Time at .61 (p < .001).

Table 2.

IRT Parameter Estimates (a, b₁, b₂, b₃) and DIF Results by Subscale and Item.

			IRT parameter estimates				DIF by gender		DIF by ethnicity
Item	λ	M (SD)	a_i (SE)	b_1,i (SE)	b_2,i (SE)	b_3,i (SE)	M-H χ²	L-A LOR	M-H χ²	L-A LOR
Positive valence
1. I like to answer questions in math class.	.71	2.64 (0.95)	1.84 (0.13)	−1.43 (0.10)	−0.33 (0.06)	1.18 (0.08)	1.26	−0.19	0.41	−0.11
4. I like math.	.83	2.55 (1.12)	2.43 (0.17)	−0.85 (0.06)	−0.14 (0.05)	0.80 (0.06)	1.08	−0.19	9.51**	0.59
7. I am interested in math.	.89	2.42 (1.09)	3.06 (0.21)	−0.77 (0.05)	0.05 (0.04)	0.89 (0.06)	0.10	−0.06	11.89**	0.69
10. Knowing a lot about math is helpful.	.60	3.09 (0.96)	1.44 (0.12)	−2.19 (0.16)	−0.98 (0.09)	0.28 (0.07)	2.44	−0.27	0.54	−0.13
13. I feel good when it comes to working on math.	.85	2.22 (1.04)	2.78 (0.19)	−0.59 (0.05)	0.32 (0.05)	1.27 (0.07)	0.33	0.11	5.00	0.43
16. I want to know all about how to do math problems.	.79	2.61 (1.12)	2.43 (0.17)	−0.99 (0.07)	−0.05 (0.05)	0.65 (0.06)	0.26	−0.09	14.44**	−0.73
19. I feel excited when a new math topic is announced.	.79	2.15 (1.02)	2.14 (0.15)	−0.60 (0.06)	0.51 (0.06)	1.43 (0.09)	0.29	−0.09	0.24	0.09
22. I want to learn more about math.	.87	2.60 (1.09)	3.31 (0.23)	−0.91 (0.06)	−0.12 (0.04)	0.68 (0.05)	3.63	−0.34	0.07	−0.05
24. I choose to work on math.	.79	2.31 (1.05)	2.18 (0.15)	−0.75 (0.06)	0.21 (0.05)	1.26 (0.08)	13.19**	0.64	1.43	−0.21
26. I want to know all about math.	.90	2.45 (1.13)	3.48 (0.25)	−0.70 (0.05)	0.07 (0.04)	0.75 (0.05)	9.47*	0.58	4.46	−0.42
Negative valence
2. I am wasting my time on math.	.64	3.09 (0.98)	1.45 (0.12)	−2.18 (0.18)	−0.87 (0.10)	0.20 (0.07)	3.63	−0.33	3.10	0.31
5. I am bored when working on math.	.68	2.60 (1.08)	1.51 (0.12)	−1.24 (0.11)	−0.13 (0.07)	0.94 (0.09)	5.86	−0.41	0.26	−0.09
8. I would rather be working on something else besides math.	.88	2.27 (1.15)	1.91 (0.15)	−0.52 (0.07)	0.30 (0.06)	1.09 (0.08)	0.39	0.10	23.45***	−0.86
11. I give up easily when working on math.	.66	2.86 (1.05)	1.81 (0.14)	−1.48 (0.11)	−0.51 (0.07)	0.50 (0.07)	7.79	0.47	4.26	0.36
14. When working on math, I want to stop and start working on something else.	.77	2.48 (1.11)	2.18 (0.17)	−0.83 (0.08)	0.00 (0.06)	0.93 (0.07)	0.26	0.09	2.69	−0.29
17. I am always thinking of other things when working on math.	.79	2.28 (1.06)	2.58 (0.20)	−0.65 (0.07)	0.25 (0.05)	1.20 (0.08)	0.73	−0.15	0.13	−0.06
20. I get mad easily when working on math.	.63	2.67 (1.12)	1.56 (0.13)	−1.17 (0.11)	−0.27 (0.07)	0.77 (0.08)	0.82	0.15	8.08*	−0.48
23. I have difficulty paying attention when working on math.	.67	2.46 (1.11)	1.87 (0.15)	−0.87 (0.08)	0.03 (0.06)	0.97 (0.08)	0.14	−0.06	9.00**	0.51
25. I spend as little time as possible working on math.	.59	2.46 (1.09)	1.31 (0.11)	−1.06 (0.11)	0.00 (0.08)	1.32 (0.11)	3.92	−0.33	2.59	−0.27
27. I struggle with math.	.57	2.48 (1.14)	1.35 (0.12)	−0.99 (0.11)	0.04 (0.07)	1.01 (0.10)	6.31	0.41	23.04***	0.80
Time
3. I work more math problems than what I have to.	.56	2.04 (1.00)	1.36 (0.13)	−0.49 (0.08)	0.82 (0.09)	1.97 (0.16)	0.14	−0.06	0.17	0.07
6. I spend many hours working on math.	.44	2.03 (1.05)	1.12 (0.12)	−0.41 (0.09)	0.94 (0.11)	2.06 (0.19)	0.15	−0.06	0.16	−0.07
9. I work on math in my spare time.	.73	1.72 (0.92)	2.20 (0.21)	0.13 (0.06)	1.11 (0.08)	1.91 (0.13)	1.43	−0.22	4.96	−0.42
12. I want to talk about math with my friends.	.64	1.58 (0.92)	2.03 (0.21)	0.46 (0.06)	1.29 (0.09)	1.90 (0.13)	0.01	−0.02	0.85	−0.19
15. I spend more time than most of my classmates working on math.	.36	1.95 (1.03)	1.01 (0.11)	−0.31 (0.10)	1.21 (0.14)	2.33 (0.23)	1.79	−0.22	0.56	−0.13
18. I prefer easy math over math that is hard.	.72	2.02 (1.08)	0.65 (0.09)	−0.46 (0.14)	1.33 (0.21)	3.06 (0.43)	5.00	0.35	10.23**	0.54
21. I am too involved in math.	.76	1.68 (0.87)	2.22 (0.22)	0.09 (0.06)	1.23 (0.09)	2.05 (0.14)	0.53	0.13	0.55	−0.14

Note. IRT = Item Response Theory; DIF = differential item functioning; L-A LOR = Liu-Agresti Cumulative Common Log-Odds Ratio; M-H = Mantel–Haenszel.

p < .05 (BH-FDR corrected). **p < .01 (BH-FDR corrected). ***p < .001 (BH-FDR corrected).

Item Response Theory Analyses

Table 2 summarizes the descriptive statistics and IRT parameter estimates for the items. Based on Baker’s (1985, 2001) guidelines, 17 items (63%) demonstrated very high discrimination, 6 items (22%) high discrimination, and 4 items (15%) moderate discrimination.

For PV, the average item thresholds were −0.98 for b₁ (SD = 0.50), −0.05 for b₂ (SD = 0.41), and 0.92 for b₃ (SD = 0.36). Item 10 demonstrated the lowest thresholds of the set (b_1,10 = −2.19, b_2,10 = −0.98, b_3,10 = 0.28), indicating that very low levels of positive affect were necessary for a student to endorse lower response options (i.e., lowest through k − 1) for the statement “knowing a lot about math is helpful,” versus endorse option k and higher. In contrast, Item 19 demonstrated the highest thresholds in the subscale (b_1,13 = −0.60, b_2,13 = 0.51, b_3,13 = 1.43), indicating that very high levels of positive affect were necessary for a student to endorse response lower order options for the statement “I feel good when it comes to working on math,” versus endorse response option k and higher. Overall, the item thresholds of PV ranged from −2.19 to 1.43, suggesting a sufficient distribution of item responses according to threshold.

For NV, the average item thresholds were −1.10 for b₁ (SD = 0.47), −0.12 for b₂ (SD = 0.35), and 0.89 for b₃ (SD = 0.33). These results seemed to suggest that the average thresholds of negative affect for mathematics required to endorse response option k and higher rather than lower order options were lower than the ones for positive affect. In other words, it was easier for respondents to endorse higher-order response options for NV items as compared with PV items. Item 2, “I am wasting my time on math,” appeared to be of the lowest thresholds of the set (b_1,2 = −2.18, b_2,2 = −0.87, b_3,2 = 0.20), whereas Item 8 “I would rather be working on something else besides math” and Item 25 “I spend as little time as possible working on math” (b_1,8 = −0.52, b_2,8 = 0.30, b_3,25 = 1.32) demonstrated the highest thresholds. Overall, the item thresholds of NV ranged from −2.18 to 1.32, also suggesting a sufficient distribution of item responses according to threshold.

For Time, the average item thresholds were much higher: −0.14 for b₁ (SD = 0.37), 1.13 for b₂ (SD = 0.19), and 2.18 for b₃ (SD = 0.41). These results indicated that this subscale might not have a sufficient distribution, particularly for thresholds below the mean (i.e., θ < 0). Item 12, “I want to talk about math with my friends,” showed the highest threshold in the subscale for k = 0 (b_1,12 = 0.46) as well as the lowest threshold for k = 2 (b_3,12 = 1.90). As such, Item 12 appeared to be an example of limited distribution. Overall, the item thresholds of time ranged from −0.49 to 3.06, suggesting that this subscale may underestimate the time and effort respondents spent on mathematics.

Figure 2 illustrates the test information curves across the subscales. As indicated by the area under the curve, the PV items altogether provided the most information for measurement of children’s positive affect for mathematics, while the Time items altogether provided the least information of measurement of time and effort invested on mathematics. For PV, the maximum of the total information is about 19.76 at θ = 0, with very close values between θ = −0.8 (18.56) and θ = 0.8 (18.57). In comparison, the NV curve contains less information that the maximum of the total information is about 10.58 also at θ = 0. This reflects a larger mean of the item discrimination (a) parameter of the PV items because the maximum total information value is directly proportional to the square of a (Hambleton & Swaminathan, 1985). Thus, Time appears to be particularly weak for estimating a respondent’s latent trait for θ < 0 because the test information function curve appears asymmetric and contains limited information in the lower half of the continuum of θ.

Figure 2.

Test information functions for three subscales of the Mathematics Interest Inventory.

Differential Item Functioning

Results of DIF analyses are summarized in Table 2. We first examined the DIF of the MII subscales across gender specifying male as the reference group and female as the focal group. Only two PV items demonstrated significant DIF: Item 24 (M-H χ² = 13.19, BH-FDR corrected p = .002) and Item 26 (p = .01). Both items showed DIF in favor of the reference group (i.e., male) given positive L-A LOR values. That is, even given the same level of positive affect for math, male students were more likely than female students to endorse higher-order options (e.g., sort of like me rather than not much like me) of the statements “I choose to work on math” and “I want to know all about math.” As indicated by the conditional differences by difficulty intervals, neither item was found to show non-uniform DIF. The DTF effect variance of the PV subscale was small, ν² = .07. No significant DIF across gender was found in NV or Time. Thus, the amount of DIF by each item for gender may be considered minimal. This result indicates that although there may be overall differences in mathematics interest according to gender, these differences are likely not due to substantially different interpretations of mathematics interest items by gender.

Next, DIF was computed between White students as the reference group and Hispanic as the focal group. For PV, three items showed significant DIF: Item 16 (p = .001), Item 7 (p = .003), and Item 4 (p = .007). Item 16, “I want to know all about how to do math problems,” was more readily endorsed by Hispanic students (L-A LOR = −.73), while Item 7, “I am interested in math,” and Item 4, “I like math,” were more readily endorsed by White students (L-A LOR = .69 and .59, respectively). The DTF effect variance of PV was large, ν² = .15. For NV, four items showed significant DIF: Item 8 (p < .001), Item 27 (p < .001), Item 23 (p = .009), and Item 20 (p = .011). Item 27, “I struggle with math,” and Item 23, “I have difficulty paying attention when working on math,” were more readily endorsed by White students (L-A LOR = .80 and .51, respectively), while Item 8, “I would rather be working on something else besides math,” and Item 20, “I get mad easily when working on math,” were more readily endorsed by Hispanic students (L-A LOR = −.86 and −.48, respectively). The DIF of all these items across ethnicity was found to be uniform. The DTF effect variance of NV was large, ν² = .20. Finally, only one item of Time (Item 18, “I prefer easy math over math that is hard”) demonstrated significant DIF (p < .001) and the DTF was small, ν² = .04.

Thus, there appears to be a substantial amount of variation in the interpretation of mathematics interest items according to group membership. Results indicated that, in terms of positive affect toward mathematics, students who identified as White appeared to more readily endorse items with more direct, affective statements such as, “I like math” and “I’m interested in math,” while students who identified as Hispanic tended to endorse higher options on a less direct, more goal-oriented and non-affective statement: “I want to know all about how to do math problems.” When it came to negative affect, however, Hispanic students were more likely to endorse higher options on affective statements, “I would rather be working on something else besides math” and “I get mad easily when working on math,” yet less likely than their White peers to do so.

Discussion

The current study provides an examination of mathematics interest according to both gender and ethnicity, specifically students who self-reported to be White or Hispanic. Results indicate that the presence of differences in mathematics interest according to gender may not be considered a function of measurement error or a lack of psychometrically sound items as the presence of differential item functioning was minimal (e.g., 20%, only 2 out of 10 PV subscale items). Teresi (2006) has indicated when less than 25% of items present DIF that this may be considered as minimal. From the results of the current study, we conclude that in examining differences in mathematics interest according to gender, MII scores may be considered as psychometrically sufficient with respect to its IRT properties. Thus, the results of the current study support previous research indicating the reliability and validity of the MII scores as evaluated by classical psychometric procedures (Stevens & Olivárez, 2005).

With respect to differences in mathematics interest between White and Hispanic students, the presence of DIF was more substantial in terms of both number of items and effect size. Approximately 30% of the PV items (e.g., Items 4, 7, and 16) and 40% of the NV items (e.g., Items 20, 23, 23, and 27) indicated the presence of significant DIF. The associated effect sizes with these items presenting DIF ranged from moderate to large (Penfield, 2007). Results indicated a pattern of response according to the indirect or direct nature of the items. Thus, the MII may have difficulty capturing mathematics interest between White and Hispanic students due to these differential response patterns according to the nature of the item as being affective versus non-affective regarding the positive and negative valence, respectively, of mathematics interest.

Although our observation is preliminary and certainly requires further exploration, cultural differences in how people experience, regulate, and express emotions are well-documented in the literature (e.g., Goetz, Spencer-Rodgers, & Peng, 2008; Scollon, Diener, Oishi, & Biswas-Diener, 2004, 2005). Thus, the translation of the role of affect in the expression of mathematics interests is not unsurprising. The mechanism behind this differential response to direct versus indirect items regarding mathematics interest would appear to be multi-faceted but may be the result of cultural differences and the associated social desirability to fit in with the culture and its subsequent values with which one identifies (Lange, Thalbourne, Houran, & Lester, 2002; Scott et al., 2007; van de Vijver & Phalet, 2004). As such, this DIF result may indicate that more direct, affective approaches to fostering mathematics interest may be more effective for students who are White while indirect approaches to identifying mathematics interest may be more effective with students who are Hispanic. For instance, results suggest that teachers directly asking, “Do you like math?” may work well in identifying mathematics interest among students who are White but not as well with students who are Hispanic. As a result of this direct, affective approach, students who are Hispanic may not be steered toward more advanced mathematics as teachers and other school personnel would consider them as lacking sufficient interest in it. Instead, students who are Hispanic may respond better to more indirect, goal-oriented approaches to discerning mathematics interest, such as by asking, “Do you want to know how to do this math problem?”

Future use of the MII must be cautioned about the presence of DIF in particular items. If any modification or adaptation is to be applied, Hambleton’s (2006) recommendations may be consulted to avoid potential causes of DIF such as complicated language, multiple ideas in individual statements (i.e., double-barrier), concepts and situations that do not easily translate to other cultures, and stereotyping. Our findings regarding the direct versus indirect/affective versus non-affective approaches may be referred to for modification purpose as well. However, users of the current MII version should not arbitrarily omit items that demonstrated DIF, because this may result in artificially higher or lower scores in a certain subgroup especially when the magnitude of DIF tends to cancel (i.e., some items favor the reference group while others favor the focal group; see Edelen, Thissen, Teresi, Kleinman, & Ocepek-Welikson, 2006). Given that this is exactly the case for DIF in the MII subscales, we recommend users retain all items for scale scores, but take into account the observed DIF when making comparisons at the item level.

Several limitations emerged as part of conducting the current study, which provide directions for future research. For instance, the MII should be examined as to its age neutrality in that the expression of mathematics interest may vary as a function of age. Insufficient sample size across age groups precluded this analysis as the examination of differential item functioning typically requires large sample sizes per group (e.g., approximately 250 or more for each group; Herrera & Gomez, 2008; Rogers & Swaminathan, 1993; Swaminathan & Rogers, 1990). Relatedly, future research examining DIF across race/ethnicity needs to collect detailed demographic information about the sample particularly with regard to their cultural and language background (e.g., which generation of immigrants, first and second languages, English proficiency), because such information would allow for investigations of the causes of DIF (Church, 2001; Ramirez et al., 2006). Lack of these pieces of information in the current study prevents us from conducting more in-depth investigations.

In conclusion, the IRT-based evaluation of the MII provides several implications for the development and use of self-report instruments to measure mathematics interest. By consulting the discrimination and threshold parameters, researchers may gain better understandings about the quality of individual items. Furthermore, the test information functions reflect how respondents interact with the instrument at the subscale level. For example, respondents tend to underrepresent their mathematics interest for the Time subscale in the MII, suggesting that children may conceptualize time in a way other than what is assumed. Most importantly, the DIF results may provide valuable information as to how items may be adapted for relatively unbiased measurement when multiple ethnic groups are included, as well as account for the variance in group differences that are due to measurement error.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Baker

F. B.

(1985). The basics of Item Response Theory. Portsmouth, NH: Heinemann.

Baker

F. B.

(2001). The basics of Item Response Theory (2nd ed.). College Park, MD: ERIC Clearinghouse on Assessment and Evaluation, University of Maryland.

Bandura

(1986). Social foundations of thought and action: A social cognitive theory. Englewood Cliffs, NJ: Prentice-Hall.

Beasley

M. A.

Fischer

M. J.

(2012). Why they leave: The impact of stereotype threat on the attrition of women and minorities from science, math and engineering majors. Social Psychology of Education: An International Journal, 15, 427-448.

Benjamini

Hochberg

(1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57, 289-300.

Bentler

P. M.

(2007). On tests and indices for evaluating structural models. Personality and Individual Differences, 42, 825-829.

Bolt

D. M.

(2002). A Monte Carlo comparison of parametric and nonparametric polytomous DIF detection methods. Applied Measurement in Education, 15, 113-141.

Brown

S. D.

Lent

R. W.

(2006). Preparing adolescents to make career decisions. In Pajares

Urdan

(Eds.), Adolescents and education: Self-efficacy beliefs of adolescents (pp. 201-223). Greenwich, CT: Information Age.

Cai

Thissen

du Toit

S. H.

(2011). IRTPRO: Flexible, multidimensional, multiple categorical IRT modeling [Computer software and manual]. Lincolnwood, IL: Scientific Software International.

10.

Camilli

Congdon

(1999). Application of a method of estimating DIF for polytomous test items. Journal of Educational and Behavioral Statistics, 24, 323-341.

11.

Church

A. T.

(2001). Personality measurement in cross-cultural perspective. Journal of Personality, 69, 979-1006.

12.

Deacon

M. M.

(2011). Classroom learning environment and gender: Do they explain math self-efficacy, math outcome expectations, and math interest during early adolescence? (Doctoral dissertation). University of Virginia, Charlottesville.

13.

de Ayala

R. J.

(2009). The theory and practice of Item Response Theory. New York, NY: Guilford Press.

14.

DeJarnette

N. K.

(2012). America’s children: Providing early exposure to STEM (Science, Technology, Engineering and Math) initiatives. Education, 133, 77-84.

15.

Edelen

M. O.

Thissen

Teresi

J. A.

Kleinman

Ocepek-Welikson

(2006). Identification of differential item functioning using Item Response Theory and the Likelihood-based model comparison approach: Application to the Mini-Mental State Examination. Medical Care, 44(11), S134-S142.

16.

Evans

E. M.

Schweingruber

Stevenson

H. W.

(2002). Gender differences in interest and knowledge acquisition: The United States, Taiwan, and Japan. Sex Roles, 37, 153-167.

17.

Fleischman

H. L.

Hopstock

P. J.

Pelczar

M. P.

Shelley

B. E.

(2010). Highlights from PISA 2009: Performance of U.S. 15-Year old students in reading, mathematics, and science literacy in an international context (NCES 2011-004). Washington, DC: National Center for Education Statistics, Institute of Education Sciences, U.S. Department of Education.

18.

Gall

M. D.

Gall

J. P.

Borg

W. R.

(2006). Educational research: An introduction (8th ed.). Boston, MA: Pearson.

19.

Goetz

J. L.

Spencer-Rodgers

Peng

(2008). Dialectical emotions: How cultural epistemologies influence the experience and regulation of emotional complexity. In Sorrentino

Yamaguchi

(Eds.), Handbook of motivation and cognition across cultures (pp. 517-539). Amsterdam, The Netherland: Elsevier.

20.

Hambleton

R. K.

(2006). Good practices for identifying differential item functioning. Medical Care, 44(11), S182-S188.

21.

Hambleton

R. K.

Swaminathan

(1985). Item response theory: Principles and applications. Norwell, MA: Kluwer Academic.

22.

Herrera

Gomez

(2008). Influence of equal or unequal comparison group sample sizes on the detection of differential item functioning using the Mantel-Haenszel and logistic regression techniques. Quality & Quantity, 42, 739-755.

23.

Hidi

Anderson

(1992). Situational interest and its impact on reading and expository writing. In Renninger

K. A.

Hidi

Krapp

(Eds.), The role of interest in learning and development (pp. 215-238). Hillsdale, NJ: Lawrence Erlbaum.

24.

Holland

P. W.

Wainer

(1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum.

25.

Bentler

P. M.

(1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1-55.

26.

Iran-Nejad

(1987). Cognitive and affective causes of interest and liking. Journal of Educational Psychology, 79, 120-130.

27.

Kim

Oshima

T. C.

(2013). Effect of multiple testing adjustment in differential item functioning detection. Educational and Psychological Measurement, 73, 458-470. doi:10.1177/0013164412467033

28.

Kline

(2000). The handbook of psychological testing (2nd ed.). New York, NY: Routledge.

29.

Köller

Baumert

Schnabel

(2001). Does interest matter? The relationship between academic interest and achievement in mathematics. Journal for Research in Mathematics Education, 32, 448-470.

30.

Lange

Thalbourne

M. A.

Houran

Lester

(2002). Depressive response sets due to gender and culture-based differential item functioning. Personality and Individual Differences, 33, 937-954.

31.

Lent

R. W.

Brown

S. D.

Brenner

Chopra

S. B.

Davis

Talleyrand

Suthakaran

(2001). The role of contextual supports and barriers in the choice of math/science educational options: A test of social cognitive hypotheses. Journal of Counseling Psychology, 48, 474-483.

32.

Lent

R. W.

Brown

S. D.

Hackett

(1994). Toward a unifying social cognitive theory of career and academic interest, choice, and performance. Journal of Vocational Behavior, 45, 79-122.

33.

Lent

R. W.

Lopez

A. M.

Lopez

F. G.

Sheu

(2008). Social cognitive career theory and the prediction of interests and choice goals in the computing disciplines. Journal of Vocational Behavior, 73, 52-62.

34.

Liu

I. M.

Agresti

(1996). Mantel-Haenszel-type inference for cumulative odds ratio with a stratified ordinal response. Biometrics, 52, 1223-1234.

35.

Lord

Novick

M. R.

(Eds.). (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.

36.

Mantel

(1963). Chi-square tests with one degree of freedom; Extension of the Mantel-Haenszel procedure. Journal of the American Statistical Association, 58, 690-700.

37.

Marsh

H. W.

Hau

K. T.

Wen

(2004). In search of golden rules: Comment on hypothesis-testing approaches to setting cutoff values for fit indexes and dangers in overgeneralizing Hu and Bentler’s (1999) findings. Structural Equation Modeling, 11, 320-341.

38.

Marsh

H. W.

Trautwein

Lüdtke

Köller

Baumert

(2005). Academic self-concept interest grades and standardized test scores: Reciprocal effects models of causal ordering. Child Development, 76, 397-416.

39.

Muthén

L. K.

Muthén

B. O.

(2010). Mplus user’s guide (6th ed.). Los Angeles, CA: Author.

40.

Padilla

J. L.

Hidalgo

M. D.

Benítez

Gómez-Benito

(2012). Comparison of three software programs for evaluating DIF by means of the Mantel-Haenszel procedure: EASY-DIF, DIFAS and EZDIF. Psicológica, 33, 135-156.

41.

Penfield

R. D.

(2007). An approach for categorizing DIF in polytomous items. Applied Measurement in Education, 20, 335-355.

42.

Penfield

R. D.

(2012). DIFAS 5.0. Differential item functioning analysis systemml: User’s manual [Computer software and manual]. Retrieved from http://erm.uncg.edu/wp-content/uploads/2012/07/DIFASManual_V5.pdf.

43.

Penfield

R. D.

Algina

(2006). A generalized DIF effect variance estimator for measuring unsigned differential test functioning in mixed format tests. Journal of Educational Measurement, 43, 295-312.

44.

Potenza

Dorans

N. J.

(1995). DIF assessment of polytomously scored items: A framework for classification and evaluation. Applied Psychological Measures, 19, 23-37.

45.

Provasnik

Kastberg

Ferraro

Lemanski

Roey

Jenkins

(2012). Highlights from TIMSS 2011: Mathematics and science achievement of U.S. fourth- and eighth-grade students in an international context (NCES 2013-009). Washington, DC: National Center for Education Statistics, Institute of Education Sciences, U.S. Department of Education.

46.

Ramirez

Teresi

J. A.

Holmes

Gurland

Lantigua

(2006). Differential item functioning (DIF) and the Mini-Mental State Examination (MMSE): Overview, sample, and issues of translation. Medical Care, 44(Suppl. 3), S95-S106.

47.

Reise

S. P.

(1990). Parameter recovery in the graded response model using MULTILOG. Journal of Educational Measurement, 27, 133-144.

48.

Renninger

K. A.

(1992). Individual interest and development: Implications for theory and practice. In Renninger

K. A.

Hidi

Krapp

(Eds.), The role of interest in learning and development (pp. 361-395). Hillsdale, NJ: Lawrence Erlbaum.

49.

Renninger

K. A.

Hidi

(2002). Student interest and achievement: Developmental issues raised by a case study. In Wigfeld

Eccles

J. S.

(Eds.), Development of achievement motivation (pp. 173-195). New York, NY: Academic Press.

50.

Rogers

J. H.

Swaminathan

(1993). A comparison of the logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17, 104-116.

51.

Russell

S. H.

Hancock

M. P.

McCullough

(2007). Benefits of undergraduate research experiences. Science, 316, 548-549.

52.

Samejima

(1969). Calibration of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 19, 86-100.

53.

Schmitt

T. A.

(2011). Current methodological considerations in exploratory and confirmatory factor analysis. Journal of Psychoeducational Assessment, 29, 304-321.

54.

Schunk

D. H.

Pintrich

P. R.

Meece

J. L.

(2008). Motivation in education (3rd ed.). Columbus, OH: Merrill.

55.

Scollon

C. N.

Diener

Oishi

Biswas-Diener

(2004). Emotions across cultures and methods. Journal of Cross-Cultural Psychology, 35, 304-326.

56.

Scollon

C. N.

Diener

Oishi

Biswas-Diener

(2005). An experience sampling and cross-cultural investigation of the relation between pleasant and unpleasant emotion. Cognition & Emotion, 19, 27-52.

57.

Scott

N. W.

Fayers

P. M.

Aaronson

N. K.

Bottomley

de Graeff

Groenvold

. . . Sprangers

M. A.

(2007). The use of differential item functioning analyses to identify cultural differences in responses to the EORTC QLQ-C30. Quality of Life Research: An International Journal of Quality of Life Aspects of Treatment, Care and Rehabilitation, 16, 115-129.

58.

Seymour

(2002). Tracking the processes of change in US undergraduate education in science, mathematics, engineering, and technology. Science Education, 86, 79-105.

59.

Shapiro

J. R.

Williams

A. M.

(2012). The role of stereotype threats in undermining girls’ and women’s performance and interest in STEM fields. Sex Roles, 66, 175-183.

60.

Shorrocks-Taylor

Jenkins

W. E.

(2002). Learning from others: International comparisons in education. Dordrecht, The Netherlands: Kluwer.

61.

Smith

K. A.

Sheppard

S. D.

Johnson

D. W.

Johnson

R. T.

(2005). Pedagogies of engagement: Classroom-based practices. Journal of Engineering Education: Special Issue on the State of the Art and Practice of Engineering Education Research, 94, 87-102.

62.

Sörbom

(1989). Model modification. Psychometrika, 54, 371-384.

63.

Spector

P. E.

(1994). Using self-report questionnaires in OB research: A comment on the use of a controversial method. Journal of Organizational Behavior, 15, 385-392.

64.

Stevens

Olivárez

J. A.

(2005). Development and evaluation of the Mathematics Interest Inventory. Measurement and Evaluation in Counseling and Development, 38, 141-152.

65.

Stevens

Olivárez

J. A.

Hamman

(2006). The role of cognition, motivation, and emotion in explaining the mathematics achievement gap between Hispanic and White students. Hispanic Journal of Behavioral Science, 28, 161-186.

66.

Stevens

Wang

Olivárez

J. A.

Hamman

(2007). Use of self-perspectives and their sources to predict the mathematics enrollment intentions of girls and boys. Sex Roles, 56, 351-363.

67.

Swaminathan

Rogers

J. A.

(1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361-370.

68.

Teresi

J. A.

(2006). Overview of quantitative measurement methods: Equivalence, invariance, and differential item functioning in health applications. Medical Care, 44(11), S39-S49.

69.

Thissen

D. J.

Steinberg

Wainer

(1988). Use of item response theory in the study of group differences in trace lines. In Wainer

Braun

H. I.

(Eds.), Test validity (pp. 147-169). Hillsdale, NJ: Lawrence Erlbaum.

70.

Thissen

D. J.

Steinberg

Wainer

(1993). Detection of differential item functioning using the parameters of item response models. In Wainer

Braun

H. I.

(Eds.), Differential item functioning (pp. 67-113). Hillsdale, NJ: Lawrence Erlbaum.

71.

Valsiner

(1992). Interest: A metatheoretical perspective. In Renninger

K. A.

Hidi

Krapp

(Eds.), The role of interest in learning and development (pp. 27-41). Hillsdale, NJ: Lawrence Erlbaum.

72.

van de Vijver

F. J.

Phalet

(2004). Assessment in multicultural groups: The Role of acculturation. Applied Psychology, 53, 215-236.

73.

(2002). Evaluating cutoff criteria of model fit indices for latent variable models with binary and continuous outcomes (Doctoral dissertation). University of California, Los Angeles.