Applying Bayesian Modeling and Receiver Operating Characteristic Methodologies for Test Utility Analysis

Abstract

This study integrated Bayesian hierarchical modeling and receiver operating characteristic analysis (BROCA) to evaluate how interest strength (IS) and interest differentiation (ID) predicted low–socioeconomic status (SES) youth’s interest-major congruence (IMC). Using large-scale Kuder Career Search online-assessment data, this study fit three models, the one-level BROCA, the two-level BROCA, and the ordinal Probit BROCA, to examine the moderating effects of gender and race/ethnicity. Both IS and ID displayed race/ethnicity differences in predicting low-SES females’ IMC. Gender difference was found only on IS in predicting low-SES youth’s IMC. Results suggested that low-SES White males and low-SES minority females may need help the most to develop stronger career interests and to differentiate their interests. This study illustrated that BROCA can be a powerful tool for test evaluation and utility analysis in the field because of its capacity of analyzing continuous, nominal, and ordinal data; its graphical nature of result presentation; multiple statistical test options; and its little requirement of Level 2 sample sizes.

Keywords

Bayesian ROC analysis classification analysis utility analysis

Introduction

Instrument evaluation has recently gone beyond the traditional reliability and validity analysis and has developed into a holistic system incorporating external evidence from nomological information, criterion-based utility, and practical impact-based decision making analysis (Lissitz & Samuelsen, 2007). This study merges Bayesian inference and receiver operating characteristic analysis into the practices of educational measurement (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999), test evaluation (Embretson, 2007), and utility analysis (Gorin, 2007; Lissitz & Samuelsen, 2007).

Receiver Operating Characteristic Curve

The receiver operating characteristic (ROC) curve graphically summarizes how accurately a diagnostic test distinguishes a characteristic group from a noncharacteristic group in a given population (Swets, Dawes, & Monahan, 2000). Consider a situation (McGee, 1979) where a large number of examinees from vocational-technical training programs have been tested on a spatial test battery. According to the administration records, their status of successfully passing the training is known. The two bell curves in Figure 1 indicate the distributions of the spatial test scores of the two groups, pass group and not-pass group. A given threshold constructs four proportions: True Positive (TP; the proportion of pass group members correctly identified), False Negative (FN; the proportion of pass group members falsely identified), True Negative (TN; the proportion of not-pass group members correctly identified), and False Positive (FP; the proportion of not-pass group members falsely identified). Using different thresholds will result in different pairs of (FP, TP).

Figure 1.

Receiver operating characteristic curve and (False Positive, True Positive) pairs given all diagnostic test score thresholds.

A pair of (FP, TP) obtained from a given threshold represents the classification accuracy of the diagnostic test for detecting program success. A perfect test should have a 100% TP and a zero FP, which is rarely possible to achieve in practice. An ROC curve reflects the trade-off between correct detection (TP) and false alarm (FP) when a diagnostic test has been used to classify examinees. This logic coincides with the usage of the statistical hypothesis test, in which the power (TP) will be optimized controlling for Type I error (FP) given the hypothetical value (the given threshold) on the statistic computed from sample data. Plotting FP values on the horizontal axis and TP on the vertical axis, one will obtain the trajectory, or an ROC curve (Swets et al., 2000) demonstrated in Figure 1. An ROC curve displays the diagnostic performance of a test on all legitimate cutoff points and can be used to compare tests and select the most powerful diagnostic instrument (Fawcett, 2003; Pepe, 2003).

A Short History of ROC Analysis

ROC analysis, originated in signal detection theory during World War II, has been applied into several fields and become a well-developed methodology. During the war, the ROC curve was used to evaluate how well a radar reader could detect enemy’s aircrafts on the radar screen with a high TP rate and a low FP rate (Green & Swets, 1966). In the 1950s, psychologists used ROC analysis to assess human cognitive processing ability (Tanner & Swets, 1954). Since the 1980s, ROC analysis has been widely used in the fields of computer engineering (Fawcett, 2003), radiologic imaging (Hanley, 1989; Metz, 1986), and medical testing (Pepe, 2000, 2003) for instrument evaluation and test utility analysis. The well-developed ROC methodology can analyze continuous measures (Pepe, 1998), ordinal data (Dorfman & Alf, 1968, 1969; Metz, 1986), and multilevel data (Gatsonis, 1995; Toledano & Gatsonis, 1995). It allows researchers to use categorical background variables (Pepe, 2000) to examine how covariates affect the ROC curve (Tosteson & Begg, 1988). Bayesian approach was successfully used in ROC analysis through the multivariate hierarchical transformation model (O’Malley & Zou, 2006).

Bayesian ROC analysis (BROCA), as illustrated in this article, has the potential to advance instrument evaluation and test utility analysis. Because of Bayesian approach’s hierarchical modeling nature (Gelman & Hill, 2007), prior information can be incorporated into lower level parameters and allow them to vary across higher level units to improve analysis accuracy.

Bayesian ROC Analysis and Test Utility Analysis

As one of the most active fields of educational measurement, career assessment research frequently examines the “congruence hypothesis” (Tracey & Robbins, 2006) and study interests’ predictive role in later educational/occupational satisfaction and performance (Swanson, 1993). Many methods, such as logistic regression and multivariate classification analysis techniques, have been used to study the prediction power of vocational interests (Hansen & Zytowski, 1979). For example, the ROC curve can be obtained through a logistic regression (Hosmer & Lemeshow, 1989) to study the moderating effect of demographic variables including gender and ethnicity.

The BROCA is different from the use of ROC curve in logistic regression. The ROC curve obtained in the logistic regression serves as an index of goodness-of-model-fit (Hosmer & Lemeshow, 1989) of the logit classification model. BROCA, as an independent methodology, can graphically depict each career interest variable’s prediction power rather than the performance of the entire model. Through the statistical significance test, BROCA evaluates whether the prediction power of a diagnostic measure varies across gender and/or ethnicity groups (Porter & Umbach, 2006).

In addition, BROCA allows researchers to include moderators, such as gender, to detect gender differences by testing if females and males have the same ROC curves. If the two gender groups have different ROC curves, it implies that the vocational interest instrument performs differently for men and women. BROCA’s hierarchical modeling nature can help researchers understand how vocational instruments perform differently across higher level demographic groupings such as racial groups. These forms of external evidence including gender and racial group differences can be incorporated into the test utility analysis corresponding with the emphasis on test impact evaluation (Lissitz & Samuelsen, 2007).

Theoretical Framework

The goal of using BROCA on career assessment data for test utility evaluation is to evaluate if the vocational interest can distinguish the interest-major-congruent (IMC) group (students whose vocational interests match their prospective major, or major they expect to later attain) and the interest-major-noncongruent (IMNC) group (students whose vocational interests do not match their prospective major) with high true positive (TP) rate and low false positive (FP) rate. Let D represent the IMC group and D* the IMNC group. The diagnostic measure Y is the interest strength, which is the largest score of the Kuder Career Search (KCS) vocational interest measures. Detailed operational definitions of the two groups and interest strength are discussed later in the Data Analysis section.

Statistical Inference for Test Utility Comparison

The statistical inference and comparison can be conducted through a Binormal ROC model (Green & Swets, 1966; Pepe, 2003). Suppose the interest strength scores of group D and group D* are normally distributed as $Y_{D} ~ N (μ_{D}, σ_{D}^{2})$ and $Y_{D *} ~ N (μ_{D *}, σ_{D *}^{2}),$ respectively. Let $Φ (.)$ be the cumulative distribution function of the normal distribution. The Binormal ROC curve is defined as

ROC (t) = Φ [a + b Φ^{- 1} (t)],

with $t \in [0, 1],$ $a = (μ_{D} - μ_{D *}) / σ_{D},$ and $b = σ_{D *} / σ_{D} .$

The area under curve (AUC) is used to evaluate the prediction power of an instrument and it is denoted in Binormal ROC model as

AUC = Φ (a / \sqrt{1 + b^{2}}) .

Previous research has suggested a rule of thumb for using AUC to evaluate the effectiveness of a classification model (Hosmer & Lemeshow, 1989). For example, AUC equal to or above .90 indicates an outstanding discrimination; AUC between 0.8 and 0.9 represents an excellent discrimination; AUC less than 0.8 but equal to or more than 0.7 represents an acceptable discrimination. In this study, a bigger area under the ROC curve indicates a better performance of interest strength in predicting interest-major congruence and distinguishing the IMC group and the IMNC group.

A chi-square test with two degrees of freedom on the estimated parameters of a and b of the binormal ROC model can be used for utility comparison between two tests (Metz & Kronman, 1980). The null hypothesis is: H₀: a₁ = a₂; b₁ = b₂, and the chi-square test statistic is $Δ' {\hat{Σ}}_{2 \times 2}^{- 1} Δ,$ with $Δ = ({\hat{a}}_{1} - {\hat{a}}_{2} {\hat{b}}_{1} - {\hat{b}}_{2}) .$ The estimated variance-covariance matrix of $Δ$ is ${\hat{Σ}}_{2 \times 2},$ which can be computed using bootstrap sampling (Pepe, 2003).

Using large-scale career interest assessment data collected online from high school juniors and seniors students living in poverty, the next section illustrates how to use BROCA to estimate the Binormal model parameters and quantities including $μ_{D},$ $μ_{D *},$ $σ_{D *},$ $σ_{D *},$ a, b, and AUC, through WinBUGS (Windows Baysian inference Using Gibbs Sampling; Lunn, Thomas, Best, & Spiegelhalter, 2000; Spiegelhalter, Thomas, Best, & Lunn, 2003) with the implementation of Markov Chain Monte Carlo (MCMC). MCMC sampling algorithms can numerically calculate multidimensional integrals to estimate ${\hat{Σ}}_{2 \times 2}$ without bootstrapping.

Data Analysis

Sample and Participants

A random sample of 1,930 high school juniors (954, 49.4%) and seniors (976, 50.6%) were drawn from a large-scale archival database for this study. These students took the KCS vocational interest measure online to better understand their vocational interests. Participants were selected from those who lived in ZIP codes where the median household income was below the federal poverty level (Diemer, Wang, & Smith, 2010). These participants belonged to five racial/ethnic groups: American Indian/Alaska Native (AIAN, 256, 13.3%), African American (AA, 455, 23.6%), Asian/Hawaiian/Pacific Islander (AHPI, 273, 14.1%), Latino/a (483, 25.0%), and White (463, 24%). Their age ranged from 15.21 to 19.49 years ( $M = 17.50,$ $SD = 0.71$ ). The number of males (1014, 52.5%) was slightly larger than females (916, 47.5%).

Variables

Interest-major congruence

The outcome variable, interest-major congruence (Diemer et al., 2010; Hansen & Zytowski, 1979; Ihle-Helledy et al., 2004; Kelly, 2002; Tracey & Robbins, 2006), is a 0-1 binary measure with 1 indicating that students’ vocational interests match their prospective college major, 0 otherwise. The binary measure also conceptualizes participants group membership, that is, 1 indicates the participant belongs to the IMC (interest-major congruent) group and 0 the IMNC (interest-major not congruent) group. The observed frequencies of IMC and IMNC groups across these five racial/ethnic groups are the following: AIAN (N₁ = 99, N₀ = 155), AHPI (N₁ = 135, N₀ = 129), AA (N₁ = 182, N₀ = 267), Latino/a (N₁ = 191, N₀ = 260), and White (N₁ = 208, N₀ = 245).

Interest strength and interest differentiation

This study compared interest strength (IS) with interest differentiation (ID) in predicting the interest-major congruence. IS was measured by the largest of the six career interest scores on the KCS online assessment (Diemer et al., 2010). The six Kuder career clusters were Outdoor/Mechanical, Science/Technical, Arts/Communication, Social/Personal Services, Sales/Management, and Business/Retail (Zytowski, 2001). The possible range of an interest cluster score is 0 to 99. The descriptive statistics of IS across the five racial/ethnic groups are the following: AIAN (M = 85.57, SD = 9.71), AHPI (M = 87.52, SD = 9.15), AA (M = 84.86, SD = 9.17), Latino/a (M = 86.37, SD = 9.06), and White (M = 88.80, SD = 8.82). ID was conceptualized as the difference between the sum of the upper three interest cluster scores and the sum of the lower three interest cluster scores. A larger ID score indicates a more differentiated gap between what the participants are most interested in and what they are least interested in. The descriptive statistics of ID across the five racial/ethnic groups are the following: AIAN (M = 138.67, SD = 40.70), AHPI (M = 143.44, SD = 38.84), AA (M = 132.97, SD = 39.02), Latino/a (M = 138.50, SD = 38.72), and White (M = 152.55, SD = 41.39).

WinBUGS with the implementation of MCMC (Lunn et al., 2000; Spiegelhalter et al., 2003) was used to analyze the data. The computation and estimation were run using R (R Development Core Team, 2008). WinBUGS code can be requested through e-mail from the first author of this study.

Analysis 1: Single-Level BROCA Model

Parameter and model specifications

The one-level Binormal model was specified as $Y_{i} = μ + δ D_{i} + ε_{ij},$ with $i = 1, \dots, n,$ $n = n_{D} + n_{D *},$ and $j = 1, 2 .$ Y_i was the IS or ID score of the i^th person. D_i was the binary group membership indicator of the i^th person, 1 for an IMC group member, 0 otherwise. $μ$ was the group mean of IS or ID score in IMNC group. $ε_{i 1}$ was the random error of IMNC group and followed a normal distribution, $N (0, σ_{D *}^{2}) .$ $ε_{i 2}$ was the random error of group IMC group and followed a normal distribution, $N (0, σ_{D}^{2}) .$ The IS or ID score was assumed to be normally distributed as $N (μ, σ_{D *}^{2})$ in the IMNC group and $N (μ + δ, σ_{D}^{2}),$ with $δ = μ_{D *} - μ_{D},$ in the IMC group.

The priors of $σ_{D}^{- 2}$ and $σ_{D *}^{- 2}$ were set as the uniform distribution unif(0, 10⁴) rather than the conjugate Inverse-Gamma distortion. This avoided software crash problem because of the problematic solution for near-zero variance during the estimation (Gelman & Hill, 2007). µ’s prior was set as N(0, 10⁴). δ’s prior was set as a normal distribution N(mu.b, tau.b), with mu.b and tau.b being the hyperparameters. mu.b was set to follow a normal distribution N(0, 10⁶) and tau.b a uniform distribution unif(1, 10⁴). One long chain was run with 10,000 iterations with the first 5,000 iterations discarded. WinBUGS run ended normally; no suspectable problems such as high auto correlation were found after the graphical estimation history was checked.

Results of Analysis 1

The posterior parameter estimates (standard deviations) of $a_{IS}$ and $b_{IS}$ for IS were .41 (.05) and 1.03 (.03), respectively. The quantities of a_ID and b_ID for ID were .38 (.05) and .92 (.03), respectively. WinBUGS estimated the value of $a / a \sqrt{1 + b^{2}} \sqrt{1 + b^{2}}$ in Equation (2) to compute AUC. The estimated AUC values were .65 and .64 for IS ROC curve and ID ROC curve, respectively. Both values were below the acceptable discrimination criterion .70 (Hosmer & Lemeshow, 1989). A chi-square test rejected the null hypothesis, $H_{0} : a_{IS} = a_{ID}, b_{IS} = b_{ID}$ ( $χ_{2}^{2} = 6.09;$ p < .05). This suggested that the ROC curve parameters of IS and ID were statistically different in distinguishing the IMC group and the IMNC group. In other words, IS and ID performed differently in predicting interest-major congruence. Analysis 2 further explored the detailed differences of the moderator variables, including gender and race/ethnicity, through a two-level BROCA model.

Analysis 2: Two-Level BROCA Model

Parameter and Model Specifications

A two-level model was implemented to examine the moderating effects of gender and race/ethnicity. In this model, racial groups were treated as Level 2 units. The two-level BROCA model was written as

Level 1 : {\begin{matrix} Y_{ijk} = μ_{jk} + d_{jk} D_{ki} + g_{jk} G_{ki} + c_{jk} D_{ki} G_{ki} + ε_{ijk}; \\ ε_{ijk} ~ N (0, σ_{jk 1}^{2}), for IMC group; \\ ε_{ijk} ~ N (0, σ_{jk 0}^{2}), for IMNC group; \\ σ_{jk D_{ki}}^{2} = {(θ_{jk 0} + θ_{jk 1} D_{ki})}^{2} . \end{matrix}

and

Level 2 : {\begin{matrix} μ_{jk} = μ_{j} + u_{1 jk}, u_{1 jk} ~ N (0, σ_{uj}^{2}); \\ d_{jk} = δ_{j} + u_{2 jk}, u_{2 jk} ~ N (0, σ_{dj}^{2}); \\ g_{jk} = γ_{j} + u_{3 jk}, u_{3 jk} ~ N (0, σ_{gj}^{2}); \\ \begin{matrix} c_{jk} = κ_{j} + u_{4 jk}, u_{4 jk} ~ N (0, σ_{cj}^{2}); \\ θ_{jk 0} = θ_{j 0} + u_{5 jk}, u_{5 jk} ~ N (0, σ_{θ j 0}^{2}); \\ θ_{jk 1} = θ_{j 1} + u_{6 jk}, u_{6 jk} ~ N (0, σ_{θ j 1}^{2}) . \end{matrix} \end{matrix}

The j^th career interest measure of the i^th individual in racial group k was denoted as Y_ijk, for i = 1, 2, . . . , n_k; j = 1, 2, . . . , J; and k = 1, 2, . . . , K. n_k was the k^th racial group size. K was equal to 5, representing the five racial groups. In this study, J was equal to 2, representing the two career interest measures IS and ID. $μ_{jk}$ was mean of the j^th interest measure in cluster k. G_ki stood for gender of the i^th individual in cluster k. G_ki was coded as 1 for males and 0 for females. D_ki took two values: 1 for the IMC group and 0 for the IMNC group.

The IMC group and the IMNC group had different residual variance parameters, which were indicated by the second and third equations of the Level 1 model in (3). $ε_{ijk}$ was the random residual associated the j^th interest measure of the i^th individual in racial group k. $ε_{ijk}$ identically and independently followed a normal distribution $N (0, σ_{jk 1}^{2}),$ if the i^th individual belonged to the IMC group, otherwise $N (0, σ_{jk 0}^{2}) .$ The residual variance components, $θ_{jk 0},$ and $θ_{jk 1},$ in the fourth equation of the Level 1 model were further specified by the fifth and sixth equations of the Level 2 model in (4). These specifications made the BROCA model different from the general two-level model, in which Level 1 residuals share equal variance.

The Level 2 equations demonstrated how Level 1 parameters varied across racial groups. The random errors, u_1jk, u_2jk, u_3jk, u_4jk, u_5jk, and u_6jk followed normal distribution $N (0, σ_{uj}^{2}),$ $N (0, σ_{dj}^{2}),$ $N (0, σ_{gj}^{2}),$ $N (0, σ_{cj}^{2}),$ $N (0, σ_{θ j 0}^{2}),$ and $N (0, σ_{θ j 1}^{2}),$ respectively. In this study, BROCA was run for IS and ID separately. This way, the covariance of $ε_{i 1 k}$ and $ε_{i 2 k}$ was zero.

Prior Distributions

The prior distribution of $μ_{j},$ $δ_{j},$ $γ_{j},$ and $κ_{j}$ was set as N(M, V), with M ~ N(0; 10⁻⁶) and V⁻² ~ unif(0, 10⁴). The priors for the residual variance parameters were more complicated. O’Malley and Zou (2006) set $θ_{jk 0}^{2}$ as an Inverse-Gamma distribution IG(10⁻³; 10⁻³), and $θ_{jk 0}^{2} + θ_{jk 1}^{2}$ an IG(10⁻³; 10⁻³). Because the use of the Inverse-Gamma priors had caused software crash problem, this study followed literature (Gelman & Hill, 2007) and set the uniform prior unif(0; 10⁴) to $θ_{jk 0}^{2}$ and $θ_{jk 1}^{2} .$

Gender-Specific ROC Curves Across Racial Groups

In the Level 1 model, the coefficient of the interaction term (D_ki G_ki) between gender and the interest-major congruence group membership variable resulted in two ROC curves: one for males and the other for females. The positive (negative) sign of the coefficient of the interaction term determined the positive (negative) gender effect. Visually, a positive gender effect pushes males’ ROC curve further to the left-above corner from females’ ROC curve. In turn, it increases the value of AUC and indicates a larger prediction power of the career interest measure for males. However, a negative gender effect indicates a higher prediction power for females and a larger AUC value of the ROC curve, which is closer to the left-above corner than the males’ ROC curve.

The Level 2 model allowed each racial group to have its own mean and residual variance. In turn, each racial group had two gender-specific ROC curves, reflecting the detailed micro differences of how IS or ID predicted interest-major congruence for males and females within the same racial group. The ROC curves for k^th racial group was obtained through $ROC (t)_{k} = Φ [a_{k} + b_{k} Φ^{- 1} (t)],$ with $t \in [0, 1],$ $b_{k} = σ_{0 k} / σ_{1 k},$ $a_{k} = (d_{k} + G \times c_{k}) / σ_{1 k},$ and G coded as 0 for females and 1 for males. Each racial group’s ROC curve had two trajectories as shown in Figure 2. The solid (dashed) line was for males (females).

Figure 2.

ROC curves of interest differentiation by gender and race

Results of Analysis 2

Gender and racial differences in career interest strength

Four of the five racial groups had negative coefficients of the interaction terms displayed in the second column of Table 1. The estimated gender effect of IS in each racial group was tested under the null hypothesis that each c_k was equal to zero for k = 1, . . . , 5. None of the five chi-square tests were statistically significant. The results suggested that, within each racial group, there was no significant gender difference on IS in predicting interest major congruence. The holistic test that all c_ks are 0 for k = 1, . . . , 5 was conducted. The result ( $χ_{5}^{2} = 4.45;$ p = .49) suggested that interest strength did not display significant gender difference in predicting interest major congruence in all five racial groups.

Table 1.

Gender Effect and AUC Estimation of Interest Strength in Five Racial Groups.

Race	Gender effect				AUC estimation
	c_k (SD)	95% PI	$χ_{1}^{2}$ ^a	p	Female	Male	$χ_{1}^{2}$ ^b	p
AIAN	−2.51 (1.62)	[−6.13, 3.41]	2.35	.13	.65	.55	2.35	.13
AHPI	0.26 (1.67)	[−2.62, 3.57]	0.03	.87	.70	.71	0.03	.87
AA	−1.73 (1.20)	[−4.19, 6.83]	2.06	.15	.66	.59	2.06	.15
Latino/a	−1.21 (1.19)	[−3.43, 1.24]	1.03	.31	.67	.62	1.03	.31
White	−1.77 (1.16)	[−4.12, 4.66]	2.30	.13	.69	.61	2.30	.13
$χ_{5}^{2}$	4.45^c	NA	NA	NA	12.91^d	2.10	NA	NA
p	.49	NA	NA	NA	.02*	.84	NA	NA

Note. AUC = area under the curve; AIAN = American Indian/Alaska Native; AHPI = Asian/Hawaiian/Pacific Islander; AA = African American.

Chi-square test on $c_{k} = 0$ for the specific racial group.

Chi-square test on equal AUCs of female and male.

Chi-square test on all the five $c_{k} s$ equal to 0.

Chi-square test on all the five female groups’ AUCs equal to .5.

p < .05.

For the purpose of cross-validation, a chi-square test was conducted for each of the five racial groups under the null hypothesis: $H_{0}, a_{0} / H_{0} : a_{0} \sqrt{1 + b_{0}^{2}} \sqrt{1 + b_{0}^{2}} = a_{1} / a_{1} \sqrt{1 + b_{1}^{2}} \sqrt{1 + b_{1}^{2}} .$ Equivalently, it tested that females and males for each racial group had the same AUC. The five chi-square tests failed to reject the null hypotheses—see the eighth and ninth columns of Table 1. This further suggested that gender difference was not statistically significant across the five racial groups. In other words, interest strength had equal power for females and males in predicting interest-major congruence. Furthermore, the overall prediction power of IS was low for both females and males because 8 of the 10 AUC values (Table 1) were smaller than the acceptable discrimination criterion .70.

For each gender group, racial differences were tested under the null hypothesis that all five AUCs were equal to 0.5. The last two rows of Table 1 summarized the results. For low-SES females, the chi-square test rejected the null hypothesis ( $χ_{5}^{2} = 12.91;$ p < .05). For low-SES males, the chi-square test failed to reject the null hypothesis ( $χ_{5}^{2} = 2.10;$ p = .84). This implied that there were no racial differences for low-SES males on the power of the interest strength in predicting interest-major congruence. However, for low-SES females, interest strength’s power in predicting interest-major congruence varied across at least two racial groups.

Gender and racial differences in career interest differentiation

All racial groups showed the negative gender effect indicated by the negative sign of the coefficient in the second column of Table 2. A holistic chi-square test under the null hypothesis that all gender effects were zero was conducted, namely H₀: c_k = 0, for k = 1, . . . , 5. The results ( $χ_{5}^{2} = 13.13;$ p < .05) suggested that ID’s power in predicting interest-major congruence was significantly different between low-SES males and females for at least one racial group.

Table 2.

Gender Effect and AUC Estimation of Interest Differentiation in Five Groups.

Race	Gender effect				AUC estimation
	c_k (SD)	95% PI	$χ_{1}^{2}$ ^a	p	Female	Male	$χ_{1}^{2}$ ^b	p
AIAN	−14.62 (6.07)	[−27.41, −3.03]	5.80	.02*	.66	.53	5.65	.02*
AHPI	−10.55 (5.84)	[−20.92, 2.03]	3.26	.07^†	.77	.67	3.22	.07^†
AA	−11.78 (5.20)	[−21.09, −0.72]	5.13	.02*	.67	.57	5.11	.02*
Latino/a	−12.58 (4.99)	[−21.67, −2.26]	6.37	.01*	.69	.58	6.29	.01*
White	−16.82 (5.76)	[−29.51, −6.57]	8.54	.004**	.70	.55	8.23	.004**
$χ_{5}^{2}$	13.13^c	NA	NA	NA	26.07^d	3.17	NA	NA
p	.02*	NA	NA	NA	.000***	.67	NA	NA

Note. AUC = area under the curve; AIAN = American Indian/Alaska Native; AHPI = Asian/Hawaiian/Pacific Islander; AA = African American.

Chi-square test on $c_{k} = 0$ for the specific racial group.

Chi-square test on equal AUCs of female and male.

Chi-square test on all the five c_ks equal to 0.

Chi-square test on all the five female groups’ AUCs equal to .5.

†

p < .10. *p < .05. **p < .01. ***p < .001.

The estimated negative gender effect of ID in predicting interest-major congruence was further tested for each racial group. The fourth column of Table 2 summarized the results of the five chi-square tests. Statistically significant negative gender differences were observed in four racial groups: the American Indian/Alaska Native group ( $χ_{1}^{2} = 5.80;$ p < .05), the African American group ( $χ_{1}^{2} = 5.13;$ p < .05), the Latino/a group ( $χ_{1}^{2} = 6.37;$ p < .05), and the White group ( $χ_{1}^{2} = 8.54;$ p < .01). The negative gender effect was marginally significant in the Asian/Hawaiian/Pacific Islander group ( $χ_{1}^{2} = 3.26;$ p < .10). The results implied that interest differentiation had significantly different power in predicting interest-major congruence between low-SES males and females. Visually, the female groups’ ROC curves (i.e., dashed lines in Figure 2) were closer to the left-above corner. It indicated that ID had stronger prediction power for low-SES females.

Similarly, the cross-validation chi-square test that females and males had the same AUC in a racial group indicated that interest differentiation had larger power in predicting low-SES females’ interest-major congruence—see the eighth and ninth columns of Table 2. The Asian/Hawaiian/Pacific Islander group had the highest values of AUC for both males (AUC = .67) and females (AUC = .77). In two groups, Asian/Hawaiian/Pacific Islander and White, the interest differentiation’s distinguishing power for females reached or surpassed the acceptable discrimination criterion .70 (Hosmer & Lemeshow, 1989).

For each gender group, racial differences were tested under the null hypothesis that the AUC values in the five racial groups were all equal to 0.5—see the last two rows of Table 2. The chi-square test rejected the null hypothesis ( $χ_{5}^{2} = 26.07;$ p< .001) for females. For males, the chi-square test failed to reject the null hypothesis ( $χ_{5}^{2} = 3.17;$ p = .67). This implied that there were no racial differences on the power of interest differentiation in predicting low-SES males’ interest-major congruence. However, interest differentiation had uneven power in predicting low-SES females’ interest-major congruence across the five racial groups.

Analysis 3: BROCA on Ordinal Data

Ordinal data

The large-scale online KCS system measured six interest clusters (Zytowski, 2001): Outdoor/Mechanical, Science/Technical, Arts/Communication, Social/Personal Services, Sales/Management, and Business/Retail. Six abbreviations, KOUT, KSCI, KART, KSOC, KSAL, and KBUS were used to represent the six Kuder interest clusters, respectively. Ordinal data were generated by ordering each student’s six interest scores of the six interest clusters. Higher rank indicated a student’s higher interest on a career cluster. For example, after sorting the six cluster scores in an ascending order, a student had the rank scores 2, 1, 5, 6, 3, and 4 for KOUT, KSCI, KART, KSOC, KSAL, and KBUS, respectively. This indicated that this student had the highest measured interests for Social/Personal Services careers and the lowest measured interests for Science/Technical careers.

Probit BROCA model

Let Y be the ordinal variable of a Kuder interest cluster. The observed ordinal responses of Y were hypothetically attributed to the latent variable (L) and corresponding thresholds ( $l_{1},$ $l_{2},$ $l_{3},$ $l_{4},$ $l_{5}$ ). The latent variable could be conceptually treated as a continuous psychological construct, namely, “career interest.” The ordinal response were associated with the latent variable as the following for a given Kuder interest cluster:

Y = {\begin{matrix} 1, L < l_{1} \\ 2, l_{1} \leq L < l_{2} \\ 3, l_{2} \leq L < l_{3} \\ 4, l_{3} \leq L < l_{4} \\ 5, l_{4} \leq L < l_{5} \\ 6, l_{5} \leq L \end{matrix}

This implied that, for example, Y took a value of 4 if the latent career interest score was between the third and the fourth thresholds (Metz, 1986; Pepe, 2000). The latent variable followed a standard normal distribution N(0, 1) for IMNC group and $N (a / b, b^{- 2})$ for IMC group (Pepe, 2003), with a and b were parameterized in Equation (1). The ordinal variable Y followed a multinomial distribution with a probability function defined in Equation (6).

p (Y = y) = {\begin{matrix} Φ (l_{y}) - Φ (l_{y - 1}), IMNC group \\ Φ (b l_{y} - a) - Φ (b l_{y - 1} - a), IMC group \end{matrix}

with y = 1, . . ., 6. Equations (5) and (6) specified the probit ordinal analysis model (Gelman & Hill, 2007).

Parameter estimation

This study used MCMC-based WinBUGS to estimate the parameters, rather than to solve the partial derivative joint functions of parameters (a, b, $l_{y}, \dots, l_{y - 1}$ ,) to obtain the maximum likelihood estimation (Dorfman & Alf, 1968, Green & Swets, 1966; Schonemann & Tucker, 1967). The ROC curve and the value of AUC were estimated by using Equations (1) and (2), respectively.

The prior of each of the five thresholds was set up as N(0, 1) and constraints were used to force their correct order ( $l_{1} < l_{2} < l_{3} < l_{4} < l_{5};$ Gajewski, Hart, Bergquist-Beringer, & Dunton, 2007). Because of the complexity of the likelihood of the data and the prior setting constraints, the joint posterior distribution of the parameters does not have a closed form. In this situation, WinBUGS uses MCMC sampling with the implementation of Metropolis-Hastings algorithm (Gelman & Hill, 2007). The parameter a’s prior was set as N(0, 10⁶). Following Gajewski et al. (2007), this study set b’s prior as the Log-Normal distribution with $μ_{logb}$ and $τ_{logb}$ being the mean and variance hyperparameters $μ_{logb}$ followed a normal distribution N(0, 10⁴) and $τ_{logb}^{- 2}$ a uniform distribution unif(0, 10⁶) (Gelman & Hill, 2007). The initial values of the threshold were set up to the maximum likelihood estimations from the non-Bayesian ordinal probit model.

Results of Analysis 3

The estimated values of AUC were .54, .63, .51, .45, .47, and .40 for KOUT, KSCI, KART, KSOC, KSAL, and KBUS, respectively. Science/Technical Cluster and Outdoor/Mechanical Cluster had the first and the second largest estimated AUC among the six interest clusters. For the Science/Technical Cluster, the estimations (standard deviations) of the seven parameters, $\hat{a},$ $\hat{b},$ ${\hat{l}}_{1},$ ${\hat{l}}_{2},$ ${\hat{l}}_{3},$ ${\hat{l}}_{4},$ and ${\hat{l}}_{5},$ were 0.42 (0.05), 0.84 (0.04), −0.98 (0.04), −0.41 (0.04), 0.03 (0.04), 0.49 (0.04), and 1.05 (0.05), respectively. For the Outdoor/Mechanical Cluster, these estimations (standard deviations) were 0.13 (0.05), 0.99 (0.04), −1.07 (0.05), −0.04 (0.04), 0.07 (0.04), 0.47 (0.04), and 1.14 (0.05). The chi-square test rejected the null hypothesis of equal ROC curve parameters, $H_{0} : a_{sci} = a_{out}$ , $b_{sci} = b_{out}$ ( $χ_{2}^{2} = 28.33;$ p < .001). It suggested that the two ROC curves were significantly different. In other words, the Science/Technical Cluster and Outdoor/Mechanical Cluster had different power in predicting interest-major congruence. It indicated that the prediction power of the Science/Technical Cluster (AUC = .63) was better than that of the Outdoor/Mechanical Cluster (AUC = .54). But the prediction power was low because the two values of AUC were below the acceptable discrimination criterion .70 (Hosmer & Lemeshow, 1989). Further analysis using gender as a covariate was conducted but failed to find any gender difference between the ordinal Science/Technical Cluster and Outdoor/Mechanical interest Cluster.

Discussion

This study used the BROCA approach to examine the prediction power of the interest measures collected from KCS system in predicting low-SES youth’s interest-major congruence. Findings from this study have significant implications to methodogy research and to educational practices.

Implications for Assessment Methodology

Three analyses in this study demonstrated that BROCA as an innovative approach has several strengths: (a) capability to handle continuous measures, nominal background variables, and ordinal rank measures; (b) graphical nature of presenting assessment results for test utility comparison; (c) availability of disparate statistical evidence including chi-square tests and nonparameter “rules of thumb”; (d) fast computation of incorporating priors on multilevel parameters; and (e) little requirement for higher level sample sizes. Because of the small number of higher level units, traditional hierarchal modeling approaches (e.g., HLM) cannot be used in systematic evaluation for educational assessment involing heriarchically structured data. Lissitz and Samuelsen (2007) suggested to examine the practical utility analysis through hierarchical liner model to answer questions such as “Are the outcomes from the test predictive of outcomes in a desired domain?” (Lissitz & Samuelsen, 2007, p. 444). The Bayesian approach, by nature, is a hierarchical modeling method, where the use of hyperparameters captures the characteristics of the lower level of parameters. Using noninformative priors did ease the model fit process and lead to appropriate results, which are “equivalents” to the maximum likelihood estimates. Bayesian model’s parameter estimates can be plugged in the chi-square test for the validation analysis to examine the predictive power of the test and to demonstrate its utility on that purpose.

The MCMC estimation based chi-square test empowered the ROC method in evaluating test utility and impact analysis. Traditionally, statistical testing on ROC analysis depends on the central limit theory and the normal distribution approximation of each parameter estimator (Pepe, 2003). BROCA involves the more powerful chi-square test (Metz & Kronman, 1980) to evaluate a composite hypothesis to provide multiple sources of evidences for assessment evaluation. The MCMC-based estimation avoided traditional heavy computation of bootstrapping in estimating variances. WinBUGS with the implementation of MCMC algorithm naturally estimated the parameters and their covariance matrix for chi-square tests in assessment evaluation and utility analysis.

The BROCA solves one of the challenges in the external investigation on the impact of testing. This challenge is that scale-specific indices are not directly comparable with one another in the practices of educational and psychological assessment involving multiple scales and multiple measures (e.g., Tracey & Robbins, 2006). BROCA handles different types of measures to provide a scale-free index (i.e., AUC) for test utility and impact analysis. The MCMC-based BROCA evaluates instruments through multiple-group comparisons. The multiple-group-multiple-instrument comparisons can be naturally conducted in BROCA by incorporating dummy coded Level 1 and Level 2 group membership variables as demonstrated in this career interest example.

Implications for Assessment Practices

In assessment practices, the strengths of BROCA can help researchers differentiate various measures and identify the most preditctive measure. For instance, in this particular study, our results show that BROCA helps researchers explain how a career assessment instrument functions differently across higher level clusters such as racial groups, nations, and time points. Detailed comparisons between interest differentiation and strength across gender and race/ethnicity groups suggested that interest differentiation was more powerful than interest strength in predicting low-SES youth’s interest-major congruence. The interaction between race/ethnicity and gender indicated a micro differential test utility functioning, which suggested that race/ethnicity differences vary across gender groups and, similarly, gender difference varies across racial groups.

Our results suggested that interest differentiation had a higher level of impact than interest strength in examining the utility of test outcomes. The practical importance of the finding is that simply having a clear sense of what a student really likes to do (i.e., interest strength) is less predictive of congruence than a more nuanced understanding of what she/he likes and does not like (i.e., interest differentiation). One implication is that career guidance practitioners should help young people explore their vocational “likes and dislikes” rather than just focusing on helping young people identify one vocational domain that they are interested in.

In addition, micro differences on the prediction powers of educational measures with respect to demographic characteristics demonstrate the differential test utility functioning. In our study, the test utility differential functioning in terms of gender and race/ethnicity suggested that low-SES males, specifically, White, African American, and American Indian/Alaska Native males, may need extra help in selecting college majors congruent with their interests. However, low-SES females in the groups of African American, Latino/a, and American Indian/Alaska Native displayed disadvantages and may need more help. One implication of this finding is that when working with low-SES students, practitioners should also pay attention to specific rece/ethinicity-gender groups (e.g., low-SES White males), and helping them develop a more nuanced and comprehensive understanding of their vocational interests rather than focusing on the one domain they express the strongest interest in.

Limitations and Future Research

Findings of this study favored the use of continuous measures of career interests, rather than ordinal measures, although the Bayesian receiver operating characteristic analysis approach can handle both types of data. The continuous measure of interest differentiation revealed informative moderating effects of gender and race/ethnicity in predicting interest-major congruence. Using ordinal data, this study did not find either gender or race/ethnicity differences. This study found that the Science/Technical career interest measure was statistically better than the Outdoor/Mechanical career interest measure in predicting interest-major congruence. However, the results may not be generalizable because of the low prediction power of ordinal measures of career-clusters observed in this study.

Conducting BROCA can be challenging for test evaluation and utility analysis in practice. For example, selecting appropriate priors for the hyperparameters can be difficult and programming in WinBUGS can be time consuming. For the highly skewed nonnormal measures, data transformation is needed to approximate normal distribution using Box-Cox method (O’Malley & Zou, 2006). However, composite measures such as the variance or the standard deviation of students’ career interest scores (Tracey & Robbins, 2006) require more complicated transformation methods to conduct BROCA. In-depth literature review and simulation studies on data transformation will be needed in future research. Alternative ROC estimation approaches such as semiparametric method (Cai, 2004) have been developed, but future research is needed to examine if and how these alternative approaches can be applied into test evaluation and utility analysis in educational measurement.

Footnotes

Acknowledgements

We would like to thank Catalina D’Achiardi-Ressler, Venessa Keesler, Yew Meng Thum, Alex von Eye, and Jing Lei for their help and feedback on the article. We gratefully acknowledge Jennifer (Johansen) Hutton for the permission to use the images that of this study was based on.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study is based on work supported by a Kuder, Inc. research grant awarded to the first and the second authors.

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). The standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Cai

(2004). Semi parametric ROC regression analysis with placement values. Biostatistics, 5(1), 45-60.

Diemer

M. A.

Wang

Smith

(2010). Vocational interests and major selection among youth of color in poverty. Journal of Career Assessment, 18, 97-110.

Dorfman

D. D.

Alf

(1968). Maximum likelihood estimation of parameters of signal detection theory: A direct solution. Psychometrika, 33, 117-124.

Dorfman

D. D.

Alf

(1969). Maximum likelihood estimation of parameters of signal detection theory and determination of confidence intervals-rating method data. Journal of Mathematical Psychology, 6, 487-496.

Embretson

S. E.

(2007). Construct validity: A universal validity system or just another test evaluation procedure? Educational Researcher, 36, 449-455.

Fawcett

(2003). ROC graphs: Notes and practical considerations for data mining researchers (HP Laboratories Technical Report). Palo Alto, CA: Intelligent Enterprise Technologies Laboratory.

Gajewski

B. J.

Hart

Bergquist-Beringer

Dunton

(2007). Inter-rater reliability of pressure ulcer staging: Ordinal probit Bayesian hierarchical model that allows for uncertain rater response. Statistics in Medicine, 26, 4602-4618.

Gatsonis

C. A.

(1995). Random-effects models for diagnostic accuracy data. Academic Radiology, 2(1), 14-21.

10.

Gelman

Hill

(2007). Data analysis using regression and multilevel/hierarchical models. New York, NY: Cambridge University Press.

11.

Gorin

J. S.

(2007). Reconsidering issues in validity theory. Educational Researcher, 36, 456-462.

12.

Green

D. M.

Swets

J. A.

(1966). Signal detection theory and psychophysics. New York, NY: Wiley.

13.

Hanley

(1989). Receiver operating characteristic (ROC) methodology: The state of the art. Critical Reviews in Diagnostic Imaging, 29, 307-335.

14.

Hansen

C. J.

Zytowski

D. G.

(1979). The Kuder occupational interest inventory as a moderator of its predictive validity. Educational and Psychological Measurement, 39(1), 107-118.

15.

Hosmer

D. W.

Lemeshow

(1989). Applied logistic regression. New York, NY: John Wiley.

16.

Ihle-Helledy

Zytowski

D. G.

Fouada

N. A.

(2004). Kuder career search: Test-retest reliability and consequential validity. Journal of Career Assessment, 12, 285-297.

17.

Kelly

K. R.

(2002). Concurrent validity of the kuder career search activity preference scales and career clusters. Journal of Career Assessment, 10, 127-144.

18.

Lissitz

R. W.

Samuelsen

(2007). A suggested change in terminology and emphasis regarding validity and education. Educational Researcher, 36, 437-448.

19.

Lunn

D. J.

Thomas

Best

Spiegelhalter

(2000). Winbugs—A Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing, 10, 325-337.

20.

McGee

M. G.

(1979). Human spatial abilities: Psychometric studies and environmental, genetic, hormonal, and neurological influences. Psychological Bulletin, 86, 889-918.

21.

Metz

C. E.

(1986). ROC methodology in radiologic imaging. Investigative Radiology, 21, 720-733.

22.

Metz

C. E.

Kronman

H. B.

(1980). Statistical significance tests for binormal ROC curves. Journal of Mathematical Psychology, 22, 218-243.

23.

O’Malley

A. J.

Zou

K. H.

(2006). Bayesian multivariate hierarchical transformation models for ROC analysis. Statistics in Medicine, 25, 459-479.

24.

Pepe

M. S.

(1998). Three approaches to regression analysis of receiver operating characteristic curves for continuous test results. Biometrics, 54, 124-135.

25.

Pepe

M. S.

(2000). Receiver operating characteristic methodology. Journal of the American Statistical Association, 95, 308-311.

26.

Pepe

M. S.

(2003). The statistical evaluation of medical tests for classification and prediction. New York, NY: Oxford University Press.

27.

Porter

S. R.

Umbach

P. D.

(2006). College major choice: An analysis of person-environment fit. Research in Higher Education, 47, 429-449.

28.

R Development Core Team. (2008). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria: R Foundation for Statistical Computing.

29.

Schonemann

P. H.

Tucker

L. R.

(1967). A maximum likelihood solution for the method of successive intervals allowing for unequal stimulus dispersions. Psychometrika, 32, 403-417.

30.

Spiegelhalter

Thomas

Best

Lunn

(2003). WinBUGS Version 1.4 user manual. Cambridge, England: MRC Biostatistics Unit.

31.

Swanson

J. L.

(1993). Integrated assessment of vocational interests and self-rated skills and abilities. Journal of Career Assessment, 1(1), 50-57.

32.

Swets

J. A.

Dawes

R. M.

Monahan

(2000). Better decisions through science. Scientific American, 283, 82-87.

33.

Tanner

W. P. J.

Swets

J. A.

(1954). A decision-making theory of visual detection. Psychological Review, 61, 401-409.

34.

Toledano

Gatsonis

C. A.

(1995). Regression analysis of correlated receiver operating characteristic data. Academic Radiology, 2, 530-536.

35.

Tosteson

A. N. A.

Begg

C. B.

(1988). A general regression methodology for ROC curve estimation. Medical Decision Making, 8, 204-215.

36.

Tracey

T. J. G.

Robbins

S. B.

(2006). The interest-major congruence and college success relation: A longitudinal study. Journal of Vocational Behavior, 69(1), 64-89.

37.

Zytowski

D. G.

(2001). Kuder career search with person match: Career assessment for the 21st century. Journal of Career Assessment, 9, 229-241.