Abstract
When tests consist of multiple-choice and constructed-response items, researchers are confronted with the question of which item response theory (IRT) model combination will appropriately represent the data collected from these mixed-format tests. This simulation study examined the performance of six model selection criteria, including the likelihood ratio test, Akaike’s information criterion (AIC), corrected AIC, Bayesian information criterion, Hannon and Quinn’s information criterion, and consistent AIC, with respect to correct model selection among a set of three competing mixed-format IRT models (i.e., one-parameter logistic/partial credit [1PL/PC], two-parameter logistic/generalized partial credit [2PL/GPC], and three-parameter logistic/generalized partial credit [3PL/GPC]). The criteria were able to correctly select less parameterized IRT models, including the PC, 1PL, and 1PL/PC models. In contrast, the criteria were less able to correctly select more parameterized IRT models, including the GPC, 3PL, and 3PL/GPC models. Implications of the findings and recommendations are discussed.
Model selection is an important consideration among researchers who wish to appropriately represent the data with which they are working. Model selection methodology is not a novel concept and has been investigated within the context of multiple regression (McQuarrie & Tsai, 1998; Shimodaira, 1998), multilevel modeling (Gurka, 2006; Whittaker & Furlow, 2009), and structural equation modeling (De Gooijer & Koopman, 1988; Whittaker & Stapleton, 2006). More recently, the issue of model selection has gained attention within the item response theory (IRT) arena (Kang & Cohen, 2007; Kang, Cohen, & Sung, 2009). The selection of an incorrect IRT model not only leads to theoretically different interpretations of the data at hand (Lord, 1968), but it also leads to inappropriate conclusions with respect to other IRT utilities (e.g., parameter estimation, differential item functioning, person-fit assessment; DeMars, 2010).
Various IRT models exist and may be employed to help describe the performance of test takers on exams. Tests can be comprised solely of multiple-choice items, constructed-response items, or a combination of the two (Rosa, Swygert, Nelson, & Thissen, 2001). While research investigating various model selection methods with solely dichotomous IRT models or solely polytomous IRT models has been conducted, no research, to the authors’ knowledge, has examined the use of model selection criteria with mixed-format IRT models (i.e., when a test includes multiple-choice items and constructed-response items). Consequently, the purpose of this article is to examine the use of model selection criteria when selecting among mixed-format IRT models.
In the following sections, a description of research that is relevant to the current study will be provided followed by a brief discussion of mixed-format tests and the implications associated with implementing mixed-format tests. The pertinent IRT models (dichotomous and polytomous) and pertinent model selection indices will subsequently be presented. A simulation study then is described, which examined these relevant model selection indices when selecting among the applicable IRT models described.
Relevant Research
There is a dearth of research in the area of relative model selection within the IRT arena (Embretson & Reise, 2000). The most relevant research in this area has examined model selection indices when selecting among exclusively dichotomous IRT models or polytomous IRT models. For instance, Kang and Cohen (2007) conducted a simulation study in which the performance of the likelihood ratio test (LRT), Akaike’s information criterion (AIC; Akaike, 1973), the Bayesian information criterion (BIC; Schwarz, 1978), the cross-validation log likelihood (CVLL; Bolt, Cohen, & Wollack, 2001; Geisser & Eddy, 1979; Gelfand & Dey, 1994), and the deviance information criterion (DIC; Spiegelhalter, Best, Carlin, & van der Linde, 2002) was examined with respect to correctly selecting among a set of estimated dichotomous IRT models when data were generated under a particular dichotomous IRT model. The dichotomous IRT models examined included the one-parameter logistic (1PL) or Rasch (1960) model), the two-parameter logistic (2PL; Birnbaum, 1968) model, and the three-parameter logistic (3PL; Birnbaum, 1968) model. Kang and Cohen also varied the test length (20 and 40 items), sample size (500 and 1,000), and ability distribution (normal, low, and high ability). All five of the model selection indices performed well with respect to selecting the correct model when data were generated using the 1PL and 2PL models. In contrast, the CVLL outperformed the remaining model selection criteria when correctly selecting the 3PL model, which was followed fairly closely only by the DIC with respect to accuracy. The LRT and AIC performed poorly when attempting to correctly select the 3PL model and the BIC never correctly selected the 3PL model in any of the conditions examined. The author concluded that the overall performance of the CVLL was the best with respect to accuracy.
In a subsequent simulation study, Kang et al. (2009) investigated the performance of the AIC, BIC, CVLL, and DIC when selecting among a set of estimated polytomous IRT models when data were generated under a particular polytomous IRT model. The polytomous IRT models employed in their study included the rating scale (RS; Andrich, 1978) model, the generalized partial credit (GPC; Muraki, 1992) model, the partial credit (PC; Masters, 1982) model, and the graded response (GR; Samejima, 1969) model. Other conditions varied in their study included test length (10 and 20 items), sample size (500 and 1,000), and number of categories per item (3 and 5). All four of the model selection criteria performed well when correctly selecting the RS, GPC, and PC models. The CVLL and the DIC, however, were outperformed by the AIC and the BIC when attempting to correctly select the GR model. The authors concluded that the BIC tended to generally outperform the remaining model selection criteria with respect to accuracy and consistency. Thus, the conclusion is counter to the results found for tests composed of dichotomously scored items.
It is recommended that the data fit the IRT model adequately prior to beginning the model comparison and selection process (de Ayala, 2009). Model-data fit in the IRT arena may be assessed at a variety of levels, including the model level, the item level, and the person level. Once model-data fit is deemed adequate, the comparison of a set of plausible IRT models and the selection of an appropriate IRT model may ensue.
It must be noted that there have been studies in which IRT model fit with respect to fit at the item level has been assessed. These item-fit assessments have been performed with single-format tests (e.g., Orlando & Thissen, 2000) as well as with mixed-item format tests (e.g., Chon, Lee, & Dunbar, 2010). While all aspects of model-data fit are important, the relative model comparison approach is an additional manner in which global IRT model fit may be assessed (Embretson & Reise, 2000) and is the focus of the current study with respect to mixed-format tests.
Mixed-Format Tests
The relative model comparison research conducted thus far in this area has focused on single-format tests. However, there are several item formats that one may choose from when creating a test to assess the skills and abilities of examinees. The most popular item formats include multiple-choice items, which are easy to score but difficult to write toward deeper levels of processing, and constructed-response items, which are difficult to score but easier to write toward deeper levels of processing among test takers. Given the advantages and disadvantages of multiple-choice items and constructed-response items, it seems reasonable to combine such item formats to psychometrically strengthen a test (Wainer & Thissen, 1993).
Several existing tests include mixed-item formats. Some of these tests include state assessments that have been administered in North Carolina and Wisconsin, most of the Advanced Placement (AP) subject examinations, as well as the National Assessment of Educational Progress (NAEP; Rosa et al., 2001). Mixed-item formats can necessitate parameter estimation using different IRT models within the same test. For example, the NAEP uses the 3PL and the GPC models simultaneously during parameter estimation. Using parameters from the 2000 NAEP in mathematics, data were generated in the current study according to combined dichotomous and polytomous models which were fitted to the data (correctly and incorrectly) to investigate the accuracy of model selection indices with mixed-format tests.
IRT Models
The appendix provides equations for the models used in the current study. The 3PL (Birnbaum, 1968) model may be used when items are scored dichotomously (i.e., correct or incorrect). Other dichotomous IRT models include the 2PL (Birnbaum, 1968) model and the 1PL or Rasch (1960) model. The GPC (Muraki, 1992) model may be employed when items are scored polytomously (e.g., math items in which partial credit is assigned). The PC (Masters, 1982) model may also be used with similar item types as with the GPC model.
Given that the NAEP data are calibrated using the 3PL/GPC combined model, the 3PL/GPC model is one of the models used to generate the data. Additional combined IRT models include the 2PL/GPC and the 1PL/PC models, which represent practical models of interest given the data types. The 3PL/GPC is the most parameterized model, followed by the 2PL/GPC, and subsequently by the 1PL/PC. Consequently, all of the models are hierarchically related in their order of parameterization. More specifically, the 1PL/PC model is nested within the 2PL/GPC model, which is nested within the 3PL/GPC model. Data were also generated using the less parameterized 2PL/GPC and 1PL/PC combined models to assess the impact of the number of parameters estimated in a model.
Model Selection Criteria
There exist a number of relative model comparison and selection methods that researchers may apply during the model selection process. When IRT models are nested, model selection may be based on a LRT (Thissen, Steinberg, & Wainer, 1988). The LRT, also referred to as G2, may be used to compare the two models using the −2 log likelihood (–2LL), also called the deviance statistic (d), obtained from fitting each of the IRT models to the data. The LRT is the difference between the two nested models’ −2LLs, or deviance statistics (d), and is asymptotically distributed as a chi-square (χ2) statistic:
where drestricted is the deviance statistic for the nested (restricted) model and duntrestricted is the deviance statistic for the more parameterized (unrestricted) model. The LRT has corresponding degrees of freedom equal to the difference in the number of parameters estimated (p) in each model. A significant LRT would indicate that the nested IRT model with fewer parameters has been oversimplified and the more parameterized IRT model should be selected. In contrast, a nonsignificant LRT would indicate that the two IRT models are comparable with respect to fit of the overall model and the nested model would most likely be selected in favor of parsimony.
When IRT models are nonnested, however, the LRT is no longer an appropriate test. In this situation, then, researchers may employ information-based criteria during the model comparison and selection process. An advantage of information-based criteria is that they may be used when comparing and selecting among a set of nested and/or nonnested IRT models. While various model selection criteria exist, this article focuses on the information criteria that may be calculated using the deviance statistic (d) or −2LL, which is readily available in most IRT software.
Presumably, the most recognized information criterion is Akaike’s (1973) information criterion (AIC), which is calculated as follows:
where d is the deviance statistic or the −2LL1 for a given model and p is the number of parameters estimated in the given model. When comparing two contending models, the model with the lowest AIC value would be selected as the model demonstrating better fit to the data. One criticism of the AIC is that it tends to overfit models, meaning that it tends to select more highly parameterized models (Shibata, 1976). Consequently, Hurvich and Tsai (1989) proposed a model selection criterion, called the finite sample corrected AIC (AICC), which incorporates a correction to the AIC to help adjust for its overfitting tendency and is calculated as follows:
where N is the total sample size.
The AIC and the AICC are asymptotically efficient, meaning that they will select the model that best approximates the correct/true model when the correct/true model is not actually among the set of comparison models. Although the AIC and AICC are efficient model selection criteria, they are not consistent model selection criteria (Bozdogan, 1987; Hannon & Quinn, 1979; Hurvich & Tsai, 1989; Schwarz, 1978). Consistent selection criteria will select the correct/true model with probabilities close to or at 1.0 when the correct/true model is actually among the set of comparison models. Additional information-based criteria, which are classified as consistent model selection criteria, have extended the AIC to account for not only model complexity but also sample size. These include Schwarz’s Bayesian information criterion (1978):
Hannon and Quinn’s (1979) information criterion (HQIC),
and Bozdogan’s (1987) consistent AIC (CAIC),
where ln is the natural log.
There is little agreement regarding whether efficient or consistent selection criteria are best (McQuarrie & Tsai, 1998), and a debate concerning this issue is beyond the scope of this article. The performance of these five model selection criteria will be gauged with respect to selecting the correct IRT model from a set of competing incorrect models. While this benchmark is the definition of consistency, it is difficult to evaluate the efficacy of these information-based model selection criteria differently.
To summarize, it is important to select the correct IRT model to provide theoretically meaningful interpretations of the data. Previous research has shown that the CVLL performed more accurately when distinguishing between dichotomous IRT models whereas the BIC performed more accurately when differentiating between polytomous IRT models. Again, criteria that may be calculated using the deviance statistic (d) or the −2LL, which are readily available in traditional IRT software programs, were investigated. Thus, the CVLL, which requires Bayesian estimation of parameters in separate software, is not evaluated in this study. The AIC and BIC, which are included in the current study, did well differentiating the 1PL and 2PL models but did poorly when differentiating the 3PL model in Kang and Cohen’s (2007) study. Thus, one purpose of this study is to extend their study by including additional information-based model selection criteria (viz., AICC, HQIC, and CAIC) to determine if their findings will be replicated.
The inconsistent findings from the previous research also beg the question, which index would perform more accurately when tests include mixed-item formats? Given the increase in the administration of mixed-format tests, it is essential to understand how these information criteria, among others, operate with respect to relative mixed-format IRT model selection. The results of this study would be of interest if one were interested in assessing global model fit and selecting a mixed-format IRT model that most appropriately represents the data at hand.
Method
A simulation study was conducted to examine the performance of six model selection criteria, including the LRT, AIC, AICC, BIC, HQIC, and CAIC, for mixed-format IRT models. Conditions that were varied included sample size, proportion of dichotomous and polytomous items on the test, total score points, and generating mixed-format IRT model.
Sample Size
Sample size was varied to represent small and moderate sample sizes. A minimum of 500 participants has been recommended for reasonable parameter estimates for polytomous models (Reise & Yu, 1990) and, thus, was selected to represent a small sample size. A sample size of 1,000 was chosen to represent a moderate sample size in which parameter estimates would be reasonable.
Proportion of Item Type
Five combinations of dichotomous and polytomous items on each test were included to examine the performance of the model selection indices under various test combination scenarios. These combinations consisted of a particular number of dichotomously scored items and polytomously scored items to result in 100%, 60%, approximately 50%, 40%, and 0% of the total score points on the test attributed to the dichotomous items with 0%, 40%, approximately 50%, 60%, and 100% of the total score points on the test attributed to the polytomous items, respectively. The term approximately is used to describe conditions in which an equal amount of points on the test was to be attributed to dichotomously scored and polytomously scored items. This is because combinations of the mixed-item types could not result in an exactly equal amount of points to be attributed to the test’s total score points, which was also a manipulated condition and is described next.
Total Score Points
Test length was not directly manipulated given the mixture of dichotomously and polytomously scored items. Instead, the number of possible score points on the test was manipulated, resulting in different test lengths. The dichotomous items were scored as correct or incorrect, meaning that they are worth one point each. The polytomous items consisted of three categories and were scored as 0, 1, or 2, indicating that they are worth 2 points each. The combination of dichotomously and polytomously scored items would add up to either 26 or 50 score points. Fully crossing the levels of this condition with the five levels of proportion of item type resulted in 10 different tests (see Table 1), which varied in length from 13 items total to 50 items total.
Number of Each Item Type on Each Test as a Function of Percentage of Item Type and Total Score Points
Generating Mixed-Format IRT Models
Three mixed-format IRT model combinations were selected when generating the data. The NAEP calibrates using the 3PL/GPC and thus was a logical choice as one of the generating models. To examine the impact of the number of parameters estimated in a model, the 2PL/GPC and 1PL/PC models were also used as generating models.
Data Generation and Analysis
Parameter estimates used to generate the data were taken from the 2000 NAEP in mathematics for Grades 4, 8, and 12. The selection of item parameters was done randomly across all three grade levels in correspondence with the proportion of items for each of the five content areas (algebra, data analysis, geometry, measurement, and number sense) on the exam. This was done for each of the 10 (Proportion of Item Type [5] × Total Score Points [2]) test conditions. While the NAEP in mathematics consists of polytomous items with more than three categories, item parameters for three-category items were selected in the present study for simplicity. The criteria for item selection was that discrimination (a) and difficulty (b) parameters were not greater than an absolute value of 4.5 and that the guessing (c) parameter was not greater than .40 to represent acceptable item parameter estimates. In addition, a value of .40 was added to the original discrimination (a) parameters to resemble estimates more likely found on high-stakes tests (Pastor, Dodd, & Chang, 2002). Descriptive statistics (means and standard deviations) of the generating item parameters for each of the 10 (Proportion of Item Type [5] × Total Score Points [2]) test conditions under the 3PL/GPC model are reported in Table 2. The guessing parameters (c) were set equal to 0.0 for the 1PL and 2PL models, and the discrimination parameters (a) were set equal to 1.0 for the 1PL and PC models.
Means (Standard Deviations) of Generating Item Parameters for Each of the 10 (Proportion of Item Type [5] × Total Score Points [2]) Test Conditions
Note: The discrimination parameters (a) and the guessing parameters (c) for the one-parameter logistic (1PL) and 1PL/partial credit (PC) generating models were set to values of 1.0 and 0.0, respectively. The guessing parameters (c) for the 2PL and 2PL/generalized partial credit (GPC) generating models were set to values of 0.0. The discrimination parameters (a) for the PC generating model were set to values of 1.0.
Item responses were simulated according to one of the three mixed-format IRT model combinations (i.e., 3PL/GPC, 2PL/GPC, and 1PL/PC) in each condition using the SAS IRTGEN program (Whittaker, Fitzpatrick, Williams, & Dodd, 2003). Once item responses were generated according to the mixed-format IRT model combination in each condition, PARSCALE version 4.1 (Muraki & Bock, 2003) was used to calibrate the data for each of the three mixed-format combination IRT models (i.e., 3PL/GPC, 2PL/GPC, and 1PL/PC) using marginal maximum likelihood (MML) estimation. A normal prior distribution of ability was used during the calibration process. The six model selection indices were then calculated for all of the estimated models using SAS (version 9.2; SAS Institute Inc., 2007). The model selected by the LRT and the model with the smallest value for each of the five information-based model selection indices was then documented within each replication. Fifty data sets per condition were generated.
Results
Parameter Estimate Recovery
Parameter estimate recovery was examined by calculating correlations between the generating item parameters and their corresponding estimates from PARSCALE within each condition for the 3PL/GPC, 2PL/GPC, and 1PL/PC generating models. The correlations between the generating and respective estimated discrimination parameters (a), difficulty parameters (b), and guessing parameters (c) for the 3PL model in the 100% dichotomously scored test and mixed-item format test conditions ranged from .57 to .83, from .92 to .98, and from .17 to .72, respectively. Lower correlations, particularly among generating and estimated guessing parameters, were observed mostly in the unequal percentage of dichotomously and polytomously scored test conditions (40/60 and 60/40) when the total score points were equal to 26. The correlations between the generating and the matching estimated discrimination parameters (a) and difficulty parameters (b) for the 2PL model in the 100% dichotomously scored test and mixed-format test conditions ranged from .84 to .96 and from .99 to 1.00, respectively. The correlations between the generating and corresponding estimated difficulty parameters (b) for the 1PL model in the 100% dichotomously scored test and mixed-item format test conditions ranged from .99 to 1.00. These correlations are fairly comparable with those found in Kang and Cohen’s (2007) study.
The correlations between the generating and respective estimated discrimination parameters (a) and step difficulty parameters (b1 and b2) for the GPC model in the 100% polytomously scored test and mixed-item format test conditions ranged from .32 to .93 and from .90 to .99, respectively. Lower correlations among generating and estimated discrimination parameters were observed mostly in the 60/40 percentage of dichotomously and polytomously scored test conditions when the total score points were equal to 26. The correlations between the generating and corresponding estimated step difficulty parameters (b1 and b2) for the PC model in the 100% polytomously scored test and mixed-item format test conditions ranged from .91 to 1.00.
Model Selection Criteria Performance
The model selection results from the polytomous IRT model comparisons (GPC vs. PC) in the 0/100 baseline test conditions are presented first followed by the model selection results from the dichotomous IRT model comparisons (1PL vs. 2PL vs. 3PL) in the 100/0 baseline conditions. The model selection results from the mixed-format IRT model comparisons (1PL/PC vs. 2PL/GPC vs. 3PL/GPC) in the remaining mixed-item test conditions are subsequently presented.
0/100 Baseline Test Conditions
PC model selection
In the 100% polytomously scored (0/100) baseline test conditions, data were generated using the PC model to assess the accuracy of the six model selection indices when selecting among the correct polytomous PC model and the incorrect polytomous GPC model. Accordingly, the LRT should not be statistically significant when comparing these models, indicating that the PC model is correctly selected in these conditions. The LRT performed well, distinguishing between the GPC and PC models in all four (Total Scores Points [2] × Sample Size [2]) of the 0/100 baseline conditions with correct model selection rates of 90% or greater. The remaining five information-based model selection criteria, including the AIC, the AICC, the BIC, the HQIC, and the CAIC, demonstrated proficient performance, correctly selecting the PC model with at least 98% accuracy in all four of these conditions.
GPC model selection
Data were also generated using the GPC model to examine the accuracy of the model selection criteria when selecting among the correct polytomous GPC model and the incorrect PC polytomous model. Consequently, the LRT test comparing the GPC with the nested PC model should be statistically significant, indicating correct selection of the GPC model in these conditions. The results indicated that the LRT, the AIC, and the AICC performed well, correctly selecting the GPC model with 94% or greater accuracy in these four baseline scenarios (see Figure 1). In contrast, the BIC and the CAIC tended to incorrectly select the PC model more often with selection rates reaching 100%. The HQIC performed adequately only when sample size was large (N = 1,000) with 100% accuracy rates (see Figure 1).

Percentage of times (out of 50 replications) each model was selected by the model selection criteria when data were generated using the GPC model for the 0% dichotomously scored/100% polytomously scored baseline polytomous tests as a function of total score points and sample size
100/0 Baseline Test Conditions
1PL model selection
In the 100% dichotomously scored (100/0) baseline test conditions, data were generated using the 1PL model to investigate the performance of the model selection indices when selecting among the correct 1PL dichotomous model and the incorrect 2PL and 3PL dichotomous models. Hence, the LRT should not be statistically significant when comparing the 1PL model with the 2PL model or with the 3PL model, which would indicate correct selection of the 1PL model.2 The correct model selection rates for the LRT in three of these four (Total Scores Points [2] × Sample Size [2]) 100/0 baseline conditions were 90% or greater. When the test included 50 total score points and sample size was large (N = 1,000), however, the LRT had more difficulty differentiating between the 1PL and the 3PL models with only 74% accuracy. While this selection rate may be acceptable, it is markedly lower than the selection rates in the remaining conditions. The remaining five information-based model selection indices correctly selected the 1PL model with 100% accuracy rates in all four 100/0 baseline conditions.
2PL model selection
Data were generated using the 2PL model in the 100% dichotomously scored (100/0) baseline test conditions to examine the accuracy of the model selection criteria when selecting among the correct 2PL dichotomous model and the incorrect 1PL and 3PL models. Consequently, the LRT should be statistically significant when comparing the 2PL and 1PL models whereas it should not be statistically significant when comparing the 2PL and 3PL models, indicating correct selection of the 2PL (see Note 2). The results indicated that the LRT performed adequately in the small sample size scenarios (N = 500) with at least 80% accuracy (see Figure 2). In the large sample size scenarios (N = 1,000), however, the LRT was less able to distinguish between the 3PL and 2PL models with accuracy rates as low as 52%. In contrast, the AIC, AICC, and HQIC performed optimally in all four of the 100/0 baseline conditions with 92% or greater accuracy (see Figure 2). The BIC performed well in the large sample size scenarios (N = 1,000) and when the test included 50 total score points with a minimum accuracy rate of 82% in these three conditions. The CAIC only performed well in the large sample size (N = 1,000) and 50 total score points condition in which it reached 100% accuracy. When the BIC and CAIC did not correctly select the 2PL model, they would incorrectly select the less parameterized 1PL model.

Percentage of times (out of 50 replications) each model was selected by the model selection criteria when data were generated using the two-parameter logistic (2PL) model for the 100% dichotomously scored/0% polytomously scored baseline dichotomous tests as a function of total score points and sample size
3PL model selection
Data generated using the 3PL model in the 100% dichotomously scored (100/0) baseline test conditions was used to examine the performance of the model selection indices when selecting among the correct 3PL dichotomous model and the incorrect 1PL and 2PL dichotomous models. Hence, the LRT should be statistically significant when comparing the 3PL model with both the 1PL and 2PL models, indicating correct selection of the 3PL (see Note 2). The results indicated that the LRT performed adequately in only one of these four 100/0 baseline conditions in which the sample size was large (N = 1,000) with 50 total score points, reaching 82% accuracy (see Figure 3). It was more difficult for the LRT to differentiate between the 3PL and 2PL models, with accuracy rates as low as 30% in the remaining three 100/0 baseline conditions. The information-based criteria did not fare much better. For instance, the AIC was the only index to perform adequately in the same large sample size (N = 1,000) and 50 total score points condition with 74% accuracy (see Figure 3). The AIC, AICC, and HQIC tended to incorrectly select the 2PL model when they did not select the 3PL model in a replication. The BIC and the CAIC incorrectly selected the 2PL and 1PL models more frequently.

Percentage of times (out of 50 replications) each model was selected by the model selection criteria when data were generated using the three-parameter logistic (3PL) model for the 100% dichotomously scored/0% polytomously scored baseline dichotomous tests as a function of total score points and sample size
Mixed-Format Test Conditions
1PL/PC model selection
In the mixed-format conditions, the LRT should not be statistically significant when comparing the 1PL/PC model with either the 2PL/GPC model or the 3PL/GPC model to indicate correct selection of the 1PL/PC model.3 While the LRT was able to correctly select the 1PL/PC model with 80% or greater accuracy in the majority of the mixed-format conditions, it was less able to correctly select the 1PL/PC model in the 40/60 and ≈50/50 mixed-item conditions in which the test included 50 total score points under the large sample size scenarios (N = 1,000; see Table 3). Though not presented in Table 3, when the LRT did not correctly select the 1PL/PC model, it tended to incorrectly select the 3PL/GPC model. Model selection results of the 1PL/PC model for the five information-based model selection criteria are not presented graphically, nor are they tabled. This is because all of the information-based model selection criteria correctly selected the 1PL/PC model in 98% or more of the 50 replications in each manipulated condition.
Percentage of Times (Out of 50 Replications) the Likelihood Ratio Test (LRT) Correctly Selected the 1PL/PC in Each Mixed-Format Condition.
Note: 1PL = one-parameter logistic; PC = partial credit.
2PL/GPC model selection
Correct 2PL/GPC model selection for the LRT was classified as occurring when the LRT comparing the 2PL/GPC and the 3PL/GPC models was not statistically significant, but the LRT comparing the 2PL/GPC and 1PL/PC models was statistically significant (see Note 3). When the total score points were attributed mostly to polytomously scored items in the 40/60 test conditions, the LRT performed well in all valid conditions with the exception of the large sample size (N = 1,000) with 50 total score points condition in which it correctly selected the 2PL/GPC model in only 56% of the 50 replications (see Figure 4), demonstrating difficulty differentiating it from the 3PL/GPC. The AIC, AICC, and HQIC performed well differentiating between the 2PL/GPC and the 3PL/GPC and 1PL/PC models with at least 90% accuracy in all four conditions (see Figure 4). The BIC and the CAIC incorrectly selected the 1PL/PC model more frequently in the small sample size scenarios (N = 500) but correctly selected the 2PL/PC with at least 96% accuracy when the sample size was large (N = 1,000).

Percentage of times (out of 50 replications) each model was selected by the model selection criteria when data were generated using the two-parameter logistic/generalized partial credit (2PL/GPC) model for the 40% dichotomously scored/60% polytomously scored tests as a function of total score points and sample size
When the total score points were attributed to approximately equal dichotomously scored and polytomously scored (≈50/50) items, the LRT performed with at least 82% accuracy in all valid conditions except when sample size was large (N = 1,000) with 50 total score points, correctly selecting the 2PL/GPC model in only 68% of the 50 replications (see Figure 5). Again, it had trouble differentiating between the 2PL/GPC model and the more parameterized 3PL/GPC model. The AIC, AICC, and HQIC correctly selected the 2PL/GPC model with 98% or greater accuracy in all four ≈50/50 test conditions. The BIC and CAIC did not perform with a high level of accuracy in the small sample size (N = 500) scenarios, but did reach at least 98% accuracy in the large sample size (N = 1,000) scenarios. Again, when the BIC and CAIC did not correctly select the 2PL/GPC model, they tended to incorrectly select the less parameterized 1PL/PC model.

Percentage of times (out of 50 replications) each model was selected by the model selection criteria when data were generated using the two-parameter logistic/generalized partial credit (2PL/GPC) model for the approximately 50% dichotomously scored/50% polytomously scored tests as a function of total score points and sample size
When the total score points were attributed to mostly dichotomously scored (60/40) items, the LRT selected the 2PL/GPC and the 3PL/GPC models almost equally when sample size was large (N = 1,000) with 50 total score points (see Figure 6). The LRT did correctly select the 2PL/GPC model with at least 80% accuracy in the remaining three 60/40 conditions. The AIC, AICC, and HQIC correctly selected the 2PL/GPC model with 100% accuracy in all four of the 60/40 conditions. The BIC and CAIC correctly selected the 2PL/GPC model with 88% or greater accuracy in all of the 60/40 conditions with the exception of the small sample size scenario (N = 500) in which the total score points were equal to 26 (see Figure 6).

Percentage of times (out of 50 replications) each model was selected by the model selection criteria when data were generated using the two-parameter logistic/generalized partial credit (2PL/GPC) model for the 60% dichotomously scored/40% polytomously scored tests as a function of total score points and sample size.
3PL/GPC model selection
The LRT was tallied as correctly selecting the 3PL/GPC when the LRT comparing the 3PL/GPC with the 2PL/GPC model and with the 1PL/PC model was statistically significant (see Note 3). When the score points were attributed mostly to polytomously scored items in the 40/60 test conditions, the LRT was able to correctly select the 3PL/GPC model with 80% accuracy in the large sample size (N = 1,000) and 50 total score points condition (see Figure 7). In the remaining conditions, however, the LRT demonstrated difficulty differentiating the 3PL/GPC from the 2PL/GPC with correct selection rates as low as 32%. The AIC, AICC, and HQIC were unable to adequately distinguish between the 3PL/GPC and the 2PL/GPC models with correct selection rates ranging from 0% to 56% (see Figure 7). The BIC and the CAIC incorrectly selected the 1PL/PC model in at least 98% of the 50 replications in each of the 40/60 conditions.

Percentage of times (out of 50 replications) each model was selected by the model selection criteria when data were generated using the three-parameter logistic/generalized partial credit (3PL/GPC) model for the 40% dichotomously scored/60% polytomously scored tests as a function of total score points and sample size
When the total score points were attributed to approximately equal dichotomously scored and polytomously scored (≈50/50) items, the LRT performed with at least 90% accuracy in the large sample size conditions (N = 1,000; see Figure 8). Conversely, it was less able to differentiate between the 3PL/GPC and the 2PL/GPC model when the sample size was small (N = 500), procuring accuracy rates as low as 46%. The AIC and the AICC correctly selected the 3PL/GPC model in at least 98% of the 50 replications in the large sample size (N = 1,000) and 26 total score points condition; however, they were unable to effectively differentiate between the 3PL/GPC and the 2PL/GPC in the remaining conditions with accuracy rates as low as 10%. The AIC and the AICC tended to incorrectly select the 2PL/GPC model when they did not correctly select the 3PL/GPC model in a replication. The HQIC performed poorly in these four ≈50/50 mixed-format conditions, obtaining correct selection rates ranging from 2% to 58% (see Figure 8). The HQIC had a tendency to incorrectly select the 2PL/GPC model more frequently. The BIC and CAIC never selected the 3PL/GPC in the four ≈50/50 mixed-format conditions with the exception of the large sample size (N = 1,000) and 26 total score points condition in which the BIC correctly selected the 3PL/GPC in only 2% of the 50 replications. The BIC and CAIC incorrectly selected the 1PL/PC model more frequently when sample size was small (N = 500) whereas they incorrectly selected the 2PL/GPC model more frequently when sample size was large (N = 1,000; see Figure 8).

Percentage of times (out of 50 replications) each model was selected by the model selection criteria when data were generated using the three-parameter logistic/generalized partial credit (3PL/GPC) model for the approximately 50% dichotomously scored/50% polytomously scored tests as a function of total score points and sample size
When the total score points were attributed to mostly dichotomously scored (60/40) items, the LRT correctly selected the 3PL/GPC model with at least 90% accuracy in all conditions with the exception of the small sample size (N = 500) with 26 total score points condition in which it only selected the 3PL/GPC model with 28% accuracy (see Figure 9). Again, it was more difficult for the LRT to distinguish between the 3PL/GPC and 2PL/GPC models. The AIC performed well in the 50 total score point scenarios with correct selection rates of 100% and 98% when sample size was small (N = 500) and when sample size was large (N = 1,000), respectively. The AIC did not perform as well in the 26 total score point scenarios, selecting the 3PL/GPC with 14% and 70% accuracy in the small (N = 500) and large (N = 1,000) sample size scenarios, respectively (see Figure 9). The AICC and the HQIC performed with at least 96% accuracy in the large sample size (N = 1,000) with 50 total score points condition; however, correct selection rates for the AICC and the HQIC in the remaining three 60/40 mixed-format conditions ranged from 4% to 64% and from 0% to 22%, respectively. The AIC and the AICC tended to incorrectly select the 2PL/GPC model when not correctly selecting the 3PL/GPC model. The HQIC had a tendency to incorrectly select the 2PL/GPC and 1PL/PC models more frequently. The BIC and CAIC never selected the 3PL/GPC in the four 60/40 mixed-format conditions with the exception of the large sample size (N = 1,000) with 50 total score points condition in which the BIC correctly selected the 3PL/GPC in only 4% of the 50 replications. The BIC and CAIC incorrectly selected the 1PL/PC and 2PL/GPC models more frequently (see Figure 9).

Percentage of times (out of 50 replications) each model was selected by the model selection criteria when data were generated using the three-parameter logistic/generalized partial credit (3PL/GPC) model for the 60% dichotomously scored/40% polytomously scored tests as a function of total score points and sample size
Discussion
This article assessed the performance of six model selection methods, including the LRT, AIC, AICC, BIC, HQIC, and CAIC, used to select among a set of mixed-format IRT models. Unfortunately, there is no model selection index that would be appropriate in all of the IRT model situations examined in this study. Instead, it appears that the choice of which model selection index to use depends on several elements, such as sample size, proportion of score points attributable to dichotomously and polytomously scored items, and total score points.
All of the model selection indices examined were able to accurately select the less parameterized PC model in the 0/100 baseline test conditions, the less parameterized 1PL model in the 100/0 baseline test conditions, and the 1PL/PC model in the mixed-format (40/60, ≈50/50, and 60/40) test conditions. In contrast, the performance of the model selection indices was less well defined when discerning the more parameterized GPC model in the 0/100 baseline test conditions, the 2PL or 3PL models in the 100/0 baseline test conditions, and the 2PL/GPC and 3PL/GPC models in the mixed-format (40/60, ≈50/50, and 60/40) tests conditions. Overall, it is fairly evident that the BIC and CAIC did not provide dependable results with respect to correctly selecting the more parameterized models in a majority of the conditions examined. They were typically inclined to select the less parameterized models (e.g., 1PL/PC). The remaining criteria (LRT, AIC, AICC, and HQIC) demonstrated more promise; still, their performance was more optimal in different conditions. For instance, the LRT, AIC, and AICC were accurate when correctly selecting the GPC model in the 0/100 baseline test conditions whereas these criteria did poorly when correctly selecting the 3PL model in the 100/0 baseline test conditions. The AIC and AICC were the most accurate criteria when correctly selecting the 2PL model in the 100/0 baseline test conditions.
Mixed-formats and model parameterization introduced a complexity into the model selection process. For instance, the LRT, AIC, AICC, and HQIC correctly selected the 2PL/GPC mixed-format IRT model with high proficiency when the majority of score points were attributable to the polytomously scored items. When the score points were attributed to approximately equal dichotomously and polytomously scored items or when the majority of score points were attributed to dichotomously scored items, the performance of the LRT was adversely affected by sample size when the total score points were high (50) wherein it began to incorrectly select the 3PL/GPC model. However, when the data were generated using the 3PL/GPC model in the mixed-format conditions, regardless of the proportion of dichotomously and polytomously scored items, the performance of the LRT, AIC, AICC, and HQIC became deficient. While the accuracy of these model selection criteria was dismal in these 12 conditions, the LRT generally performed more optimally, correctly selecting the 3PL/GPC model with 80% or greater accuracy in half of the conditions, which occurred mostly in the large sample size scenarios (N = 1,000). This is not unexpected given that statistical significance tests are sensitive to sample size. Thus, small differences may become more pronounced, indicating statistical significance, due to the power of the test, which resulted from the large sample size. The AIC barely followed with respect to this benchmark, correctly selecting the 3PL/GPC model with 70% or greater accuracy in a third of these conditions, which most often occurred in the 60/40 mixed-format test conditions. This is interesting given that the AIC tends to select more parameterized models, yet failed to do so with high proficiency in the current study.
Overall, the model selection indices were able to accurately distinguish between the models when the correct model is a less parameterized PC, 1PL, or 1PL/PC model. The distinction became more uncertain when the correct model was the most parameterized GPC, 3PL, or 3PL/GPC model. In particular, the majority of the model selection criteria examined were unable to effectively differentiate between the 2PL and 3PL models or between the 2PL/GPC and 3PL/GPC models when the correct model was the 3PL model or the 3PL/GPC model, respectively. Kang and Cohen (2007) also found that the model selection criteria examined in their study had difficulties correctly selecting the 3PL model as compared with the less parameterized 2PL and 1PL models. They attributed this to potential problems when estimating the guessing parameters. Accordingly, a random selection of 3PL/GPC model parameter estimates that were estimated in conditions in which the 3PL/GPC model was not correctly selected was inspected. It was observed that there were problems with the estimation of some of the 3PL difficulty and guessing parameters for a particular item. For instance, the difficulty parameter estimate and its associated standard error for an item in one replication were observed to be as large as −4.70 and 10.08, respectively. The guessing parameter estimate and its associated standard error for the same item were 0.00 and 7.67, respectively.4
Another potential explanation for the poor performance of the information-based criteria is that the NAEP exam, from which generating parameter estimates were taken, is a low-stakes test, which may reduce the motivation of examinees. Low-stakes tests are typically associated with low effort among examinees, resulting not only in underestimates of examinees’ ability but also in biased item parameters (Wise, 2006). Accordingly, model-data fit may have been compromised, yielding erroneous model comparison and selection outcomes.
Future research should examine using a combination approach of examining item-fit statistics (e.g., Chon et al., 2010) and model selection criteria for global IRT model comparison, which may result in the deletion of poor-performing items. This could allow for more appropriate model and item analysis/selection. Future research could also incorporate other model selection criteria, such as the CVLL and the DIC, which have shown some promise (Kang & Cohen, 2007).
In conclusion, model selection methods may be used to aid in the model selection process. The comparison of the model selection criteria examined in this study indicated that none perform with 100% accuracy in all possible scenarios. While the LRT, AIC, AICC, and HQIC demonstrated potential, the sole reliance on model selection criteria is not recommended. Furthermore, the plausibility of potential models also depends on the appropriateness of parameter estimates and their associated standard errors. It is hoped that this study provides applied researchers with more information about the use of model selection methods in IRT for mixed-format tests under various conditions.
Footnotes
Appendix
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
The author(s) received no financial support for the research, authorship, and/or publication of this article.
