Abstract
Using the bifactor item response theory model to analyze data arising from educational and psychological studies has gained popularity over the years. Unfortunately, using this model in practice comes with challenges. One such challenge is an empirical identification issue that is seldom discussed in the literature, and its impact on the estimates of the bifactor model’s parameters has not been demonstrated. This issue occurs when an item’s discriminations on the general and specific dimensions are approximately equal (i.e., the within-item discriminations are similar in strength), leading to difficulties in obtaining unique estimates for those discriminations. We conducted three simulation studies to demonstrate that within-item discriminations being similar in strength creates problems in estimation stability. The results suggest that a large sample could alleviate but not resolve the problems, at least when considering sample sizes up to 4,000. When the discriminations within items were made clearly different, the estimates of these discriminations were more consistent across the data replicates than that observed when the discriminations within the items were similar. The results also show that the similarity of an item’s discriminatory magnitudes on different dimensions has direct implications on the sample size needed in order to consistently obtain accurate parameter estimates. Although our goal was to provide evidence of the empirical identification issue, the study further reveals that the extent of similarity of within-item discriminations, the magnitude of discriminations, and how well the items are targeted to the respondents also play factors in the estimation of the bifactor model’s parameters.
Keywords
Educational and psychological instruments are often designed to measure latent traits having a bifactor structure (Holzinger & Swineford, 1937; Reise, 2012). That is, the instruments measure one general (or primary) dimension that is usually of substantive interest and one or more specific (or secondary) dimensions that are less of an interest or altogether irrelevant to what the primary dimension represents. One example of an instrument eliciting a bifactor structure in item response data is the Rosenberg Self-Esteem Scale (RSES; Rosenberg, 1965). The RSES is designed to measure global self-worth (i.e., the general dimension) with five positively and five negatively phrased items. Including items written in different polarities could induce a wording method effect (Marsh et al., 2010; Michaelides et al., 2016), leading to additional dimensions being represented in data. RSES data, then, should reflect specific dimensions representing how similarly respondents treat positively and negatively phrased items along with the general dimension of global self-worth.
The bifactor item response theory (IRT) model (Gibbons et al., 2007) is ideal for confirming whether data like RSES data represent a bifactor structure. Such a model, then, can provide evidence for the internal structural aspect of validity as outlined in Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014). Additionally, the bifactor model can determine the extent to which the specific dimensions are a part of the response process, indicating how much of the item responses represent the general and specific dimensions. Such information could be useful in understanding to what degree the extraneous aspects of the measurement processes—such as positive and negative wording—are represented in the item responses.
Unfortunately, using the bifactor IRT model to analyze data comes with challenges compared with using unidimensional IRT models. The bifactor model is a multidimensional model and thus requires a larger sample size than unidimensional IRT models (De Ayala, 1994; Jiang et al., 2016). In addition to the sample size needs, a less frequently discussed issue exists with the bifactor IRT model. That is, an empirical identification issue arises when an item’s discriminatory magnitudes on the general and specific dimensions are similar (i.e., the within-item discriminations are similar in strength; Stone & Zhu, 2015). This issue could lead to unstable parameter estimates, which then could affect the quality of evidence the results provide for the internal structural aspect of validity and could mislead researchers about how strongly specific dimensions are represented in data.
Of these challenges, dealing with the sample size issue is straight forward because researchers can control the sample size. The possible empirical identification issue is harder to deal with because researchers cannot know in advance whether the within-item discriminations are similar in strength before they have analyzed the data, unless the instrument of interest has been widely used.
To our knowledge, there is no empirical evidence demonstrating this identification issue or how much this issue impacts the model’s parameter estimates. Thus, we performed simulations to provide evidence of such empirical non-identification and its impact on parameter estimates across a range of sample sizes. To get a better understanding of the conditions in which this potential issue arises, we provide the technical details of the bifactor IRT model next.
The Bifactor IRT Model
The bifactor IRT model we use for our discussion is based on the graded response model (GRM; Samejima, 1969). In the presentation of this model and for general discussion, we use the following notations. Let d represent a dimension (where d = 1, 2, …, D, with D representing the total number of dimensions). Let i represent an individual (where i = 1, 2, …, N, with N representing the total number of individuals). Let j represent an item (where j = 1, 2, …, J, with J representing the total number of items). Let k represent a category score (where k = 0, 1, 2, …, m, with m representing the highest score category).
Under a bifactor GRM, the conditional probability of individual i endorsing category k for item j is given as follows:
Regarding the specifics of the parameters that make up the systematic component,
An Empirical Identification Issue With the Bifactor Model
Of interest for this article is the empirical identification issue Stone and Zhu (2015) alluded to, an issue that occurs when an item’s discriminations on the general and specific dimensions are approximately equal. For instance, in a bifactor model, suppose Item 1 discriminates on the general dimension (i.e., Dimension 1) and the first specific dimension (i.e., Dimension 2). When this item’s two discriminations are approximately equal (i.e., α1,1 ≈ α1,2), an empirical non-identification could arise, although how similar the two discriminations must be is unclear.
The reason the within-item discriminations being similar in strength could create estimation problems is that “in the bifactor model, where multiple slope parameters are estimated for each item, the likelihood surface may have multiple equivalent modes when the slope parameters are similar in size” (Stone & Zhu, 2015, p. 165). Although they noted such an issue, Stone and Zhu (2015) did not provide empirical evidence of it. Our study fills this void. This issue we investigated for this article only applies to within-item discriminations and not between-item discriminations. That is, how similar two different items’ discriminations are on the same dimension is not an issue given that at least three items discriminate on a dimension (in general).
The Aim of the Article
Currently, there is a lack of studies demonstrating the empirical identification issue related to when the discriminations within items have similar values. This article fills this void in that we report on the simulations we conducted to provide empirical evidence of this identification issue and show its impact on parameter estimates, with the simulation conditions motivated by real RSES data. The article, then, is relevant to researchers because it provides evidence of the empirical identification issue arising from within-item discriminations being similar, which in turn alerts researchers as to when they should be cautious in interpreting their findings from a bifactor analysis.
As for the organization of the remainder of this article, in the next three sections, we provide details of the simulation studies that were conducted to investigate the previously noted identification issue with the bifactor model. The first of these studies was motivated by RSES data. The second study is a further investigation into whether the estimation difficulties observed in the first simulation study resulted from the discriminations within items being similar in strength. The third simulation study examined whether the estimation difficulties observed in the first simulation study could be resolved when there were no problematic items (i.e., each item, if it discriminated on different dimensions, had distinctly different discriminations on the general and specific dimensions). The article ends with a discussion and concluding remarks.
Study 1
We conducted a simulation study to determine whether the discriminations within items being approximately equal created estimation problems and whether this issue persisted across a range of sample sizes (N = 500, 1,000, 2,000, and 4,000). The bifactor structure to which we generated data was motivated by real RSES data. Recall that the RSES consists of five positively and five negatively worded items. A complete bifactor structure for RSES data would consist of all items discriminating on the primary dimension (Dimension 1), the positively phrased items additionally discriminating on the dimension representing the positive aspect of the wording method effect (Dimension 2), and the negatively phrased items additionally discriminating on the dimension representing the negative aspect of the wording method effect (Dimension 3). However, when we analyzed RSES data with an IRT model based on a complete bifactor structure to obtain our data generation values, two of the positively worded items had small negative discrimination estimates on the positive-phrasing dimension, a finding consistent with other studies (e.g., Alessandri et al., 2015; Donnellan et al., 2016; Hyland et al., 2014; Salerno et al., 2017). Reise et al. (2016) suggested that these item discriminations should be set to 0, and we followed suit. Specifying an incomplete bifactor structure does not impact our goals because if the empirical identification issue exists, it should appear regardless of whether a complete or incomplete bifactor structure is specified.
Data Generation
One hundred data sets were generated for each sample size condition, with each data set resembling 4-point ratings to 10 items (similar to RSES data). The latent trait dimensional positions (
Values Used for the Item Discriminations and Category Intercepts to Generate the Data.
Note. An empty space indicates a value of 0. The category intercepts and the items’ discriminations on the primary dimension remained the same for all simulation studies.
Analytic Strategy
We used the Mplus software (Muthén & Muthén, 2016) to perform all analyses. The bifactor model we fitted was that expressed in Equations (1) to (3), and the parameters of the model were estimated using full-information maximum likelihood (FIML), using 20 quadrature points and assuming that the dimensional positions were multivariate normally distributed, that is,
The mean vector of 0s and the variances being set to 1 (as the identity matrix conveys) established the location and metric of the underlying scale, respectively. The identity matrix also indicates that all dimensions were set orthogonal to each other, as conventionally done with the bifactor model.
To determine the impact of within-item discriminations being similar on the estimation process, we examined the convergence rates and the recovery of the item discriminations with respect to estimation accuracy in each data replicate. For the convergence criterion, the default in Mplus was used. That is, the estimation process stopped when the increment of improvement in the minimization function reached 1E−6.
Regarding the estimation accuracy, we calculated the errors of the parameter estimates, with each error representing the difference between the value for item j’s discrimination on dimension d that was estimated during the analysis of the rth data replicate
We used boxplots to provide a visualization of the errors across the data replicates. The lower and upper limits of the boxplots (i.e., the whiskers) were at most Q1 − 1.5 × IQR and Q3 + 1.5 × IQR, respectively, where Q1 and Q3 are the first and third quartiles, respectively, and IQR is the interquartile range (i.e., IQR = Q3 − Q1).
Item discriminations suffering from estimation issues could be reflected in the boxplots in one of two ways. One way is for the median of the errors (i.e., the median bias) being noticeably greater or less than 0. The other way is for the problematic parameters’ corresponding IQRs and full ranges (upper limit minus lower limit) being larger and having more outliers than those for the parameters estimated without any issues, regardless of whether the medians are 0; even if the medians are 0, large IQRs and full ranges of the errors across the data replicates would indicate severe over- or under-estimation existed, representing inconsistency in the estimation of those parameters across the data replicates.
The full range and IQR of a distribution of errors demonstrate the estimation stability across the data replicates, similar information to the Monte Carlo standard deviation (MCSD), although the latter provides a formal index. Thus, for space reasons, we report the MCSD in an Online Supplemental document. This document also includes the mean absolute difference (MAD) for those who are interested in the magnitude of the estimation inaccuracy irrespective of whether the estimates were overestimated or underestimated, as noticeable estimation error could occur in both directions across the data replicates, leading to an average of the errors (i.e., the mean bias) close to 0 even when the MAD is large.
Results
Convergence Rates
The convergence rates were perfect (CR = 1.00) for all sample sizes. In other words, Mplus did not report any errors during the analysis of any of the data replicates. However, as we review next, estimation instability was present in that the estimates were not consistently accurate across the data replicates for some items.
Parameter Recovery
The distribution of the errors related to the item discrimination estimates is summarized in Figure 1. In each plot, the item discrimination parameters are represented along the x-axis (where α
j,d
is item j’s discrimination on dimension d), and the y-axis represents error. If estimation difficulty does not exist, then the errors related to each item should be distributed similarly and thus the boxplots (i.e., medians, IQRs, and full ranges) should be similar across the parameters. Unfortunately, that is not the case. Boxplots summarizing the errors of the item discrimination parameter estimates by sample size for Study 1.
In general, more noticeable outliers were observed for the discriminations related to Items 1, 2, 9, and 10; in general, the within-item discriminations for these items were more similar in magnitude than those for the other items. Some outliers were also present for the other items, but they were near the upper limits of the boxplots for those items. The full range and IQR further showed the inconsistency in the accuracy of the estimates for the parameters related to Items 1, 2, 9, and 10 across the data replicates. For instance, when N = 500, even though the median biases across the parameters were all close to 0, the ranges of the errors across the data replicates for Items 1, 2, 9, and 10 were noticeably greater than the ranges for the other items. Among the discriminations for Items 1, 2, 9, and 10, the largest range of errors was 2.96, with a lower and an upper limit (LUL) of (−0.86, 2.10), which was for α2,1, and the smallest was 1.75, with LUL(−0.77, 0.98), which was for α1,2. In contrast, for the other items (i.e., the items other than Items 1, 2, 9, and 10), the largest range of errors was 1.35 (LUL[−0.65, 0.70]) and the smallest was 0.49 (LUL[−0.21, 0.28]), which were for α5,1 and α8,1, respectively. A similar pattern for the IQR appeared. That is, the IQRs for Items 1, 2, 9, and 10 were greater than those for the other items.
The trends in the errors observed when N = 500 were present in the larger sample sizes, although the estimation instability (with respect to full range and IQR) decreased as the sample size increased. In other words, the difference in the ranges and IQRs for Items 1, 2, 9, and 10 and for the other items became less pronounced, although it was still slightly noticeable at N = 4,000. The decrease in the difference between the two sets of items as the sample size increased demonstrates the empirical nature of the identification issue from within-item discriminations being similar in strength. As the sample size increased, more information was contained in the data, leading to greater estimation stability for the items of which the within-item discriminations were similar.
Summary of Study 1
This simulation study demonstrated that within-item discriminations being similar in magnitude may lead to estimation issues for the bifactor model, as indicated by the greater range and IQR with respect to the errors in the discrimination estimates for Items 1, 2, 9, and 10 than those for the other items. Although having a large sample size could mitigate the effects of within-item discriminations being similar, it did not resolve the issue—at least it was not completely resolved at N = 4,000.
Another pattern that emerged from this study is that when the items had similar strength of discriminations on the primary dimension, the items having stronger discriminations on their respective secondary dimension displayed more severe estimation issue (i.e., items with ratios of discriminations on the primary dimension to discriminations on the secondary dimension that were closer to 1 displayed more problems than items with larger ratios), especially for the smallest sample size. Such a pattern explains, in part, the reason the estimates related to Item 1 were the least problematic among the four questionable items we identified, as the ratio for this item was the furthest from 1 among the flagged items.
Even though we focused on the four items in which their within-item discriminations were similar and showed estimation problems, the ratio for Item 3’s within-item discriminations was approximately 2—similar to Item 1’s ratio. However, Item 3 did not display any estimation difficulties. A possible reason could be that Item 1’s discriminations were greater than Item 3’s discriminations. This suggests that whether the within-item discriminations being similar in strength leads to estimation issues could depend on the magnitude of the item discriminations. Brief follow-up simulations support this notion (details available from the first author), but we did not conduct a comprehensive follow-up analysis on the required magnitude for the discriminations because our goal for this study was to provide evidence of whether within-item discriminations being similar creates an empirical identification issue, and we conducted other, more comprehensive simulations related to this goal—which we report next. Specifically, the simulations we report next provide evidence that the patterns regarding the estimation inconsistency in this first simulation study were in large part because the within-item discriminations were similar in magnitude for Items 1, 2, 9, and 10.
Study 2
The design of our second simulation study was similar to that of Study 1 in all ways except one of the problematic items in Study 1 was fixed so that the item’s discriminations on different dimensions were distinct, thereby no longer creating a problem for this item, in theory. This fixing entailed keeping Item 10’s discrimination on the primary dimension at 3.57 as in Study 1 (α10,1 = 3.57) while setting the item’s discrimination on the specific dimension to 1.13 (α10,3 = 1.13, where this parameter’s value was 2.47 in Study 1). The value of 1.13 used for Study 2 was obtained by randomly drawing a value from a uniform distribution over the closed interval [1, 1.5], or U[1, 1.5]. This range ensured that the value 2.47 was excluded and that the resulting value would be more distinctly different from the item’s discrimination on the primary dimension. All other values (i.e., the values for the other item parameters and latent trait dimensional positions) used in the previous study were used in this study. The data generation values for the item parameters used in this simulation study are also in Table 1.
The reason for this simulation study is that if within-item discriminations being similar in strength created the observed estimation issues for Items 1, 2, 9, and 10 in the previous study, then by fixing one of these problematic items’ discriminations to be distinctly different should result in that item no longer having estimation problems, while the other items remain having estimation issues. Such a pattern would provide additional evidence that within-item discriminations being similar creates estimation problems, as Stone and Zhu (2015) have noted. If within-item discriminations being similar does not create estimation problems, then making Item 10’s discriminations distinctly different should still lead to estimation problems for this item.
We chose Item 10 because, among the items that discriminated on two dimensions, it was the most discriminating item on the general dimension. However, the conclusions from this study are the same as when one of the other problematic items was fixed. The same analytic strategies used in the previous simulation study were used in this simulation study.
Results
Convergence Rates
Recall that the convergence rates represent the proportion of runs that terminated without any error messages from Mplus under the software’s default convergence settings. In this simulation study, the convergence rates were similar to those in Study 1 in that they were all high (at least .99). However, these high rates of convergence, again, are misleading because the estimates for a set of items were extremely inaccurate in some of the convergent data replicates, a pattern also observed in Study 1. The recovery results are provided next to show how making Item 10’s discriminations distinctly different from each other affected the estimation of these parameters.
Parameter Recovery
The errors related to the item discrimination estimates are summarized in Figure 2. Overall, the estimation of the discrimination parameters for Item 10 improved noticeably after differentiating Item 10’s data generation discrimination values. For instance, when N = 500, the errors for α10,1 (i.e., Item 10’s discrimination on Dimension 1) had a lower and an upper limit (LUL) of (−0.60, 0.85) and an interquartile (IQ) of (−0.21, 0.22), and α10,3 had LUL (−0.46, 0.55) and IQ (−0.17, 0.12), whereas in Study 1, α10,1 had LUL (−1.06, 1.83) and IQ (−0.44, 0.47), and α10,3 had LUL (−1.01, 1.67) and IQ (−0.41, 0.42). Boxplots summarizing the errors of the item discrimination parameter estimates by sample size for Study 2.
Regarding the other problematic items, the errors in the parameter estimates related to Item 2 had smaller ranges and IQRs than in Study 1, most notably for N = 500. However, the ranges and IQRs were still larger than those for the non-problematic items, indicating that Item 2 still suffered from estimation difficulties as expected. The errors for Item 9 had greater ranges and IQRs than in Study 1, continuing to indicate that Item 9 had estimation problems. The pattern of the errors for Item 1 remained similar to that observed in Study 1.
Summary of Study 2
Study 2 demonstrated that when the discriminations within an item are clearly differentiated, those discriminations can be estimated more consistently. This study also demonstrated that the estimation difficulties for Item 10 observed in Study 1 are mainly because the item’s discriminations on the general and specific dimensions are similar in magnitude. We say mainly because, in each sample size condition, the full ranges and IQRs of the errors for Item 10 were never as small as those of the non-problematic items, even though the within-item discriminations were clearly distinct. The remaining estimation uncertainty could be because of the category intercepts used to generate the data. Research suggests that items that are not well targeted to the respondents (e.g., the higher categories having a greater representation in the data than the lower categories for the items) could lead to inaccurate estimates (Linacre, 2002; Xia & Yang, 2018). Nevertheless, the ranges and IQRs noticeably decreased for Item 10 when its within-item discriminations became distinctly different, suggesting that majority of the estimation instability observed for this item in Study 1 was because this item’s within discriminations were similar in strength.
Although we only reported the results for when we fixed Item 10, we similarly fixed the other problematic items, such as making the discriminations within Item 1 distinctly different while Items 2, 9 and 10 remained problematic. Each time we fixed an item, the pattern of the errors for that item was similar to that reported for Item 10 in this simulation study. Next, we report a third simulation study that was conducted to investigate whether the estimation problems from within-item discriminations being similar could be resolved when there were no problematic items, thereby providing further evidence of within-item discriminations being similar is the main reason for the estimation problems observed in Studies 1 and 2.
Study 3
For this study, we fixed all the items found to be questionable in Study 1 (Items 1, 2, 9, and 10) such that the discriminations within these items were clearly different. More specifically, we set the data generation values for these items’ discriminations on the specific dimensions (i.e., α1,2, α2,2, α9,3, and α10,3) to values that differentiated them from their corresponding discriminations on the general dimension, values that were randomly drawn from U[1, 1.5]. The data generation values used for these simulations are also in Table 1.
The reason for this simulation study is that if the estimation difficulties observed in Studies 1 and 2 can be partly attributed to within-item discriminations being similar in magnitude, then after clearly differentiating the within-item discriminations for the problematic items, the estimation problem stemming from within-item discriminations being similar should be resolved and thus the boxplots (i.e., medians, IQRs, and full ranges) for the problematic items’ parameters should look closer to those for the other items’ parameters.
Results
Convergence Rates
Similar to the other two studies, the convergence rates were high (CR = 1.00) across all sample sizes. However, some instability still remained, as the following parameter recovery results verify.
Parameter Recovery
The errors related to the item discrimination estimates are summarized in Figure 3. Even though extreme estimates (i.e., outliers) still existed, the ranges and IQRs for the parameter estimates related to Items 1, 2, 9, and 10 became noticeably smaller than in Study 1, especially for N ≤ 1,000. For instance, when N = 1,000, the errors for α2,1 had a lower and an upper limit (LUL) of (−0.43, 0.76) and an IQ of (−0.18, 0.19), whereas in Study 1, α2,1 had LUL (−0.62, 1.14) and IQ (−0.24, 0.31). A similar pattern for the LULs and IQs appeared for the other parameters related to the problematic items. Boxplots summarizing the errors of the item discrimination parameter estimates by sample size for Study 3.
As the sample size increased, the estimates of the parameters related to Items 1, 2, 9, and 10 became more stable. When N ≥ 2,000, the estimation uncertainty in these estimates became minimal and matched that in the estimates related to the other items.
Summary of Study 3
Study 3 showed that when the discriminations within Items 1, 2, 9, and 10 were clearly differentiated, their parameter estimates were more stable than when the discriminations within the items were similar, especially for the larger sample sizes. For the smallest sample size (i.e, N = 500), there could be various factors in addition to the sample size itself that impact the estimation of the item discrimination parameters, such as how well the items were targeted to the respondents (Linacre, 2002; Xia & Yang, 2018).
We confirmed the role of how well the items were targeted in brief follow-up simulations. These simulations showed that when Item 2’s intercepts were set to be more accurately targeted to the respondents (while all other values remained the same as those used in Study 3), the errors related to Item 2’s discriminations decreased, revealing that how well the items were targeted to the respondents played a part in the estimation process, especially for N ≤ 1,000. We only conducted a small simulation to explain that the remaining estimation inconsistency observed in Study 3 was because of item targetedness, as others have already demonstrated the general role item targetedness plays in accurate estimation of the parameters of IRT models (Linacre, 2002; Xia & Yang, 2018).
Discussion
The bifactor IRT model is useful for confirming dimensional structures in which general and specific dimensions are represented in data, thereby providing evidence for the internal structural aspect of validity. Moreover, such a model can indicate the extent to which the data reflect extraneous aspects of the measurement process that lead to additional dependencies in the item responses; these aspects (e.g., positively and negatively worded items) are often irrelevant to the general trait of interest and thus should be assessed. Unfortunately, using the bifactor model in practice could be challenging. An empirical identification issue exists with the bifactor IRT model when the discriminations within items are similar in magnitude (Stone & Zhu, 2015), although this issue is seldom discussed and, to our knowledge, the effect of this issue on item discrimination estimates has not been demonstrated. Our simulations provide evidence of this empirical identification issue—at least within the conditions we investigated, conditions that were based on RSES data. The evidence was presented through the full range and IQR of the errors. Some of the ranges and IQRs were noticeably large, indicating the inconsistency in the accuracy of the parameter estimates, representing an estimation issue regardless of whether the medians of the errors were 0.
As it was demonstrated in Study 1, when the items’ discriminations on the general and specific dimensions were similar in magnitude, the parameter estimates had noticeably greater fluctuation across the data replicates compared with the items that had distinctly different discriminations. The additional simulation studies verified that the estimation problems observed in Study 1 were mainly because the discriminations within the problematic items were similar in strength. The additional studies also showed that, after making the discriminations within the problematic items identified in Study 1 distinctly different, the estimation of the parameters corresponding to the corrected items became more stable. For instance, in Study 2, we clearly differentiated the discriminations for one of the problematic items in Study 1 (i.e., Item 10), and the estimates for this item’s discriminations had much less fluctuation across the data replicates than the other three problematic items’ discriminations. In Study 3, the discriminations within all four flagged items were made distinctly different, and the estimation of the parameters for these items improved. If the estimation issues observed for the four items in Study 1 were not, at least to some degree, because the within-item discriminations were similar, then making the discriminations within these items distinctly different should not have led to any improvements in the estimation of these parameters regardless of the sample size.
Although our main focus for this study was to provide evidence of the estimation issue arising from having similar discriminations within items, other aspects of our study contribute to highlighting estimation issues related to the bifactor IRT model. Earlier, we noted that the estimation problems observed in the simulation studies were mainly because the discriminations within some of the items were similar. However, the simulations also revealed the role other factors play in the accuracy of the item discrimination estimates, with these factors including sample size, how well the items were targeted to the respondents, the magnitude of the item discriminations, and the extent of similarity between an item’s discriminations.
The role these other factors played in the estimation of the bifactor model appeared in Study 3. In that study, the within-item discriminations for all the items were distinctly different. Thus, any observed estimation problems had to be because of one or more of these other factors. As the results revealed, noticeable ranges in the distribution of errors were observed when N ≤ 1,000. Such estimation instability could be attributed to a mix of factors. As noted when we summarized the findings of Study 3, we performed additional simulations showing that making the items well-targeted to the respondents resulted in greater estimation stability, controlling for sample size. These simulations also provide support to other studies (e.g., Linacre, 2002; Xia & Yang, 2018) that demonstrated the role item targetedness plays in the estimation process.
Another secondary finding of our study is that Mplus does not produce an error message to warn users performing bifactor modeling when the within-item discriminations create estimation problems. Recall that the convergence rates were perfect in Studies 1 and 3 and nearly perfect (i.e., at least .99) in Study 2. This secondary finding is critical because, in practice, researchers commonly assume that parameter estimates are appropriate to interpret when the estimation process ends without error messages. However, our simulations demonstrate that the parameter estimates could be extremely inaccurate when performing bifactor modeling, even without any error messages from Mplus.
Our three simulation studies were able to demonstrate that the similarity of the discriminations within items has an effect on parameter estimation. The only difference among the three studies was whether the four problematic items identified in Study 1 continued to have similar discriminations within items in the other studies. By manipulating only this aspect of the item characteristics, the differences in the estimation uncertainty observed in Studies 1 and 2 relatives to Study 3 could be attributed to how similar the discriminations within items were. We did not investigate how similar the within-item discriminations had to be before estimation problems arose because our focus was to provide evidence that this issue existed, although we recognize that understanding how similar the discriminations have to be is crucial and should be investigated.
Another possible future investigation could involve a more thorough assessment of the sample size requirement for the bifactor IRT model. Given our simulation conditions, our findings revealed that a sample size of 1,000 was necessary when the discriminations within items were distinctly different (i.e., Study 3), whereas sample sizes of 2,000 and larger resulted in mixed findings for the conditions in which the discriminations within some items were similar in strength (i.e., Studies 1 and 2). In Study 1, the distribution of the errors was fairly acceptable for sample sizes of 2,000 and larger (relative to the smaller sample sizes), whereas in Study 2, the spread of the errors was questionable even when the sample size was 2,000. The findings across the studies show that the similarity in the discriminations within items could play a factor in the sample size needed to estimate the parameters of a bifactor model. We are not aware of studies that focused on the sample size requirement for the bifactor IRT model based on the graded response model, let alone studies that also considered the necessary sample size of bifactor models while considering the similarity in the within-item discriminations.
We noted a limitation of our study already—that we did not thoroughly explore how similar the discriminations within items had to be before estimation issues arose. Some other limitations of our study should also be acknowledged. For space reasons, we did not investigate how setting the problematic items’ discriminations to be even more similar would impact the parameter estimation for these items. We anticipate that increasing the similarity of the within-item discriminations will aggravate the estimation difficulty.
Another limitation of this study is that we only examined a single bifactor structure based on 10 items. Our dimensional structure was motivated by RSES data because the bifactor structure has been suggested as an appropriate structure for such data (e.g., Alessandri et al., 2015; Donnellan et al., 2016; Hyland et al., 2014). Using only one set of items representing a bifactor structure could limit the generalizability of our findings, although Stone and Zhu (2015) suggested this was a general issue. We wanted to provide evidence of the estimation issue Stone and Zhu (2015) raised in a situation relevant to real situations. Thus, we explored deeper within our set of items and dimensional structure to provide confirmation that our findings, to a certain degree, were because of within-item discriminations being similar in strength rather than investigate a wider range of item by dimensional structure conditions. Further research should be conducted to determine whether such an estimation issue exists with the bifactor IRT model beyond our conditions.
A final limitation we note is that we only provided evidence of this issue and did not offer a solution. Because of the growing popularity in bifactor modeling, being able to perform such an analysis without having to be concerned about the estimates is critical, as one cannot know when the estimates are inaccurate in practice because the true values cannot be known as they are in simulations. A solution may be within a Bayesian setting, assigning informative priors, which can be an avenue for a future study.
Notwithstanding these limitations, our study provides empirical evidence demonstrating the estimation challenges that arise with the bifactor model when the discriminations within items are similar in magnitude. We also showed that increasing the sample size could alleviate this estimation problem to an extent. Our findings, then, demonstrate the importance of considering how similar the discriminations within items are when using the bifactor IRT model to analyze data, as this issue has direct implications on the sample size required to obtain accurate parameter estimates on which to interpret.
Supplemental Material
Supplemental Material - An Empirical Identification Issue of the Bifactor Item Response Theory Model
Supplemental Material for An Empirical Identification Issue of the Bifactor Item Response Theory Model by Wenya Chen and Ken A. Fujimoto in Applied Psychological Measurement
Footnotes
Acknowledgments
The authors would like to thank Dr. John R. Donoghue, Dr. Brian Habing and two anonymous reviewers for their valuable comments on earlier versions of the article. All errors remain the responsibility of the authors.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
