Abstract
We used multigroup confirmatory factor analysis to evaluate the five-factor measurement model underlying the 50-item Irrational Beliefs Inventory (IBI) in samples of university students in the United States (n=827) and Iceland (n=720). Global model fit was marginally acceptable in each sample. Further analyses identified several sources of model misfit that included weak factor loadings, several item pairs with correlated errors, and items with loadings on more than one factor. Cronbach’s alpha reliability estimates for the five factors were similar for the U.S. and Icelandic samples, and comparable to those reported by the developers of the IBI. Measurement invariance testing supported configural (same form) and metric invariance (equal loadings), but identified only 20 items that had invariant item intercepts across the U.S. and Icelandic groups. Given the finding of partial measurement invariance, we offer caution when using the IBI to make group comparisons for U.S. and Icelandic samples. Recommendations are proposed for ongoing psychometric evaluations of the IBI that would identify strengths of the IBI and items that, if revised or deleted, may improve the quality of the measure for research and clinical purposes.
Introduction
Irrational beliefs are illogical and/or dogmatic beliefs (Ellis, 2003) that have been recognized as contributing factors in the development of psychological problems such as obsessions and compulsions, depression, anxiety, and social phobias (Bridges & Harnish, 2010). The connection between irrational beliefs and psychological disorders has led to the development of a number of measures of irrational beliefs that have been used to inform therapeutic treatments and research related to irrational beliefs. In two comprehensive reviews of measures of irrational beliefs, Terjesen et al. (2009) and Bridges and Harnish (2010) were critical of the limited amount of independent, psychometric support for the measures that were being used in research and clinical practice. Terjesen et al. concluded that “most measures of irrational beliefs do not provide the evidence needed to adequately address the Standards for Educational and Psychological Testing” (p. 83). Similarly, Bridges and Harnish noted that “the authors of the tests were frequently the sole source of reliability and validity evidence which varied considerably among published work” (p. 873). Both groups of researchers called for greater attention to the evaluation of the psychometric properties of irrational beliefs measures that were being used to inform important clinical and research decisions.
In response to this call, the present study was conducted to evaluate the psychometric properties of the Irrational Beliefs Inventory (IBI), a widely used 50-item self-report measure of irrational beliefs that was developed in the Netherlands by Koopmans et al. (1994). See Table 1 in supplemental file for studies using the IBI. The IBI was developed by pooling and factor analyzing items from two older measures, each developed in the United States: the 37-item, 11-factor Rational Beliefs Inventory (RBI; Shorkey & Whiteman, 1977) and the 100-item, 10-factor Irrational Beliefs Test (IBT; Jones, 1968). Psychometric evaluations of the RBI and IBT found them to have low reliabilities and construct validity, measuring general negative affect, rather than irrational beliefs. Over the course of three studies, Koopmans et al. (1994) reduced the IBI from 137 items to 50 using principal components analysis with varimax rotation and identified five factors in Studies 2 and 3: Worrying (12 items, α=.85 and .84), Rigidity (14 items, α=.81 and .71), Need for Approval (7 items, α=.75 and .80), Problem Avoidance (10 items, α=.70 and .73) and Emotional Irresponsibility (7 items, α=.72 and α=.72).
Psychometric results for the IBI in university student samples from the U.S. (n=827) and Iceland (n=720), with increasingly stringent criteria for standardized factor loadings.
Note. Values in parentheses are Cronbach’s alphas reported by Koopmans et al. (1994) from Studies 2 and 3, respectively. Standardized loadings for the present study are from the five-factor correlated CFA model for the IBI analyzed separately for the U.S. and Icelandic samples. See supplemental file for performance of specific items as the increasingly stringent criteria were applied.
The IBI has been translated into multiple languages, used with both clinical and non-clinical samples in several countries (e.g., Australia, Kuwait, Pakistan), and continues to be used at the time of this writing (see Table 1 in supplemental file). Scores from the IBI have served as outcome variables in intervention studies, predictors/mediators of behaviors, and as part of the construct validation of other questionnaires (e.g., Bridges & Harnish, 2010). Despite the IBI’s continued use, there have been limited psychometric analyses of the measure outside of those conducted by the original developers of the IBI. The factor structure of the IBI has not been evaluated using confirmatory factor analysis (CFA), a rigorous method for assessing how well the internal structure of the items conforms to the theoretical model underlying the measure, despite the fact that one of the early criticisms raised by Koopmans et al. (1994) about existing measures of irrational beliefs was that “little was known about the factorial validity of these instruments” and “factor replication studies showed that the factor structures of these instruments were only moderately congruent with the original patterns” (p. 15). Since the publication of the article detailing the original development of the IBI, there have been only two studies that have factor analyzed the IBI scores, and each used exploratory factor analysis. Woodward et al.’s (2001) study of 127 parents of children in primary schools in Australia, identified 5-7 factors but only five factors were theoretically meaningful. Al-heeti et al. (2012) used principal components with varimax rotation to evaluate the IBI in two samples of United Arab Emirates undergraduate students (ns=384 and 251). The researchers identified five components but only 34 of the 50 items met the researchers’ criteria of loading .40 or greater on one component with low loadings on the remaining components (8 of 12 for Worrying, 7 of 10 for Problem Avoidance, 7 of 7 for Emotional Responsibility, 6 of 7 for Demand for Approval, and 6 of 14 for Rigidity).
The limited focus on the psychometric properties of the IBI also is reflected in the number of studies (19 out of 37; see Table 1 in supplemental file) using the IBI that did not report any measure of reliability for the IBI scores used in their studies. Studies that did not report psychometric results for their own data frequently referred to the results reported in the original IBI development study (Koopmans et al., 1994) to support the psychometric quality of the IBI (i.e., reliability induction). This inadequate reporting of reliability estimates for a measure is not unique to the IBI. Vacha-Haase and Thompson (2011) found that the majority of studies (about 55%) of about 13,000 did not discuss reliability. Inadequate reporting of the measurement properties of the IBI is inconsistent with the guidelines provided by the American Education Research Association (AERA; Duran et al., 2006), which call for transparency and completeness in reporting the psychometric properties of measures used in research.
Study aims
The present study had three purposes. In view of the limited analyses of the factor structure and psychometric properties of the IBI, the first purpose of the present study was to use confirmatory factor analysis (CFA) to evaluate the five-factor structure (Worrying, Rigidity, Problem Avoidance, Need for Approval, Emotional Irresponsibility) underlying the IBI in a sample of university students in the United States. The present study represents the first confirmatory factor analysis of the IBI. Our focus on the internal structure of the IBI was designed to add to the construct validity evidence of the IBI and is consistent with the Standards for Educational and Psychological Testing (American Educational Research Association et al., 2014), which encourages evaluation of measurement quality each time a measure is used. According to the Standards, measurement quality, such as the validity of a measurement model, cannot be assumed but must be tested.
In addition to examining the factor structure of the IBI when used with university students in the United States, the second purpose of this study was to evaluate the factor structure of a newly translated and adapted Icelandic version of the IBI. The impetus behind the translation of the IBI for use in Iceland was research interest in how a highly abstract construct used in clinical practice (i.e., irrational beliefs) might translate across cultures that are both Western and industrialized, yet different across several dimensions (e.g., demographics and public health infrastructure). Our hypothesis was that the five-factor structure of the IBI would hold in both the U.S. and Icelandic samples. As noted in the Standards for Educational and Psychological Testing (American Educational Research Association et al., 2014), “when a test is translated and adapted from one language to another, test developers and/or test users are responsible for describing the methods used in establishing the adequacy of the adaptation and documenting empirical or logical evidence for the validity of test score interpretations for intended use” (p. 68). Consistent with this standard, we describe in the Method section the details of the translation and adaptation process for the IBI, and then evaluate the factor structure of the IBI when used with students in Iceland to provide empirical support of the validity of the IBI scores.
Finally, because one intended use of the translated IBI was to facilitate cross-cultural research, the third purpose of this study was to evaluate the measurement invariance of the IBI (e.g., item factor loadings, item intercepts, item error variances) across samples in the United States and Iceland. If the IBI’s item factor loadings, item intercepts, and item error variances differ across groups, comparisons of mean scores on the five latent factors measured by the IBI would need to be viewed with caution.
Method
Participants
Student samples were obtained from the United States and Iceland. The total U.S. sample consisted of 849 university students, from one large urban research university in the southeastern United States that had around 34,000 undergraduates and 8500 graduate students. Students’ median age was 21 years (M = 23.0, SD = 6.6, range=18-61) and 75% were female. Samples were obtained by contacting faculty face-to-face and via email. Twenty-one faculty members were contacted, 19 of whom were able to participate.
The total Icelandic sample consisted of 733 university students from one public and one private university in an urban area. The student population of the public university was around 10,000 undergraduates and 3500 graduate students, whereas the private university had around 2,300 undergraduates and 600 graduate students. The students’ median age was 24 years (M = 26.0, SD = 6.2, range = 19–61) and 69% were female. For recruitment, the first author sent requests to faculty via email. Twelve faculty members in Iceland were contacted, all of whom participated.
To address potential biases due to language issues and diversity, respondents were asked to provide information on their native language and country of citizenship. Only responses from native speakers of English were used for analysis in the U.S. sample, and an analogous procedure was followed for the Icelandic sample. Participants who did not state their native language or citizenship were excluded. After these exclusions, the final U.S. sample consisted of 827 eligible participants (22 students were excluded or 2.6%), and the final Icelandic sample counted 720 (13 students were excluded or 1.8%).
Instrument and procedures
The IBI consists of 50 items using terminology derived from Rational Emotive Behavior Therapy (REBT) theory, 37 of which are worded towards irrationality, and 13 items towards rationality; these latter 13 items were reverse-scored for analyses. Students responded on a five-point Likert scale (1 = Strongly disagree, 2 = Disagree, 3 = Neutral, 4 = Agree, and 5 = Strongly agree). The instrument was translated from English into Icelandic by three translators, one of whom was the first author. Once a preliminary version had been synthesized from the three translations, the IBI was back-translated into English by two additional translators who had not worked on the instrument before. Upon having a completed back-translated version of the IBI, the first author used a cultural review committee of three members in Iceland to gauge language, tenor, reading difficulty, cultural appropriateness, and other potential issues in wording. Feedback on both U.S. and Icelandic versions was also obtained using cognitive interviews (think-aloud and verbal probes) with four U.S. and four Icelandic university students. For more details on the translation/adaptation process, and a copy of the Icelandic-version of the IBI see Heimisson (2011).
The same administration procedures were used for the U.S. and Icelandic samples. The first author administered the IBI personally in university classrooms in a paper-based test booklet format, with an informed consent document that was approved by the University’s Institutional Review Board. The IBI took around 15–20 minutes to complete.
Statistical analyses
Confirmatory factor analysis was conducted using Mplus 8.4 (Muthén & Muthén,1998–2017) to evaluate the five-factor model for the IBI separately for students in the United States and Iceland. Analyses were based on the variance-covariance matrix of the 50 items, and used full information maximum likelihood estimation (ML). Full information ML using the EM algorithm was used to handle missing data. The amount of missing data was minimal in each country and ranged from 0% to 3.5% in the United States and 0% to 1.1% in Iceland.
Model fit was evaluated using χ2, the root mean square error of approximation (RMSEA), and the standardized root mean square residual (SRMR). The comparative fit index (CFI) was not used because the RMSEA values for the null models for the U.S. and Icelandic samples were .100 and .098, respectively, values less than the .158 value, which Kenny (2015) identified as a general rule of thumb for where an incremental measure of fit like the CFI “may not be that informative.” Hu and Bentler’s (1999) cutoff values of < .06 on the RMSEA and < .08 on the SRMR were used as general indicators of acceptable fit of the models. These global measures of fit can be affected by other factors, such as the magnitude of the standardized loadings and the number of items (Greiff & Heene, 2017; McNeish et al., 2018), and therefore, we examined potential local model misspecifications and parameter estimates (e.g., factor loadings). In line with this examination we used guidelines by Hair et al. (2014) for evaluating the standardized loadings. Hair et al. (2014, p. 618) suggested that “standardized loading estimates should be .5 or higher, and ideally .7 or higher.” The rationale for ideally wanting standardized loadings greater than .7 is that the square of these loadings can be viewed as the percentage of the variance in an item that can be explained by a factor. Thus, a loading of .7 would indicate that 49% of the variance in an item can be explained by a factor.
Measurement invariance of the five-factor IBI model for the U.S. and Icelandic groups was tested using multigroup confirmatory factor analysis (MCFA). This analysis involves testing a series of hierarchically ordered models of increasing restrictiveness that evaluates the means and covariance structure of the observed variables (i.e., items on the IBI). Model identification was achieved using the effects coding method (Kline, 2016), which involves constraining the mean unstandardized item factor pattern coefficient to 1.0 within each factor and the average intercept to zero.
The first model tested (Model 1) evaluated configural invariance, the least restrictive invariance model. For configural invariance, the same five factors of the IBI and their associated items had no equality constraints imposed on the factor loadings, item intercepts, item error variances, and factor variances and covariances across the U.S. and Icelandic groups. Model 2, which assumes configural invariance, was used to test metric invariance (also known as weak invariance), which evaluates the equality of the unstandardized factor loadings across the groups. If the factor loadings representing the relationships between the items and the latent variables (e.g., Worrying) are different across groups, this suggests that the items may have different meanings across groups. A lack of metric equivalence is often referred to as differential item functioning (DIF), specifically non-uniform DIF (i.e., item response differences between groups vary across the levels of the latent variable). Model 3, which assumes metric invariance, was used to test scalar invariance (also known as strong invariance), which evaluates the equality of the item intercepts (i.e., the extent to which students endorse an item). A lack of invariance in the intercepts is an indication of uniform DIF (i.e., after equating the groups on a latent variable, one group’s item responses differ in the same direction from the other group’s responses across all levels of the latent variable). Model 4 was a follow-up test to Model 3 that was used to evaluate partial scalar invariance. After conducting tests of configural, metric, scalar, and partial scalar invariance, we examined invariance of the item error variance parameters (referred to as strict invariance) in Model 5. Evaluating invariance of the error variances provides information about differences in the reliabilities of the item scores. In reviewing the literature involving applications of measurement invariance, there are many cases where strict invariance is not tested, however, as noted by Wu et al. (2007, p. 19) “inequality in the residuals may distort the loading/intercept metric equality,” and therefore strict invariance was examined as part of the evaluation of the equality of the measurement model (Models 1-5). Finally, we examined structural invariance in terms of the equality of the factor variances (Model 6) and factor covariances (Model 7).
The strategy used to examine the various levels of measurement invariance was to evaluate the chi-square change (Δχ2) for statistical significance relative to the change in degrees of freedom (Δdf) for the models being compared. These tests were supplemented by comparing the changes in the RMSEA and SRMR to the guidelines presented by Chen (2007). When the changes in these fit indices for the more restrictive model met Chen’s guidelines (ΔRMSEA < .015 and ΔSRMR < .03 for the factor loadings, and ΔRMSEA < .015 and ΔSRMR < .01 for the item intercepts), we concluded that the hypothesis of invariance was tenable (i.e. do not reject the null hypothesis of equality).
Cognitive interviews
To gain insight into the quantitative results related to the IBI items, the first author conducted one-on-one cognitive interviews (Willis, 2004) with four students from the United States and four from Iceland. The four students from the United States were all female between the ages of 21-35 years; the Icelandic students were two males and two females between the ages of 24-36. Each interview consisted of a think-aloud session and follow-up probes. In the think-aloud sessions, participants were instructed to read each item aloud, and ask the first author if they had any questions or comments on the item. The first author also made notes if participants showed hesitation or other reactions indicating that an item might affect the participant in an unintended way. In the follow-up sessions, the first author asked about items that got flagged in the think-aloud session. (e.g., “you asked if this survey had to do with fundamentalist religion. What made you think that?”, “On several items, you remarked that they were strange. Can you elaborate on that?”) Comments from the cognitive interviewing sessions were listed, analyzed for common themes by the first author, and independently confirmed by the second author.
Results
Descriptive statistics
For the U.S. sample, mean scores for the 50 IBI items ranged from 1.96 (SD = 0.84) for item 47 (“I dislike responsibility,” Problem Avoidance item) to 3.98 (SD = 0.78) for item 34 (“Helping others is the very basis of life,” Rigidity item). For the Icelandic sample, mean scores ranged from 1.73 (SD = 0.71) for item 44 (“One should rebel against doing unpleasant things, however necessary, if doing them is unpleasant,” Problem Avoidance item), to 4.03 (SD = 0.72) for item 34, the same Problem Avoidance item that had the maximum score in the U.S. sample. Responses to the IBI items for the students in the United States and Iceland were approximately normally distributed (see Table 2 in the supplemental file).
Correlations of Irrational Beliefs Inventory factors for students in the United States (n=827) and Iceland (n=720) compared with Bridges and Sanderman (2002) and Koopmans et al. (1994).
Note. Pearson product moment correlations between factors are listed in the following order: current U.S. sample/current Icelandic sample/U.S. 2002 sample/Dutch 1994 sample. U.S. 2002 sample = Bridges and Sanderman’s (2002) U.S. sample of 248 undergraduate students; Dutch 1994 sample = Koopmans et al.’s (1994) sample of 538 university students from the Netherlands. Intercorrelations between the factors for the current U. S. and Iceland samples were derived from the five-factor confirmatory factor analysis model.
Confirmatory factor analysis
Confirmatory factor analysis (CFA) was used for three purposes in this study: 1) evaluate the five-factor model of the IBI in a U.S. sample, 2) do the same in an Icelandic sample for a newly translated Icelandic version of the IBI, and 3) evaluate the measurement invariance of the IBI across the U.S. and Icelandic samples. We report the results separately for each purpose.
U.S. sample
The five-factor model for the U.S. sample, which is currently used in the literature, showed a statistically significant lack of fit based on the chi-square, χ2 (1165, N = 827) = 3612.32, p < .001. Alternative measures of fit, less sensitive to sample size, suggested that the overall model was acceptable for the U.S. sample (RMSEA = .050, 90% confidence interval = .049 to .052; SRMR = .061). In interpreting these alternative measures of fit it is important to keep in mind that the model fit guidelines offered by Hu and Bentler were based on a Monte Carlo simulation study in which several conditions were manipulated (e.g., sample size) but the magnitude of the standardized factor loadings was set to be between .70 and .80. As noted earlier in the Method section, McNeish et al. (2018) found that when the measurement quality of constructs is poor, as reflected by weak factor loadings, these alternative measures of fit have reduced sensitivity to identifying model misspecifications. In view of the relationship between weak factor loadings and the performance of these measures of model fit, we evaluated the magnitude of the standardized item factor loadings for the five factors underlying the IBI. Table 1 summarizes these standardized loadings by various cut points. We started by using a ≥. 40 loading, as this is a common cut point for exploratory factor analyses (EFA; see Henson & Roberts, 2006) such as the ones that were used in the development of the IBI (Koopmans et al., 1994). We then tested the model with increasingly stringent cut points for the standardized loadings (.50, .60, and .70), as these cut points are more the norm when using confirmatory factor analysis (Hair et al., 2014).
Using the least stringent cut point for the standardized factor loadings, we found that 40 out of the 50 loadings met this criterion, with the percentage of items meeting this criterion ranging from 70% (7 of the 10) for the Problem Avoidance items, to 100% of the seven Emotional Responsibility items. Although the .40 cut point is commonly used in EFA and may be appropriate for the measurement of highly abstract constructs such as irrational beliefs, a value of .40 represents a relatively weak relationship between an item and a factor because a .40 loading indicates that only 16% of the variance in an item’s scores can be explained by a factor. When we used Hair et al.’s lower end of their rule of thumb for standardized loading estimates (≥. 50) from a CFA, 27 of the 50 IBI items (54%) met this criterion. The percentage of items that met this criterion within each factor varied from 28.6% of the Rigidity items (4 out the 14 items) to 100% of the Emotional Irresponsibility items (7 out the 7 items). As can be seen from Table 1, as the criterion increased to ≥.60 only nine items met this criterion, and when ≥.70 was used (Hair et al.’s ideal) only 4 of the 50 IBI items met this criterion. None of the items in the Rigidity, Problem Avoidance, and Emotional Irresponsibility scales met the ≥.70 standardized factor loading criterion. Overall, the standardized loadings for the IBI items were weak, varying from an average of .43 for the Rigidity scale to .63 for the Need for Approval scale.
The correlations between the five factors measured by the IBI varied from -.21 (Emotional Irresponsibility and Rigidity) to .53 (Need for Approval and Worrying). In general, the correlations in the present study were larger than those reported by Bridges and Sanderman (2002) in their U.S. sample of 248 undergraduate students and those reported by Koopmans et al. (1994) in their sample of 538 university students from the Netherlands (see Table 2).
In view of the statistically significant chi-square for the five-factor model, we evaluated the modification indices (MI) for the model to explore potential sources of misfit. Modification indices represent estimations of how much the chi-square would be reduced (improved fit) if a parameter that was set to zero was added to the model. Based on the modification indices, a major source of misfit involved correlations between measurement errors for pairs of items. A total of 85 item pairs with correlated errors were identified when the minimum modification index was set at the critical value of 10.83 (p < .001). The largest MI (121.68) was the correlation between the errors for two similarly worded Emotional Irresponsibility items (17. “Nothing is upsetting in itself - only in the way you interpret it,” and 30. “People are disturbed not by situations but by the view they take of them”). An additional source of model misfit involved items with loadings on one or more additional factors. The largest MI (72.19) was for the Need for Approval item 23 (“I hate to fail at anything”), which had a secondary loading on the Worrying factor. There were 31 MIs greater than the critical value of 10.83 (p < .001), which were reflective of items with relationships with one or more additional factors beyond the factor the item was intended to measure.
Icelandic sample
The five-factor model for the Icelandic sample showed a significant lack of fit based on the chi-square, χ2 (1165, N = 720) = 3177.82, p < .001. The alternative measures of fit, however, suggested acceptable overall fit for the Icelandic sample (RMSEA = .049, 90% confidence interval = .047 to .051; SRMR = .060). Following the same procedures with the Icelandic sample as with the U.S. sample, we evaluated the standardized loadings from the five-factor model of the IBI using the criteria ≥.40, ≥.50, ≥.60, and ≥.70. A total of 38 items in the Icelandic sample met the ≥.40 loading criterion. Using Hair et al.’s rule of thumb for standardized loadings at ≥.50, a total of 27 items in the Icelandic sample met the criterion. At the ≥.60 cutoff point, only 12 items met the criterion, with no item left on the Rigidity factor. When we used the most stringent cutoff point proposed by Hair et al. (≥.70), only three out of 50 items on the IBI met the criterion, all on the Need for Approval factor. Overall, the standardized loadings for the IBI items in the Icelandic sample were weak, varying from an average of .36 for the Rigidity scale to .63 for the Need for Approval scale. The average standardized item factor loadings for each factor in the Icelandic sample were similar to those in the U.S. sample with the largest difference occurring on the Rigidity factor (U.S. = .43 and Iceland = .36; see Table 1). Twenty-two items had standardized loadings that were greater than or equal to .50 for both the U.S. and Icelandic samples; each sample had 5 unique items that had standardized loadings ≥.50; see Table 3 for the item content for the 32 items).
Irrational Beliefs Inventory items with standardized factor loadings ≥ .50 in the U.S. or Icelandic samples.
Note. Standardized loadings are from the five-factor correlated CFA model for the IBI, analyzed separately for the U.S. and Icelandic samples.
aStandardized factor loadings ≥ .50 in both U.S. and Icelandic samples. US = Standardized factor loadings ≥ .50 in the U.S. sample only. IS = Standardized factor loadings ≥ .50 in the Icelandic sample only.
The correlations between the five factors in the Icelandic sample ranged from -.23 (Emotional Irresponsibility and Rigidity) to .56 (Need for Approval and Worrying), and the overall pattern of the correlations in the Icelandic sample largely followed that of the U.S. sample (Table 2).
Following the same procedure as with the U.S. sample, we evaluated the modification indices (MI) for the model to explore potential sources of misfit. Mirroring the results from the U.S. sample, correlations between measurement errors for pairs of items were a major source of misfit; we identified a total of 68 item pairs with correlated errors when the minimum modification index was set at the critical value of 10.83 (p < .001). The largest MI (95.88) was a correlation between errors for two similarly worded items on the Need for Approval factor (37. “It is important to me that others approve of me” and 43. “What others think of you is most important”). Additionally, model misfit involved items with secondary loadings on one or more additional factors. The largest MI (101.95) was for the Need for Approval item 23 (“I hate to fail at anything”), which had a secondary loading on the Worrying factor. We identified 29 MIs greater than the critical value of 10.83 (p < .001), which were reflective of items with relationships with one or more additional factors beyond the intended factor in the five-factor model.
Reliability
Cronbach’s alpha reliability coefficients in the U.S. and Icelandic samples were greater than .70 for all factors except for Rigidity for the Iceland sample (α = .67). Koopmans et al.’s (1994) Cronbach’s alpha reliability coefficients from Studies 2 and 3 were slightly larger than those found in the present study. The biggest exception was Need for Approval, which had a Cronbach's alpha coefficients of .82 for both the U.S. and Icelandic samples in the current study (Koopmans et al. reported alphas of .75 and .80).
Measurement invariance
Table 4 summarizes the results of the series of invariance tests for the five-factor IBI model. Results of the multigroup confirmatory factor analysis indicated that there was support for configural invariance (equal form). Although the chi-square for the configural model was statistically significant, χ2 (2330, N = 1547) = 6790.139, p < .001, alternative measures of fit indicated that the model had acceptable fit (RMSEA = .050, SRMR = .061). In testing for metric invariance, Model 2 (equal factor loadings) was compared to Model 1 (configural model). The change in chi-square was statistically significant (Δχ2 = 456.359, Δdf = 45, p < .001), however the changes in the alternative measures of fit using Chen’s (2007) guidelines indicated that the hypothesis of equal loadings was tenable (e.g., ΔRMSEA = .001, ΔSRMR = .008).
Invariance tests for the Irrational Beliefs Inventory for students in the United States (n = 827) and Iceland (n = 720)
Note. RMSEA = Root Mean Square Error of Approximation; SRMR = Standardized Root Mean Square Residual. Numbers in parentheses represent the 90% confidence interval for the RMSEA.
aItem intercepts that were invariant: Worry (1, 10, 16, 19, 26, 32); Rigidity (4, 29, 34); Problem Avoidance (11, 25, 31, 36, 40); Need for Approval (5, 37); and Emotional Irresponsibility (12, 30, 42, 46).
***p < .001. ns = not statistically significant (p > .05).
Next, we tested the equality of the item intercepts (scalar invariance) by comparing Model 3 with Model 2. The change in chi-square was statistically significant (Δχ2 = 4561.131, Δdf = 45, p < .001), with the changes in the alternative measures of fit also being larger than Chen’s guidelines. Therefore, we concluded that there were significant differences in the item intercepts (e.g., ΔRMSEA = .020, ΔSRMR = .019) for the U.S. and Icelandic samples. We then tested each item’s intercept, one at a time, to evaluate if there was evidence of partial scalar invariance. In view of the number of statistical tests, we evaluated the change in chi-square for each one degree of freedom test at the .0001 level of significance. Results of these tests identified 20 items that did not show evidence of significant differences in the item intercepts. Invariant item intercepts included 6 out of the 12 Worrying items, 3 out of the 14 Rigidity items, 5 out of the 10 Problem Avoidance items, 2 out of the 7 Need for Approval items, and 4 out of the 7 Emotional Irresponsibility items. When these 20 item intercepts were constrained to be equal, the change in chi-square (Δχ2 = 86.909) was statistically significant; however, the changes in the alternative measures of fit using Chen’s (2007) guidelines indicated that the hypothesis of equal intercepts for this set of items was tenable (ΔRMSEA = .001, ΔSRMR = .000; see Table 4 comparing Model 4 vs. Model 2).
Tests of the quality of the item error variances resulted in a significant change in chi-square, but the alternative measures of fit indicated that the hypothesis of equal error variances for this set of items was tenable (e.g., ΔRMSEA = .001, ΔSRMR = .005). Lastly, we evaluated structural equivalence by examining the equality of the five factor variances (Model 6) and 10 factor covariances (Model 7). The change in the chi-square for the test of equal factor variances was statistically significant, whereas the change in chi-square for the test of equal factor covariances was not statistically significant (Δχ2 = 13.922, Δdf = 10, p = .18). In both cases, the alternative measures of fit indicated that the hypotheses of equal factor variances and equal covariances between factors were tenable.
Cognitive interviews
The eight students who participated in the cognitive interviews in the United States and Iceland commented that, as a whole, the tenor and language of the IBI seemed dated. For example, one participant asked if the IBI was written during the Cold War. As an analogous example, item 10 (“I hardly ever think of such things as death or atomic war”) caused pause by all the participants in the cognitive interviews. The fear of atomic war did not appear to be salient among participants, but other events in the public discourse at the time were suggested instead, including but not limited to terrorism, the rise of hostile superpowers, and campus gun violence.
Students also commented about the tedium in responding to the 50 items with what they saw as over-repetition of items, or items that they felt were extremely like each other. There were several examples such as: item 36 (“It is difficult for me to do unpleasant chores”) and item 13 (“I usually avoid chores which I dislike doing”); item 7 (“I tend to become terribly upset when things are not the way I would like them to be”) and item 32 (“I get terribly upset and miserable when things are not the way I like them to be”); and item 30 (“People are disturbed not by situations but by the view they take of them”) and item 17 (“Nothing is upsetting in itself - only in the way you interpret it”).
Students also commented about the use of value-laden words such as evil, bad, and blame, and noted the apparent strong religious content of some of the items (e.g., “It is sinful to doubt the Bible”). Several students saw words such as wicked and severely punished in Item 3 (“Certain people are bad or wicked and should be severely punished for their sins”), as archaic with strong religious connotations. Some of the participants in Iceland commented that the IBI had an “American tone” or “religious feel”, and a couple of participants asked if the IBI was intended to measure religiosity.
Discussion
With the continued use of the Irrational Beliefs Inventory (IBI) in research and clinical practice in multiple countries, there is a need for ongoing evaluations of the psychometric quality of this measure. This applies to research within particular countries, such as the United States and Iceland, and across countries such as when versions of the IBI are translated and adapted for a new language and culture. Measurement plays a central role in research (including cross-cultural research) and clinical practice, and validity and reliability problems with measures such as the IBI can threaten the quality of the decisions made using such measures. For example, in research applications, weak validity and reliability can produce biased statistical estimates and attenuate relationships between variables (Vacha-Haase & Thompson, 2011).
To date, there has been limited psychometric analyses of the IBI. Our focus on the internal factor structure, reliability, and measurement invariance of the IBI scores was designed to add to the validity evidence of the 50-item IBI. This focus is consistent with the Standards for Educational and Psychological Testing (American Educational Research Association et al., 2014), which calls for the ongoing collection of empirical and logical evidence that the scores from a measure are valid and reliable for their intended use. Our study represents the first confirmatory factor analysis (CFA) of the five-factor model of the IBI in the United States and in Iceland, and the first evaluation of measurement invariance for these two countries. The results of the present study provide new information about some of the psychometric limitations of the IBI, for both the U.S. and Icelandic samples. Global model fit was marginally acceptable in each sample. More fine-grained analyses identified several sources of model misfit that included weak factor loadings (only 54% of the items in both the U.S. and Icelandic samples met Hair et al.’s minimum guideline of ≥.50 for a standardized factor loading), and the presence of several pairs of correlated errors and items with secondary loadings on factors that the items were not intended to measure. Reliabilities of the scores from the five IBI factors, as measured by Cronbach’s alpha, were generally acceptable (≥.70), according to Nunnally’s (1978) guidelines for basic research, in both samples with the exception of the Rigidity factor in the Icelandic sample (α = .67); these reliability estimates, however, were not exceptionally high. The reliabilities observed in the present study would not be acceptable for individual clinical decision making where values of .90 would be desired (Nunnally, 1978). The largest Cronbach’s alpha for the five factors for the U.S. and Icelandic samples was .83 for the Worrying factor in the Icelandic sample.
Although there have been no confirmatory factor analyses of the IBI that would allow us to compare our results with other researchers’ findings, results of psychometric analyses of other measures of irrational and rational beliefs are consistent with our findings. For example, Hyland et al. (2017) conducted a confirmatory factor analysis of the Attitudes and Belief Scale 2-Abbreviated Version, a self-report measure of rational and irrational beliefs, and found that the fit of the eight-factor model was unsatisfactory, with only 7 of the 25 (28%) standardized loadings ≥.70. Using Hair et al.’s lower guidelines for standardized loadings of ≥.50, 19 of their 25 loadings (76%) were ≥.50. This percentage of loadings (76%) meeting Hair et al.’s minimum guideline ≥.50 was greater than the percentage of IBI items that met the minimum guideline of ≥.50 in the U.S. and Icelandic samples (54% for both samples) but the results of both studies underscore the challenges of achieving factor loadings ≥.50 when measuring highly abstract and complex constructs like irrational beliefs. Researchers should aim to create items meeting Hair et al.’s ideal factor loading (≥.70), but to retain a sufficient number of items for adequate coverage and content validity of a factor, Hair et al.’s lower guideline of factor loadings ≥.50 may be reasonable for basic research. In the current analysis of the IBI, this lower guideline would result in the retention of 27 items (54% of the items) in both the U.S. and Icelandic samples (22 items were common across both samples and there were five items unique to each sample; see Table 3). Working from the ≥.50 criterion in the U.S. sample, the number of items per factor ranged from four items (Rigidity and Problem Avoidance) to seven items (Worrying and Emotional Irresponsibility). Using the same criterion in the Icelandic sample, the number of items per factor ranged from three items (Rigidity and Emotional Irresponsibility) to nine items (Worrying). These items may be viewed as an initial set of items to begin with as researchers consider revising the IBI.
Correlated errors may give additional hints when revising the IBI. In addition to identifying a number of weak factor loadings in both the U.S. and Icelandic samples, we found several instances of correlated errors for pairs of items in both samples that reduced model fit. Correlated errors are common in questionnaires like the IBI (Brown, 2015). Analyses of the correlated errors identified several items that were worded similarly, which can contribute to correlated errors. Participants in the in-depth cognitive interviews consistently commented about the redundancy of several items. Other method effects due to response styles (e.g., social desirability, acquiescence) and the use of reversed worded items (37 items were worded towards irrationality, and 13 items towards rationality) may also have contributed to these correlated errors.
Another source of model misfit was the presence of secondary loadings in both the U.S. and Icelandic samples; that is, items that loaded on one or more unintended factors. These secondary loadings provide potential insights into the conceptual clarity of the items. For example, in both the U.S. and Icelandic samples the Need for Approval item 23 (I hate to fail at anything) had a secondary loading on the Worrying factor. When item 23 was allowed to load on both the Need for Approval and Worrying factors, the standardized loadings were .02 and .39, respectively, in the U.S. sample, and .12 and .48 in the Icelandic sample. These statistical results, showing an item’s stronger loading on the secondary factor compared to its intended factor, along with information obtained during cognitive interviews, raise questions about the clarity and relevance of this item as an indicator of Need for Approval.
Measurement Invariance
In the present study we were interested in evaluating the psychometric properties of the IBI separately among college students in the United States and Iceland. For those researchers interested in cross-cultural comparisons of the IBI between these two samples, we conducted tests of measurement invariance to determine if there were differences in these measurement properties (item factor loadings, item intercepts, and error variances) between the U.S. and Icelandic samples. As noted by Wu et al. (2007), many factors may threaten the comparability of scores across countries, including difficulties in translating and adapting the items, differences in individuals’ response style, and cultural differences. Because of these many factors, equivalence of measures across different groups cannot be assumed but must be tested. Our results indicated that although there were a number of weak factor loadings in each sample, there were not practical differences in the factor loadings between the two samples. Our results did show, however, that the majority of the item intercepts were different (noninvariant) across the U.S. and Icelandic samples with only 20 item intercepts found to be invariant across the U.S. and Icelandic samples. Of note, Emotional Irresponsibility had the greatest percentage (4 out of 7 or 57%) of invariant item intercepts (i.e., no differential item functioning) and Rigidity had the lowest percentage of invariant item intercepts (3 out of 14 or 21%). The finding of partial measurement invariance is consistent with many measurement invariance studies, which generally find that it is difficult to achieve full measurement invariance (i.e., all item measurement parameters are equal across groups; Putnick & Bornstein, 2016; Vandenberg & Lance, 2000), especially in cross-cultural studies. Currently, there is no consensus on what percentage of items needs to be invariant to make valid latent mean comparisons between groups, although Chen’s (2008) research showed that a greater percentage of noninvariant items resulted in more bias in latent mean comparisons. Chen has discussed a number of possible ways of handling noninvariant items (e.g., eliminating the noninvariant items) but ultimately recommended comparing the statistical results (e.g., latent mean comparisons between groups) for a partial measurement invariance model to those from a fully invariant model to determine the practical consequences of noninvariant items. Chen argued that “if the differences are small, it may be justifiable to make group comparisons” (p. 1015). Chen did warn that “trivial differences in statistics do not imply that the construct is conceptually equivalent” across groups. As the first, invariance study of the IBI, we offer caution when using the IBI to make group comparisons for U.S. and Icelandic samples until additional research is conducted to evaluate the robustness of our findings. As part of this research it is recommended that cognitive interviews and cross-cultural expert panel reviews be conducted to gain insight into psychometric differences. In view of the psychometric results for the IBI reported in this study, we now turn to a discussion of some of the limitations of the present study and present recommendations for conducting additional psychometric analyses of the IBI.
What are the Next Steps for the Analysis of the Irrational Beliefs Inventory? The results of the present study have identified a number of limitations of the IBI. Decisions to revise a measure such as the IBI should not be based on a single study like ours, or on findings based on a single source of validity (e.g., internal structure) or reliability evidence (e.g., Cronbach’s alpha). Although our samples were large in both the United States (n = 827) and Iceland (n = 720), our data were collected from nonprobability samples from a small number of institutions – one from the United States and two from Iceland. Data were also collected at a single point in time and therefore we were not able to evaluate the temporal stability of students’ reporting. Finally, time constraints in administering the 50-item IBI during class sessions precluded the inclusion of additional psychometric measures that could be used to evaluate the relationships between the IBI factors and other theoretically meaningful constructs as a form of criterion-related validity.
The limitations noted above can be addressed by ongoing research conducted by multiple investigators focusing on the psychometric properties of the IBI. Measurement instruments like the IBI are best revised and strengthened through an iterative process using the results from multiple studies. When multiple studies provide consistent evidence of weak factor leadings (e.g., < .50), correlated errors for pairs of items, secondary factor loadings, or qualitative feedback about the limited clarity and relevance of the items, more informed decisions can be made about removing or modifying items. To accomplish this goal, researchers using the IBI in research need to conduct psychometric analyses of their own data and report details of these results either in published articles or in supplemental materials. By following this recommendation, it will be possible to build a cumulative knowledge base of good and bad items. This feedback can be used to create a more psychometrically sound measure of irrational beliefs. Space limitations in journals have often been cited as a factor in the limited reporting of psychometric information for measures such as the IBI but many journals are now making it possible for researchers to submit supplemental materials online with their manuscripts that contain detailed psychometric information (e.g., factor loadings) about the measure. Additionally, data repositories such as those provided by the Open Science Framework (https://osf.io/) make it possible for researchers to upload their de-identified datasets so that other researchers can evaluate additional models or evaluate the original model using alternative methods (e.g., different estimation method). One of the benefits of the availability of item-level data for measurement instruments like the IBI is that meta-analytic methods can be used to provide a more comprehensive evaluation of the quality of items used as part of scales. The study by Carpenter et al. (2016) is a good example of a rigorous analysis of the quality of item-level data from 27 independent studies to determine if the psychometric properties of the items that were established when the measure was first developed hold up in current applications.
Although at this point in time we do not want to make definitive statements about changes in the IBI, we believe that revisions to the IBI are warranted, based on quantitative results of the factor analyses and reliability analyses, and from feedback gained in the cognitive interviews. Comments by participants in both the United States and Iceland identified issues related to dated language and tone of several of the IBI items. An example of a dated item was Item 10 in the Worrying factor that used the word “atomic”: I hardly ever think of such things as death or atomic war. Students also commented on the redundancy of several items. For example, two Worrying items were perceived to be very similar: item 7 (“I tend to become terribly upset when things are not the way I would like them to be”) and item 32 (“I get terribly upset and miserable when things are not the way I like them to be”). Deleting redundant items and revising dated language would be one starting point for revising the IBI.
If the results of the present study that found 32 items (22 were the same in the U.S. and Icelandic samples) with standardized loadings ≥.50 in either the U.S. or Icelandic samples were replicated in other studies, these items could become the core of a shorter version of the IBI. Eliminating items with consistently weak psychometric properties, as was done by Al-heeti et al. (2012), who removed 16 weak items to create an Arabic version of the IBI, could enhance the validity and reliability of the IBI scores, while at the same time producing a shorter instrument that would make it more practical for research and clinical purposes. Concerns about the practicality of administering long measures have been raised by many researchers (e.g., Kosovich et al., 2015), who note that long questionnaires may produce fatigue, frustration, disengagement, and refusal to respond. Having a shorter instrument in a research study also makes it possible for researchers to include additional measures in their study.
A potentially shorter version of the IBI (see Table 3) would need further psychometric testing before it could be recommended for research and clinical applications. Evidence of validity consisting of the five major sources of validity evidence outlined in the Standards for Educational and Psychological Testing (American Educational Research Association et al., 2014) would be needed: content, internal structure, relations to other variables, response processes, and consequences of testing, as well as evidence of reliability (e.g., internal consistency, test-retest). This validity and reliability evidence would not need to come from one study, but rather could come from multiple studies that evaluate different aspects of the IBI. By publishing psychometric results from multiple studies, other researchers who use the IBI can identify strengths of the IBI and items that, if revised or removed, may improve the quality of the IBI for research and clinical purposes.
Supplemental Material
sj-pdf-1-prx-10.1177_0033294120971773 - Supplemental material for Factor Structure and Measurement Invariance of the Irrational Beliefs Inventory for University Students in the United States and Iceland
Supplemental material, sj-pdf-1-prx-10.1177_0033294120971773 for Factor Structure and Measurement Invariance of the Irrational Beliefs Inventory for University Students in the United States and Iceland by Gudmundur T. Heimisson and Robert F. Dedrick in Psychological Reports
Footnotes
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
