Abstract
BACKGROUND:
The Critical Incident Inventory (CII) was developed to assess stressful exposures in firefighters and emergency service workers. The CII includes six subscales: trauma to self, victims known to fire-emergency worker, multiple casualties, incidents involving children, unusual or problematic tactical operations, and exposure to severe medical trauma.
OBJECTIVES:
To examine the construct validity of all subscales of the Critical Incident Inventory (CII) by assessing the unidimensionality of the scales, and the interval properties of CII subscales by examining fit to the Rasch model and ordering of item thresholds.
METHODS:
This was a secondary data analysis based on survey data collected from a sample of 390 firefighters.
RESULTS:
Item 4 and Item 20 were removed with the confirmation of unacceptable fit residual. This revised version of the CII showed satisfactory fit to the Rasch model by non-significant Chi-square test and acceptable level of item fit. We rescored the CII original version and considered all items as only dichotomous response options where 0 represented the original no experience, and 1 presents the combination of experiencing 1, 2, 3 cases.
CONCLUSION:
The re-appraisal of the revised version CII indicated a satisfactory level of Rasch model fit.
Introduction
Firefighters may be routinely exposed to stressful events as part of their occupation. The Critical Incident Inventory was developed to assess stressful exposures in firefighters and emergency service workers and has been used to understand the potentially negative impacts of such exposures in this population [1]. The CII (firefighter specific) includes six subscales: trauma to self, victims known to fire-emergency worker, multiple casualties, incidents involving children, unusual or problematic tactical operations, and exposure to severe medical trauma [1]. Each of the sub-scales are then further into 2 –6 items of potential critical incident exposures with a total of 24 items. CII enables to record the existence of an exposure (a dichotomous response as no/yes) and frequency of exposures (None, One time, Two times, Three or more times) [1]. For scoring, a value of 0, 1, 2, or 3 is assigned to each response, respectively. The CII scale score ranges from 0 to 72and is generated based on the summed scores [1]. The trauma to self (5-items) subscale included incidents such as serious line of duty injury to self, threat of serious line of duty injury/threat of death to self, incidents necessitating search/rescue involving serious risk to yourself, direct exposure to extremely hazardous materials or to blood and body fluids. The victims known to fire-emergency worker (5-items) subscale comprised of exposures to line of duty death of a fellow emergency worker, serious line of duty injury to fellow emergency worker, threat of serious line of duty injury/threat of death to fellow emergency worker, experience of suicide or attempted suicide by fellow emergency worker or exposure to incidents with the victim(s) known to you [1]. The multiple casualties (3-items) subscale included responses to incidents involving three or more deaths, one or two deaths or multiple serious injuries. The incidents involving children (2-items) subscale consisted of responses to exposures involving serious injury/death to children or severe threat to children. The unusual problematic tactical operations (6-items) subscale included incidents requiring police protection while on duty, verbal/physical threat by public while on duty, failed mission after extensive effort, critical (negative) media interest, use of deadly force by police at an incident or critical equipment failure or lack of equipment in any of the above situations. The exposure to severe medical trauma (3-items) subscale comprised of incidents involving close contact with burned/mutilated victim, removing dead body, or prolonged extrication of trapped victim with life-threatening injuries.
The Rasch model considers the probability of a participant affirming a given item is a logistic function of the discrepancy between participant’s ability of, for instance, upper limb function (θ) and the level of work required (b) expressed by the given item [2–4].
Further advances in research led to expansion of the Rasch model from its dichotomous form to polytomous case by rating scale and partial credit model. The underlying assumption of the rating scale indicates that it is equidistant between thresholds across items, which is not entirely the case with the partial credit model.
Rasch analysis was established based on item response theory (IRT) [5]. The Rasch model provides a valuable methodology to examine the validity properties of patient-reported outcome measures (PROM). Rasch analysis is grounded on the assumption that the item scores on a given PROM are highly correlated with item difficulty and individual (respondents’) ability [5]. For example, participants with advanced skills and ability will perform better (get higher scores) than those with lower skills and capabilities. Rasch analysis provides researchers with the opportunity to assess the assumption of unidimensionality: that is, whether the items on any scale or subscale are assessing a single construct. It also evaluates as the structure of rating scales to propose recalibration to convert the ordinal scaling into interval-level measurement, supporting the valid summing to create a total score from individual items [4]. The original Rasch model was developed for dichotomous items, however, it has been extended to polytomous cases, also referred to as the rating scale model. The core assumption of the rating scale model is that the distance between adjacent scoring options is approximately 0.5 probability point [4]. However, this assumption does not apply to the partial credit model. Therefore, Rasch analysis with the application of the partial credit model can be applied for PROM that capture ordinal response options. For example, in the Critical Incident Inventory (CII) questionnaire, respondents (e.g. firefighters) report whether a particular critical (traumatic) event had occurred throughout their careers by indicating “No” or “Yes”, and whether it had occurred “one time,” “two times,” or “three or more times” [1]. While the scoring adds a number value to these descriptors, conducting a Rasch analysis can interrogate whether this scoring structure can discriminate between persons with low or high risk of poor psychological outcomes after traumatic stressors.
Study purpose
We aimed to apply Rasch analysis 1) to test the construct validity of all subscales of the CII by assessing the unidimensionality of the scales, 2) to examine the interval properties of the six subscales of the CII by examining fit to the Rasch model and ordering of item thresholds, 3) to examine the potential of CII score bias based on age, sex and years of service of firefighters and then explore solutions for minimizing any bias by altering the scale.
Methods
Sample
This study was a secondary data analysis based on survey data collected from firefighters in 160 locations across Canada. Three hundred ninety firefighters (272 males and 118 females) completed the survey –response rate 100% (–a prospective study from which we obtained the data used in this study). Institute Review Board (IRB) approval was obtained from the the Hamilton Integrated Research Ethics Board in 2019 and participants provided informed written consent to have data used in research. All data were fully anonymized before we accessed them. Previous research has indicated that a sample of 250 subjects be included at minimum for Rasch analysis to present a stable estimates of item analysis [5].
In RUMM2030, the selection of partial credit model was derived based on the significant results from likelihood ratio test [6]. The Rasch analysis involved the testing of unidimensionality fit of residual, ordering of item thresholds, Pearson separation index, differential item functioning, and dependency [5]. All the analyses were conducted by RUMM 2030 professional suite software. The significance level of ANOVA and chi-square test were set at 0.05. Bonferroni correction was applied with multiple comparisons. Class interval was set as the default setting, and then tailored based on iterations of analysis. Descriptions of the specific procedures and their rationale are listed as follow.
Test of fit
The test of fit assesses to what degree the items from the PROM fit with the expectations of the Rasch model. Fit statistics can be examined at overall and individual item levels. Concerning the overall fit, p-value from chi-square test of item-trait interaction must be non-significant after applying Bonferroni corrections [5], as a significant p-value here indicates the lack of invariance and hierarchy concern across items. In contrast, the item–person interaction statistics are transformed to estimate a z score (logits) followed a standardized normal distribution. Thus, the item mean is set at zero, and we expect a person mean of nearly zero and a standard deviation of approximately 1 to satisfy the assumption of a normal distribution. In regard to the individual level, a fit residual localised within±2.5 logits display adequate fit to the model. The Chi-square test is still presented at this level to estimate if individual person abilities varies from what is predicted. The item characteristic curve (ICC), displays the observed respondent scores plotted against the expected model curve to allow for visual inspection of the model fit [5]. In the case of a good fit, nearly the all scores (plots) would follow the expected curve. Steeper plots indicate the measure is likely over-discriminating, and vice versa. Furthermore, the graph facilitates the inspection and identification of outliers. Extreme outliers tend to influence the observed score and cause deviation from expected model and lead to misfit concerns. ICC enables the identification of such outliers and allows the misfit individuals to be removed from the analysis. From a practical and clinical point of view, respondents with low literacy, cognitive deficits and/or other co-morbid conditions may ultimately misunderstand specific questions and report extreme scores to corresponding questions.
Threshold
The threshold demarcates the point between two response categories in which either of the two responses is equally probable [5]. A disordered threshold indicates the respondents are likely to fail to discriminate between different response options. Threshold maps illustrated the relative distance between thresholds, and response plots allow for visual inspection of disordered progression of person abilities across the response options. This concern could be fixed by collapsing adjacent categories.
Targeting
Targeting is also referred to as the scale-to-sample targeting [5]. Here, researchers assess to what degree items can quantify the range of individual abilities demonstrated in the sample. The person item threshold distribution indicates the relative difficulty (item locations) and relative ability (person location) on the same scale of logits. Precise person measurement depends on how well the ranges match each other: the better the ranges match, the higher the precision. Poor targeting often leads to floor or ceiling effects.
Differential item function
Examination of DIF is used in Rasch analysis to verify if items of PROMs are stable across different subgroups of the sample [5]. For example, the probability of confirming an item must be equal for men and women in study sample. Uniform DIF takes place where divergences are consistent across individual groups (e.g. where the scoring progression for an item is different for men and women). Uniform DIF can be addressed by ‘splitting’ the item to create a unique scoring structure for these individual groups to ensure the equality of estimates (in our example, for men and women). It is important to note this does not mean there is an actual difference in the amount of the trait of interest between the groups, but rather that the groups may different appraisal systems to give ratings of this trait. Non-uniform DIF occurs from random error creating differences in item estimates between groups with no possible solution with the exception of item removal [4, 7]. The ANOVA statistic and visual inspection (ICC) will be assessed and cross referenced during the analysis.
Dimensionality
Unidimensionality (the basic assumption of Rasch model) will be assessed through the principal component analysis (PCA) within item response theory (IRT) [5]. It will verify that all items within one subscale and all subscales within the CII measure are measuring the same latent construct. Statistics will be assessed on the first important factor using independent t-test to compare positive and negative item loadings. After item reduction and the rescoring of individual response options on the CII, the PCA will be re-visited as validation of the unidimensionality [8]. We chose the number of significant t-test as larger 5% of the total comparisons as an indicator of multidimensionality.
Local independency
As the additional analysis provided by RUMM 2030, the residual correlation was displayed after PCA [5]. Factors contributing to the appearance of local dependency (LD) ultimately indicate response dependency and multidimensionality. Residual correlations between any two items greater than 0.3 indicate LD. Item deletion has been suggested as the priority solution to address LD [3]; however, combining two or more locally depend items into one ‘super item’ is another effective strategy [7].
Reliability
The person separation index (PSI) displays the level of precision of the estimate for each person and is utilized as the statistic for internal consistency under the Rasch model [5]. The acceptable threshold was set as 0.7 for treatment groups displaying the scale as sufficiently reliable to distinguish between at least two groups [4, 9]. Furthermore, 0.8 was used as the satisfactory threshold for the traditional Cronbach’s alpha(α) calculated by the RUMM2030 program [10,11, 10,11].
Results
Study participants
Rasch analysis requires a full set of data without any missing values. Data from a total of 390 full-time firefighters (272 males and 118 females) was available for analysis (Table 1). In the 390 firefighters included, 376 (96.4%) reported exposure to some type of critical incident. A mean of 30 and range of 0–72 critical incidents were experienced by the firefighters over the span of their entire careers. The total number of firefighters exposed and types and frequency of exposure to each CII item, has been reported in Table 2. From the 390 respondents, 351 (90%) endorsed “respond to incident involving one or two deaths”, 314 (81%) indicated “respond to incident involving multiple serious injuries”, 312 (80%) reported “direct exposure to blood and body fluids”, and 300 (77%) indicated being exposed to incidents involving “removal of dead body or bodies”. The most infrequent event exposures were “use of deadly force by police at an incident” (7%) and “serious line of duty injury to self” (16%).
Demographic characteristics
Demographic characteristics
Number of firefighters exposed and times of exposure to critical incidents based on each item (Canada)
The partial credit model was selected for the current analysis since the likelihood ratio test was significant at overall questionnaire and subscales level (p < 0.001).
The initial inspection of the overall questionnaire including 24 items revealed the lack of invariance across the trait by a significant Chi-square test (χ2 = 249.04, df = 144, p < 0.001) for item-trait interaction. Only one subscale, named as ‘Unusual or Problematic Tactical Operations’, demonstrated the model fit across all individual subscales. (χ2 = 38.85, df = 36, p = 0.34) on initial inspection. (Table 3) The overall item fit residual showed unacceptable fit to the Rasch model as moderate deviation was found in the standardised item fit residual statistic (mean = 0.02 SD = 1.59). Similar misfit was revealed by the out-of-range mean and SD values in all 6 subscales. (Table 3)
Overall fit statistic
Overall fit statistic
Abbreviations: PSI, person separation index; SD, standard deviation; §Criteria for acceptable distribution of fit residual: Mean = 0, SD = 1. *Criteria for significant p-value of chi-square test for item-trait interaction: p < 0.05. ¶Criteria for acceptable level of PSI index: PS I > 0.7. ¶¶Criteria for satisfactory level of Cronbach’s alpha: α> 0.8.
To identify the problematic items which may cause mis-fitting, the next step was to check the individual item fit statistics. At the level of the overall CII, Item 4 (Direct exposure to extremely hazardous materials) and Item 20 (Serious injury to fellow without death) were removed with the confirmation of unacceptable fit residual. At the subscales level, the fit residual was equal to 2.59 for item 14 (Victim/s known to you), revealing mild misfit within the ‘Victims Known to Fire-Emergency Worker’ subscale. This was not supported by the non-significant p value of the chi-square test. No other individual item misfit was noted in the rest of items at subscale level. This revised version of the CII shows satisfactory fit to the Rasch model by non-significant Chi-square test (χ2 = 127.01, df = 110, p = 0.13), acceptable level of item fit (mean = –0.21, SD = 1.05) and person fit (mean = –0.2, SD = 0.72). All resultant individual item fit statistics met with the Rasch expectation (Table 4).
Individual item fit statistic in original CII with item reduction and rescoring strategy
§Fit residual within±2.5. *Significant P value after Bonferroni correction at 0.002.
Initially, no items were found to be ordered on the threshold map in the original version of CII and its subscales, necessitating rescoring. Ultimately, rescoring supporting considering all items as requiring only dichotomous response options where 0 represented the original no experience, and 1 presents the combination of experiencing 1, 2, 3 cases. The new threshold map is presented in Fig. 1.

Threshold map for revised CII.
Figure 2 shows the targeting of the revised CII scale. The mean of person logits is 0.13 indicating the total score of original CII from the participants is slightly higher than the target of the scale (Fig. 2).

Person-item threshold distribution.
In compliance with the Rasch model, all continuous variables including age, service years were transferred to categorical variables according to their 25, 50, 75 percentiles. Therefore, the personal factors for DIF analysis were set as four groups of service years (0 to 4, 5 to 11, 12 to 18, and 19 to 34 years), four groups of age (20 to 31, 32 to 39, 40 to 47, and 48 to 60 years old). The questionnaire data reported sex as two groups (male versus female). DIF was examined using both statistical critical values (after Bonferroni correction) and visual inspection (ICC curve). The visual inspection was facilitated by plotting the item characteristic curve along with the person trait for given person factors (See Fig. 3). Under the revised version of CII, uniform DIF was found in item 2 (Threat of serious injury/death to self), item 19 (Incident involving serious risk to yourself), item 6 (Suicide or attempted suicide by fellow), and item 17 (Close contact with burned) across male and female groups. Uniform DIF was also detected in item 12 (Incident involving serious injury or death to children) for different levels of service years. Item splitting by sex and service year groupings resolved all uniform DIF issues (Table 5).

Uniform DIF presented in Item 2: Threat of serious line of duty injury or threat of death to self (that did not result in actual serious injury).
DIF summary based on the revised CII
The whole CII questionnaire failed to meet the unidimensionality criterion as 12.83% of the independent t-test was found to be significant at 5% level. When dimensionality was inspected at the subscale level, the percentage of significant t-tests of all subscales were all lower than 5%, indicating the unidimensionality of each CII subscale. After item reduction and response rescoring, the overall CII demonstrated unidimensionality (Table 3).
Local independency
The assumption of local independence was mildly violated between Item 8 (Responded to incident involving one or two deaths) and item 9 (Responded to incident involving multiple serious injuries) (r = 0.35) in the revised CII. Since the two items represented diverse conceptual meanings, we elected not to delete or create a ‘super item’ (Table 6). Further, as both items are on the same subscale, creating a summary score for that subscale has the same statistical effect a ‘super-item’.
Residual correlation based on the revised CII.
Residual correlation based on the revised CII.
*residual correlation > 0.03 or < –0.03.
Both reliability statistics achieved satisfactory levels for the original (PSI = 0.88, α= 0.91) and revised (PSI = 0.86, α= 0.88) versions of the CII. Moreover, according to previous studies, a PSI value above 0.8 indicates the ability of discriminating between at least 3 groups [9]. The PSI was not applicable for each subscale due to the limited number of items.
Discussions
To our knowledge, there is no previous study utilizing same analytic strategy on the Critical Incident Inventory to examine 1) the construct validity including unidimensionality and internal consistency, 2) threshold of the response option, 3) and the impact of personal factors such as age, sex, and years of service. The findings of the current study contribute to the body of psychometric validation literature supporting the CII and enhance its application in both research and clinical settings.
The CII in its original format using a 4-point ratings scale for all 24 items summed to calculate a total score, failed to meet the expectations of scale measurement and demonstrated a significant misfit to the Rasch model. All six subscales individually achieved the acceptable values for the unidimensionality test, however, the combination of all 24 items did not support this assumption. Since there were no studies that had previously analyzed the CII under item response theory to examine if the scale represents a single underlying factor, results from our study warrant confirmation by explanatory/confirmatory factor analysis.
With the consideration of both Rasch statistics (threshold map) and the occupational context of firefighting, we merged the adjacent response options of the original CII questionnaire. Firefighters may experience difficulty recalling the number of critical incidents during the extreme working conditions that they are constantly involved in over the course of their career. Such long recall intervals may be particularly prone to response error or bias [12]. Furthermore, their primary focus revolves around the task at hand –the rescue of potential victims, extinguishing the fire, maintaining ventilation, hauling equipment, and forcible entry operations [1]. These physically and cognitively demanding tasks are performed in evolving, risk-intense environments. Other Rasch statistics including over fit, individual fit, and unidimensionality all met the expectations. Therefore, the proposed revised version of the CII, which contains 22 items with dichotomous response options, fits the Rasch model.
Based on previous studies, the value of mean and SD for item-person interaction should achieve 0 and 1 to maintain the normal distribution of residuals respectively [13]. Throughout our analyses, we monitored these values after every iteration of the item reduction and response rescoring. As a result, the distribution of residual was improved for the revised CII with an acceptable level of mean and SD.
Within RUMM2030, the various configurations of the class interval results in different over fit statistics. In addition, no unified rules exist in the literature to support the choice of class interval setting. According to the operation manual [5], the default class interval should be set as 10 and endorsement on each interval should be close to 50. However, we failed to maintain the default setting through our analyses since endorsements on each class interval were not evenly distributed. Therefore, we decided to adjust the setting according the PSI value (0.86) since it indicated that the revised CII has sufficient ability to discriminate between 3 groups.
Clinical implications
Items 8 and 9 should not be considered in isolation –testing of alternate forms where 9 does not always follow 8 could be instructive on the nature and persistence of the dependency. Evidence indicates that subjective experience of intense fear and helplessness in response to critical incidents have been considered as factors that contribute to posttraumatic stress disorders (PTSD) [14]. Our Rasch analysis and subsequently, the introduction of the new CII version, may potentially further improve the predictive validity of PTSD and various mental health disorders by increasing the precision of the scores.
Research implications
Informed by the results of our study, future studies could perform field testing for the revised version of CII to establish the measurement properties (reliability) under classical test theory (CTT) and consider order dependency related to the order of answering individual items [15]. Cognitive debriefing of each individual item is also essential to examine the content validity of CII, as such evidence is lacking [5]. This would further be informative to understand potential threats to reliability in the new scoring scheme.
Conclusions
Out Rasch analysis indicated that the original CII with 24 items failed to meet the expectations of Rasch model including the appearance of multi-dimensionality, misfit statistics, and disordered thresholds. After item reduction of item 4 and item 20, rescoring the original response options into dichotomous format, the revised version of CII showed acceptable level of the fit to the Rasch model. To address the DIF issues, several items need to be split by sex groups (items 2, 19, 6, 17) and service year (items 12). The re-appraisal of the revised version CII indicated a satisfactory level of Rasch model fit. There is a need of further field-testing to establish the psychometric properties of the revised CII under Classical test theory.
Conflict of interest
None to report.
Funding
This work was funded by the Ontario Ministry of Labour (FRN #13-R-027).
