Abstract
This article investigates the measurement and structural invariance of a newly developed self-report questionnaire, the Forensic Symptoms Inventory-Revised, aimed at measuring eight cognitive, emotional, and behavioral deficits (aggression, lack of social support, problematic substance use, lack of concentration, anger, poor self-regulation, impulsivity, and sexual problems) among adult forensic outpatients. The sample consisted of 716 outpatients (603 males, 113 females) with a mean age of 38.19 (SD = 12.47). Multi-Group Confirmatory Factor Analyses supported the measurement and structural invariance with respect to gender and age groups (18–23 years and ≥24 years). Between-group comparisons revealed that, compared to females, male outpatients reported more substance related problems, as well as incapacities to control verbal and/or physical aggression. Compared to adults, young adults displayed more inadequate self-regulation skills and reported more social support. These findings may promote the formulation of gender- and age-specific treatment goals.
Keywords
Introduction
In recent years, many mental health services have been working toward the introduction of routine outcome measurement (ROM), the periodically repeated assessment of patients’ mental health problems. In the Netherlands, ROM in forensic mental health has a particular focus on clinician-rated, rather than client-rated, measures such as the Health of the Nations Outcome Scales (Wing et al., 1998) and a variety of risk assessment instruments among which is the Historical Risk Management—20V3 (Douglas, Hart, Webster, & Belfrage, 2013). Traditionally, clinician-rated measures are widely used in forensic settings to rule out clients’ tendency to give socially required responses (Bech & Mak, 1995). However, a recent meta-analytic study suggests that differences in treatment effect sizes and in outcome scores based on self-report and clinician-rated instruments might be explained by the fact that self-report ratings obtain unique information that is not accessible by clinician ratings or vice versa (Cuijpers, Li, Hofmann, & Andersson, 2010). For instance, self-report measures allow respondents to rate their performance in social and occupational activities. Despite these insights, little progress has been made in the development of a broad spectrum self-report instrument to monitor the progress of forensic outpatients in the key areas of forensic problems (Shinkfield & Ogloff, 2014). To fill this gap, the Forensic Symptom Inventory—Revised (FSI-R adults) was developed. The present study reports on the measurement and structural invariance of this newly developed self-report instrument that—combined with clinician-rated instruments—can be used to monitor changes in cognitive, emotional, and behavioral deficiencies in forensic outpatients.
From (meta-analytic) studies, it becomes clear that cognitive, emotional, and behavioral regulatory deficits (symptomatic of various clinical and personality disorders) are associated with offence-related behaviors such as substance misuse, impulsivity, inadequate problem solving, and poorly regulated anger (Davey, Day, & Howells, 2005; Haden & Shiva, 2009; Leenaars, 2005; Smallbone & Milne, 2000; Stinson, Robbins, & Crow, 2011). These inadequacies can be considered as important dynamic risk factors that should be the focus of any cognitive-behavioral forensic intervention to reduce the risk of re-offending (Andrews & Bonta, 2010; Day, 2009). In the FSI-R adults, relevant risk-related domains such as anger, aggression, substance misuse, and impulsivity are targeted.
From previous studies, it is known that dynamic risk factors differentiate between male and female offenders (e.g., Hollin & Palmer, 2006), as well as between young adults and adults (e.g., Corrado & Mathesius, 2014). In their review, Hollin and Palmer (2006) point out that male and female offenders do not differ in types of dynamic risk factors per se, but rather show different levels in the same risk factors, in particular regarding substance use and relationship problems. Corrado and Mathesius (2014) discussed the age-based psychosocial, cognitive, and neurological development of decision-making in relation to the legal principle of criminal responsibility. They concluded that, while cognitive decision-making is fully developed by the age of 16, psychosocial capacities continue to develop into young adulthood (18 to 23/24 years). Compared to adults, young adults are more susceptible to influences from peers and display more impulsivity and risk-taking behaviors such as criminality. Therefore, in the Netherlands, young adult offenders can be sentenced within the juvenile justice jurisdiction. In the current study, forensic outpatients aged 18 to 23 years were classified as young adults.
The development of the FSI
Understanding the developmental history of an instrument is important in assessing its validity, and therefore, a brief overview is presented. The initial draft of the FSI was a 109-item instrument covering eight key areas of forensic problems: aggression, lack of social support, problematic substance us, lack of concentration, anger, poor self-regulation, impulsivity, and sexual problems. These so-called forensic symptoms were derived from extensive literature review and relevant instruments such as the Barrat Impulsivity Scale (Spinella, 2007), Outcome Questionnaire 45 (Lambert et al., 2004), and the Symptom Checklist (Derogatis, 1994). To maximize face and content validity, four forensic experts (two clinical psychologists and two behavioral therapists), with at least 10 years of experience in the forensic field, provided input in different phases of the developmental process. Two open-ended questions were added to the FSI as an opportunity for clients as end-users to provide input concerning the reading level and relevance of the items.
Data from a sample of 760 forensic outpatients were subjected to analyses which employed exploratory (oblique rotation) and confirmatory factor analyses, as well as a content analysis based on input from outpatients on the open-ended questions (Bisschop, 2014). The analyses resulted in the removal of items with cross loadings or items with the low factor loadings (in general <.50) and inter-item correlations (<.20 or >.70), as well as the creation of new items and rewording of existing ones. A minimum of four or five items is required for a subscale to adequately assess the domain of interest (Hinkin & Schriesheim, 1989). The subsequent refinement process resulted in a final revised version of the FSI adults (FSI-R) consisting of 32 items, each construct containing four items. In the Appendix, the translated FSI-R items and their corresponding subscales are presented.
To ensure the translation validity, the procedure of forward- and back-translation as described by Beaton, Bombardier, Guillemin, and Ferraz (2000) was conducted by two bilinguals. The forward-and-back-translation method is a procedure for investigating the conceptual equivalence (i.e., symmetry) of the original and translated versions, necessary for valid cross-cultural comparisons. First, a bilingual translator translated the Dutch version of the FSI-R adults into English. Second, the English version was translated back into Dutch by another bilingual translator. Finally, the original source and the back-translated items were compared for non-equivalence of meaning and any discrepancies were noted. This iterative process of translation and back-translation was continued until no semantic differences were noticed between both questionnaire forms.
Measurement and structural invariance
Testing for measurement and structural invariance of measurements is not common practice in the scientific forensic field. Invariance of at least measurement level is critically important when comparing subgroups. If invariance is absent, observed differences in means or other statistics might reflect differences across subgroups in systematic response biases or a different understanding of the concept, rather than substantive differences per se (Byrne, Shavelson, & Muthen, 1989; Baumgartner & Steenkamp, 1998). Stated differently, to representing a powerful test of the generalizability of a measurement model across samples and subpopulations, measurement invariance represents an important pre-requisite to meaningful and unbiased between-group comparisons.
Current study
The aims of the present study were three folded. First, the FSI-R seven-factor model (sexual problems subscale excluded) was tested for both measurement and structural invariance across gender and age groups (young adults and adults) using structural equation modeling. The Sexual Problems subscale was excluded since there were no female sex offenders in the sample, and only 16 young adults entered treatment for having committed a sex offence. In addition, the FSI-R eight-factor model (sexual problems subscale included) was tested in a subsample of male sex offenders. Finally, mean latent differences were investigated between gender and age groups in different treatment phases. Measurement and structural invariance were expected across gender and age groups.
Methods
The study was prospectively conducted at De Waag, a Dutch outpatient treatment facility. De Waag offers mainly cognitive-behavioral based interventions to juvenile and adult offenders who due to their offence behavior (are prone to) come into contact with police force or judicial authorities. Patients enter treatment on a voluntary or mandatory basis. Voluntary treatment indicates that the patient enters treatment on his own initiative, on referral of a general practitioner or another mental health care institute. Mandatory treatment means that treatment is imposed by a judge. In these cases, a probation officer fulfils the supervisory role.
Study protocol
In the ROM infrastructure, all outpatients referred to De Waag are routinely assessed with a number of Internet-based instruments, among which the FSI-R adults, at baseline during intake and, if treatment is initiated, repeatedly every four months during treatment. The FSI-R adults were included in the ROM-procedure from April 2014. All clients with a registered email address received a secure link to the FSI-R.
At intake, patients are informed by the therapist about the data collection during their treatment, as well as how these data will be used for scientific purposes and to match their treatment to their individual problems. Patients also receive a flyer detailing the data collection procedure and patients are asked to sign an informed consent letter when they agreed on the use of their data for scientific research. Participation in the study was voluntary. The procedure falls within the Dutch Data Protection Act (Dutch DPA) and other specific Dutch healthcare laws, which provides legal provisions on how to deal with the privacy of personal information within the context of, among others, mental health services. The Dutch DPA also includes that all patients have the right to withdraw their previous consent at all times during and after treatment.
Sample
The initial sample comprised a total group of 856 outpatients who filled out the FSI-R, which made up about 60% of the adult offenders in treatment during the inclusion period (April to October 2015). Patients who did not complete the FSI-R did not have a registered email address or did not complete the FSI-R within the time window of four weeks.
Demographic and treatment-related sample characteristics differentiated by gender and age groups.
Most patients who were born abroad were born in Surinam, Dutch Antilles or Morocco.
Clinical disorders in this category are autism spectrum disorders, substance related disorders.
This category consists of offences such as fire setting and possession of weapons.
No age differences were found between male and female offenders, t(714) = .768, p = .443. Male offenders were in treatment for a longer time than their female counterparts, t(288.32) = 3.437, p < .001. Differences in percentages were significant for country of birth, χ2(1) = 7.51, p = .006, clinical disorders, χ2(5) = 25.93, p < .001, index offence, χ2(4) = 83.17, p < .001, and treatment context, χ2(1) = 7.63, p = .006: compared to their male counterparts, female offenders more often were born abroad, diagnosed impulse control disorder, they were referred to treatment for having committed a property offence, and treated on a mandatory basis. Differences in personality disorders failed to reach statistical significance, χ2(3) = 2.12, p = .548. As for the differences between young adults and adults, young adults were in treatment for a significant shorter period of time than adults, t(354.221) = −3.553, p < .001. Differences in percentages were significant for clinical disorders, χ2(5) = 17.85, p = .003, personality disorders, χ2(3) = 12.13, p = .007, index offence, χ2(4) = 25.726, p < .001, and treatment context, χ2(1) = 11.173, p = .001: young adults were diagnosed less often than adults with a clinical or personality disorder. They also more often committed a property offence and a higher percentage was in mandatory treatment. Differences in country of birth were not significant, χ2(1) = .040, p = .842.
Diagnoses
In January 2017, the fifth edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM 5) has been introduced in the Netherlands as the standard classification system for mental disorders by the Ministry of Health, Welfare, and Sport. Given the timeframe of the inclusion period (April to October 2015) of the data used in the present study, the DSM-IV (DSM-IV-TR; American Psychiatric Association, 2000) was used by a psychologist or psychiatrist at intake. The intake session lasts approximately 60 minutes and consists of a screening of issues relevant to outpatient forensic care, among which the criminal history and index offence, family (situation), education, and work. The clinically assessed primary diagnosis is discussed and determined by a multidisciplinary team of a psychiatrist, psychotherapists, and psychologists. The primary diagnosis underlies the patient’s offence behavior.
Instruments
The FSI-R adults (van Horn, Hendriks, & Kraanen, 2015) consists of 32 items measuring the following eight domains each consisting of four items: Aggression, Lack of Social Support (SOC), Problematic Substance Use, Lack of Concentration, Anger, Poor Self-regulation (SEL), Impulsivity, and Sexual Problems. The Aggression subscale refers to the incapacity to control anger which leads to verbal and/or physical aggression towards others. Items in the SOC-subscale express the lack of social support as experienced by the patient. The Problematic Substance Use scale contains items referring to the urge to use alcohol and/or drugs and the problems associated with the patient’s problematic substance use. The Lack of Concentration subscale reflects the patient’s inability to concentrate which can also lead to being easily distracted. Items in the SEL subscale reflect the cognitive and behavioral inability to reduce stress or tension under pressure. The Impulsivity subscale refers to a general predisposition toward rapid, unplanned emotional 6(e.g., diminished ability to delay gratification) or behavioral (e.g., acting on the spur of the moment) reactions to internal or external stimuli without thinking about the consequences (cf. Patton, Stanford, & Barratt, 1995). The Sexual Problems subscale refers to uncontrollable sexual thoughts, feelings, and/or behaviors such as watching pornographic material. This scale has to be completed if one or more of the following situations was applicable: the patient committed a sexual offence, had a negative sexual experience in the past, and/or experiences current sexual problems. This procedure was applied because from clinical experience it is known that patients without a history of sexual problems refuse to answer any questions related to their sexual feelings and/or behavior.
The clients rated, on a 5-point Likert type scale, the degree to which they thought, felt or acted on the statement over the last two weeks. Answering categories were: 1 “(almost) never,” 2 “sometimes,” 3 “occasionally,” 4 “often,” and 5 “(almost) always.” High scores on the subscales indicate increased deficits in cognitive, behavioral, and/or emotional functioning. In the Appendix, the FSI-R items are listed, as well as the subscales they represent.
Analysis
SPSS and AMOS Version 20.0 were used to analyze the data. Normality tests showed that the skewness of the FSI-R items ranged from −0.09 to 2.64. The kurtosis values for almost all FSI-R items ranged from −1.27 to 2.88. According to Newsom (2015), skewness and kurtosis values between ±2 and ±3 can be considered moderately skewed, and values above 3 are seriously skewed. Four items exceeded the kurtosis threshold of 3: item 1 (kurtosis = 6.92), item 9 (kurtosis = 7.53), item 18 (kurtosis = 4.60), and item 25 (kurtosis = 3.61). Items 1 and 8 are part of the Aggression subscale and the other two items are part of the Substance subscale. See the Appendix for the content of these items. At subscale level, the skewness and kurtosis values were all at or below the threshold of 3 with values ranging from −0.82 to 3.03. All items were standardized by using a log10 plus 1 transformation to move the lowest score up to a minimum of 1. After this transformation, skewness levels ranged from −.61 to 1.96 and kurtosis levels ranged from −1.32 to 3.59. At subscale level, the skewness and kurtosis values were all substantially below the threshold of 3 with values ranging from −0.79 to 1.53.
Single-group and multi-group confirmative factor analyses
Preliminary separate single-group CFA analyses were conducted, followed by Multi-Group Confirmatory Factor Analyses (MGCFA) (AMOS 20.0) to test for measurement and structural invariance across gender and age groups. The hypothesized model is presented in Figure 1.
Eight-factor FSI-R adults’ model.
Given that the distribution of most of the item scores did not diverge strongly from univariate normality, the models were fitted via the maximum likelihood estimation. The measurement and structural invariance tests were performed on seven of the eight FSI-R subscales in the gender and age subgroups. The Sexual Problems subscale was excluded, since no female offenders entered treatment for a sex offence and only 16 young adults entered treatment for having committed a sex offence. In a separate single group CFA, the factor structure of the eight subscales was performed in the male sex offender subgroup.
MGCFA is the most widely used method to test for measurement invariance (Billiet, 2002). This method permits testing for invariance easily by setting cross-group constraints and comparing more restricted with less restricted models (e.g., Byrne et al., 1989). Little (1997) identified two types of invariance: type 1 invariance has to do with the psychometric properties of the measurement scales (measurement invariance) and type 2 invariance deals with between-group differences in latent variables (structural variance). Following Brown (2006) measurement invariance is examined for the next four MGCFA models:
Configural invariance (model 1)
In this baseline model, it is tested whether the proposed factor structure would be equal across the two groups. In other terms, the items should exhibit the same configuration of salient and non-salient factor loadings across different groups. Failure to demonstrate configural invariance (also called pattern invariance) indicates that different constructs were measured across groups.
Metric (weak) invariance (model 2)
Measurement weights (factor loadings) are constraint to be equal across groups in order to test whether respondents across groups attribute the same meaning to the latent construct under study. Metric invariance (also called factorial invariance) allows a meaningful comparison of relationships (unstandardized regression coefficients, covariances) between the latent construct and other concepts across groups (Baumgartner & Steenkamp, 1998).
Scalar (strong) invariance (model 3)
Both configural and metric invariance are tested by using information on the covariances between the items. They are not sufficient if the goal of the analysis is to compare means across groups. To justify comparing means, scalar invariance is necessary. In the scalar invariance model, the item intercepts are constrained to equality to test whether the meaning of the levels of the underlying items (intercepts) are equal in both groups.
Error variance invariance (model 4)
In this so called strict factorial model, equality constraints across groups are specified for measurement residuals. Testing for the equality of between-group residual variance determines if the scale items measure the latent constructs with the same degree of measurement error. It was recommended by Vandenberg and Lance (2000) that the evaluation of error variance invariance be left to the researcher’s discretion because if scalar invariance holds, group difference in the residual variances is indicative of only the difference in reliabilities of the observed scores. Hence, group difference is compensated if comparison is to be made on the latent variable level. Following this rationale, significant improvement in fit is interpreted as difference in measurement reliability (i.e., random noise) rather than evidence of bias. The error invariance test should only proceed if (at least partial) metric and scalar invariance has been established first.
Next, structural invariance of the latent variables was tested (i.e., variances, co-variances, and latent means). It should be noted that these structural parameters describe characteristics of the population from which the sample was drawn and thus nonequivalence in structural parameters do not represent critiques of the measures themselves, but rather reflect differences in the distribution of the underlying construct between the two groups (i.e., “true” substantive differences; Adamsons & Buehler, 2007). The following sequence of structural invariance models was applied regularly used in the literature (Vandenberg & Lance, 2000).
Factor variance invariance (model 5)
In this model, the equality of the factor variances is measured. Invariance of factor variance indicates that the range of scores on a latent factor does not vary across groups.
Factor covariance invariance (model 6)
In this model, the equality of the covariances between latent constructs is measured. If the covariances are invariant, the correlations between latent constructs are invariant across groups.
Assuming measurement and structural invariance, latent mean differences across gender were estimated, fixing the latent mean values to zero in females (reference group).
The process of MGCFA model fitting from steps 1–6 yielded a nested hierarchy of models in which each model contained all the constraints of the prior model, and thus, each was nested within the previous models.
Fit Indices
Previous investigators have recommended the use of multiple indices when evaluating the fit of a structural model (Hu & Bentler, 1995; Marsh, Hau, & Wen, 2004). Based on these authors and the review by Hooper, Coughlan, and Mullen (2008), a number of absolute and incremental fit indices were included to examine measurement invariance of the FSI-R adults across gender and age groups.
The next absolute fit indices were used to evaluate how well the a priori model reproduced the sample data: Chi-square (χ2), the Normed Chi-square (or the chi-square to degrees of freedom ratio, χ2/df), the Root Mean Square Error of Approximation (RMSEA), and the Standardized Root Mean square Residual (SRMR). The likelihood ratio test (also called chi square or χ2 test) has been traditionally used as a goodness-of-fit statistic in structural equation modeling. However, its sensitivity to sample size and its underlying assumption that the model fits the sample data perfectly has long been recognized as problematic (e.g., Bentler & Bonnet, 1980). It has therefore been recommended that this statistic should be used as a measure of fit rather that a test statistic. Two incremental fit indices were also presented, the Comparative Fit Index (CFI) and the Non-Normed Fit Index (NNFI, also known as the Tucker-Lewis Index—TLI). These relative fit indices do not use chi-square in its raw form but compare the chi-square value to a baseline model. For these models, the null hypothesis is that all variables are uncorrelated (McDonald & Ho, 2002), that is, there are no latent variables. The CFI is a “normed” index with values between or equal to 0 and 1.
With each more restrictive model, chi-square values were monitored to detect for significant changes (chi-square difference test) with each added set of constraints. If chi-square difference is significant (at least one of the items within), the construct is non-invariant. However, as noted previously, with large sample sizes, chi-square values will be inflated (statistically significant), thus might erroneously implying a poor data-to-model fit (Schumacker & Lomax, 2004). As a remedy, Cheung and Rensvold (2002) proposed to rely on the difference in the comparative fit index (Bentler, 1990), ΔCFI, to judge the adequacy of invariance assumptions. Based on their simulation study which analyzed the behavior of several fit indexes, they found that the ΔCFI was the only fit index which was not correlated with its overall value of the former model. They proposed to reject the invariance hypothesis when there is a decrease of .01 or larger in CFI (ΔCFI ≤ −.01). The ΔCFI was calculated by subtracting the CFI of the more constraint model from that of the less constraint model.
Thresholds for fit indices
A number of statistic cutoffs for the fit indices are suggested to discriminate between poor and optimal model fit. Although there is no consensus regarding an acceptable χ2/df ratio (CMIN/DF), recommendations range from smaller than 5.0 (Wheaton, Muthen, Alwin, & Summers, 1977) to preferably around 2.0 (Tabachnick & Fidell, 2007). Optimal chi-square values are non-significant, but significant values do not necessarily indicate poor model fit. Chi-square provides an excellent index of model modification improvement. However, given the sensitivity of the chi-square statistic to sample size, its role in CFA testing for model data fit is more descriptive than inferential. Using chi-square as the sole measure of model acceptability may increase the risk of Type II errors (i.e., the probability of accepting the null hypothesis when it is false) (Schumacker & Lomax, 2004). For the additional absolute fit indices, a value ≤.06 is needed for the RMSEA index and for SRMR, a value close to .08 is considered best (Hu & Bentler, 1999). RMSEA values in the range of .08 to .10 indicate mediocre fit and above .10 poor fit (Browne & Cudeck, 1993). Kline (2010) suggests that for the incremental fit index CFI and NNFI, values above .90 are adequate, although values above .95 are more desirable (Hu & Bentler, 1999).
Cronbach’s alpha coefficients and Pearson correlation coefficients were calculated for the FSI-R subscales. Cronbach’s alpha coefficients >.70 can be considered acceptable and coefficients >.80 good (Nunnally & Bernstein, 1994). Following the guidelines provided by Cohen (1988), the strength of the correlation is interpreted as follows: r ≥ .10 = weak, r ≥ .30 = moderate, r ≥ .50 = strong. Although there is no unambiguous threshold above which it can be concluded that the subscales probably measure the same underlying dimension, in the present study, the threshold was set at 0.80 (Field, 2013), which corresponds with 64% of shared variance.
The Bollen-Stine bootstrap technique (95%CI) was used to analyze the mean latent differences between the gender and age subgroups in different treatment phases. This technique involves a resampling from the empirical distribution in which multiple subsamples of the same size are drawn at random replacement (Good, 1999). The bootstrap samples were set at 1000 as recommended by Cheung and Lau (2008) to obtain stable probability estimates.
Results
The seven-factor configural FSI-R model (the subscale Sexual Problems was excluded) was tested separately for males, females, young adults, and adults. Results showed a good model fit for male offenders (χ2 = 974.680; df = 329; χ2/df = 2.963; RMSEA = .057, 90%CI = .053 – .061; SRMR = .0523; CFI = .928; NNFI = .918; PNFI = .780) and an adequate fit for female offenders (χ2 = 605.526; df = 329; χ2/df = 1.841; RMSEA = .087, 90%CI = .076 – .10; SRMR = .0818; CFI = .854; NNFI = .832; PNFI = .638). As for age groups, the configural seven-factor model pointed towards a good model fit for adults (χ2 = 966.638; df = 329; χ2/df = 2.938; RMSEA = .056, 90%CI = .052 – .060; SRMR = .0519; CFI = .929; NNFI = .919; PNFI = .781) and an adequate fit for young adults (χ2 = 974.680; df = 329; χ2/df = 1.579; RMSEA = .076, 90%CI = .063 – .088; SRMR = .0755; CFI = .893; NNFI = .878; PNFI = .662). These configural models provided an overall support for the seven-factor model in the gender and age groups.
The second step was to move from single-group CFAs to MGCFAs to cross-validate the seven-factor model in the gender and age group samples. A series of hierarchical measurement models nested within the baseline configural model 1 were used to test for gender and age group measurement invariance.
Measurement and structural invariance across gender
FSI-R adults: model fit indices for measurement invariance across gender (Nmale = 603, Nfemale = 113).
Model 1 (configural model) functioned as the baseline model on which further equality constraints were imposed. This model attained adequate fit to the data, with values well within the thresholds, indicating that males and females shared the same FSI-R adults underlying factor pattern and that corresponding subtests loaded on the same factors.
The metric invariance model (Model 2) tested the invariance of factor loadings across gender by placing equality constraints on these parameters. This measurement weights model was nested within the unconstrained configural model (Model1). The chi-square difference test did not increase significantly, Δχ2(21) = 31.734, p = .062, indicating an adequate fit of the metric invariance model. This was also supported by the ΔCFI, the difference of the comparative fit index did not exceed the cut-off value of .01. Additionally, other comparative fit indices values (i.e., χ2/df, RMSEA, CFI, and NNFI) were well within the thresholds limits. Consequently, the factor loadings can be considered equal across gender groups.
Model 3 (full scalar invariance) is further nested within the metric invariance model (model 2), that is, the intercepts were also constrained to be equal among males and females. The chi-square difference test increased significantly, Δχ2(28) = 51.756, p = .004, indicating a deterioration of the model. However, as noted earlier, the chi-square value may be inflated due to the large sample size employed. Other fit indices were within the specified criterion. Hence, the full scalar model can be considered invariant across gender groups.
The step from model 3 (scalar invariance) to model 4 (error variance invariance) resulted in a significant chi-square difference, Δχ2(28) = 111.407, p < .001. However, based on the ΔCFI difference, which did not exceed the threshold of .01 (ΔCFI = −.008), it can be concluded that error variance invariance can be assumed across gender groups.
Next, the tenability of structural invariance models was examined in which factor variances and factor covariances were constrained to equality across the two gender groups. Model 5 resulted in a significant improvement in almost all (changes in) fit indices as compared to those generated for model 4. The addition of the equality of the covariances between latent constructs (model 6) resulted in a slightly better fit compared to the previous model (model 5). In Table 7 standardized FSI-R adults item factor loadings based on model 4 error variance invariance, are presented for gender subgroups.
Latent means for males and females in different treatment phases
FSI-R adults: bias-corrected means in different treatment phases for males, with females as the reference group.
Measurement and structural invariance across age groups
FSI-R adults: model fit indices for measurement invariance across age groups (Nyoung adults = 102, Nadults = 614).
Latent means for young adults and adults in different treatment phases
FSI-R adults: bias-corrected means in different treatment phases for young adults, with adults as the reference group.
Latent correlations and reliabilities
FSI-R adults: latent correlations and reliabilities based on the total sample (N=716).
Note: AGR: aggression; SOC: lack of social support; SUB: problematic substance use; CON: concentration; ANG: anger; SEL: poor self-regulation; IMP: Impulsivity; SEX: sexual problems. Cronbach’s α reliability coefficients are reported in the diagonal; the correlation and reliability coefficients of the Sexual problems scale are based on the male sex offender subsample (n = 126).
FSI-R adults: standardized FSI-R adults item factor loadings for gender, age, and male sex offender group based on model 4 error variance invariance.
Note: AGR: aggression; SOC: lack of social support; SUB: problematic substance use; CON: concentration; ANG: anger; SEL: poor self-regulation; IMP: Impulsivity; SEX: sexual problems.
Most correlation coefficients between the FSI-R subscales were weak or moderate, except for the Impulsivity subscale that showed strong relations with the subscales Aggression, Substance use, Concentration, and Anger. These coefficients were below the threshold of .80 above which a considerable contextual overlap can be assumed.
Single CFA in the male sample of sex offenders using eight latent factors
In a single CFA, the eight-factor structure of the FSI-R adults was tested in a sample of 113 men (13 cases were missing) who had committed a sexual offence. Overall, the results supported the fit of the model in a male sex offender subsample (χ2 = 652.383; df = 436; χ2/df = 1.496; RMSEA = .067, 90%CI = .056 – .077; SRMR = .079; CFI = .866; NNFI = .848). The incremental fit indices (CFI and NNFI), however, were just below the threshold of adequate fit. In Table 7 the standardized FSI-R adults item factor loadings based on model 4 error variance invariance, are presented for the male sex offender group.
Discussion
The present study tested gender and age measurement and structural invariance of the FSI-R adults through CFA methods. The FSI-R subscales (aggression, lack of social support, problematic substance use, lack of concentration, anger, poor self-regulation, impulsivity, and sexual problems) adequately represent forensic outpatients’ response patterns in the overall sample, as well as when the sample is broken down in gender and age subgroups. In addition, the fit of the FSI-R eight-factor model (subscale Sexual problems included) was supported in a male sex offender subsample. These findings are important since measurement and structural invariance is critically important to ensure unbiased between-group comparisons (Byrne et al., 1989). Results from the correlational subscale analysis (r < .80 threshold) pointed to the conclusion that the FSI-R subscales can be regarded as distinct, but related to the concepts of cognitive, emotional, and behavioral deficits in particular seen in forensic patients (Haden & Shiva, 2009; Smallbone & Milne, 2000; Stinson et al., 2011). The strong association between impulsivity and aggression, problematic substance use, lack of concentration, and anger found in the present study has been well established in the previous research (e.g., Coccaro, Lee, & McCloskey, 2014; Davidson, Putnam, & Larson, 2000; Ramirez & Andreu, 2006; Verdejo-Garcia, Lawrence, & Clark, 2008). Support for the reliability of the FSI-R concepts was evidenced by good internal consistency coefficients for six of the eight FSI-R scales (α > .80). The reliability of two scales (Self-regulation and Impulsivity) fell within a few points of the .80 threshold outlined by Nunnally and Bernstein (1994). These outcomes suggest homogeneity of the scale items.
Having established the measurement and structural invariance of the FSI-R in both gender and age subgroups, between-group comparisons were made. Results pointed towards differences between male and female offenders on the FSI-R adults’ subscales who were at the start of their treatment and among those already in treatment. A consistent difference in both treatment phases was found for problematic substance use, with male offenders showing more substance-related problems. This finding aligns with previous research in which males were found to have higher rates of alcohol and/or drug use and are more likely to develop dependence or abuse (e.g., Fattore, Melis, Fadda, & Fratta, 2014). Among patients who were in treatment, higher aggression scores were found in males compared to females. These differences were not found in the group who just started treatment. However, since in both treatment phases other outpatient subgroups were analyzed, it cannot be concluded that male offenders did not profit from treatment as much as females did. The same accounted for the results concerning the age group comparison (young adults vs. adults). Findings pointed towards significant poor self-regulation among young adults, while adults reported less social support at treatment start. These differences were not found between the age groups who were in treatment.
Study limitations and future directions
The results of our study should be interpreted with the following limitations. First, the findings are specific to a Dutch forensic outpatient sample, raising issues of representativeness of all forensic outpatients, and thus the generalizability of the results to other areas of the Netherlands and across its borders. Future studies should, therefore, use samples from other Dutch forensic outpatient facilities, as well as from other countries to provide empirical support for the FSI-R adults, and to support the generalizability of the current findings at national and international level. This also accounts for testing the measurement and structural invariance of the FSI-R adults across ethno-racial subgroups. Given the relatively small percentages (<5% in most ethnic groups) of patients from different countries, the measurement and structural invariance could not be tested between ethnic subgroups. It has been stressed by Comşa (2010) that it is important to at least demonstrate cross-cultural measurement invariance prior making between comparisons.
Furthermore, analyses were conducted on adult offenders who consented in using their data for scientific purposes. This could have led to the loss of non-consenting patients with higher scores on specific forensic problems, resulting in non-significant differences between gender and age groups. Results from a very limited number of studies conducted on non-consenters within the forensic population led to the preliminary conclusions that patient diagnosis is unrelated to the ability to decide whether to participate in research or not (McDermott, Gerbasi, Quanbeck, & Scott, 2005) and that perceived control over the participation procedure plays a central role in the decision to consent to participation (Edens, Epstein, Stiles, & Poythress, 2011). The characteristics in these decision-making processes could be an interesting research topic from a ROM perspective in general and the bias in FSI-R research in particular.
Since the FSI-R is developed to measure emotional, cognitive, and behavioral deficiencies in forensic outpatients, the convergent and discriminant validity (as part of the construct validity) has to be confirmed by correlating its subscales to standardized and validated instruments, preferably by using the Multitrait-Multimethod Matrix developed by Campbell and Fiske (1959). In the forensic field, clinician-rated instruments such as risk assessments are eminently suitable for these validation procedures. These instruments have been the focus of empirical forensic research for quite some time (e.g., Singh & Fazel, 2010) and statistically demonstrated moderate to good predictive validity, with an overall better performance of actuarial risk assessment instruments (e.g., Luallen, Radakrishnan, & Rhodes, 2016). Finally, the FSI-R adults has to be studied in relation to its sensitivity to measure changes in emotional, cognitive, and behavioral deficiencies, as well as its predictive validity with regard to recidivism.
To sum, the present study yielded strong support for the invariant factor structure of the FSI-R adults across gender and age groups. However, future research need to focus on the relevance of the FSI-R from a clinical point of view. Stated differently, the question that needs to be answered is whether changes in the FSI-R subscale scores relate to changes in the risk to relapse in offence behavior. Prospective longitudinal research and the collection of judicial information are needed to answer this question.
General conclusion and implication for clinical practice
In accordance with the foregoing, at this point, the clinical value of the FSI-R has to be established yet. However, the results seem to indicate that differences in forensic complaints emerge across gender and age groups, particularly regarding substance use related problems (more present in male offenders) and inadequate self-regulation skills (more present in young adults). Substance use problems were higher in male offenders at the start of the treatment and during treatment, indicating that these problems might take a longer time to decrease than for instance the inadequate self-regulation problems reported by young adults (as opposed to adults) and the lack of social support experienced by adults (as opposed to young adults). Both types of problems were reported among offenders at the start of the treatment and were not present in the group who was in treatment. Obviously, the comparison between two groups who were in different treatment phases limits the conclusion drawn from these findings. Nevertheless, it can be of clinical use to take notice of possible sustained substance use problems during treatment, in particular among men.
Footnotes
Appendix
Acknowledgments
The author would like to express her appreciation to the following persons who have contributed to the development of earlier versions of the FSI adults and/or have commented on earlier versions of this article: M. Bisschop, L. Hoogsteder, N. Boswinkel, P. Emmelkamp, H. Hendriks, and G. J. Stams.
