Abstract
Even though risk assessments are routinely conducted in the criminal justice system to inform sentencing and case management, their cross-cultural applicability remains contested. This study investigated the generalizability of the Youth Level of Service/Case Management Inventory (YLS/CMI), a widely implemented youth forensic risk assessment instrument, using an Item Response Theory framework, in a sample of Indigenous (n = 205) and non-Indigenous (n = 193) youth. Differential item functioning analyses demonstrated similar discrimination across groups. However, despite similar latent risk levels, non-Indigenous youth were more likely to have items from the Education domain endorsed, while Indigenous youth were more likely to have items from the Substance Abuse domain endorsed. Predictive accuracy analyses indicated that total YLS/CMI scores significantly predicted general recidivism (without administration of justice convictions) for non-Indigenous youth, but not for Indigenous youth. There is an urgent need for more research investigating the applicability of the YLS/CMI to diverse groups of Indigenous youth.
Introduction
Indigenous people are overrepresented in criminal justice systems around the world, including Australia (Krieg, 2006), Canada (La Prairie, 2002), and the United States (United States Sentencing Commission, 2013). This pattern of overrepresentation is mirrored in the youth justice system in Canada (Malakieh, 2018). Some of the causal factors have been identified as a history of colonization and its resulting fragmentation of the Indigenous family, community, economic, and political structures (Scrim, 2010). In Canada, the overrepresentation of Indigenous youth in the justice system is tied to the legacy of residential schools and the ensuing disconnect between generations of youth and their communities. Many Indigenous youth struggle with addiction, family violence, mental illness, and parental involvement in the justice system, which place them at a greater risk for criminal justice involvement (Truth and Reconciliation Commission of Canada, 2015).
The overrepresentation of Indigenous people in the criminal justice system and their associated complex risk factors rooted in a history of colonialism raise important questions about the cross-cultural applicability of forensic practices, such as the use of forensic risk assessment tools (Wormith et al., 2015). For example, are the risk assessment instruments capable of identifying the unique risk and needs of Indigenous individuals? Over the past few decades, structured risk assessments have been increasingly used to predict justice-involved individuals’ risk of recidivism over unstructured clinical judgments (Grove & Meehl, 1996; Hanson, 2005). Risk assessment results are used to inform individuals’ progression throughout the criminal justice system, from sentencing, to case management in community supervision, to treatment planning (Gutierrez et al., 2016). Given the repercussions of the risk assessment results, it is imperative that risk assessment instruments be empirically validated, including establishing their degree of measurement invariance (Wormith et al., 2015). In an instrument where measurement invariance across particular groups (e.g., gender) has been established, group differences in scores can be attributed to differences in the construct the instrument was designed to measure, rather than measurement bias (Osterlind & Everson, 2009). The importance of empirical validation is highlighted in the Ewert v. Canada (2015) decision, in which a Canadian court cautioned against the use of five risk assessment instruments (e.g., the Violence Risk Appraisal Guide) with Indigenous justice-involved individuals on the grounds that the research supporting the psychometric properties of these five instruments on Indigenous individuals was inadequate.
The Ewert v. Canada decision highlighted a pressing need to assess the cross-cultural applicability of risk assessment instruments. A widely implemented youth forensic risk assessment instrument in North America and beyond is the Youth Level of Service/Case Management Inventory (YLS/CMI; Hoge & Andrews, 2002, 2011) derived from the Risk–Need–Responsivity (RNR) framework (Andrews et al., 1990). The RNR framework outlines a systematic and empirically based approach for evaluating an individual’s risk of recidivism. Central to the framework are the risk principle which states that the intensity of intervention should increase with risk of recidivism, the need principle which states that interventions should target an individual’s criminogenic needs (i.e., risk factors that are strongly and directly related to criminal behavior), and the responsivity principle which states that services should be delivered using evidence-based programming (general responsivity) and tailored to individuals’ personal characteristics and circumstances (specific responsivity; Andrews & Bonta, 2010; Andrews et al., 1990). Considerable research supports the utility of targeting criminogenic needs in reducing reoffending (Andrews & Bonta, 2010). However, the research on the generalizability of the RNR-based risk assessment tools to Indigenous youth is limited.
Existing research on the YLS/CMI reveals that the total risk score significantly predicts recidivism for Indigenous and non-Indigenous youth (Jung & Rawana, 1999; Luong & Wormith, 2011; Olver et al., 2012; Shepherd et al., 2015; Thompson & McGrath, 2012). However, some studies have demonstrated lower AUC (area under the curve) values for Indigenous youth compared with non-Indigenous youth (Luong & Wormith, 2011; Thompson & McGrath, 2012), suggesting that total scores may be differentially predictive across groups. Group differences also exist on the YLS/CMI; compared with non-Indigenous youth, Indigenous youth are consistently rated at higher risk in terms of total scores (Jung & Rawana, 1999; Luong & Wormith, 2011; Olver et al., 2012; Shepherd et al., 2015; Thompson & McGrath, 2012), as well as in the domains of Criminal History, Education/Employment (Olver et al., 2012; Thompson & McGrath, 2012), Peer Relations, Leisure/Recreation (Lockwood et al., 2018; Olver et al., 2012), and Substance Abuse (Jung & Rawana, 1999; Shepherd et al., 2015). Although differences in risk scores do not preclude measurement invariance across groups, it remains unclear whether the heightened total and domain scores in Indigenous individuals represent an increased prevalence of certain risk factors or entirely different relationships between the risk factors and recidivism compared with non-Indigenous individuals. Some have argued that the heightened risk scores for Indigenous individuals reflect the justice system’s tendency to “responsibilize” individuals for systemic social disadvantages, instead of true differences in risk (Hannah-Moffat, 2016). Furthermore, given that the YLS/CMI was normed mainly on Caucasian youth (Hoge & Andrews, 2011), its use with Indigenous individuals needs to be thoroughly investigated.
The current body of literature on this issue is small and has focused mainly on examining the YLS total risk scores’ relationships to recidivism. However, individual items provide important information about an individual’s profile and potential treatment targets, so examining the functioning of a test at the item level is a vital part of validation. Furthermore, understanding whether items function similarly across subgroups of a population is especially important in establishing an instrument’s degree of measurement invariance. To this end, differential item functioning (DIF) analyses are a necessary aspect of instrument validation. DIF occurs when an item functions differently across subgroups despite having the same level of the latent construct of interest (Osterlind & Everson, 2009). This means that the groups may not be meaningfully compared on the item (Furr, 2013). Analysis of DIF provides valuable information about subgroups within the population of interest; unbiased items indicate commonalities, and DIF items indicate potential measurement bias or components of the latent trait that are unique to each subgroup (Osterlind & Everson, 2009). One method that can be used to conduct DIF analyses is through the Item Response Theory (IRT) framework, which focuses on examining individual items to arrive at a holistic evaluation of the instrument (Embretson & Reise, 2000; Furr, 2013; Nunnally & Bernstein, 1994). Unidimensional IRT models assume that there is a single latent variable affecting responses to the items in an instrument.
To our knowledge, one study has examined the measurement invariance of the YLS/CMI items using DIF analyses across gender (male and female youth) and race (White and Black youth; Huang, 2019). The findings indicated that several items from the Education, Family, and Personality domains differed between subgroups but they demonstrated comparable sensitivity to changes across latent risk levels regardless of group membership. To date, no study has examined DIF across Indigenous and non-Indigenous youth on the YLS/CMI. From the extant literature, it is unclear whether the YLS/CMI has equal predictive validity for Indigenous and non-Indigenous youth. If a predictive gap indeed exists, this may signal differential predictive accuracy, or test bias, across groups. Furthermore, it is unclear whether the heightened total risk and domain scores represent differential functioning of the risk factors across groups or merely differences in frequency of endorsement. Therefore, in this study, we examined whether DIF exists in YLS/CMI items across a sample of Indigenous and non-Indigenous justice-involved youth.
Method
Participants
The sample consisted of 398 justice-involved youths from Ontario, Canada, who underwent risk assessments associated with probation sentences between 2007 and 2014. The program effectiveness branch of the ministry responsible for the supervision of youth on probation generated a sample of 205 Indigenous youths and 193 non-Indigenous youths. The two groups were individually matched (using case control) on gender (170 male and 35 female Indigenous youth; 160 male and 35 female non-Indigenous youths), age at time of risk assessment (M = 15.09 years, SD = 1.37), and on category (of 26 possible codes) of most serious offense associated with the assessment. Index offense was subsequently collapsed into three categories: violent nonsexual (e.g., robbery, assault), nonviolent (e.g., theft, drug-related), and sexual (e.g., sexual assault, invitation to touching; see Table 1). By design, cases were not matched on risk.
Demographic Information, Domain, and Total Risk Scores of Indigenous and Non-Indigenous Youth
Note. The numbers following the risk categories are the ranges of total RNA scores used to classify each category. The numbers following the total risk score and domains are the number of items within the instrument and the domains. Bolded t and chi-square values are significant at the .05 level. Phi (φ) value of .1 is considered a small effect, .3 a medium effect, and .5 a large effect. Cohen’s d value of .2 is considered a small effect, .5 a medium effect, and .8 a large effect (Cohen, 1988).
Measures
YLS/CMI
In Ontario, the RNR framework guides the assessment and case management of justice-involved youth (Ontario Ministry of Children and Youth Services, 2006). Each youth is assessed by probation officers, who are trained in RNR principles and in the administration, scoring, and interpretation of the YLS/CMI (Hoge & Andrews, 2002, 2011). The YLS/CMI comprises 42 binary items in eight need domains that reflect the major risk factors for reoffending identified in the RNR literature (Andrews & Bonta, 2010). These domains include Criminal History, Family Circumstances, Education/Employment, Peer Relations, Substance Abuse, Leisure/Recreation, Personality, and Attitudes. Items in each domain are summed into domain scores, and an overall risk score is determined by summing all items. In practice, total scores are also used to categorize youth as low, moderate, high, or very high risk to reoffend through the application of cutoff scores. The YLS/CMI possesses moderate to strong internal consistency for most subscales (Catchpole & Gretton, 2003; Schmidt et al., 2005), and moderate to strong concurrent validity with broad and narrow-band scores on the Child Behavior Checklist (Schmidt et al., 2005) and Structured Assessment of Violence Risk in Youth (Catchpole & Gretton, 2003). For copyright reasons, the Ontario ministry calls the YLS/CMI the Ministry Risk/Need Assessment (RNA) Form and it is referred to as such hereafter. For each youth, we used scores from the first risk assessment in the dataset in an effort to ensure that the two groups were as similar as possible with respect to the point in their justice system involvement trajectories.
Recidivism
Recidivism was defined as whether a youth was convicted of one or more offenses within a 3-year follow-up period from the date of their first RNA on file and was obtained from a national police criminal record database. The types of recidivism offenses included violent nonsexual (e.g., robbery, assault), nonviolent (e.g., theft, drug-related), sexual (e.g., sexual assault, invitation to touching), and administration of justice (e.g., failure to comply).
Analysis
Group Comparisons on Risk and Reoffense and Confirmatory Factor Analysis of the RNA
Prior to DIF analysis, total RNA and domain scores were compared across Indigenous and non-Indigenous youth and genders using independent sample t tests. The total and domain scores served as continuous dependent variables, and the grouping variable with two categories was used as an independent variable. We further examined the association between the groups and recidivism rates using chi-square analyses. In addition, receiver operating characteristic (ROC) analyses were conducted to test the RNA total score’s predictive accuracy for the total sample and the two groups across recidivism outcomes. ROC analyses generate an AUC statistic, which represents the likelihood that a randomly selected recidivist will have a higher score than a randomly selected nonrecidivist.
An assumption of unidimensional IRT is that the set of test items represent a single dimension (Embretson & Reise, 2000). Thus, a confirmatory factor analysis (CFA) was performed to empirically evaluate the dimensionality of the RNA using MPlus 8 (Muthén & Muthén, 2017). A second-order CFA model using a robust weighted least squares approach (i.e., WLSMV in Mplus) for dichotomous RNA items was performed.
IRT Analysis: Model Fitting
In IRT analysis, the latent construct, or θ, is the likelihood for recidivism. IRT models estimate item response probabilities as a function of person (i.e., θ) and item parameters (e.g., difficulty, discrimination, or guessing; Baker & Kim, 2017). Two of the most common unidimensional IRT models, two-parameter and one-parameter logistic models, were fitted to the data to find the better fitting model for subsequent DIF analyses.
DIF
DIF exists when items exhibit differential probabilities of endorsement among subgroups matched on the same latent construct (i.e., the level of risk). It is critical to match the subgroups (focal and reference groups) on the latent scale which is free from DIF. Group comparison without matching tests group differences at the scale score level, which is similar to aforementioned t-tests we performed as a preliminary step. This group difference is referred to as impact. On the contrary, DIF estimates item-level differences in response probabilities conditional upon the latent trait. DIF can be uniform or nonuniform. Uniform DIF refers to items exhibiting constant statistical relationship between item response and group across all levels of the latent construct. In IRT, uniform DIF shows the same shape of item characteristics curves, but differs only in location (Osterlind & Everson, 2009), thus different b parameters between matched subgroups. On the contrary, nonuniform DIF items exhibit different discrimination parameters for the two groups, and their item characteristic curves cross at a certain point along the theta scale. Nonuniform or crossing DIF items have different discrimination (a) parameters between the groups (Holland & Wainer, 1993).
Differential Test Functioning
The presence of differential functioning on the item level may or may not have an impact at the test level. Differential test functioning (DTF) occurs when the total test score functions differently across groups such that the final scores do not represent the same latent construct levels across groups (Stark et al., 2004). For example, if DTF exists, then Indigenous and non-Indigenous youth with the same θ risk levels will have different expected observed total scores. Differential test functioning R (DTFR) is a test level statistic that measures the difference between groups in the relationship of θ to the expected observed scores (Kleinman & Teresi, 2016). A DTFR value represents the number of raw test points that are due to differential functioning alone (Kleinman & Teresi, 2016). The Expected Test Score Standardized Difference (ETSSD)—a standardized mean difference between two groups in terms of total score for the scale, on the same metric as Cohen’s d (Meade, 2010)—was also used to measure the effect size of DTF.
Results
Group Comparisons on Risk and Reoffense
As shown in Table 1, Indigenous youth were assessed at higher risk than non-Indigenous youth based on the RNA total score, as well as in the domains of Criminal History, Family, Peer Relations, Substance Abuse, Personality, and Attitudes; effect sizes for the differences ranged from small to moderate. There were no differences between males and females in the RNA total or domain scores. As shown in Table 2, the 3-year recidivism rate for the total sample was 70%, with a significantly higher reoffense rate for Indigenous youth (82%) than non-Indigenous youth (58%). There were no differences in the type of reoffenses between Indigenous and non-Indigenous youth. Overall recidivism rates were also examined within risk categories for each group. Within the low and moderate risk categories, Indigenous youth had significantly higher recidivism rates than non-Indigenous youth. No differences were observed in the high-risk category (see Table 2).
Recidivism Information for Indigenous and Non-Indigenous Youth
Note. Bolded chi-square values are significant at the .05 level. Phi (φ) value of .1 is considered a small effect, .3 a medium effect, and .5 a large effect (Cohen, 1988).
ROC analyses were conducted to examine the extent to which RNA total scores could discriminate recidivists from nonrecidivists across types of recidivism outcomes in the total sample and across groups. Table 3 shows that the only significant difference between AUC values across groups was the recidivism outcome of general without administration of justice convictions. The AUC value was significantly lower for Indigenous youth compared with non-Indigenous youth, and it was a moderate effect for non-Indigenous youth. In the Indigenous group, the RNA scores predicted recidivism at chance levels across recidivism outcomes.
Predictive Accuracy of the RNA for Indigenous and Non-Indigenous Youth (3-Year Fixed Follow-Up Period)
Note. Bolded AUCs are significant at the .05 level. An AUC value of 0.56 is considered a small effect, 0.64 a medium effect, and 0.71 a large effect (Rice & Harris, 2005). DeLong’s test was used to compare AUC values across Indigenous and non-Indigenous groups. AUC = area under the curve; CI = confidence interval.
Factor Analysis of the RNA Tool
To test the unidimensional IRT assumption, a CFA was conducted on the RNA. A second-order factor measurement was tested; consistent with the RNR model and clinical practice, the 42 items were loaded onto their respective eight domains, which were in turn loaded onto a single latent factor of risk of recidivism. The residual terms for all items were fixed to be uncorrelated. One item, failure to comply, demonstrated a standardized factor loading above 1 and was removed from the factor analysis. Standardized factor loadings for the remaining 41 items ranged from 0.36 (poor relations with father) to 0.94 (previous probation); thus, all items met or exceeded the minimum threshold to be considered interpretable (0.30–0.40; Hair et al., 2006). Standardized factor loadings for the eight need domains ranged from 0.57 (Criminal History) to 0.95 (Attitude). The model fit of the CFA solution was evaluated using chi-square goodness-of-fit (χ2[771] = 1,499.73, p < .001), root mean square error of approximation (RMSEA = 0.05, 95% CI = [0.04, 0.05]), comparative fit index (CFI = 0.84), and Tucker–Lewis fit index (TLI = 0.83). The CFI and TLI values demonstrated poor fit, as they were below the cutoff values of 0.95 and 0.90, respectively (Hu & Bentler, 1999). However, the RMSEA demonstrated a good fit, as it was below the cutoff value of 0.08 (Hu & Bentler, 1999). The RMSEA for the null model was then examined (null RMSEA = 0.12). When the null RMSEA value is below 0.16, incremental measures of fit such as TLI and CFI may not be particularly informative (Kenny, 2015). In this case, given the acceptable RMSEA and poor reliability of the TLI and CFI fit statistics, the model was considered to be sufficiently unidimensional to proceed with unidimensional IRT analysis.
IRT Analysis: Model Fitting
Two IRT models can be used to model responses to dichotomous items such as those in the RNA. A one-parameter logistic model assumes that only the threshold/difficulty index is required to represent the item response process (Baker & Kim, 2017; Embretson & Reise, 2000; Harvey & Hammer, 1999). The threshold/difficulty index, or the b parameter, is the score on the θ axis that is associated with a 50% likelihood of an endorsed item response (Harvey & Hammer, 1999). The further right it is located on the θ axis, the less likely the item will be endorsed (i.e., the item is less prevalent among the population examined). The further left the b parameter is located on the θ axis, the more likely the item will be endorsed (Embretson & Reise, 2000; Furr, 2013; Nunnally & Bernstein, 1994). A two-parameter logistic model adds a parameter, or discrimination parameter. The discrimination parameter is the slope at the point of inflection for an item (Hulin, 1987). The parameter allows one to determine an item’s strength of relationship to the underlying construct (θ); a larger a value indicates a stronger relationship (Harvey & Hammer, 1999).
Using IRTPRO (Cai et al., 2011), the one-parameter and two-parameter logistic models (1PL and 2PL) were fitted to the RNA items to find the better fitting model. The 2PL model (−2 log likelihood = 16,629.13, Akaike information criterion [AIC] = 16,797.13, Bayesian information criterion [BIC] = 17,131.99; M2 = 2,876.39) was determined to be the better fit compared with the 1PL model (−2 log likelihood = 16,766.68, AIC =16,852.68, BIC = 17,024.10; M2 = 2,991.83). Furthermore, in the 2PL model, 40 items demonstrated acceptable individual item fit through Pearson’s chi-square (S-X2) fit statistics, which indicated that the observed and expected frequencies correct and incorrect for each item were not significantly different from each other, supporting the assumption of local independence of items. Thus, the results endorse the suitability of proceeding with IRT analyses, and the 2PL model was used for item calibrations and subsequent DIF analyses.
DIF
One item, previous custody from the Criminal History domain, was removed from analysis due to lack of endorsement for non-Indigenous youth. The remaining 41 RNA items were assessed for DIF in IRTPRO (Cai et al., 2011) using a nested model comparison; non-Indigenous youth were set as the reference group and Indigenous youth as the focal group. First, items were calibrated simultaneously for both groups to identify a subset of items that were unbiased and could be used as a matching criterion to match the groups on the same scale levels. These “matching items” were identified using an iterative procedure in which each item was examined for potential DIF using all other items as temporary matching items. This was done by comparing a model where the parameters were constrained to be equal across the groups with a model in which the parameters were freely estimated across the groups. Model comparison chi-square values were then evaluated for significance and items with alpha values less than .05 were identified with potential DIF. Items with potential DIF were then removed from the matching item list and the process was repeated until a final set of anchor items containing no potential DIF items was reached. Then, each of the items flagged for potential DIF was compared with the final matching item set to evaluate whether it demonstrated true DIF. The items that display DIF were those items with significant differences in item parameters (a and b parameters) across groups (Holland & Wainer, 1993). Effect size measures for these final DIF items were calculated using the VisualDF 1.3 program (Meade, 2010), which followed the same magnitude guidelines as those for Cohen’s d, such that an effect size of .20 represents a small effect, .50 represents a moderate effect, and .80 represents a large effect (Cohen, 1988).
Of the remaining 41 RNA items, uniform DIF was found for seven items; no nonuniform DIF items were found (see Table 4). Recall that uniform DIF items are significant differences in the threshold/difficulty parameter (b) across groups. Smaller b parameters were observed for Indigenous youth in the items: persistent alcohol use and substance use related to offense from the Substance Abuse domain. The b parameters for these items were shifted to the left on the θ axis for Indigenous youth relative to non-Indigenous youth; at a given risk level, it was more likely that Indigenous youth had these items endorsed than non-Indigenous youth. In contrast, smaller b parameters were observed for non-Indigenous youth in the items: disruptive behavior in classroom and school property, and problems with peers and teachers from the Education domain, and “poor frustration tolerance” from the Personality domain. Non-Indigenous youth were more likely to have these items endorsed than Indigenous youth, with the b parameters shifted more to the left on the θ axis for non-Indigenous youth. All seven uniform DIF items demonstrated moderate to large effect sizes.
Differential Item Functioning Between Indigenous and Non-Indigenous Youth of the YLS/CMI Items
Note. It is interpreted using the same guidelines as Cohen’s d, ESSD of 0.20 represents a small effect, 0.50 represents a moderate effect, and 0.80 represents a large effect (Cohen, 1988; Meade, 2010). All quoted YLS/CMI items are from the Youth Level of Service/Case Management Inventory (Hoge & Andrews, 2002, 2011). a = discrimination parameter; b = difficulty parameter; ESSD = expected score standardized difference is a measure of effect size for DIF items; DIF = differential item functioning.
DTF
To better understand the group differences on overall test functioning, the DTFR and ETSSD values were calculated to provide a numerical index of DTF. A DTFR value represents the number of raw score points that were due to DTF, or measurement bias (Stark et al., 2014). In this study, the DTFR was −0.65, indicating that Indigenous youth scored 0.65 points higher than non-Indigenous youth due to DTF alone. In comparison, RNA scores range from 0 to 42. The effect size was small (ETSSD = −0.10).
Discussion
This study conducted IRT-based DIF analyses to examine differences in the item functioning of the RNA across Indigenous and non-Indigenous youth. Uniform DIF emerged for seven items and no nonuniform DIF was found. Recall that in uniform DIF items, the range of the latent risk levels needed for item endorsement differs across groups but the item has similar relevance for both groups, whereas nonuniform DIF items indicate differential relevance to the latent risk level across groups (Hulin, 1987). Thus, results indicated comparable sensitivity of the RNA items to changes across latent risk levels regardless of group membership. However, total RNA scores predicted general recidivism for non-Indigenous youth, but not for Indigenous youth, which raises questions about how the risk assessment instrument is functioning at the test level for Indigenous youth, despite similar functioning at the item level.
Predictive Accuracy of the RNA
Current findings were consistent with previous research, in which higher total risk scores and higher recidivism rates were observed for Indigenous youth compared with non-Indigenous youth (Jung & Rawana, 1999; Luong & Wormith, 2011; Olver et al., 2012). It is noteworthy that total RNA scores predicted recidivism at chance level for Indigenous youth across types of recidivism outcomes (general recidivism, general recidivism without administration of justice convictions, nonviolent recidivism, violent nonsexual recidivism). This contradicts previous findings in which the YLS/CMI total scores significantly predicted recidivism for Indigenous justice-involved youth (Jung & Rawana, 1999; Luong & Wormith, 2011; Olver et al., 2012; Shepherd et al., 2015; Thompson & McGrath, 2012). Given that the RNAs were scored by probation officers, whose training may differ from those of clinicians or researchers, studies using probation officer and clinician scored risk assessments were examined as a potential explanation for the current findings. Thompson and McGrath (2012) used scores on the Australian version of the YLS/CMI completed by Juvenile Justice Officers and found that the total score significantly predicted recidivism for both groups, and the AUC values were similar to values for non-Indigenous youth in this study (total sample = 0.65, non-Indigenous sample = 0.64). Olver and colleagues (2012) examined mental health professional scored YLS/CMI and found that the instrument predicted recidivism equally well and at times more strongly for Aboriginal youth than non-Aboriginal youth. Therefore, the fact that the RNAs were scored by probation officers does not appear to explain our finding of poor predictive accuracy for Indigenous youth. Another potential explanation may be the use of the youths’ first RNAs on file for risk prediction, which may be too distal from the time of reoffense to make accurate predictions. Previous studies have demonstrated that recidivism prediction is improved in more proximal risk assessments over previous assessments of risk (Clarke et al., 2017). However, even if RNAs administered closer to youths’ reoffenses are more predictive, it still does not explain why the RNA total scores predicted recidivism differentially across groups as both had the same follow-up period.
Within the Indigenous group, we found high recidivism rates even for those in the low (79%) and moderate (82%) risk categories. Given the high general recidivism rates observed in supposedly lower risk categories, it may not be surprising that the total scores were not predictive for Indigenous youth. Previous studies in Australia (Thompson & McGrath, 2012) and Canada (Wilson, 2016) also reported that the YLS/CMI underpredicted general recidivism for Indigenous youth in the low- and moderate-risk categories. This finding raises important concerns with the use of YLS/CMI total scores in the prediction of risk with Indigenous youth. One explanation is differences in conviction rates between Indigenous and non-Indigenous youth (Wilson, 2016). In general, Indigenous offenders have higher conviction rates and receive lengthier sentences as well as more stringent sentencing conditions compared with non-Indigenous offenders (Mann, 2009). This finding has been attributed to multiple factors, including over-policing, institutional racism embedded within an English-based justice system, and increased offending in Indigenous individuals resulting from the negative impacts of colonialism (Blagg, 2012). These factors are emblematic of the broader sociopolitical climate that fails to consider the historical and social context of Indigenous communities, contributing to continued marginalization and discrimination of Indigenous individuals. As a result, Indigenous individuals may be more likely to be charged and convicted of an offense even if their risk assessment results place them in the low-risk category, rendering the concept of individual criminogenic risk invalid. However, differences in conviction rates between groups were not examined in the study and this remains a hypothesis to be evaluated in future research.
Finally, if one accepts that the risk assessment enterprise is a valid one, another possible explanation is that the problem is with the tool. Given that Indigenous youth in the low- and moderate-risk categories were reconvicted at high rates, the RNA may be incorrectly classifying high-risk youth as lower risk. According to the RNR framework’s risk principle, the intensity of service should be matched to an individual’s risk level. Thus, if a high-risk youth is classified as low risk, the youth may not receive the appropriate services necessary for reducing risk to reoffend. However, the underprediction of recidivism in the low- and moderate-risk categories was restricted to Indigenous youth, raising concerns with the applicability of the criminogenic needs outlined in the RNR framework with Indigenous individuals. The applicability of the RNR framework—including the instruments derived from it—with Indigenous individuals is a critical line of investigation, especially in light of the Ewert v. Canada (2015) decision. To this end, potential item and test differential functioning of the RNA across groups are discussed below.
Item Level Functioning
The difference in b parameters across Indigenous and non-Indigenous youth varied across items. Indigenous youth were more likely than non-Indigenous youth to have items from the Substance Abuse domain endorsed. This may not be surprising as higher prevalence rates of alcohol and substance use among Indigenous compared with non-Indigenous youth have been consistently established (Canadian Centre on Substance Use, 2007). Indigenous youth are often rated higher than non-Indigenous youth in the domain of Substance Abuse (Jung & Rawana, 1999; Olver et al., 2012; Shepherd et al., 2015). Specifically, substance use related to offense required an extremely high latent risk level for it to be endorsed for non-Indigenous youth, whereas a substantially lower threshold was required for Indigenous youth, indicating a potentially biased item.
In contrast, non-Indigenous youth were more likely than Indigenous youth with the same latent risk levels to have the item “poor frustration tolerance” from the Personality domain endorsed. The importance of community harmony and cooperation is emphasized in many Indigenous cultures, in which expressions of anger and aggression are discouraged (Briggs, 1970). Within such communities, members may avoid frustrating each other and may be slower in interpreting someone else’s behavior as frustrating (Briggs, 1970), which may explain why Indigenous youth were less likely to have these items endorsed. In addition, despite similar latent risk levels, non-Indigenous youth were more likely to have items from the Education domain endorsed, including disruptive behaviors and problems with teachers and peers. This may be explained by the school environments faced by Indigenous youth, which many feel are difficult and alienating. Indigenous students repeatedly experience racism and discrimination in schools, and as a result, many respond with resistance and withdrawal from the education system (Ontario First Nations Young People’s Council of the Chiefs of Ontario, 2016). Some scholars argue that resistance to formal education is a reaction against perceived inequalities in the school system (Rahman, 2010). Thus, the experience of discrimination and alienation may be related to disruptive behaviors and interpersonal difficulties, which can lead to increased rates of suspensions and absenteeism. Specifically, teachers have a significant impact on Indigenous youth school engagement. When Indigenous students have poor relationships with teachers, they may develop identities that are in opposition to those desired by the school (Groome & Hamilton, 1995). Given the unique school challenges faced by Indigenous youth, it may not be surprising that endorsement of items from the Education items are related to increased latent risk for Indigenous youth compared with non-Indigenous youth. Therefore, culturally responsive schooling emphasizing safe school environments and teachers who understand the diverse backgrounds of their students are encouraged to improve school engagement and learning outcomes among Indigenous students (Rahman, 2010).
Understanding whether item endorsements represent different latent risk levels across groups is important for clinicians to gain an accurate understanding of their risk assessment results. For example, in the current sample, Indigenous and non-Indigenous youth had similar observed frequencies of item endorsement in the Education domain. Knowing that uniform DIF exists in some of the Education items helps to inform clinicians that the two groups may not be similar on latent risk levels despite similar item endorsements. However, it is important to note that despite differences in the range of the latent risk levels needed for item endorsement in the uniform DIF items, their relationships to latent risk were similar across groups. Thus, in general, the presence of only uniform DIF items indicates that the RNA items are functioning similarly across Indigenous and non-Indigenous youth.
Test Level Functioning
Evaluation of differential functioning at the test level found negligible DTF effect sizes, indicating similar functioning of the total score across groups. Such results may appear paradoxical given the lack of predictive accuracy of the total score for Indigenous youth in predicting general recidivism. However, predictive accuracy and differential item/test functioning analyses address distinct issues. Predictive accuracy is concerned with the relationship between test scores and external criteria (e.g., recidivism), whereas DIF/DTF is concerned with the internal psychometric properties of the test (i.e., measurement equivalence; Stark et al., 2004). Both analyses are complementary and necessary to establish the validity of an instrument. In this study, the RNA items and total scores functioned similarly across groups, indicating similar relationships between the items and latent risk levels for both Indigenous and non-Indigenous youth. Its poor predictive accuracy for Indigenous youth points to the possibility, discussed above, that Indigenous justice-involved individuals have unique risk factors and needs that were not captured in the RNA, and by extension, the RNR framework. However, also discussed above is the possibility that poor predictive accuracy reflects systemic issues that are not related to criminal riskiness but rather systems-level policies and practices that function to criminalize particular groups.
Limitations and Future Directions
In addition to the findings and implications discussed above, several methodological limitations of this study suggest important directions for future research. First, we analyzed the factor structure of the RNA using one model for the purpose of addressing the IRT assumption of unidimensionality. However, in future research, it will be important to more thoroughly examine the factor structure of the RNA (YLS/CMI), possibly extending beyond the one-factor model, as understanding its internal structure is an important component of instrument validation. Second, although we used the first RNA in our dataset for each youth, these were not necessarily the first RNAs ever conducted with these youth and we do not know the date of their first risk assessment. As such, we were unable to examine whether the two groups were at similar points in their youth justice system involvement. Therefore, we cannot rule out the possibility that group differences in DIF findings reflect systematic differences in familiarity with youth on the part of probation officers. To address this question, in future research, it will be important to match youth on the extent of their youth justice system involvement (e.g., examining only first offenders). Third, we included males and females together to meet the sample size requirements of IRT for DIF analysis (Embretson & Reise, 2000). As such, we were unable to address questions of gender differences in DIF. While preliminary analyses indicated no gender differences in total risk or domain scores on the RNA, this does not mean that there are no gender differences in item-level functioning. In future studies of item-level functioning, it will be important to examine whether the YLS/CMI functions similarly across male and female youth, both generally and within particular ethno-cultural and racial groups. Finally, the current findings emerged from a sample of Indigenous youth in Ontario, Canada. Given the wide range of diversity among Indigenous communities and the prevalence and impact of risk assessment tools, their predictive accuracy should be examined in large samples of Indigenous justice-involved youth across Canada and in other countries. Thus, a future direction for research is to replicate the current analyses and investigate the applicability of the RNA (YLS/CMI) on larger sample sizes of Indigenous youth.
To fully understand and tackle the continuing and deeply problematic issues of high recidivism rates and overrepresentation of Indigenous people in the criminal justice system, we must not only thoroughly examine the psychometric properties of a risk assessment instrument, but also understand how it is applied and interpreted in the real world. An instrument that is measurement invariant may be applied differently for individuals in different subgroups, leading to differences in forensic and judicial outcomes. These are important future areas of investigation, but they are further complicated by the fact that they continued to be studied and theorized from different disciplinary perspectives—in parallel lines of research and scholarship for the most part—rather than in collaborative, transdisciplinary approaches. While the latter is far more challenging to engage, we argue that it is a necessary approach if we are to make progress in addressing these pressing problems.
Footnotes
Acknowledgements
The authors would like to express their deep appreciation to Lauren Freedman, Team Lead, Effective Programming and Evaluation Unit, Ontario Ministry of Children, Community and Social Services; Mike Kirk, Ontario Ministry of Community Safety and Correctional Services; the Honorable Mr. Justice Brian Weagant, Ontario Court of Justice; and Ilana Lockwood for their contributions to this study. This article is based on Shiming Huang’s PhD dissertation. The research was conducted at the Ontario Institute for Studies in Education, University of Toronto.
This research was supported by grant 435-2016-0152 from the Social Sciences and Humanities Research Council to the second and fourth authors.
