Abstract
The authors reviewed nine studies examining psychometric properties of the Juvenile Sex Offender Assessment Protocol–II (J-SOAP-II) and examined the psychometric properties of the J-SOAP-II when items were scored based on probation records obtained at or near disposition and prior to treatment. Data from 73 boys ages 12 to 17 who participated in a larger randomized clinical trial informed this study. Reliability (internal consistency and interrater agreement) and validity (concurrent, discriminant, and predictive) were examined. Scale 1, Sexual Drive/Preoccupation, was characterized by adequate reliability and concurrent validity but did not predict scores on a measure of concerning sexual behavior. This is consistent with seven studies that failed to find evidence of predictive validity using measures of sexual recidivism. Also consistent with the literature, Scale 2, Impulsive/Antisocial Behavior, performed well with respect to nearly all psychometric properties including predictive validity. Review of remaining scales and scores and clinical policy implications are discussed.
Keywords
In recent years, formal recidivism risk assessments have become a standard element of assessments of juveniles who sexually offend. Results often are used to guide decisions regarding placement, supervision, treatment, and whether youthful offenders will be subjected to policies such as sex offender registration, notification, and civil commitment (Prescott, 2006a). Given the influence of juvenile recidivism risk estimates on decisions with such far-reaching effects, the lack of predictive evidence is a significant limitation of existing measures (Fanniff & Becker, 2006; Prescott, 2006a; Worling & Långström, 2003). Moreover, accurate recidivism risk prediction for juveniles is complicated by numerous complex factors as noted by scale developers and others (Prentky & Righthand, 2003; Prescott, 2006a; Vitacco, Caldwell, Ryba, Malesky, & Kurus, 2009; Worling, 2004). Foremost among these is the rapidly changing developmental status of youth. Adolescence is a period of tremendous cognitive, social, and sexual growth that does not lend itself to easy solutions for recidivism risk prediction. In addition, youth behavior is influenced by factors across the many ecological systems in which they are embedded (e.g., individual, family, peer, school, and community factors) and the relative influence of each system on youth behavior changes across childhood (Quinsey, Skilling, Lalumière, & Craig, 2004). Efforts also are complicated by low rates of sexual recidivism (e.g., Caldwell, 2002, 2010; Letourneau & Armstrong, 2008; Zimring, Jennings, Piquero, & Hays, 2009).
Despite the complexity of the task, several investigators have developed instruments designed to assist with predicting juvenile sexual recidivism risk. The current article begins with a review of the existing peer-reviewed evidence on the psychometric properties of the Juvenile Sex Offender Assessment Protocol–II (J-SOAP-II), the most extensively studied of the available instruments (see Prescott, 2006b). This study also contributes new information from a recently completed clinical effectiveness trial (Letourneau, Henggeler, et al., 2009).
Description of the J-SOAP-II
The J-SOAP-II is designed to assess recidivism risk in boys ages 12 to 17 with a history of sexually coercive behaviors (Prentky, Harris, Frizzell, & Righthand, 2000; Prentky & Righthand, 2003). The original version consisted of 23 items developed based on research pertaining to juveniles who sexually offend as well as research regarding general delinquents and adult sex offenders (Prentky et al., 2000). Preliminary research provided promising results regarding internal consistency, interrater reliability, construct validity, and factor structure (Prentky et al., 2000; Righthand et al., 2005). The J-SOAP was revised to remove ambiguous items and items with weak predictive validity, to add potentially relevant items, and to improve reliability of coding (Prentky & Righthand, 2003). The J-SOAP-II consists of 28 items comprising four scales: Sexual Drive/Preoccupation (Scale 1), Impulsive/Antisocial Behavior (Scale 2), Intervention (Scale 3), and Community Stability/Adjustment (Scale 4). Each scale is composed of five to eight items, and each item is scored on a 3-point scale reflecting severity or presence/applicability to the youth. Scales 1 through 4 are summed for a total score. Although not the focus of this article due to limited research, Static (sum of Scales 1 and 2) and Dynamic (sum of Scales 3 and 4) summary scores can also be calculated. The J-SOAP-II functions as a structured professional judgment measure that assesses operationalized risk factors derived from a review of the literature that should inform evaluators’ risk judgments (Heilbrun, Yasuhara, & Shah, 2010). This is in contrast to actuarial instruments that use formulas to combine empirically derived risk factors to estimate the probability of an outcome (Heilbrun et al., 2010).
Review of the Existing Literature
Nine published studies were identified that presented information on at least one psychometric property of individual J-SOAP-II scales and/or the Total Score. Details from these studies are reviewed below and summarized in Tables 1 and 2.
Characteristics of and Reliability Estimates From Prior J-SOAP-II Studies
Note: N/A indicates the parameter of interest was not reported.
Standard deviation of mean age not provided in Powers-Sawyer and Miner (2009) or in Prentky et al. (2010). The participants’ ages ranged from 14 to 19 and 3 to 20, respectively.
In Prentky et al. (2010), interrater reliability was calculated in a nonstandard manner.
Test-Criterion Validity Results From Prior J-SOAP-II studies
Note: All parameters presented are statistically significant.
For ease of presentation, only results regarding any sexual reoffense are presented. The pattern of results was the same for violent sexual reoffense. In addition, the results of Cox proportional hazard analyses are not presented, the pattern of results was similar, although Scale 1 did not significantly predict sexual recidivism.
For youth with both child and peer/adult victims, higher scores associated with lower recidivism rates.
Prentky et al. (2010) present results for the full sample and for a high-risk subsample, as well as for preadolescents and total adolescents. For ease of presentation, only results regarding the high-risk total adolescents subsample are presented here. In addition, only results of the Cox proportional hazard and ROC curve analyses are presented (results from logistic regression were similar).
Results of Kaplan-Meier survival curve analyses are not reported; no significant predictors were identified in these analyses. In addition, results regarding in-treatment behavior are not included. Scale 1 predicted sexual aggression during treatment (AUC = .65). Nonsexual aggression during treatment was predicted by Scale 2 (AUC = .63), Scale 3 (AUC = .61), Scale 4 (AUC = .67), and Total Score (AUC = .66).
Significant predictors of serious violent rearrest were Scale 2 (AUC = .67), Scale 4 (AUC = .63), and Total Score (AUC = .63).
Reliability
Reliability refers to the consistency of measurements obtained with repeated testing and the degree to which these measurements are free from error (Joint Committee on the Standards for Educational and Psychological Testing [SEPT], 1999). With respect to the J-SOAP-II, existing studies have focused on internal consistency and interrater agreement.
Internal Consistency
Internal consistency is typically measured by Cronbach’s alpha coefficient (Cronbach, 1951), with alpha values of .70 or higher and item–total correlations of .30 or higher indicating adequate internal consistency 1 (Nunnally & Bernstein, 1994). Internal consistency was examined in three studies (Table 1). In all three studies, Cronbach’s alpha values for Scales 2 and 3 and the Total Score exceeded .70. Values for Scale 1 exceeded .70 in two of three studies and values for Scale 4 exceeded .70 in one of two studies. Thus, the data generally supported the internal consistency of the J-SOAP-II Scales 1 through 3 and Total Score, with less support for Scale 4.
Interrater Agreement
Interrater agreement is measured by the Pearson’s correlation or an intraclass correlation coefficient (ICC). ICC is the preferred measure when the variables of interest share method variance, as is the case with interrater agreement (McGraw & Wong, 1996). ICCs above .60 are generally considered good and values above .75 are considered excellent (Cicchetti et al., 2006; however, see Harvey & Hollander, 2004). Interrater agreement was examined in seven studies (Table 1). Interrater agreements exceeded .75 for Scale 1 in five of six studies, for Scale 2 in four of six studies, for Scale 3 in four of five studies, and for the Total Score in three of five studies. All values for Scale 1, Scale 3, and the Total Score exceeded .60. An unacceptably low ICC for Scale 2 was reported in one study and an unacceptably low ICC for Scale 4 was reported in two of the studies. 2 Thus, the data supported the interrater agreement of Scales 1 through 3 and the Total Score, with insufficient evidence for Scale 4.
Validity
Validity refers generally to the degree to which interpretation of test scores for proposed uses is supported by an accumulation of evidence (SEPT, 1999). The J-SOAP-II manual explicitly states that the intended purpose of this instrument is to “facilitate risk assessment and risk management” (Prentky & Righthand, 2003, p. 9). Given this focus, the most relevant sources of validity for the J-SOAP-II would include comparison with future recidivism rates (test-criterion validity) and comparison with previously validated risk-assessment measures (convergent evidence).
Test-Criterion Evidence
The degree of accuracy with which J-SOAP-II scale and Total Scores predict sexual and other recidivism is critical to evaluating the effectiveness of this instrument. As will be seen, every study to examine the question has found some support for the measure’s association with recidivism; however, when looking at which scales were associated with which types of recidivism, the picture that emerges is far less consistent. Complicating the review of validity is the fact that the developers provide few specific hypotheses regarding the expected relationship between each scale and types of reoffense. We propose the following hypotheses:
Hypothesis 1: Scale 1 will predict sexual recidivism.
Hypothesis 2: Scale 2 will predict general recidivism.
The test developers do indicate the following:
Hypothesis 3: Scale 3 “may be useful” in predicting both sexual and nonsexual reoffense (Prentky & Righthand, 2003, p. 6).
We propose the following:
Hypothesis 4: Scale 4 will predict both sexual and nonsexual reoffense.
Hypothesis 5: The Total Score will predict both sexual and nonsexual reoffense.
Results relevant to these five hypotheses are reviewed in the next sections.
Scale 1
Scale 1 is composed of items that reflect offense and offender characteristics such as having male child victims or a history of child sexual victimization, as well as items reflecting sexual drive and preoccupation. The relationship between Scale 1 scores and sexual recidivism has been examined in nine studies; a significant positive relationship was reported in just two (Table 2). A significant negative relationship was reported in one study, with lower Scale 1 scores predicting sexual recidivism over a 1-year follow-up, and no significant relationship was reported in the remaining six studies. These results suggest that Scale 1 scores cannot be interpreted with confidence as indicating a youth’s propensity to commit future sexual offenses.
Scale 2
Scale 2 assesses impulsive and antisocial behavior, types of behavior that have been convincingly linked to general recidivism risk among delinquent juveniles (e.g., Lipsey & Derzon, 1999). Statistically significant relationships in the expected direction were found in seven of the eight studies that examined the relationship between Scale 2 scores and general and/or violent recidivism (Table 2). These results suggest that Scale 2 scores can be interpreted with confidence as indicating a youth’s propensity for committing future nonsexual offenses.
Scale 3
Scale 3 was designed to capture change as a function of treatment (Prentky et al., 2010) but as noted earlier might also predict sexual and general recidivism risk (Prentky & Righthand, 2003). The relationship between Scale 3 and sexual recidivism was examined in seven studies; a significant positive relationship between Scale 3 scores and sexual recidivism was reported in four of these studies. A significant positive relationship was identified in two of six studies that evaluated the relationship between Scale 3 and general recidivism. These results suggest that more research is required before drawing confident conclusions regarding the relationship between Scale 3 and sexual recidivism, but Scale 3 scores cannot be interpreted with confidence as indicating a youth’s propensity for committing future nonsexual offenses.
Scale 4
Scale 4 is designed to assess community adjustment and might be related to sexual and general recidivism. Five studies evaluated the relationship between Scale 4 and sexual recidivism; significant positive relationships were found in four studies. A significant positive relationship was found between Scale 4 and general recidivism in all four studies that examined the relationship. These results suggest that Scale 4 scores might indicate a youth’s propensity for committing future sexual and nonsexual offenses.
Total Score
The Total Score could possibly predict both sexual and general recidivism. Significant positive predictive relationships were reported in four out of seven studies that evaluated the relationship between Total Score and sexual recidivism (Table 2). A significant positive predictive relationship between Total Score and general recidivism was reported in three out of six studies. These results suggest that further study is needed before the Total Score can be interpreted with confidence as indicating a youth’s propensity for committing future sexual or nonsexual offenses.
Convergent Evidence
Convergent evidence pertains to the degree of relationship between scores across instruments designed to measure similar constructs (SEPT, 1999). This property of the J-SOAP-II has been examined in just four studies, only one of which provided information regarding the convergent evidence for Scale 4. Thus, discussion of Scale 4 convergent validity is omitted from the review below.
Scale 1
The clearest test of convergent evidence for Scale 1 was presented by Caldwell, Ziemke, and Vitacco (2008). These authors compared Scale 1 scores with scores from three other juvenile sexual risk prediction instruments. Convergent validity of Scale 1 was evidenced by significant and positive correlations between Scale 1 and two of these juvenile risk instruments. However, the construct assessed by Scale 1 as well as the comparison instruments does not appear to be functionally related to youth sexual recidivism risk, given the minimal predictive evidence for Scale 1 (reviewed above) and given findings that none of the recidivism risk instruments reviewed in Caldwell and colleagues’ study predicted sexual recidivism. Similarly, a significant correlation between Scale 1 and the Juvenile Sexual Offense Recidivism Risk Assessment Tool–II (J-SORRAT-II) could be interpreted as providing convergent evidence; however, the J-SORRAT-II has limited evidence of predictive validity as well (Viljoen et al., 2008).
Scale 2
The convergent validity of Scale 2 is supported by significant positive correlations between Scale 2 and several instruments that predict propensity for general delinquency, including the PCL:YV (Caldwell et al., 2008), the YLS/CMI (Caldwell & Dickinson, 2009), and the SAVRY (Viljoen et al., 2008).
Scale 3
Scale 3 focuses on treatment progress and has not been compared with other measures of treatment progress. As Scale 3 is hypothesized by the developers to be related to sexual and general recidivism, convergent validity might be demonstrated via positive correlations with sexual and general recidivism risk measures. Research is mixed regarding the relationship between Scale 3 and other measures intended to assess sexual risk (Caldwell et al., 2008; Viljoen et al., 2008), but as noted above the construct validity of these sexual risk measures is uncertain. In addition, Scale 3 was not related to PCL:YV scores (Caldwell et al., 2008) but was found to be significantly correlated with SAVRY scores (Viljoen et al., 2008). More research is needed regarding the convergent evidence for Scale 3.
Total Score
Given the inclusion of sexual and general risk factors in the J-SOAP-II, the Total Score is expected to be correlated with other measures of sexual risk as well as general recidivism risk measures. The convergent evidence for the Total Score has been examined in three studies; significant positive correlations were identified with both sexual and general risk-assessment measures (Caldwell et al., 2008; Rajlic & Gretton, 2010; Viljoen et al., 2008).
In conclusion, the existing literature provides important evidence regarding the reliability and validity of the J-SOAP-II. The available evidence supports the reliability of Scale 1, but this scale cannot yet be confidently interpreted as indicative of sexual recidivism risk. The evidence suggests that the same underlying construct has been identified in Scale 1 and other sexual recidivism risk measures, but the meaning of this construct is not clear. Existing research provides strong support for the reliability, construct validity, and predictive validity of Scale 2. Regarding Scale 3, the available evidence supports its reliability, but additional research is needed regarding its relationship to sexual and general recidivism. Scale 4 has inadequate evidence of reliability but has promising evidence for predictive validity regarding both general and sexual recidivism, although this scale suffers from a dearth of research. Finally, the available evidence supports the reliability and construct validity of the Total Score, but evidence regarding predictive validity is mixed.
Current Study
The J-SOAP-II is intended to be scored using a multitude of sources, and almost all existing studies used information derived from a combination of corrections and treatment records (see Table 1). However, a common risk-evaluation scenario involves assessing youth recently charged or adjudicated for sexual offenses who have not yet begun treatment. In such “disposition” cases, assessments are made in the absence of treatment records. Measurement error might increase under such less-than-ideal but common scoring procedures (Nunnally, 1978). It is important to determine whether the psychometric properties of instruments hold up under common but alternative scoring conditions. Indeed, studies conducted in different contexts can contribute valuable information—pieces of the puzzle if you will—toward a more complete picture of instrument reliability and validity. In the next section, we present an evaluation of the reliability and validity of the J-SOAP-II with scores based on archival juvenile justice records for a sample of youth recently charged or adjudicated for a sexual offense.
Method
The present study is based on data collected as part of a randomized trial comparing the effectiveness of Multisystemic Therapy (MST; Henggeler, Schoenwald, Borduin, Rowland, & Cunningham, 1998) to that of a specialized sex offender group therapy intervention (Letourneau, Henggeler, et al., 2009). The parent study sample and procedures have been described in detail elsewhere (Henggeler et al., 2009; Letourneau, Henggeler, et al., 2009) and are therefore reviewed only briefly in the following sections.
Participants
All youth and their primary caregivers were referred to the study by the local solicitor’s office and the recruitment rate of eligible referred youths was 74% (N = 127). Inclusion criteria included diversion (44%) or adjudication (56%) for a sexual offense with an order for community-based specialized treatment, ages 11 to 17 years (inclusive), and residence with a permanent caregiver who spoke either English or Spanish. The principle exclusion criteria were current psychosis or serious brain dysfunction.
Of the 127 participants in the parent study, data from 73 informed the present study. Excluded were female participants (n = 3) who cannot be evaluated with the J-SOAP-II and youth whose juvenile justice records were simply too limited (e.g., often including only the index arrest report) to support any coding of the J-SOAP-II (n = 51; most of these youth were diverted vs. adjudicated). The following sections summarize information for the 73 youths and their caregivers who contributed data to this study.
Youth Demographics
The mean age of youth at baseline was 15.11 years (SD = 1.54, range = 12-18). Most youth were Black (58%) or White (40%), and 26% indicated Hispanic ethnicity. This sample was representative of the demographic makeup of the urban area in which youth resided. Index sexual offenses included aggravated criminal sexual assault, criminal sexual assault, aggravated criminal sexual abuse, criminal sexual abuse, other sexual offenses, and sexual offenses that were pleaded to nonsexual offenses. Forty-four percent of youth had a prior history of nonsexual offenses, with an average of 1.60 prior arrests for those with such a history. Seven percent of the youth had one prior sexual offense arrest or adjudication.
Caregiver and Family-Level Demographics
The youth’s primary caregiver typically was his mother (56%), father (16%), or another female relative (18%). Primary caregivers were partly or fully employed outside the home (58%), unemployed (21%), or homemakers (21%). Many caregivers (38%) had not completed high school, but 26% were high school graduates and 36% had completed 1 or more years of college. At the time of assessment, primary caregivers were married (48%), divorced (19%), separated (6%), never married (26%), or widowed (1%). Family economic status varied, with 38% of families earning less than US$10,000 in the past year, 35% earning between US$10,000 and US$30,000/year, 28% earning US$30,000 or more.
Procedure
Study referral, recruitment, consent/assent, and data-collection procedures were approved by institutional review boards at two universities, and data were further protected by a federal certificate of confidentiality. Following consent, a baseline assessment protocol was completed after which participants were randomly assigned to the MST or usual services treatment conditions. Subsequent research assessments were collected at 6-, 12-, 18-, and 24-months postbaseline. The present study uses data collected at baseline and at the final 24-month follow-up assessment. For 25 youth recruited into the study with less than 24 months remaining, their final follow-up assessments (occurring at 13 to 23 months postbaseline) were used. During the baseline assessment interview, caregivers and youths jointly completed a comprehensive survey of demographic characteristics. During the baseline and follow-up assessments, youth and caregivers independently completed assessment protocols assessing youth sexual behavior, delinquency, mental health, family functioning, peer relationships, and school performance.
Measures
The key measures for the current study included the J-SOAP-II and measures pertaining to convergent and test-criterion evidence. In addition, we examined discriminant evidence which is supported when instruments designed to measure dissimilar constructs are not correlated.
J-SOAP-II
As defined earlier, the J-SOAP-II is designed to provide information on risk factors for reoffending (Prentky & Righthand, 2003). The authors recommend accessing several sources of information when scoring J-SOAP-II items and note that, in the event of limited information, scoring should be conservative. The 28 items are scored using a 3-point scale as described previously. Items within each of the four scales are summed to provide scale scores and items are summed across scales for a Total Score. For this study, J-SOAP-II scores were based on information extracted from juvenile justice archival records which included psychosexual evaluations completed by mental health professionals for the solicitor’s office near the time of disposition, probation officer reports (typically disposition recommendation reports), criminal case reports including investigating officer notes and victim statements pertaining to the sexual offense(s). In 93% of cases, either a detailed report from a probation officer, a psychological evaluation, or both were available. Frequently, the initial arrest petition (64%), arrest report (58%), or incident report (41%) were available. Less common sources of information included other court documents (38%), the youth’s statement to police (18%), or individualized education plans (7%). In 8 of the 73 cases (11%), treatment reports from previous residential treatment placements were included in the files and used for scoring.
Two coders blind to study participants’ treatment condition and rearrest status reviewed baseline records to make J-SOAP-II item ratings. The primary coder (AMF) was trained by a J-SOAP-II developer. Training involved review of materials from a coding workshop and a phone review of practice ratings completed by the coder. Detailed feedback was provided regarding the reasons for the correct scores on each item. The primary coder then trained the secondary coder using a similar training protocol. The primary coder scored all 73 files, and the secondary coder scored 19% (14 cases) of the files for the purposes of calculating interrater reliability (see Results). In all other analyses, only scores coded by the primary coder were used.
Convergent/Discriminant Evidence
Several scales were used to examine convergent and discriminant evidence for Scales 1 and 2 and the Total Score. There were no obvious measures (from those available in the parent study) for examining convergent or discriminant validity for Scale 3. As discussed under Results, Scale 4 could not be reliably coded from the available data for this study, precluding validity analyses.
Concerning sexual behavior
Convergent evidence for Scale 1 and the Total Score was examined by comparing scale scores with scores on the youth- and caregiver-report versions of the Adolescent Clinical Sexual Behavior Inventory (ACSBI; Friedrich, Lysne, Sim, & Shamos, 2004; Wherry, Berres, Sim, & Friedrich, 2009). The ACSBI is a 45-item instrument that measures inappropriate or concerning sexual behaviors. The ACSBI has demonstrated good reliability (α = .84 for caregiver report and .86 for child report) and convergent validity with clinical samples of sexually abused and nonabused youth (Friedrich et al., 2004). For purposes of the present study, two of five ACSBI subscales were examined: Divergent Sexual Interests and Sexual Risk/Misuse. The youth and caregiver report versions of the Divergent Sexual Interest scale include nine and five items, respectively, with some items that appear to align with Scale 1 (e.g., “has been accused of sexually abusing another person”). The youth and caregiver report versions of the Sexual Risk/Misuse scale include 8 and 10 items, respectively, with some items that appear to align with Scale 1 (e.g., “pushes others into having sex”). Few youth or caregivers endorsed items and therefore scale scores were combined. It was hypothesized that Scale 1 and Total Score would be positively correlated with the composite ACSBI scale scores (the sum of Divergent Sexual Interests and Sexual Risk/Misuse).
Antisocial behavior
Convergent evidence for Scale 2 and the Total Score was examined by comparing scale scores with the Externalizing T-scores of the caregiver-reported Child Behavior Checklist and the parallel Youth Self-Report (CBCL and YSR; Achenbach & Rescorla, 2001). There is ample research evidence supporting the internal consistency (α ≥ .90 for the internalizing and externalizing scales; Achenbach et al., 2008), and test-criterion validity of these instruments (e.g., Achenbach, 2005; Ebesutani, Bernstein, Martinez, Chorpita, & Weisz, 2011; Hudziak, Copeland, Stanger, & Wadsworth, 2004). Convergent evidence for Scale 2 and the Total Score also was assessed by comparing scale scores with the General Delinquency scale of the Self-Report of Delinquency Scale (SRD; Elliott, Ageton, Huizinga, Knowles, & Canter, 1983). The SRD is a well-validated instrument (Thornberry & Krohn, 2000), and the mean coefficient alpha across assessment points was .67 in the full parent study sample (Letourneau, Henggeler, et al., 2009). It was hypothesized that Scale 2 and Total Score would be positively correlated with baseline Externalizing and General Delinquency scores.
Internalizing behavior problems
Divergent evidence for Scale 2 was assessed by comparing scale scores and the Internalizing T-scores from the caregiver-reported CBCL and youth-reported YSR. The Internalizing scale provides a measure of problem behaviors such as depression and anxiety. It was hypothesized that Scale 2 scores would not be significantly associated with Internalizing T-scores obtained at baseline.
Criterion Variables
Test-criterion evidence was examined for Scales 1 through 3 and the Total Score. As mentioned earlier, Scale 4 could not be reliably coded for this study, precluding validity testing of this scale and limiting the Total Score to the sum of Scales 1 through 3.
Concerning sexual behaviors
For Scale 1, the most relevant criterion would be sexual recidivism. However, across an average follow-up period of 29.52 months (SD = 8.28 months) the sexual rearrest rate was less than 3% for the full study sample precluding predictive validity testing on this outcome. Rather, follow-up ACSBI composite scores were used to indicate ongoing concerning sexual behaviors. These scores do not equate to sexual recidivism, nor has the relationship between the ACSBI scores and recidivism been empirically demonstrated. Nevertheless, it seems feasible to posit that scores on a measure of sexual risk would correlate positively with scores on a measure of concerning sexual behaviors. It was hypothesized that Scale 1 baseline scores would predict youth- and caregiver-reported ACSBI composite scores at follow-up. The follow-up ACSBI scores also were used as a test criterion for Scale 3 and the Total Score.
General recidivism
For Scale 2, the most relevant criterion is nonsexual recidivism and there were sufficient events across the follow-up period to support this assessment. Recidivism events were operationalized as new charges (regardless of final disposition) occurring after the date of study recruitment. Charges were identified from criminal justice records that included information from city, state, and federal criminal history reports. Charge labels were used to categorize recidivism events as sexual or nonsexual. It was hypothesized that Scale 2 scores would predict general rearrests and felony rearrests (e.g., see Caldwell et al., 2008 regarding serious reoffense). In addition, it was hypothesized that Scale 2 scores would predict follow-up SRD general delinquency scores. It was also predicted that Scale 3 and the Total Score would predict general rearrests and follow-up SRD scores.
Treatment length and completion
Because Scale 3 purports to reflect treatment readiness, we also examined whether this scale predicted treatment length and successful treatment completion. At the end of treatment, therapists from the parent study completed a brief form that indicated length of treatment and whether treatment was successfully completed. Data on treatment completion (but not treatment length) were missing for four youth. Psychometric properties for these therapist-report variables are unknown.
Analytic Plan
Evidence of reliability was assessed by determining internal consistency and interrater agreement. Specifically, Cronbach’s alpha coefficients and corrected item–total correlations were used to evaluate internal consistency. ICCs computed using a two-way random effects model assessing degree of absolute agreement were used to determine interrater agreement (McGraw & Wong, 1996).
Evidence of concurrent and discriminant validity was assessed via Pearson’s correlations between J-SOAP-II scores and variables of interest described earlier. To determine test-criterion validity, variables with highly positively skewed distributions (i.e., ACSBI composite scale, SRD, and number of rearrests) were analyzed using negative binomial regression (Walters, 2007). All analyses using negative binomial regression models also controlled for exposure time (i.e., length of follow-up). Length of time in treatment was analyzed using linear regression. Successful treatment completion was analyzed using logistic regression. Models predicting follow-up ACSBI and SRD scores controlled for baseline scores.
For Scales 2 and 3 as well as the Total Score, test-criterion validity also was examined using Cox regression (Cox, 1972) and area under the curve (AUC) of the receiver operating characteristic curve (Hanley & McNeil, 1982) analytic methods. The Cox regression method assesses the relationship between predictors (scale scores) and outcomes (general and felony recidivism) while simultaneously accounting for differences in length of follow-up. Cases were followed through first rearrest or censored at the end of follow-up with the censoring mechanism assumed to be noninformative (Klein & Moeschberger, 2003). The AUC provides an indication of the strength of a measure’s predictive effect and is less likely to be influenced by base rates than other indices used to evaluate predictive accuracy (e.g., correlations, percentage of correctly classified cases; see Barbaree, Seto, Langton, & Peacock, 2001; Rice & Harris, 1995). The AUC represents the probability that a randomly selected recidivist would score higher than a randomly selected nonrecidivist. AUC values of .50 or lower indicate chance (or worse) prediction.
Results
Summary scores for all measures are presented in Table 3. Of note, mean scores for behavioral (e.g., ACSBI, SRD) and clinical (e.g., CBCL) scales indicated low baseline rates of problem behaviors or clinical symptoms. Thus, this was not a particularly delinquent or disordered group of youth. Results are presented for reliability evidence, convergent/discriminant evidence, and test-criterion evidence. Within each category, results are presented separately by J-SOAP-II scale or Total Score.
Descriptive Results for J-SOAP-II Scales and Variables Used in Validity Analyses
Note: J-SOAP-II = Juvenile Sex Offender Assessment Protocol-II; ACSBI = Adolescent Clinical Sexual Behavior Inventory; CBCL = Child Behavior Checklist; YSR = Youth Self Report; SRD = Self-Reported Delinquency.
Limited analyses to those with at least one postrecruitment arrest (64% of sample).
Limited analyses to those with at least one postrecruitment felony arrest (38% of sample).
Reliability Evidence
Scale 1
Reliability evidence is presented in Table 4. As can be seen, the Cronbach’s alpha for Scale 1 was .65, below the recommended minimum of .70. Five of the eight items comprising Scale 1 had item–total correlations below .30. The ICC was in the acceptable range.
Reliability of the J-SOAP-II
p < .05. **p < .01.
Scale 2
The Cronbach’s alpha for Scale 2 exceeded .70 and no items had item–total correlations below .30. The ICC was in the excellent range.
Scale 3
The Cronbach’s alpha for Scale 3 was below .70, indicating inadequate internal consistency. Three of the seven items comprising Scale 3 had item–total correlations below .30. The ICC was in the acceptable range.
Scale 4
The Cronbach’s alpha for Scale 4 was .43, indicating inadequate internal consistency. Four of five items comprising Scale 4 had item–total correlations below .30. The ICC was very low (i.e., ICC = .07).
Total Score
The Cronbach’s alpha for the Total Score calculated using all four scales was .81, with a value of .79 for the Total Score using Scales 1 through 3, both of which are acceptable. The ICC for the Total Score using all four scales can be considered good, with the ICC for the Total Score using only Scales 1 through 3 falling in the excellent range.
Convergent/Discriminant Evidence
Results of correlation coefficients assessing convergent/divergent evidence of validity are presented in Table 5. Correlations are based on baseline assessment results. Given the low internal consistency and interrater reliability results for Scale 4, validity testing was not conducted with this scale.
Convergent and Discriminant Validity Indicators
Note: Pearson’s correlation coefficients presented. ACSBI Composite Scale = Adolescent Clinical Sexual Behavior Inventory composite of sexual risk/misuse and divergent interest scales. CBCL = Child Behavior Checklist. YSR = Youth Self-Report. SRD = Self-Reported Delinquency.
p < .05. **p < .01.
Scale 1
With respect to convergent evidence, Scale 1 was significantly positively correlated with the baseline caregiver-reported ACSBI composite scale but not with the youth-reported scale. There were also significant positive correlations between Scale 1 and parent- and youth-reported Externalizing T-scores. All significant correlations had medium effect sizes (Cohen, 1988).
Scale 2
With respect to convergent evidence, Scale 2 was significantly correlated with baseline CBCL and YSR Externalizing scores, and baseline SRD scores. With respect to discriminant evidence, Scale 2 was not significantly correlated with baseline Internalizing scores. There was an unexpected significant positive correlation with the caregiver-reported ACSBI score. All effect sizes were again in the medium range.
Scale 3
No specific hypotheses were generated with respect to the convergent/divergent evidence for Scale 3. However, Scale 3 was included in the correlation analyses. Results indicated Scale 3 was not significantly correlated with any baseline measures.
Total Score
The Total Score (Scales 1-3) was significantly correlated with the ACSBI composite scale based on caregiver (but not child) report, as well as all measures of externalizing (CBCL Externalizing, YSR Externalizing, SRD General Delinquency). Unexpectedly, it was also associated with parent-reported internalizing symptoms.
Test-Criterion Validity
Scale 1
It was hypothesized that baseline Scale 1 scores would predict follow-up ACSBI composite scale scores. Results indicated that baseline Scale 1 scores were not significantly predictive of parent- or youth-reported ACSBI composite scores, after accounting for baseline ACSBI scores (Table 6).
Regression Models for Criterion Validity Measures
Note: Adjusted R2 and χ2 for the full model are reported for Cox regression and linear regression models. ΔR2 and Δχ2 represent the improvement over a model including only the covariate. Age and treatment group were included in the first step as covariates for all analyses. If either covariate was not marginally related to the outcome at p < .10, it was dropped from the analysis. HR = hazard rate (exponentiated parameter coefficient).
Scale 2
It was hypothesized that Scale 2 scores would predict follow-up SRD scores and general and felony arrests. Results of negative binomial regression models indicated that Scale 2 was not significantly predictive of follow-up SRD scores but was significantly positively associated with number of rearrests (exp[b] = 1.11, p <.01, see Table 6). This indicates that for each point increase in Scale 2 scores, there is an 11% increase in the number of rearrests. Results of Cox regression analyses indicated that Scale 2 scores were predictive of general rearrest and felony rearrest events (Table 6). The AUC results indicated that Scale 2 performed significantly better than chance in predicting felony (AUC = .64, p = .041, 95% CI [.51, .78]) but not general rearrest (AUC = .61, p = .116, 95% CI [.48, .74]).
Scale 3
It was hypothesized that Scale 3 scores would predict follow-up ACSBI scores, general rearrest, and treatment length and completion. Results indicated that Scale 3 did not predict treatment completion, length of time in treatment, child-reported ACSBI composite score, self-reported delinquency, or number of rearrests. Scale 3 did predict caregiver-reported ACSBI composite scores at follow-up (Table 6), such that for each point increase in Scale 3 scores, ACSBI composite scores increased by 28%. Results of Cox regression (see Table 6) and AUC analyses (AUC = .61, p = .138, 95% CI [.47, .74]) indicated that Scale 3 did not predict general rearrest.
Total Score
It was hypothesized that the Total Score would predict follow-up ACSBI scores, general rearrest, and felony rearrest. The Total Score had a marginal positive predictive relationship with follow-up caregiver- and child-reported ACSBI composite scores and number of rearrests in negative binomial regression models. In addition, the Total Score had a marginal relationship with general rearrest (but not felony rearrest) in Cox regression analyses. Finally, results of AUC analyses indicated that the Total Score was not significantly predictive of either general (AUC = .60, p = .153, 95% CI [.47, .73]) or felony rearrest (AUC = .58, p = .262, 95% CI [.44, .72]).
Discussion
This article presented a review of existing evidence of reliability and validity for J-SOAP-II and results from an examination of the psychometric properties of the J-SOAP-II when coded based on information available at disposition. Across most studies including our own, only Scale 2 (Impulsive/Antisocial Behavior) demonstrated acceptable reliability and validity properties. In contrast, the psychometric properties of Scales 1, 3 and 4 as well as the Total Score were characterized by significant limitations.
Consistent with the existing literature, results from the current study do not support the use of Scale 1, the Sexual Drive/Preoccupation scale, in the context of juvenile sexual risk assessments conducted at the time of disposition. With respect to reliability, the internal consistency of Scale 1 from our study (.65) fell within the range of previously reported coefficients which varied between .56 and .77 (Table 1). These relatively low coefficients suggest that the domain content for youth Sexual Drive/Preoccupation is inaccurately and/or insufficiently sampled (Nunnally & Bernstein, 1994), a concern supported by the minimal predictive validity evidence identified for this scale. Interrater agreement was acceptable in the present study, although lower than found in previous research, suggesting that Scale 1 items might depend on the availability of treatment information for accurate coding. The present study was consistent with the literature in providing some support for the convergent validity of Scale 1, which correlated significantly with caregiver-reported ACSBI scores. Despite convergence, there is little evidence supporting the test-criterion validity, given the lack of relationship between Scale 1 and follow-up concerning sexual behaviors in the current study as well as the lack of relationship between Scale 1 and sexual recidivism in seven out of nine previously published studies. There is insufficient evidence to suggest that Scale 1 scores, and perhaps youth static sex offense characteristics more generally (Caldwell & Dickinson, 2009), are indicative of future sexual recidivism risk.
As noted, Scale 2 (Impulsive/Antisocial Behavior) was found to have acceptable performance across nearly all psychometric properties examined in the current study and results generally were within the range of previous findings. The internal consistency and interrater reliability for Scale 2 in the present study were good to excellent and within the range of previously reported findings (Table 1). Scale 2 scores were correlated with related measures (e.g., externalizing problems and delinquency) in the present study and have previously been found to be correlated with general delinquency risk measures. Scale 2 was also significantly positively correlated with the caregiver-reported ACSBI combined scale, which likely reflects the association between delinquency and risky sexual behaviors (e.g., Aalsma, Tong, Tenkit, & Tu, 2008; Mason et al., 2010). The lack of relationship between Scale 2 and internalizing problems provides discriminant evidence. Consistent with prior research, Scale 2 also demonstrated strong evidence of test-criterion validity in the present study. In particular, the AUCs for the relationship between Scale 2 and general and felony rearrest were within the range found in the broader literature regarding general delinquency risk measures (Schwalbe, 2007), although only the AUC for the prediction of felony rearrest was significantly higher than chance.
Results from the current study provide minimal support for the use of Scale 3 (Intervention) in evaluations of juvenile sexual recidivism risk conducted at disposition, prior to treatment. The low internal consistency found in the present study is in contrast to prior studies although the ICC was within the range reported in previous studies. Scale 3 was not significantly associated with any baseline measures which could indicate lack of concurrent validity or reflect a study limitation (i.e., a lack of relevant comparison measures). The significant association between Scale 3 and follow-up caregiver-reported ACSBI composite scores indicates that this scale is predictive of relevant behaviors, though we could not examine its relationship with sexual recidivism specifically. This predictive relationship is consistent with other studies reporting significant associations between Scale 3 and sexual recidivism (Caldwell et al., 2008; Martinez, Flores, & Rosenfeld, 2007; Prentky et al., 2010). However, Scale 3 was not associated with youth-reported ACSBI follow-up scores or indicators of general recidivism. Prentky and colleagues (2010) noted that Scale 3 is “typically coded when there has been exposure to sex offense-specific treatment” (p. 29). The dynamic items comprising Scale 3 may be especially sensitive to the availability of treatment records for scoring purposes. Given the significant prediction of problematic sexual behaviors and prior mixed findings regarding the test-criterion validity of Scale 3, more research on Scale 3 when coded prior to treatment is warranted.
In the present study, Scale 4 (Community Stability/Adjustment) performed particularly poorly. The scale had low internal consistency as well as interrater reliability, consistent with some of the available literature (Aebi, Plattner, Steinhausen, & Bessler, in press; Martinez et al., 2007). Accurate scoring of the dynamic items comprising Scale 4 may depend on the availability of treatment records.
Finally, the current study provided limited support for the Total Score, although it was based only on Scales 1 through 3. Although there was evidence for the reliability and convergent validity of the Total Score, this score was not significantly related to any test-criterion variables. These results might reflect a true lack of predictive validity or could be the result of study limitations: the exclusion of Scale 4 may limit the predictive power of the Total Score and the small sample may have resulted in marginal (rather than significant) findings. The available literature is quite mixed regarding the predictive validity of the Total Score, with three previous studies reporting no relationship with any type of recidivism and four reporting some evidence of predictive validity for sexual and/or general recidivism. Research directly comparing outcomes for the three versus four scale versions of this score might help clarify these discrepancies.
Study Limitations
Several limitations of the current study deserve mention. First, only one of the two raters whose scores informed analyses of interrater agreement completed formal scoring training by a developer of the J-SOAP-II. The informal training of the secondary rater could have influenced interrater reliability results, although the fact that ICC values for three of the scales were within the range of other studies would seem to argue against this hypothesis. Second, raters followed the recommendation of Prentky and Righthand (2003) and scored items conservatively when presented with limited information. This reasonable strategy could have had the effect of restricting the range of item and scale scores, which could negatively influence estimates of validity. Third, we were unable to use the most relevant test-criterion (sexual recidivism) for Scale 1 and instead relied on a measure of concerning sexual behaviors. Nevertheless, the failure of Scale 1 to predict follow-up ACSBI composite scores is consistent with the majority of the existing literature that also failed to support the predictive validity of this scale. Fourth, the participants in the current study were relatively low risk (see Table 3) which could influence both reliability and validity results by restricting the range of scores and limits generalizability to similar lower risk samples. Fifth, the sample was relatively small and was followed for a relatively brief length of time. Larger samples followed over longer periods likely improve psychometric testing by including a wider variety of youth characterized by higher levels of risk and by including a higher number of recidivists due to the longer time at risk. Sixth, most evaluators relying on the J-SOAP-II to inform risk evaluations conducted predisposition have the added benefit of information derived from clinical interviews, which may improve reliability and validity of the scales. Although research on the J-SOAP-II is commonly conducted based on file reviews (e.g., Parks & Bard, 2006; Viljoen et al., 2008), there is no research comparing scoring based on file reviews with scoring that also includes a clinical interview (as exists for some forensic assessment measures; e.g., Grann, Långström, Tengström, & Stålenheim, 1998; Leistico, Salekin, DeCoster, & Rogers, 2008). Research regarding the psychometric properties of the J-SOAP-II when coded prior to treatment based on both records and clinical interview would be a valuable addition to the literature. Finally, recent research highlights the importance of investigating the predictive validity of risk-assessment measures in different subgroups; specifically, Rajlic and Gretton (2010) found that the J-SOAP-II was significantly predictive of sexual recidivism in juveniles with no prior delinquent offenses but not in juveniles with a delinquent history. The current sample was too small to conduct analyses by subtype, but future research should include such analyses whenever possible.
Policy and Practice Implications and Future Directions
The available evidence fails to support the use of Scale 1 to predict sexual recidivism. Although strong support exists for Scale 2 to predict general and felony recidivism, there is mixed evidence regarding the construct and test-criterion validity of Scale 3 and the Total Score, and insufficient or mixed evidence with respect to the psychometric properties of Scale 4. Until the psychometric properties of the J-SOAP-II are more consistently supported by empirical evidence, evaluators should not base significant decisions, such as opinions regarding registration and notification or youth confinement status, on J-SOAP-II results. Moreover, mental health professionals conducting predisposition evaluations should proceed with great caution when interpreting J-SOAP-II scores as part of broader risk assessments. Even when the J-SOAP-II is only one source informing clinical judgment, evaluators have been unable to produce valid estimates of risk (Elkovitch, Viljoen, Scalora, & Ullman, 2008). Based on results of the present and previous studies, future research efforts should endeavor not only to clarify the psychometric properties of the J-SOAP-II but also how varying assessment conditions influence these properties. In particular, reliable coding of Scales 3 and 4 might depend on the availability of treatment reports or other detailed sources of information that typically are provided only after a youth has been in treatment, placement, or under supervision. Furthermore, although results from the current and previous studies support the predictive validity of Scale 2, evaluators have access to well-validated general delinquency risk instruments with equivalent (or better) predictive validity and larger empirical bases of support (e.g., Schwalbe, 2007).
We recognize that the recommendation to be very cautious in applying the J-SOAP-II to inform critical decisions places clinicians in a bind, given increased requests for juvenile sexual risk evaluations (Chaffin, 2008). In the absence of formal risk measures, such evaluations could result in judicial decisions based on prosecutor’s recommendations or inappropriate assessment tools (Vitacco et al., 2009). However, we concur with Vitacco and colleagues (2009) who recommend that evaluators responding to queries for sexual recidivism risk assessments focus on short-term risk, acknowledge the fluid nature of both risk and sexuality in juvenile populations, highlight the low base rate of sexual recidivism as well as the positive response to treatment demonstrated in the JSO literature, and focus on the juvenile’s social context in addition to individual risk factors. These recommendations are sound. Until existing or new instruments are better validated, evaluations in this context will remain a complex balancing act between the need to provide the courts and other stakeholders with useful information and the serious limitations in empirically based knowledge about sexual risk.
Footnotes
Acknowledgements
The authors sincerely thank the many families that participated in this project. The success of this study depended upon close collaboration with the State’s Attorneys Office, the Circuit Court, and the Juvenile Probation and Court Services in the county in which this study was conducted. They also thank Jennifer Smith Powell and Kelly Bolger for their assistance.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This manuscript was supported by grant R01MH65414 from the National Institute of Mental Health.
