Applicability and Efficiency of a Computerized Adaptive Test for the Washington Assessment of the Risks and Needs of Students

Abstract

The Washington Assessment of the Risks and Needs of Students (WARNS) is a computer-based assessment created to help courts, schools, and youth service providers determine an adolescent’s risks and needs that may lead to truancy, drop out, or delinquency from school. Users are advised to consider the WARNS total score to work with youth. A total score estimate based on fewer items than the full item set may result in less respondent burden, administration time, and fatigue, while not hindering accurate decisions. This simulation study examined the applicability and efficiency of a computerized adaptive test (CAT) to estimate a WARNS total score under a unidimensional item response theory model. The results demonstrate that the CAT provides an accurate estimate of students’ risks and needs and reduces the number of items administered for each examinee compared with the existing version. Future directions and limitations of CAT development with the WARNS are discussed.

Keywords

computerized adaptive test item response theory risk assessment screening WARNS

Prevention science addresses problems that individuals experience due to maladaptive behaviors (e.g., substance use, suicide, and delinquency) before the behaviors occur. This is typically accomplished by employing psychology principles (De Matos et al., 2019) to inform the systems (e.g., schools, organizations) in which an individual is engaged. As these maladaptive behaviors can result in severe tragedies for youth, early intervention is suggested (De Matos et al., 2019). To aid early intervention, the potential risk of maladaptive behaviors and social and emotional needs should be assessed. Many assessments exist for this purpose, including the (a) Beck Scale for Suicide Ideation (Beck et al., 1979), (b) Youth Level of Service/Case Management Inventory (Hoge & Andrews, 2002), (c) Washington Assessment of the Risks and Needs of Students (WARNS; George et al., 2015), and (d) Problem-Oriented Screening Instrument for Teenagers (POSIT; Rahdert, 1991), to name a few. Existing measures require individuals to respond to many items, requiring much time for the youth to complete the assessment. Shorter assessments may be preferred in settings such as schools, as time and personnel resources are limited. Computerized adaptive testing (CAT) can assist in creating shorter versions of assessments without sacrificing the accuracy of information (De Beurs et al., 2014; Flens et al., 2016; Hol et al., 2008).

CAT has been applied to several measures developed for assessing maladaptive behaviors (e.g., Butler et al., 2017; De Beurs et al., 2014; Latimer et al., 2014; Thimm, 2020), and with rating scales commonly used in personality assessments (Hol et al., 2005; Magis et al., 2017). The primary purpose of applying CAT to such measures has been to reduce the number of administered items to save administration time and decrease the respondent burden, without hindering accuracy of the measured trait, as well as improving standard error of estimate (SEE) for ability levels on both sides of the ability continuum where high SEE is seen under a traditional fixed-length test.

Applicability and efficiency of CAT can differ depending on the assessment (e.g., the characteristics of item pool) and participants (e.g., ability distribution). The same stopping rule for terminating the CAT for an individual for different assessments (e.g., anxiety, depression, motivation, panic disorder) does not result in the same average item reduction or correlations between CAT and full assessment scores (e.g., Gibbons et al., 2012; Hol et al., 2007; Sunderland et al., 2017; Walter et al., 2007). However, in a live CAT environment, it is impractical to continually adjust stopping rules to determine which performs best. Thus, simulation provides an opportunity to examine the effectiveness of different stopping rules for CAT administrations (Magis et al., 2017).

The WARNS is a self-report instrument used to assess student risks and needs linked to truancy, delinquency, and dropping out of school. Six subdomains assessed include the following: Aggression–Defiance, Depression–Anxiety, Substance Abuse, Peer Deviance, School Engagement, and Family Environment. The WARNS was developed in Washington State in response to court administrators’ need for information to guide decision making about youth who had been served with a court petition (George et al., 2015). The development of the WARNS was related to legislation known as the Becca bill. The bill was formed in response to Rebecca Hedman’s parent’s discovery of a lack of resources from the school system, court system, and community to assist their daughter, who was truant due to drug and prostitution problems, and eventually was murdered. The WARNS was intended to assist the truancy process and serve to facilitate conversations with the student and the education and court systems to aid support services and reduce negative outcomes. This context of use is shifting, given changes to policy in Washington, where assessments such as the WARNS are to be used at an earlier time in student development. Moreover, use of the WARNS is occurring beyond Washington, pushing its use beyond the boundaries of original intended uses (Gotch & French, 2020).

In the WARNS computer-based version, the risks and needs scores are automatically calculated by each subdomain and a general risks and needs domain (George et al., 2015). Users are advised to consider the WARNS total score with other information, as reliability is higher for the total score compared with the subdomain scores (Strand et al., 2019). Moreover, classification of risk is based on the total score. The subdomain scores may be used to understand the reasons for the total risks and needs scores in conversations with the student. For students who are not at risk, there is no need to examine subdomain scores for a deeper understanding of issues, as these students most likely have low scores across all six subdomains and typically are categorized into a low-risk profile (Iverson et al., 2018). Therefore, determining whether a youth has a low-risk score based on the total score may save time and decrease the respondent burden and save school resources. Thus, a CAT may quickly screen if a student has a risk level that deserves completion of the full WARNS for a complete understanding of the student’s risks and needs.

Present Study

Although other risk assessments exist, we focus on the WARNS because it (a) is perhaps the most comprehensive in terms of covering the domains most associated with negative outcomes for youth, (b) provides in-depth information across several domains, (c) has a well-articulated theoretical framework, and (d) has a comprehensive validity argument supported by claims and evidence compared with other existing measures (Gotch & French, 2020). Creating an operational CAT to examine if a CAT is applicable and efficient for a test under different conditions might be impractical, as real data do not support the manipulation of study conditions (Feinberg & Rubright, 2016), such as a stopping rule. Therefore, CAT simulations can assess the efficiency of different conditions to inform operation (e.g., Butler et al., 2017; Flens et al., 2016; Hol et al., 2008; McClarty, 2006; Waller & Reise, 1989).

The purpose of this simulation study was to investigate if the WARNS total scores could be estimated with fewer items using a CAT environment, compared with the current computer-based WARNS, without degrading score accuracy. This simulation study can (a) help determine the parameters (e.g., stopping rules) in a WARNS CAT that leads to optimal performance and (b) continue to inform the CAT personality literature of the appropriateness of CAT in another behavioral domain. Therefore, we performed a CAT simulation with a stopping rule criterion based on the commonly used SEEs. We also examined the most appropriate stopping rule for the study purpose—the best reduction rate without degrading score accuracy.

Method

Measure

The WARNS comprises 40 items, five to nine items for subdomains, responded to on a 4-point rating scale (0 = never or hardly ever, 1 = sometimes, 2 = often, and 3 = always or almost always). Once students complete the assessment, scores are used by the school personnel, student service providers, or court system to inform conversations about the risks and needs of the student and a course of action for the student. The WARNS has support for its bifactor structure (Strand et al., 2019), item and test-level invariance across groups (Alpizar et al., 2020; French & Vo, 2020), and reliability of scores (Gotch & French, 2020), with a focus on the general score for decisions.

Unidimensionality

To support CAT use in estimating a WARNS total score, evidence of essential unidimensionality is required. A sufficient unidimensional structure can be supported for a bifactor model when common pattern coefficients are greater than 0.3 (McDonald, 1999), as with WARNS (Strand et al., 2019). We inspected the scree plot from an exploratory factor analysis, which supported a dominant first factor, with a second factor accounting for less than 10% of variance. In addition, omega (.98) and omega_hierarchical (.83) estimates suggested that 83% of the variance of unit-weighted total scores can be attributed to the individual differences on a general factor, and there is a strong correlation (.91) between the general factor and the observed total score (Rodriguez et al., 2016). This indicates that about 15% of the reliable variance in total scores can be due to the multidimensionality related to the specific factors, and only about 2% of this variance is attributable to random error. Thus, recognizing few tests are truly unidimensional (Nandakumar, 1991), our evidence supports the essential unidimensionality assumption in item response theory (IRT; e.g., Stout, 1990) and that the minor dimensions outside the primary dimension of risks and needs likely have negligible consequences on parameter estimates (e.g., Anderson et al., 2017). As the WARNS is used in practice with a total score, it is reasonable to interpret the scores as an essentially unidimensional reflection of student risks and needs, even in the presence of multidimensionality as indicated by the minor specific factors.

Procedure

CAT Simulation

A one-factor simulation study with eight SEE stopping rules (0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and no stopping) was conducted in catR (Magis & Raîche, 2012). The no stopping rule condition was applied to compare the WARNS CAT with the WARNS used in practice (i.e., all WARNS items administered). The catR package requires item and person parameters to conduct the CAT simulations. To obtain these parameters, the best-fitting IRT model was determined by examining model-data fit statistics in IRTPRO Version 4.2. (Cai et al., 2011) using live student data based on the most commonly used polytomous IRT models, including (a) the graded response model (GRM; Samejima, 1969), (b) the generalized partial credit model (Muraki, 1992), (c) the rating scale model (Andrich, 1978), and (d) the partial credit model (Masters, 1982). The live data consisted of 4,081 (female = 46.2%) high school students ages 13 to 18 years in the State of Washington. Students completed the WARNS as a part of the school counseling service. Data were stored anonymously to protect student information. The student response data were used to estimate item and person parameters to inform the simulation study.

To prevent bias caused by using the same sample both in estimating item parameters and in CAT simulations, the sample was randomly divided into two halves, consisting of 1,000 and 3,081 participants. Item parameters were estimated using the responses of 1,000 participants. Based on the estimated item parameters, thetas for remaining sample were estimated to be used in simulations. The thetas were normally distributed (i.e., M = −0.03, SD = 1.01; skewness = −0.280, kurtosis = 0.303). Thus, 3,081 true theta values and their real responses to WARNS items were assigned to simulees. Several decisions were made to apply the CAT simulation procedures, including (a) a starting point, (b) methods for selecting the next items, (c) estimating simulees’ theta levels (θ), and (d) stopping criteria. The catR package (Magis & Raîche, 2012) was used for all these steps since the recent Version (3.16) allows researchers to conduct CAT simulations with the given IRT models (Magis & Barrada, 2017).

The most informative item at the mean of the ability continuum (0) was selected as an initial item for all simulees in all conditions. After each item administration, simulees’ theta levels were estimated based on the most current response patterns using the expected a posteriori (EAP) estimation method (Bock & Mislevy, 1982). As EAP is efficient to compute scores for polytomous IRT models, does not need iteration, and can estimate trait levels even with initial responses to highest or lowest response categories (Embretson & Reise, 2013; Linden & Glas, 2007), it was chosen to estimate theta values. In selecting the next most informative items for the current theta estimates, maximum Fisher information (MFI) was used, which is widely used in CATs with polytomous items (e.g., De Beurs et al., 2014; Walter & Holling, 2008). Each item was chosen from any subdomain as long as it was not previously administered to the same simulee, and it was the most informative item in the currently estimated θ level. Any trait estimator and item selection criteria can be used together for a given item pool, but one of the most commonly used combinations is the EAP estimator with the MFI item selection method (e.g., Bulut & Kan, 2012; Lian et al., 2020; Tan et al., 2018). MFI is the only method currently available in catR for polytomous IRT models (Magis & Raîche, 2012). The item selection and ability estimation procedures continued until termination criteria were met or all WARNS items were administered. A fixed-length stopping rule was not considered, given it would not meet the purpose of the study. After item administration terminated, for each simulee, a final theta level was estimated using the EAP method.

Outcome Measures

Precision

The WARNS CAT precision was assessed for each condition based on marginal reliability (MR; Equation 1) calculated using the mean SEE. The conditions with MR over .80 were considered sufficient for CAT use (e.g., Butler et al., 2017). While $\overset{\land}{θ}$ represents the estimated theta, $σ \begin{matrix} 2 \\ \overset{\land}{θ} \end{matrix}$ corresponds to the variance of the estimated thetas, and $\bar{p}$ is the MR where $\bar{σ} \begin{matrix} 2 \\ \dot{e} \end{matrix}$ is the square of the mean standard error.

\bar{p} = (σ \begin{matrix} 2 \\ \overset{\land}{θ} \end{matrix} - \bar{σ} \begin{matrix} 2 \\ \dot{e} \end{matrix}) / σ \begin{matrix} 2 \\ \overset{\land}{θ} \end{matrix}

(1)

Efficiency

To assess which stopping rule was more efficient for WARNS CAT, (a) the average number of administered items, (b) the differences in the number of administered items across the theta levels, and (c) the proportion of simulees that satisfied the stopping criteria were examined for each simulation condition. Stopping conditions with fewer items administered on average, especially for the lower risks and needs groups, and higher stopping criteria satisfaction proportion were considered more efficient.

Accuracy

In the investigation of how accurately the WARNS CAT estimated simulees’ theta levels, Pearson’s correlation coefficient was calculated between theta estimates of WARNS CAT simulations and raw scores used in practice. A correlation over .85 can be considered sufficient for CAT use (e.g., Butler et al., 2017). Also, the correlation between true thetas and estimated thetas from each simulation condition was examined to support the accuracy of WARNS CAT true theta estimation. To investigate the accuracy of the estimated thetas from the WARNS CAT simulations, bias and root mean of square error (RMSE) were examined for true thetas versus estimated thetas from CAT simulations for each condition, separately. As bias and RMSE values get closer to zero, score accuracy increases. While true thetas were symbolized $θ_{t r u e}$ , all other estimated thetas from different CAT conditions were symbolized ${\overset{\land}{θ}}_{i} .$ Bias (Equation 2) is a measure of the average difference between true and estimated thetas for each stopping condition, whereas the RMSE (Equation 3) provides the square root of the average differences between true and estimated thetas for each stopping condition.

{B i a s}_{t r u e} = \frac{\sum_{i = 1}^{n} ({\overset{\land}{θ}}_{i} - θ)}{n}

(2)

{R M S E}_{t r u e} = \sqrt{\frac{\sum_{i = 1}^{n} {({\overset{\land}{θ}}_{i} - θ)}^{2}}{n}}

(3)

For a final decision of which simulation condition performed best, precision, efficiency, and accuracy of each condition were considered in combination based on the given criteria.

Results

The GRM (Samejima, 1969) was the best fitting IRT model for the live data based on the statistics of −2 log likelihood, Akaike’s information criterion (Akaike, 1974), and Schwarz’s Bayesian information criterion (Schwarz, 1978; Table 1). Thus, item parameters (Table 2) and person parameters (thetas) were estimated based on the best fitting IRT model (i.e., GRM).

Table 1.

Model-Data Fit Statistics for the WARNS Data Set.

IRT models	AIC	BIC	−2LL
GRM	285127.57	286138.14	284807.57
GPCM	285528.16	286538.73	285208.16
PCM	317085.28	317849.52	316843.28
RSM	322548.44	322820.03	322462.44

Note. Values in boldface indicates preferred model. GRM = graded response model (Samejima, 1969); GPCM = generalized partial credit model (Muraki, 1992); PCM = partial credit model (Masters, 1982); and RSM = rating scale model (Andrich, 1978); WARNS = Washington Assessment of the Risks and Needs of Students; −2LL = log likelihood.

Table 2.

Item Parameters for the WARNS Based on the Graded Response Model.

Item	Discrimination	Step parameters
Item	a	b₁	b₂	b₃
1. Like School^a	1.103	−2.203	−0.301	1.864
2. Fights	1.343	1.713	3.177	4.218
3. Close^a	1.166	−0.749	0.502	2.060
4. Homework^a	1.103	−0.119	1.027	2.633
5. Temper	1.564	0.366	1.542	2.521
6. Supported^a	0.857	−1.390	0.467	2.691
7. Cheer	1.476	−0.495	1.037	2.142
8. FriendsDrunk	1.480	0.547	1.483	2.483
9. Talk^a	0.835	−1.743	−0.304	1.610
10. Sad	1.701	−0.737	0.555	1.607
11. Sick	1.554	2.217	3.110	3.715
12. Worried	1.277	−0.834	0.566	1.886
13. Learned^a	0.772	−2.389	−0.388	2.484
14. FriendsArr	1.547	0.557	1.593	2.684
15. Argue	1.593	−0.485	1.087	2.368
16. Studied^a	1.079	−2.698	−0.847	1.388
17. Drank	1.557	1.914	3.056	3.795
18. Threatened	1.826	1.408	2.545	3.678
19. Sleeping	1.461	−0.446	0.648	1.737
20. Dropout	1.615	0.429	1.494	2.302
21. Lied	1.571	−0.184	1.589	2.732
22. CouldTalk^a	1.264	−0.748	0.096	1.532
23. Hopeless	1.750	−0.182	1.034	1.874
24. FriendTrouble	1.080	0.375	2.259	3.730
25. Tense	1.448	−0.262	1.050	2.163
26. HWcomplete^a	1.071	−2.778	−0.791	1.352
27. UseDrugs	2.135	2.368	3.019	3.635
28. FriendSkip	1.330	0.067	1.636	3.090
29. PickedOn	1.280	2.684	4.443	4.897
30. Nervous	1.369	0.219	1.447	2.488
31. Missed	2.092	2.056	2.790	3.170
32. Care	2.072	0.322	1.354	2.161
33. Smoked	1.565	1.166	1.997	2.539
34. Angry	1.635	0.591	1.942	2.837
35. Teachers^a	0.860	−1.900	0.174	2.263
36. Wanted	1.274	1.505	2.902	4.282
37. Classes^a	1.024	−2.920	−0.971	1.417
38. Damaged	1.752	1.723	2.796	3.911
39. FrFights	1.157	0.901	2.691	4.043
40. ParentHelp^a	0.939	−1.061	0.069	1.540

Note. WARNS = Washington Assessment of the Risks and Needs of Students. a = Reversed-coded.

Precision

The stopping conditions of 0.2, 0.3, and 0.4 had acceptable MRs, with .93, .91, and .85, respectively. MR for the no stopping condition, corresponding to the current computer-based WARNS, was also .93. All other stopping conditions had low MRs compared to our criterion (i.e., > .80). Mean SEEs were 0.26, 0.31, and 0.39 for the 0.2, 0.3, and 0.4 stopping conditions, respectively, and the no stopping condition had the same mean SEE with the 0.2 stopping condition. See Table 3 for details about precision results.

Table 3.

Precision: Marginal Reliability (MR) and Mean Standard Error of Estimate (SEE) of Each Stopping Condition.

Stopping rule	MR	M SEE (SD)
No stopping	.93	0.26 (0.06)
SEE < 0.2	.93	0.26 (0.06)
SEE < 0.3	.91	0.31 (0.03)
SEE < 0.4	.85	0.39 (0.01)
SEE < 0.5	.76	0.48 (0.01)
SEE < 0.6	.62	0.56 (0.03)
SEE < 0.7	.28	0.65 (0.03)
SEE < 0.8	.05	0.68 (0.05)

Efficiency

The average number of administered items was 10, 21, and 40 for the stopping conditions of 0.4, 0.3, and 0.2, respectively, whereas it was 5, 3, 2, and 2 for the stopping conditions of 0.5, 0.6, 0.7, and 0.8, respectively. The CAT simulations required a different number of items to estimate a WARNS total score for the simulees at the different locations of the risks and needs scale for each condition. As the risks and needs levels of the simulees decreased on the scale, more items were needed to estimate total WARNS scores. For most of the simulees with relatively lower risks and needs levels compared with the mean point of the scale, around 14 items were administered on average in the stopping condition of 0.4. However, this average rapidly increased for stopping conditions of 0.3 and 0.2 with roughly 30 and 40 items administered, as seen in Figure 1. The stopping rule satisfaction proportion was 100%, 95.72%, and 82.83%, for stopping conditions of 0.5, 0.4 and 0.3, respectively. However, this proportion rapidly decreased to 1.14% for the stopping condition of 0.2. For stopping conditions of 0.6, 0.7, and 0.8, all simulees satisfied the stopping rule, so item administration ceased before all items were administered for these conditions. See Table 4 for details.

Figure 1.

Differences in the number of administered items across theta levels for each simulation condition.

Table 4.

Efficiency: Mean Number of Administered Items (MNAI) and Stopping Rule Satisfaction Proportion (SRSP) for Each Stopping Condition.

Stopping rule	MNAI (SD)	SRSP
No stopping	40 (0)	0
SEE < 0.2	40 (0.52)	1.14
SEE < 0.3	21.40 (9.69)	82.83
SEE < 0.4	10.06 (7.40)	95.72
SEE < 0.5	5.41 (2.95)	100
SEE < 0.6	3.24 (1.45)	100
SEE < 0.7	1.94 (0.82)	100
SEE < 0.8	1.58 (0.49)	100

Note. SEE = standard error of estimate.

Accuracy

The correlations between theta estimates of WARNS CAT and raw scores were >.85 for the stopping conditions of 0.2, 0.3, and 0.4. The correlations between true thetas and estimated thetas were also >.85 for the same conditions with an addition, condition of 0.5. Other stopping conditions could not satisfy the desired correlation criteria, being <.85. Bias was close to zero for each condition, and RMSE kept increasing from .01 to .68 for stopping condition 0.2 to 0.8. Tables 5 provides full results for accuracy.

Table 5.

Accuracy: Correlation, Biases, and Root Mean Square Error (RMSE) for WARNS CAT Stopping Conditions.

Stopping rule	r (with true score)	r (with raw score)	Bias	RMSE
No stopping	1.00	.96	.00	.00
SEE < 0.2	1.00	.96	.00	.01
SEE < 0.3	.98	.92	.00	.20
SEE < 0.4	.94	.87	−.04	.36
SEE < 0.5	.88	.81	−.02	.49
SEE < 0.6	.83	.77	−.01	.57
SEE < 0.7	.77	.73	.01	.64
SEE < 0.8	.74	.70	.01	.68

Note. SEE = standard error of estimate.

Discussion

The purpose of this study was to examine the applicability and efficiency of a CAT on estimating students’ WARNS total risks and needs scores with fewer items without degrading score accuracy. The WARNS CAT was simulated using the item parameters estimated from 1,000 live student responses based on the GRM (Samejima, 1969). Eight stopping conditions were investigated to determine which CAT simulation condition performed well compared with the current computer-based WARNS, using 3,081 live student responses. The WARNS CAT simulation results suggested the best performing stopping conditions of 0.3 and 0.4 based on the aggregation of MRs, the number of administered items, stopping rule satisfaction proportion, the correlation between raw scores and estimates from each simulation condition, bias, and RMSE.

The differences in the number of administered items across the ability scale for simulees were related to the total information and standard error curves of the WARNS across the scale, as seen in Figure 2. As the WARNS CAT provided more information at the higher risks and needs levels of the scale with smaller error, generally fewer items were administered for the simulees at this end of the ability continuum compared with the lower risks and needs levels. Using more items for simulees at the lower risk levels of the scale increased the mean number of administered items. More items are required for the lower risk and need score estimates to decrease the number of item administration on average, which can enhance the reliability (Flens et al., 2016).

Figure 2.

Total test information and standard error curves of the WARNS items.

Although most CAT simulation studies did not report the differences in the number of items used across the scale, we thought that reporting these differences would be necessary to determine the best performing stopping condition. Determining a student who is not at risk with fewer items was more important than determining a student at risk, as the WARNS system suggests users administer all items for at-risk students to calculate subdomain scores for better understanding of the risks and needs, if risk is elevated. Based on this consideration, the stopping condition of 0.4 performed better than the stopping condition 0.3, by estimating total risks and needs scores with fewer items for simulees at low levels of risks and needs, with little loss of measurement precision. The WARNS item pool was effectively used for the condition of 0.4, using all the WARNS items for different simulees.

Sireci et al. (1991) stated that MR based on IRT was comparable to the internal consistency reliability based on classical test theory (CTT). The MR of the current WARNS was .93 based on no stopping rule in the CAT simulation, which is in accord with the CTT-based internal consistency reliability estimate of the WARNS (.93) for the general factor. This consistency between both indices is evidence for the applicability of the WARNS CAT simulations. As MR calculations are based on averaging unequal variances of measurement error, some information is lost with these calculations (Sireci et al., 1991). Nevertheless, consistency between IRT and CTT based reliability indices for the WARNS supported the use of MR indices for examining the consistency of the WARNS CAT simulations.

To determine whether the WARNS CAT scores are accurate for use in practice, we must ensure that the scores obtained from the WARNS CAT are comparable with the scores used currently. This is one of the most important indicators of applicability of the WARNS CAT in practice. A high correlation (i.e., >.85) between raw scores and estimated WARNS CAT scores was a substantial indicator of score accuracy. For the stopping condition of 0.4, the correlation between raw scores and estimated scores was .87, which meets the criterion for score accuracy (e.g., Butler et al., 2017).

A practical implication of this work is that it does allow for a cut score to be suggested. In doing so, the existing WARNS cut score was treated as expected observed score by tracing the test characteristic curve. In doing so, we arrived at an approximate cut score of a theta estimate equal to −0.5. This places about 40% of the sample into a category of low risks and needs. This suggested cut score is in accord with profiles that have been advocated for the WARNS (Iverson et al., 2018), where approximately 40% of students were identified as being in the low-risk profile. The caution with this suggested cut score is that for full implementation additional validity evidence is required. A main limitation of advocating for this cut score is that the original cut score was developed based on the criterion variables of arrests and suspensions (George et al., 2015), whereas score use currently is more in line with truant behaviors. Thus, the use of the suggested cut score may result in a high rate of false positives until additional data can be gathered and adjustments considered (e.g., Beuk, 1984). This seems low risk, given it will likely result in more conversations with students about what is happening in their lives and maybe help to build more connections to the school systems for some students.

There were some limitations to this study. The first limitation was the sample used for calibration, which was limited to Washington and to students who, in general, were identified as needing to complete this assessment for an undisclosed reason (e.g., truant, general school screening, referred for services). A more heterogeneous sample is suggested to further investigate the comparability to the current WARNS. This limitation resulted in the inability to generalize the results to a national audience. Second, the WARNS cut score used to derive the CAT based cut score proposed here was based on a sample of referred youth and derived based sensitivity and specificity values for the outcomes of arrests and suspensions of that sample. The sample used in this study is different from that original sample, where students in this sample completed the WARNS for a variety of reasons, including general screening for risks and needs, and for some associated with truant behavior. We do not have access to truancy levels, or other outcomes variables, for this sample. Thus, additional research is needed to confirm the proposed cut score and its accuracy before full implementation. Third and a final limitation of the study was that content balance was not a component of item selection. Content balance warrants administering items from each subdomain for tests with a multidimensional structure. As (a) essential unidimensionality was supported, (b) the number of items in each subdomain was limited, and (c) item exposure was not a concern for the WARNS, we did not integrate content balance into the CAT simulation. However, exclusion of content balance may favor subdomains with more informative items (i.e., high discrimination parameters). This can have a negative influence on measurement accuracy (e.g., Zheng et al., 2013). As the WARNS items’ discrimination values are similar across subdomains, accuracy may not be degraded. However, research should examine accuracy differences between the presence and absence of content balanced CAT WARNS scores.

These results can make important contributions to the CAT personality literature and, more important, the practice of the WARNS assessment system. This study was critical in continuing to demonstrate that personality assessments with many items could be implemented with fewer items with the use of a CAT. It continues to build evidence that CAT can work well with personality measures. With less administration time and less burden for both students and test users, the WARNS CAT may provide a quick screening assessment for schools and districts. Besides, this study demonstrated that a CAT could be used in a CTT-derived personality test (e.g., Butler et al., 2017), yielding equal reliability and precision compared with conventional test models. Also, results indicated that the WARNS CAT could provide a 75% reduction in items for estimating total scores based on general risks and needs factor with minimal loss of measurement precision. Another strength of the WARNS CAT was that all WARNS items for simulees in different locations on the ability scale were used, showing that all WARNS items were useful in estimating the WARNS total scores.

Future work is needed to overcome some limitations of this study and provide supporting evidence for developing a WARNS CAT to use in practice. First, the WARNS CAT should be inspected for items functioning differently for different groups, as French and Vo (2019) found that six WARNS items functioned differently across groups. It was stated that the amount and magnitude of these items were probably not enough to change the decisions made using total scores calculated based on CTT (French & Vo, 2019). Nevertheless, this might cause inappropriate item selections for individuals in the administration of the WARNS CAT. Second, it is recommended to create more items for the WARNS item pool. The availability of additional items at the lower risk score levels may provide more information for the individuals who are not at-risk. Thus, a better reduction for the WARNS CAT can be obtained with higher measurement precision for the same stopping condition. Third, the mean point of 0.3 and 0.4 SEEs (i.e., 0.35) can be investigated as a stopping rule to indicate better psychometric quality and reduction than the condition 0.4 for simulees who are not at-risk. The 0.35 SEE as a stopping rule in practice for a live WARNS CAT may be best. Fourth and finally, cut scores need to be determined based on the current intended use of the WARNS, including general screening for risks and needs, and for some associated with truant behavior.

In conclusion, practical youth risks and needs assessment can aid early intervention with high school students. An assessment format that is least burdensome for the student and counselor will encourage use. Results support a WARNS CAT version to reduce assessment time without degrading measurement precision. More broadly, results support that CAT can be useful beyond achievement testing and adds to the personality assessment literature.

Footnotes

Declaration of Conflicting Interest

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Cihan Demir

References

Akaike

(1974). A new look at the statistical model identification. In Parzen

Tanabe

Kitagawa

(Eds.), Selected papers of Hirotugu Akaike: Springer series in statistics (perspectives in statistics) (pp. 215-222). Springer. https://doi.org/10.1007/978-1-4612-1694-0_16

Alpizar

French

B. F.

T. T.

(2020). Equivalence testing of a youth risk and needs assessment. Journal of Psychoeducational Assessment, 8(8), 1046-1051. https://doi.org/10.1177/0734282920930892

Anderson

Kahn

J. D.

Tindal

(2017). Exploring the robustness of a unidimensional item response theory model with empirically multidimensional data. Applied Measurement in Education, 30(3), 163-177. https://doi.org/10.1080/08957347.2017.1316277

Andrich

(1978). A rating formulation for ordered response categories. Psychometrika, 43(4), 561-573. https://doi.org/10.1007/bf02293814

Beck

A. T.

Kovacs

Weissman

(1979). Assessment of suicidal intention: Scale for suicide ideation [Database record]. APA PsycTests. https://doi.org/10.1037/t01299-000

Beuk

C. H.

(1984). A method for reaching a compromise between absolute and relative standards in examinations. Journal of Educational Measurement, 21(2), 147-152. https://doi.org/10.1111/j.1745-3984.1984.tb00226.x

Bock

R. D.

Mislevy

R. J.

(1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6(4), 431-444. https://doi.org/10.1177/014662168200600405

Bulut

Kan

(2012). Application of computerized adaptive testing to entrance examination for graduate studies in Turkey. Eurasian Journal of Educational Research, 12(49), 61-80. https://eric.ed.gov/?id=EJ1059924

Butler

S. F.

Black

R. A.

McCaffrey

S. A.

Ainscough

Doucette

A. M.

(2017). A computer adaptive testing version of the Addiction Severity Index-Multimedia Version (ASI–MV): The addiction severity CAT. Psychology of Addictive Behaviors, 31(3), 265-275. https://doi.org/10.1037/adb0000256

10.

Cai

Thissen

du Toit

S. H. C.

(2011). IRTPRO for Windows [Computer software]. Scientific Software International.

11.

De Beurs

D. P.

De Vries

A. L.

De Groot

M. H.

De Keijser

Kerkhof

A. J

. (2014). Applying computer adaptive testing to optimize online assessment of suicidal behavior: A simulation study. Journal of Medical Internet Research, 16(9), e207. https://doi.org/10.2196/jmir.3511

12.

De Matos

M. G.

Wainwright

Brebels

Craciun

Gabrhelík

Schjodt

B. H.

Plantade-Gipch

Poštuvan

Stojadinovic

Richards

. (2019). Looking ahead. European Psychologist, 24(4), 337-348. https://doi.org/10.1027/1016-9040/a000362

13.

Embretson

S. E.

Reise

S. P.

(2013). Item response theory for psychologist. Psychology Press.

14.

Feinberg

R. A.

Rubright

J. D.

(2016). Conducting simulation studies in psychometrics. Educational Measurement, 35(2), 36-49. https://doi.org/10.1111/emip.12111

15.

Flens

Smits

Carlier

Van Hemert

A. M.

De Beurs

(2016). Simulating computer adaptive testing with the mood and anxiety symptom questionnaire. Psychological Assessment, 28(8), 953-962. https://doi.org/10.1037/pas0000240

16.

French

B. F.

T. T.

(2020). Differential item functioning of a truancy assessment. Journal of Psychoeducational Assessment, 38(5), 642-648. https://doi.org/10.1177/0734282919863215

17.

George

Coker

French

Strand

McBride

McCurley

(2015). Washington assessment of the risks and needs of students: WARNS user manual. Center for Court Research, Administrative Office of the Courts.

18.

Gibbons

R. D.

Weiss

D. J.

Pilkonis

P. A.

Frank

Moore

Kim

J. B.

Kupfer

D. J.

(2012). Development of a computerized adaptive test for depression. Archives of General Psychiatry, 69(11), 1104-1112. https://doi.org/10.1001/archgenpsychiatry.2012.14

19.

Gotch

C. M.

French

B. F.

(2020). A validation trajectory for the Washington assessment of risks and needs of students. Educational Assessment, 25(1), 65-82. https://doi.org/10.1080/10627197.2019.1702462

20.

Hoge

R. D.

Andrews

D. A.

(2002). Youth level of service/case management inventory: User’s manual. Multi-Health Systems.

21.

Hol

A. M.

Vorst

H. C.

Mellenbergh

G. J.

(2005). A randomized experiment to compare conventional, computerized, and computerized adaptive administration of ordinal Polytomous attitude items. Applied Psychological Measurement, 29(3), 159-183. https://doi.org/10.1177/0146621604271268

22.

Hol

A. M.

Vorst

H. C.

Mellenbergh

G. J.

(2007). Computerized adaptive testing for polytomous motivation items: Administration mode effects and a comparison with short forms. Applied Psychological Measurement, 31(5), 412-429. https://doi.org/10.1177/0146621606297314

23.

Hol

A. M.

Vorst

H. C.

Mellenbergh

G. J.

(2008). Computerized adaptive testing of personality traits. Zeitschrift für Psychologie/Journal of Psychology, 216(1), 12-21. https://doi.org/10.1027/0044-3409.216.1.12

24.

Iverson

French

B. F.

Strand

P. S.

Gotch

C. M.

McCurley

(2018). Understanding school truancy: Risk–need latent profiles of adolescents. Assessment, 25(8), 978-987. https://doi.org/10.1177/1073191116672329

25.

Latimer

Meade

Tennant

(2014). Development of item bank to measure deliberate self-harm behaviours: Facilitating tailored scales and computer adaptive testing for specific research and clinical purposes. Psychiatry Research, 217(3), 240-247. https://doi.org/10.1016/j.psychres.2014.03.015

26.

Lian

Cai

(2020). Developing and validating an item bank for alcohol use disorder screening in the Chinese population by using the computerized adaptive testing. Frontiers in Psychology, 11, 1652. https://doi.org/10.3389/fpsyg.2020.01652

27.

Linden

W. J.

Glas

C. A.

(2007). Computerized adaptive testing: Theory and practice. Springer Science & Business Media.

28.

Magis

Barrada

J. R.

(2017). Computerized adaptive testing with R: Recent updates of the package catR. Journal of Statistical Software, 76(Code Snippet 1). https://doi.org/10.18637/jss.v076.c01

29.

Magis

Raîche

(2012). Random generation of response patterns under computerized adaptive testing with the R package catR. Journal of Statistical Software, 48(8). https://doi.org/10.18637/jss.v048.i08

30.

Magis

Yan

Davier

A. A.

(2017). Computerized adaptive and multistage testing with R: using packages catR and mstR. Springer.

31.

Masters

G. N.

(1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149-174. https://doi.org/10.1007/bf02296272

32.

McClarty

K. L.

(2006). A feasibility study of a computerized adaptive test of the international personality item pool NEO [Doctoral dissertation, The University of Texas at Austin]. https://repositories.lib.utexas.edu/handle/2152/2576

33.

McDonald

R. P.

(1999). Test theory: A unified treatment. Psychology Press.

34.

Muraki

(1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16(2), 159-176. https://doi.org/10.1177/014662169201600206

35.

Nandakumar

(1991). Traditional dimensionality versus essential dimensionality. Journal of Educational Measurement, 28(2), 99-117. https://doi.org/10.1111/j.1745-3984.1991.tb00347.x

36.

Rahdert

(1991). Problem-oriented screening instrument for teenagers: The adolescent assessment referral system manual. National Institute on Drug Abuse.

37.

Rodriguez

Reise

S. P.

Haviland

M. G.

(2016). Evaluating bifactor models: Calculating and interpreting statistical indices. Psychological Methods, 21(2), 137-150. https://doi.org/10.1037/met0000045

38.

Samejima

(1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika, 34(1), 1-97. https://doi.org/10.1007/BF03372160

39.

Schwarz

(1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461-464. https://doi.org/10.1214/aos/1176344136

40.

Sireci

S. G.

Thissen

Wainer

(1991). On the reliability of testlet-based tests. ETS Research Report Series, 1991(1), i-15. https://doi.org/10.1002/j.2333-8504.1991.tb01389.x

41.

Stout

W. F.

(1990). A new item response theory modeling approach with applications to unidimensionality assessment and ability estimation. Psychometrika, 55(2), 293-325. https://doi.org/10.1007/BF02295289

42.

Strand

P. S.

Gotch

C. M.

French

B. F.

Beaver

J. L.

(2019). Factor structure and invariance of an adolescent risks and needs assessment. Assessment, 26(6), 1105-1116. https://doi.org/10.1177/1073191117706021

43.

Sunderland

Batterham

P. J.

Calear

A. L.

Carragher

(2017). The development and validation of static and adaptive screeners to measure the severity of panic disorder, social anxiety disorder, and obsessive compulsive disorder. International Journal of Methods in Psychiatric Research, 26(4), e1561. https://doi.org/10.1002/mpr.1561

44.

Tan

Cai

Zhang

(2018). Development and validation of an item bank for depression screening in the Chinese population using computer adaptive testing: A simulation study. Frontiers in Psychology, 9. https://doi.org/10.3389/fpsyg.2018.01225

45.

Thimm

J. C.

(2020). The Norwegian computerized adaptive test of personality disorder–static form (CAT-PD-SF): Reliability, factor structure, and relationships with personality functioning. Assessment, 27(3), 585-595. https://doi.org/10.1177/1073191117749296

46.

Waller

N. G.

Reise

S. P.

(1989). Computerized adaptive personality assessment: An illustration with the absorption scale. Journal of Personality and Social Psychology, 57(6), 1051-1058. https://doi.org/10.1037//0022-3514.57.6.1051

47.

Walter

O. B.

Becker

Bjorner

J. B.

Fliege

Klapp

B. F.

Rose

(2007). Development and evaluation of a computer adaptive test for “Anxiety” (Anxiety-CAT). Quality of Life Research, 16(Suppl. 1), 143-155. https://doi.org/10.1007/s11136-007-9191-7

48.

Walter

O. B.

Holling

(2008). Transitioning from fixed-length questionnaires to computer-adaptive versions. Zeitschrift für Psychologie/Journal of Psychology, 216(1), 22-28. https://doi.org/10.1027/0044-3409.216.1.22

49.

Zheng

Chang

(2013). Content-balancing strategy in bifactor computerized adaptive patient-reported outcome measurement. Quality of Life Research, 22(3), 491-499. https://doi.org/10.1007/s11136-012-0179-6