Abstract
Child maltreatment and intimate partner abuse determinations often include judgments (e.g., severity) that go beyond whether or not the allegations are founded. Severity ratings inform multiple stakeholders (e.g., researchers, policymakers, clinicians, supervisors) and response pathways (e.g., “differential response” to child maltreatment). However, because severity guidelines typically only provide global direction for raters, these gradations are often of questionable reliability (and thus validity). Extending earlier work developing and implementing reliable and valid family maltreatment substantiation criteria (e.g., Heyman & Slep, 2006, 2009), a classification system for maltreatment severity was created, refined, and field-tested with a sample of clinicians from the largest maltreatment protection agency in the United States The goal was to develop operationalized criteria delineating mild, moderate, and severe maltreatment that could be consistently applied across types of maltreatment, raters, and clinics. To facilitate proper use, a computerized clinical decision support tool for the criteria was created. First, the severity classification system was piloted and refined at four sites throughout the United States. Then, clinicians at these sites (N = 28) and a master reviewer independently rated de-identified cases as part of the clinicians’ routine assessments. Agreement between clinicians and the master reviewer was excellent for all types of maltreatment. Implications for practical dissemination are discussed.
Keywords
The severity of abuse and neglect has been, implicitly or explicitly, a foundational notion of family maltreatment research and intervention since their inceptions. For instance, the first child maltreatment article (“The Battered Child Syndrome” by Kempe et al., 1962) and early work on intimate partner violence (e.g., Walker’s [1979] description of the “cycle of abuse”) focused public and research attention on their severe forms. Subsequent work on assessment (e.g., Conflict Tactics Scale, Straus, 1979; national incidence study of child abuse and neglect; NIS; Sedlak et al., 2010), theory (e.g., Holtzworth-Munroe & Stuart, 1994; Johnson, 2010; Walker, 1979), and intervention (e.g., Pence & Paymar, 1993) involved a continuum of severity. However, severity has either been inferred via a priori assumptions about potential impact (e.g., Straus, 1979) or measured via complicated research operationalizations that are not viable for use by clinicians in the field (e.g., NIS-4; Sedlak et al., 2010). In essence, the edifices of family maltreatment theory, assessment, prevention, and treatment are built on foundational assumptions regarding severity (e.g., minor versus severe, “situational couple violence” versus “intimate terrorism,” appropriate child discipline versus abuse), and yet the foundation was never truly built, at least not empirically. This is the first study to construct and test a field-usable scoring system for incidents that meet the threshold for clinically significant child or partner abuse or neglect as operationalized by criteria now in use by the diagnostic and statistical manual of mental disorders (DSM-5, American Psychiatric Association, 2013), the U.S. Department of Defense (2014), and the international classification of disease, 11th edition (World Health Organization, 2020).
Whether allegations of abuse and neglect are founded and, if so, the severity of the incidents have implications for the individuals involved (e.g., removal of a child from the home, law enforcement action, legal involvement). Furthermore, the data regarding such incidents inform broader policy decisions (e.g., staffing, prevention, trend analysis, response). When decisions are not consistent across families, caseworkers, contexts, and time, efforts to protect families adequately, detect trends, and oversee responses are hampered (Heyman & Slep, 2019), as are fundamental notions of fairness. In both research and field decisions, two related decisions require establishing reliable thresholds: (a) Do the acts/omissions constitute maltreatment? (b) If so, what is the severity of the incident(s)? These decisions are operationalized in various ways depending on context (e.g., research, treatment), setting (e.g., child protective services, law enforcement), and form of maltreatment (e.g., partner vs. child, abuse [physical, emotional, sexual] or neglect). They also go by different names (e.g., differential response, diversion, mild/moderate/severe maltreatment). Nevertheless, the distinctions are fundamentally the same: Is this maltreatment or not? If so, how severe?
Despite the foundational nature of these decisions, their reliability 1
We will follow common usage and refer to “reliability” and “interrater agreement” interchangeably, especially because in our study, interrater agreement involves consistency or error influenced by several sources: (a) ratings of numerous field social workers from (b) four military services, each with its own training, traditions, and unique operational influences and pressures in (c) four different geographical areas with (d) diverse settings (urban, rural, desert) assigning ratings to (e) behavior of families from across the United States somewhat haphazardly assigned to that site. Thus, in our study “interrater agreement” includes many of the elements of “reliability” in Mitchell’s (1979) distinctions between the two terms:
[I]t should be repeated that reliability and observer agreement are not the same. The differences between agreement and reliability are based on the way the two indices are defined. Reliability coefficients partition the variance of a set of scores into a true score (individual differences) and an error component. The error component may include random fluctuations in the behavior of subjects, inconsistencies in the use of the scale, differences among observers, and so forth. Interobserver agreement [indices], on the other hand, carry no information at all about individual differences among subjects and contain information about only one of the possible sources of error—differences among observers. In other words, a reliability coefficient reflects the relative magnitude of all error with respect to true score variability, whereas an agreement percentage reflects the absolute magnitude of just one kind of error (p. 378).
Regarding the maltreatment occurrence threshold (act and impact/potential impact), Heyman and Slep (2019; see also Heyman & Slep, 2006, 2009; Heyman et al., 2013; Slep & Heyman, 2006) established the reliability and validity (content and criterion) of criteria originally developed for the country’s largest maltreatment protection agency (the U.S. Department of Defense’s [DoD] Family Advocacy Program [FAP]). These criteria were adapted and included in the fifth edition of the diagnostic and statistical manual of mental disorders (DSM-5; American Psychiatric Association, 2013) and in the eleventh edition of the international classification of diseases (ICD-11; World Health Organization, 2020).
However, unlike maltreatment occurrence, severity does not have established, field-usable, and reliable thresholds. Child maltreatment severity has included investigations of (a) frequency of different degrees of child maltreatment based on risk factors (Kepple, 2017); (b) whether severity influences removal of a child from a home (e.g., Bartelink et al., 2018; Biehal et al., 2018), and (c) the extent to which severity influences future outcomes (e.g., O’Sullivan et al., 2018; Schwandt et al., 2013). Two rating systems designed for research contexts are of note. The national incidence study on child abuse and neglect (NIS; Sedlak et al., 2010) classified severity based on the level of harm (i.e., fatalities, serious injuries or conditions, moderate injuries or conditions, probable impairment) or endangerment (e.g., less severe actual harm, potential for harm). Both levels are operationalized and can be coded reliably, but it can be argued that the definitions are not easily used by clinicians in the field (i.e., they are complicated) and are not sufficiently fine-grained. Barnett et al. (1993) used a point system to rank order severity by maltreatment type (i.e., physical, sexual, or emotional abuse; neglect). Raters are provided with descriptions and prototypical examples that—although thorough and include room for consideration of different specific acts and constellations of acts (e.g., a wide range of ways in which a child could be hit)—are global and do not include easily navigated subcriteria. Although Barnett et al. (1993) proved that the system could be used reliably, it is complex and requires intensive training that cannot be easily applied to a typical clinical or fieldwork context (especially to maintain reliability on an ongoing basis).
A third system was developed explicitly for use by child protective services caseworkers in Spain (Arruabarrena et al., 2013). Maltreatment operationalizations were created via a literature review, expert consultation, and a pilot study for five levels (i.e., no maltreatment and low, moderate, severe, very severe maltreatment). Similar to Barnett et al.’s (1993) system, guidance and examples were provided for each rating. In the pilot study, caseworkers read vignettes and rated the severity of each case. Caseworkers using the old definitions accurately rated severity 20% of the time. After a 10-hour training, accuracy was 45.5%, rising to 62% after a 20-hour training. Thus, although the criteria with training tripled accuracy, decisions were still wrong more than one-third of the time. Caseworkers particularly struggled with reliably rating emotional maltreatment cases and with “moderate” cases (Arruabarrena et al., 2013).
In contrast, intimate partner violence (IPV) severity ratings have typically involved a priori classifications of acts based on the inherent likelihood of injury (e.g., mild and severe physical and psychological IPV on the Conflict Tactics Scales [CTS-2, Straus et al., 1996, 2003]). Although the CTS2’s self-reports are straightforward and easy to use, it cannot adequately incorporate the contextual information available in even a relatively brief verbal incident assessment (e.g., the push was onto a bed or was down a flight of stairs). Furthermore, many IPV severity measures focus on only one type of maltreatment and often use acts and frequency to assess severity. The Severity of Violence Against Women Scale (Marshall, 1992) contains 46 items assessing the frequency of physical acts. The Sexual Experiences Survey (SES; Koss & Gidycz, 1985) measures sexual abuse based on act and frequency reported, with severity ranked from no act through rape. The Safe Dates—Psychological Abuse Perpetration (Foshee et al., 1996) scale asks for frequency ratings, and the scoring results in classifications of none, mild, moderate, or severe psychological abuse.
Two IPV assessments include impact beyond the frequency of specific acts. First, the Measure of Wife Abuse (Rodenburg & Fantuzzo, 1993) contains 60 items that first assess act, and then impact on a 4-point scale ranging from “This never hurt or upset me,” to “This often hurt or upset me.” Higher scores indicate higher levels of psychological abuse. Physical abuse items are also included. At 60 items, this is a lengthy assessment. Second, the Danger Assessment (Campbell, 1986) assigns a severity risk based on past IPV to predict risk of future maltreatment. Victims rate previous incidents on a scale of 1–5 and use a calendar to mark incident dates. The lowest anchor on the scale is “Slapping, pushing; no injuries and/or lasting pain;” the highest anchor is “Use of weapon; wound from weapon.” Victims are instructed to assign the higher rating if there is an act or impact that matches more than one point on the scale. Then, 15 items further assess acts, but not impacts.
In summary, previous attempts to classify severity (a) are often cumbersome and challenging to implement in the field, (b) are not comprehensive and focus on only (1) an incident with global items, or (2) act or impact/possible impact, but not both, (c) require a significant amount of training that does not improve accurate assessment. Furthermore, the importance of reliable measurement in severity has implications for (a) policymakers interpreting trends, (b) researchers measuring severity, (c) interventionists preventing and treating, and (d) child welfare caseworkers triaging cases.
Creation of Field-usable Severity Rating Scales
As noted above, the most developed system for reliably substantiating partner and child maltreatment is the one used by the DoD, DSM-5, and ICD-11. Because (a) the program of research that resulted in those criteria (summarized in Heyman & Slep, 2019) provided the roadmap for the current research program on maltreatment severity ratings and (b) the substantiation determination process typically precedes the severity determination process, we will summarize the initial research here.
The development process involved a five-study research program including (a) a content validity study, (b) a mixed-method study with clinicians about clinical utility, (c) development of operationalized criteria, (d) evaluation of the criteria in typical usage in field settings (i.e., agreement between field decisions and those of master reviewers listening to the same case presentation), and (e) evaluation of the criteria in field settings while using a computerized clinical decision support tool. Baseline agreement between sites and master reviewers was 50%; in the final development field trial, agreement was 92% (Heyman & Slep, 2006) and was maintained at 91% when the criteria were disseminated to 41-sites worldwide (Heyman & Slep, 2009). Field users and other stakeholders involved in the decision-making process viewed the new criteria as fair to alleged perpetrators and victims (Heyman et al., 2010). Results implying the validity of the determinations are summarized in Heyman & Slep (2019). Thus, clinical decisions on above/below clinically significant thresholds can be made with high levels of reliability and validity (Heyman & Slep, 2019).
To summarize, the purpose of the current study is to extend this work by creating severity gradation distinctions for incidents exceeding the maltreatment substantiation threshold (i.e., distinguishing mild, moderate, and severe maltreatment) and testing their reliability. To accomplish this, a decade after the original clinical criteria studies, we conducted a program of research for severity following a similar development process. Phase 1 (initial development) involved six activities: (a) gather severity operationalizations used in research and child protective service settings; (b) survey users and agency leaders on the then-in-uses severity measure; (c) revise the operationalizations of severity to support reliable implementation across users, sites, and services; (d) conduct focus groups with users, agency leaders, and civilian experts on the revised measure; (e) test and revise the measure; (f) conduct a vignette study to obtain an estimate of the potential interrater reliability of the severity measure. Finally, Phase 2 involved two activities: refinement and field-testing of the severity scales.
Phase 1 Activities: Initial Development
Activity 1: Gather Severity Operationalizations Used in Research and Child Protection Services
We conducted literature searches of the psychological (PsychInfo) and medical (Medline) research databases. We also searched full-text resources (e.g., PsychArticles, Google Scholar) because severity does not always appear in either the keywords or the abstract. We also examined operationalizations in the national incidence study on child abuse and neglect (NIS-3 and NIS-4; Sedlak et al., 2010). Finally, we contacted professional organizations via listservs (e.g., child maltreatment, partner abuse) and asked colleagues if they were aware of any other severity scales not identified in the previous steps.
Activity 2: Survey Users and Agency Leaders on the Then-current Severity Matrix
We received input from FAP clinicians in the field, as well as agency leaders, to ensure that the revised measure (a) retained the strengths of the then-current measures, (b) adequately addressed weaknesses, (c) was easy to use, and (d) fulfilled desired clinical and administrative functions. Finally, we believed that by involving stakeholders in the revision process, clinical utility/validity, and end-user acceptance of the measure would be enhanced.
Study 1: Usability survey on the then-current severity matrix.
Method.
Participants and procedures. Participating FAP clinicians (N = 66) volunteered to participate after receiving an invitation from FAP headquarters staff from their service. Participants received an emailed consent form, short measure on their satisfaction with the current severity scale, and an option to provide comments. Those who did not respond were contacted by phone and, if they consented, completed the measure by phone. It is important to note that limited demographic variables were collected in an attempt to increase participant comfort in providing honest feedback on an assessment that was currently in use.
Results. Clinicians rated the existing maltreatment severity scales as modest overall (mean item score = 3.27 [child] and 3.36 [partner] out of 5). The highest mean ratings were given for ease of use and the appropriateness of the measures to physical and sexual abuse, while the lowest mean ratings were given for reliability of ratings across clinicians, installations, and services.
Discussion. There were considerable interagency differences in the ratings. This variation within FAP agencies is similar to that found in civilian agencies (e.g., within a large county or city; among counties in a state). To achieve consistent ratings, developers must (a) create usable criteria set, (b) train users, and (c) ensure that users match a gold standard rating.
Activity 3: Revise the Operationalizations of Severity to Support Reliable Implementation
Based on the results of Study 1, we revised the severity measure by (a) adopting the DoD maltreatment criteria as the basis for the severity scale, and (b) operationalizing each maltreatment criterion with severity criteria. There were few extant sources to borrow from that matched the maltreatment criteria; however, where appropriate, we tried to modify existing operationalizations (e.g., NIS4; Sedlak et al., 2010).
Activity 4: Conduct Focus Groups on the Revised Measure
We conducted three focus groups with field clinicians. The first two focus groups asked attendees to describe cases that they had recently assessed and attempt to classify them using a paper version of the severity scale draft. It became clear in these discussions that (a) the task (i.e., determining severity for each criterion, retaining the highest score) was too complex for paper to be a viable mode of administration; and, critically, (b) elements necessary to assess severity were often not gathered during assessments.
Before the third focus group, we developed a computerized prototype. Branching through the severity scales was much easier for users to accomplish. We then turned this prototype into an online test site. Shortly after testing the computerized prototype with the focus group, we presented it to agency chiefs and got their input.
Activity 5: Test and Revise the Severity Scales
To account for interclinician variability, four clinicians (i.e., one per service) were nominated to participate in the initial testing of the revised severity measure via teleconference. Each group conducted one teleconference per week for four weeks with a doctoral-level researcher from the university development team. Each week, one clinician presented several assessments. All four members (plus the researcher) listened to each assessment, asked questions, made severity ratings, and discussed their ratings. These teleconferences served three purposes: (a) provide “gray area” cases that were used to refine the severity measure, (b) identify elements of assessment that needed to be unified or strengthened to support reliable determinations, and (c) identify process issues in the determination process that could impede consistent usage. We found that refinement in all three areas—scale operationalizations, clinical assessments, and ratings process—would be necessary to achieve high reliabilities. We conducted a total of four unique groups, resulting in a total of 16 sessions over an approximately six-week period, generating frequent alterations.
Activity 6: Conduct a Vignette Study to Obtain an Estimate of Potential Interrater Reliability
Study 2: Usability survey on the then-current severity matrix.
Method.
Participants. Of the 643 FAP clinicians contacted, 251 (39%) replied within the data collection period. Participants averaged 11.50 years of experience (SD = 8.91, range 0–41), with 93% holding a master’s degree.
Procedure. Each participant rated one type of family maltreatment, for a total of seven vignettes. Vignette selection was done randomly for each of the seven vignettes. An assent form, cover letter with instructions, and vignettes were emailed to potential participants. Clinicians completed the vignettes using the web-based severity measure.
Results. Clinicians’ ratings of the standardized vignettes were compared with the rating provided. Agreement for most forms of maltreatment was poor to modest.
Discussion. Clearly, the limited agreement with the “correct” answer in standardized vignettes indicated that further development work was needed. Some of the difficulties may have been due to the contrast between (a) the rich information in a typical clinical assessment and (b) the brevity of the vignettes. As with substantiation decisions, “easy” cases with observable, uncontested facts (e.g., a broken bone) were relatively easy to judge. However, the majority of cases involve psychological impact or the potential for serious impacts; these cases, although challenging for yes/no substantiation decisions, are even harder for severity determinations. See Tables 1 and 2 for a brief overview of the severity conceptualization.
Conceptualization of Severity Criteria Outlined in Phase 2 Activities: Refinement and Field Testing.
Note. 1There are six types of child neglect. Abandonment is always severe, and there are no separate criteria to consider.
H = exposure to physical hazards, S = supervision: lack of/inappropriate/inadequate, M = medical, E = educational, D = deprivation of necessities.
Conceptualization of Severity Ratings by Criterion Outlined in Phase 2 Activities: Refinement and Field Testing.
Note. These are broad categorizations to capture the general thinking behind rating each criterion, but many branches have specific wording based on previous responses. For example, there are 12 types of physical injury/impact, each with tailored wording (e.g., respondents indicate level of pressure required to stop bleeding from damage to skin). So, the somewhat or did not/highly impacted/extremely impacted physical functioning categories are still accurate.
Phase 2 Activities: Refinement and Field Testing
Activity 7: Refinement of the Revised Severity Scales
All clinicians (n = 28) who conducted family maltreatment assessments at four sites nominated by their respective services’ FAP headquarters participated in the project. Most clinicians (93%, n = 26) had a master’s degree. Of these, 79% (n = 22) held a degree in social work. There was a great deal of variation (range 0–30) in the years worked in the program, with a median of 4 years.
To launch the activity in California, Nebraska, and Washington State, the research team conducted in-person group trainings and individual consultation sessions to teach clinicians the content and use of the severity classification system. Clinicians were able to ask questions and comment on the system while using the computerized decision tree support tool with de-identified cases. In turn, the team employed a cognitive interviewing procedure (e.g., Dillman, 2000), where clinicians talked through their thought processes and responses to each item. Refinements were made to the severity classification system and training materials as needed to enhance their consistent interpretation, application, and ease of use.
After all site visits, regular telephone calls were scheduled with each clinician (on a schedule determined by caseload) with at least one member of the research team, who served as the master reviewer(s). During these calls, the clinician presented de-identified cases and talked through responses to each item in the support tool ratings. The master reviewer was initially quite active, asking questions and providing guidance. Based on these calls, the research team refined the classification system, making small wording changes and other adjustments to support consistent application and creating training and support materials. The draft classification system was considered final after there were no revisions for several months of consistent field use.
Severity criteria.
The severity classification was made following a determination whether the allegation met criteria for maltreatment. (Incidents not meeting criteria for maltreatment were coded as “not applicable” for severity ratings.) The computerized severity system comprised decision algorithms to produce one of the three ratings: mild, moderate, or severe.
Clinicians provided information about the different actual and potential impacts of each incident of maltreatment. The severity rating for that incident was based on the most severe impact (or potential impact). For example, an act/omission might lead to a variety of impacts. If an act/omission was determined to have one “moderate” impact and no “severe” impacts, the final rating was “moderate.” If all impacts resulted in “mild” classifications, the final rating was “mild.” Clinical utility was maximized by having the algorithm reach a rating as quickly as possible to minimize clinician’s time. As soon as the clinician indicated a “severe” answer, the rating process stops, as “severe” is the only possible determination. Further, clinicians answered several binary screener questions, and follow-up questions were asked only for relevant elements.
Activity 8: Field Testing the Revised Severity Scales
Study 3: Interrater agreement of caseworkers and master reviewer.
Method.
Procedures. After the scales were finalized at the end of Activity 7, no further changes were allowed. The team conducted intensive trainings via telephone with all clinicians at each site, using possible assessment items, multiple examples, screenshots from the web-based classification system, vignettes, brief quizzes, recommended assessment questions, and pictures.
In Activity 8, clinicians presented cases on the telephone as in Activity 7 except that rating decisions were made independently, and the clinician and master reviewer were blind to the decisions of the other. In a very small number of cases, after clinicians locked in their incident ratings, the master reviewer asked for facts that had not been included in the case presentations (e.g., “What was the age of the child?”). In making their decisions, clinicians were able to use reference materials from the training but were not able to ask the master reviewer questions.
Results. Analyses compared FAP severity ratings with master reviewer ratings using G (Holley & Guilford, 1964). Similar to the often-used Cohen’s (1960) kappa, G measures the extent of interrater agreement controlling for chance; however, of the many statistics used to measure interrater agreement, G is least biased by unequal distributions among the classes or by low sample sizes (Xu & Lorber, 2014). G can be interpreted in the same manner as Cohen’s kappa (e.g., .60 – .74 as good and .75 and above as excellent).
A total of 929 incidents (65% partner maltreatment [n = 605]; 35% child maltreatment [n = 324]; see Table 3) were rated. The data collection window was determined to ensure a minimally sufficient number of incidents for most forms of maltreatment (i.e., child and partner physical and emotional abuse, and child neglect). Because child and partner sexual abuse comprise approximately 2% of allegations that meet criteria in these agencies, no practically feasible data collection period could have generated enough of these cases to result in stable estimates of interrater agreement as with the other forms of maltreatment. With that noted, we believe enough incidents were included in this study to provide a solid foundation to generalize the findings to other populations.
Table 4 displays results indicating that the classification system had excellent reliabilities and ranged from G = 1.00 (child sexual abuse) to .78 (partner physical abuse).
Number of Incidents by Type of Maltreatment—Outlined in Phase 2 Activities: Refinement and Field Testing.
Comparison of Clinician Ratings and Master Reviewer Ratings by Type of Maltreatment—Outlined in Phase 2 Activities: Refinement and Field Testing.
Discussion
Severity determination—critical to families, caseworkers, agencies, researchers, and policymakers alike—are useful only if they can be made reliably (i.e., consistently) and validly (i.e., the ratings actually measure gradations in impact or potential impact.). Given the maxim that reliability constrains validity (Grove, 1987), establishing reliability in a field-usable severity determination would pave the way toward improving studies that include maltreatment severity rated by assessors. Of equal—but less pithily emphasized—importance, in clinical contexts reliability constrains fairness; in policy contexts, reliability constrains interpretability. Encouragingly, results from this study indicate that a severity classification system can be used reliably in the routine work of social workers in the field. Furthermore, the system was able to be used reliably for all types of maltreatment involving both child and adult victims.
Of note is that the scales were tested in America’s largest child and partner protection agency (the DoD’s Family Advocacy Program), comprising four semiautonomous agencies that serve a racially and ethnically diverse population of families. The communities served are located in urban, suburban, exurban, and rural locations throughout the United States plus Europe and Asia. Depending on the location and whether the incident happened on or off the installation, Family Advocacy involvement may be in addition to, or instead of, civilian legal or state child protective services actions.
Research, Prevention, and Policy Implications
Reliable measurement of incident severity serves myriad purposes. First, stakeholders are frequently interested not only in trends in maltreatment but also in trends in the severity of maltreatment. Such information is not useful unless the signal (i.e., true score) is stronger than the noise (i.e., error variance; Lord & Novick, 1968); consistency in measurement is a prerequisite for a strong signal. Second, researchers are often interested in studying severity levels or trajectories; reliable severity scales measurement makes this possible. Third, given that past behavior is the best predictor of future behavior (Ouellette & Wood, 1998), a reliable measure of incident severity may be of use in planning interventions (Slep & Heyman, 2004). Finally, reliably indexing and tracking severity likely also facilitates understanding case acuity and worker caseload, which might also contribute to maintaining manageable caseloads and better outcomes for children and families (e.g., Kaye et al., 2012).
In addition, although risk factors for perpetration are already known and disseminated broadly (see CDC technical package, Niolon et al., 2017), understanding how risk factors differentially relate to various forms of maltreatment at different levels of severity would allow for more informative prevention and outreach presentations and actions.
Finally, a more reliable system is perceived of as fairer, which typically results in better buy-in from stakeholders. As noted earlier, Heyman et al. (2010) found that the reliable case determination system that preceded the severity system was perceived as fairer; in the ensuing year, there was a 50% reduction in child and partner maltreatment recidivism (Snarr et al., 2011). Similarly, a more trustworthy system for rating severity may result in unintended positive consequences, with stakeholders paying particular attention to those meeting the moderate and severe thresholds.
Limitations
There is nothing military-specific about the severity scales; however, reliability in other systems or contexts would have to be established. That said, at the start of this initiative, there was concern about the differences among the four agencies. The systems within which workers in the different services operate vary substantially. These results demonstrate that a single severity system can be implemented consistently across different subpopulations and settings, implying the possibility for broader dissemination to civilian agencies in the United States and elsewhere.
Recommendations
First, it is strongly suggested that the assessing clinician make severity ratings after a substantiation determination is made. The assessing clinician is in the best position to answer the detailed questions within the severity support tool. Based on the results of the current project, it appears that clinicians make reliable severity ratings using this system (with the proper training and supervision). By making ratings following the substantiation determination, ratings will only be made on cases that require them. Furthermore, determinations will benefit from additional information that may come to light through the substantiation process.
Second, it appears that presenting the severity system item-by-item in a decision-tree format aided in the consistency of ratings. Without the computerized support tool, it is unclear whether reliable severity ratings would have been made. Furthermore, the development activities in this paper (Phase 1, Activity 4) supports the need for structured, item-by-item administration (most easily facilitated by a computerized support tool). Numerous free or low-cost and user-friendly options are available to make such a tool (e.g., Google Forms, LimeSurvey, Survey Monkey). Based on the needs of each organization and regulations regarding electronic records, professional survey platforms might be a better fit (e.g., Qualtrics), as would secure, custom-built tools. Simple skip patterns would result in the necessary branching.
Finally, the severity system was reliably implemented following a period of training and use with supervision and feedback to ensure that clinicians were interpreting the criteria as intended. This training was not intensive, but it is likely an essential part of any implementation.
Conclusion
In conclusion, this study supports that this streamlined severity classification system assessing both act and impact for family maltreatment can produce reliable ratings across cases, clinicians, sites, and systems with a reasonable level of training. Although further investigation during widespread use in this and other maltreatment response settings would be necessary to establish reliability and validity in everyday use, this initial study supports the prerequisite that any such severity scales be able to be used reliably by clinicians in everyday fieldwork.
Footnotes
Acknowledgments
We would like to thank our installation-level points of contact and their clinicians, who generously donated their time and clinical knowledge to help create a useable tool. We would also like to acknowledge Katherine Casillas, PhD, who served as project director on the precursor to this work, creating a solid foundation for this study, and David Lloyd, JD, who was our headquarter-level point of contact.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project was funded by (a) contract FA8901-06-C-0027 from the U.S. Air Force and (b) a contract from the U.S. Air Force, via the USDA, administered by Kansas State University (2009-48353-06045).
