Abstract
Objective:
The Diagnostic Infant and Preschool Assessment was revised to include Likert ratings (DIPA-L) to give a broader range of severity ratings that may have greater utility for clinical and research purposes. In addition, the instrument was updated for Diagnostic and Statistical Manual of Mental Disorders, 5th ed. (DSM-5), and two types of Likert ratings—frequency versus problem intensity—were explored for posttraumatic stress disorder (PTSD) symptoms. Concurrent construct validation and test–retest reliability were examined for the five most common disorders seen in very young children in outpatient clinics: PTSD, attention-deficit/hyperactivity disorder, oppositional defiant disorder, separation anxiety disorder, and generalized anxiety disorder (GAD). A sixth disorder, disruptive mood dysregulation disorder (DMDD), which was created in DSM-5, was tested for the first time. Functional impairment was also examined.
Methods:
The caregivers of 58 two- through six-year-old children (57 mothers and 1 father) were recruited from an outpatient clinic. They were interviewed at Time 1, and 52 were reinterviewed at Time 2 by research assistants (children's age M 4.7 years, standard deviation 1.2).
Results:
Few differences were found between the ratings of frequency versus problem intensity for PTSD symptoms. Tests of concurrent criterion validation were acceptable for all disorders when compared against disorder-specific questionnaires; the range of Pearson correlation coefficients was 0.56–0.94. A trend for attenuation of diagnoses from Time 1 to Time 2 was evident, but not statistically significant. Test–retest reliabilities were strong when examined with continuous Likert scores, except for GAD (the range of intraclass correlation coefficients values was 0.29–0.91, but were less consistent for categorical disorder-level status [the range of Cohen's κs was 0.35–0.79]). The range of internal consistencies was 0.78–0.95, excluding DMDD, which could not be calculated.
Conclusions:
The updated and revised DIPA-L demonstrated many acceptable features of a valid and reliable instrument for the assessment of very young children. While the findings are tentative given the small sample size, the DIPA-L is the only diagnostic instrument for young children with a replication, tested in clinic populations, updated for DSM-5, with psychometrics for functional impairment, and has Likert ratings.
Introduction
The Diagnostic Infant and Preschool Assessment (DIPA) has its origins in 2004 in an initiative to create a developmentally-sensitive tool about psychiatric syndromes in very young children that mapped closely onto the Diagnostic and Statistical Manual of Mental Disorders, 4th ed. (DSM-IV) (American Psychiatric Association 1994). At the time, there existed only one other structured diagnostic interview, the Preschool Age Psychiatric Assessment (PAPA) (Egger et al. 2006), which overlapped with normative behaviors and was not always organized into DSM-IV diagnostic modules. The DIPA was designed to overcome those limitations and be more in line with diagnostic assessment tools for older children, adolescents, and adults (Scheeringa and Haslett 2010).
The DIPA was designed to be administered either by trained research assistants or licensed clinicians so that the exact wording of initial probes was specified, but interviewers were required to follow up the initial affirmative answers of respondents to gather examples and justify to themselves that endorsements of symptoms were valid endorsements. Answers were coded as yes/no to replicate the binary nature of symptom endorsement in the DSM-IV. Caregivers were the only respondents because of the well-established fact that children younger than 7 years of age have not yet developed the cognitive capacities of self-reflection and understanding of normative behavior versus problematic symptoms (Martini et al. 1990; Scheeringa et al. 2001).
The original DIPA, based on DSM-IV criteria, was tested for construct validation and test–retest reliability on a clinic sample of 50 one- to six-year-old children (Scheeringa and Haslett 2010). The caregivers of outpatients were interviewed twice to examine concurrent criterion validity and test–retest reliability about eight disorders. These preliminary data supported the DIPA as a reliable and valid measure of symptoms in research and clinical work with very young children. Since the original study, the DIPA has found traction as a useful measure at numerous research sites. Parts, or all, of the DIPA have been used in over 20 peer-reviewed articles (list available from the author by request). However, the binary responses of the DIPA limited its value to measure severity and to capture change over time.
For example, a child could have five symptoms of depression and meet the full criteria for the diagnosis of major depressive disorder, and then receive treatment and show substantial reductions in all five symptoms. However, a posttreatment reassessment could theoretically show that all five symptoms were still present, although at much lower severity, and the child still met full criteria for the diagnosis, despite major improvements in severity and the family feeling satisfied with the treatment outcome. To expand the capacities of the DIPA beyond diagnoses to a more sensitive metric of change, the DIPA was revised to replace binary yes/no responses with five-point Likert-style ratings. This broader range of severity rating may have greater utility to determine prognosis, treatment planning, show change with treatment, and predict treatment response.
Another limitation was that the Diagnostic and Statistical Manual of Mental Disorders, 5th ed. (DSM-5; (American Psychiatric Association 2013), was released in 2013 and the DIPA needed to be updated. One main change was that the DSM-5 created posttraumatic stress disorder (PTSD) criteria specifically for very young children called PTSD for children 6 years and younger. The wordings of several symptoms were modified. Avoidance of thoughts, feelings, and conversations associated with the trauma was changed to avoidance of people, conversations, or interpersonal situations. Avoidance of activities, places, or people that arouse recollections was changed to avoidance of activities, places, or physical things. Aggression was added to the symptom of irritability and outbursts of anger. Two items from the DSM-IV were deleted—inability to recall an important aspect of the event, and sense of a foreshortened future. A new symptom, increased negative emotional state, was added. No updates were needed for attention-deficit/hyperactivity disorder (ADHD), oppositional defiant disorder (ODD), or the anxiety disorders.
In addition, a new disorder, disruptive mood dysregulation disorder (DMDD), was created in the DSM-5, and there were no existing test–retest reliability data with very young children. DMDD was created to provide a diagnostic home for children whose severe temper outbursts diagnosed them as bipolar disorder with a frequency that many considered to be inappropriate (Carlson 2016). The DMDD criteria specify that children must be at least 6 years of age to receive the diagnosis, but this is not based on any known empirical data. Based on several studies, there is reason to believe that DMDD occurs in very young children (Axelson et al. 2012; Margulies et al. 2012; Copeland et al. 2013; Dougherty et al. 2014). These studies are limited, however, by either being based on retrospective analyses of data that were collected before DMDD was created (Copeland et al. 2013; Dougherty et al. 2014) or based on samples that included both younger and older children, but did not provide enough details about the younger children (Axelson et al. 2012; Margulies et al. 2012).
When creating anchors for Likert ratings of symptoms, there is often a decision to make about how to word the response choices as either the frequency of a symptom versus how much of a problem the symptom creates. A high-frequency symptom that is not intense could cause relatively few problems; therefore, change in frequency would be a more sensitive metric than problem intensity. Vice versa, a low-frequency symptom that is very intense could cause relatively more intense problems; therefore, problem intensity would be a more sensitive metric than frequency.
For example, the PTSD symptom of intrusive recollections about trauma events can happen frequently, but tends to be kept internalized and does not lead to observable problems; frequency may be the most reliable way to rate this symptom. On the other hand, the PTSD item of difficulty concentrating would be difficult to rate on frequency because if the child successfully avoids tasks that require concentration, it would be difficult to observe; so problem intensity seems the more reliable way to rate this symptom. PTSD symptoms are relatively more complicated in this regard because many of the symptoms are triggered by internal or external reminders. The decision of whether to rate these symptoms by frequency or problem intensity has lacked an empirical basis.
Functional impairment is a required criterion for nearly all DSM-5 disorders. The original DIPA study reported the presence of disorders with or without functional impairment, but separate psychometrics on impairment were not reported (Scheeringa and Haslett 2010). Haag et al. (2020) used the DIPA-Likert (DIPA-L) to assess PTSD in 1–6-year-old children who were exposed to accidental injury trauma and enrolled in a randomized trial of early intervention with a cognitive behavioral therapy-based intervention versus treatment as usual. Functional impairment, measured as a binary categorical variable related to PTSD symptoms, significantly decreased in the intervention group concurrent with a decrease in symptom severity (Haag et al. 2020). This study will be the first to use Likert-style ratings and assess test–retest reliability of functional impairment.
An additional issue in psychometric studies is that respondents commonly endorse fewer symptoms on the second administration compared to the first administration, and this accounts for a substantial amount of test–retest unreliability of measures (Piacentini et al. 1999). This phenomenon is known as attenuation, which has been demonstrated many times with caregivers of older children and adolescents.
There is, however, only one known study that has examined attenuation in very young children. Egger et al. (2006) administered a diagnostic interview with caregivers of 2–5-year-old community children, assessed 1 week apart, and they found significant attenuation of the diagnosis by the second interview for two disorders (depression and generalized anxiety disorder [GAD]). Nonsignificant trends for attenuation were noted for separation anxiety disorder (SAD), social phobia, PTSD, ODD, conduct disorder, and ADHD. Scheeringa (2010) did not statistically test for attenuation in the original DIPA study. Because the Egger et al. (2006) study was conducted in a community sample, different results may be found with a more severe help-seeking clinic population.
The purpose of this study is to conduct the first replication of psychometric indices in any diagnostic interview used with very young children, and extend the systematic development of the DIPA in several ways. The first research question was to explore the utility of frequency versus problem intensity Likert ratings for PTSD symptoms. Only PTSD symptoms were chosen for this question because they are relatively more complicated and difficult to explain due to the linkage to past events and highly internalized nature of many symptoms (Scheeringa 2011). Frequency versus problem intensity was not tested for other disorders because this would greatly lengthen the interview. A directional hypothesis was not formulated due to the lack of empirical research on this question.
The second research question was to explore how reliably the revised DIPA-L, with new Likert ratings and DSM-5 updates, captured the existence of five of the most common disorders seen in clinics (PTSD, ADHD, ODD, SAD, and GAD). Hypothesis 2: concurrent criterion validity will be acceptable when compared to relevant disorder-specific scales on continuous variables (correlations >0.50). In addition, functional impairment will be assessed for criterion validity. The first DIPA study did not include an impairment-specific criterion measure for comparison. To address this limitation, it is hypothesized that the severity of functional impairment as rated by the DIPA-L will strongly positively correlate with the severity of functional impairment as measured by a self-administered questionnaire. Concurrent criterion validity could not be tested for DMDD because no known instrument existed for comparison.
The third research question was to explore the reliability of the DIPA-L interview on successive administrations. Hypothesis 3: test–retest reliabilities between two independent interviewers will be acceptable for both continuous (intraclass correlation coefficient [ICC] >0.50) and categorical indices (Cohen's κ fair to good, >0.40). Because the original DIPA validation study compared research assistant interviewers to a clinician interviewer, this study was designed with only research assistant interviews to more fully standardize the test–retest reliabilities.
Methods
Participants
Participants included 58 caregivers of 2–6-year-old children, including 48 biological mothers, 4 adoptive mothers, 3 grandmothers with legal guardianship, 2 foster mothers, and 1 father. They were recruited from one private outpatient child and adolescent psychiatry clinic that specialized in very young children (Child Counseling Associates, LLC located in Metairie, LA, within the greater metropolitan area of New Orleans). The inclusion criterion was to be a consecutive intake at the clinic with a child who was between 9 months and 6 years 11 months of age. There were no exclusion criteria for the research study; however, the routine clinic procedure excluded children with a primary diagnosis of autism spectrum disorder from intakes.
Consecutive intakes were invited to participate in the research. Fifty-eight participants completed the first interview, and 52 completed the second interview. Six participants did not complete the second interview because they were unwilling to commit the time. Caregivers were compensated for their participation by being allowed to pick a toy for their children at both visits from a menu of choices that were all less than $10 US dollars.
Procedure
The protocol was approved by the Tulane University Committee on Use of Human Subjects. The clinicians who conducted the routine clinic intake evaluations asked the caregivers at their initial meeting if they would agree to speak with a research assistant about participation in a research study. If they agreed, a research assistant met with them in a private room and described the study and reviewed the informed consent form with them. One hundred thirty-six cases were eligible to participate from May 2015 through February 2019, and 72 agreed to speak with a research assistant. Twenty-four caregivers declined to participate at this stage, leaving 58 who participated. If they signed the consent form, a time was arranged for the first interview to be conducted. Most participants completed the study, while their children were being seen by therapists at their next clinic visit, but some participants came to the clinic separately without their children to complete the study. Children's participation was not required.
All interviews were conducted in-person at the clinics and videotaped. Children often received treatment in between the first and second interviews. At the second interview, respondents were instructed to rate the intensity or frequency of symptoms at the time of the first interview.
The interviewers were nonlicensed research assistants who were trained by the principal investigator (PI) on the DIPA-L. For training, they reviewed the DIPA-L and the administration rules on paper. Next, they watched three videotaped interviews conducted by experienced interviewers. Next, they practiced administering a DIPA-L to another research assistant. Next, when they interviewed research participants, every interview was videotaped and portions were reviewed together with the PI to ensure fidelity. There were 14 interviewers, and the number of interviews conducted by each ranged from 1 to 27. The research assistant interviewers were either undergraduate or graduate college students. Of the 110 interviews conducted, 49 were conducted by undergraduate assistants and 61 were conducted by graduate assistants. Interviewers performed first or second interviews based on scheduling availability, but first and second interviews for each participant were always conducted by different interviewers.
Measures
The DIPA-L version is an interview of caregivers about their children from late in the first year of life through 6 years (Scheeringa and Haslett 2010). It includes all symptoms for 14 DSM-5 disorders, but only 6 disorders were used in this study—ADHD, ODD, PTSD, SAD, GAD, and DMDD. These disorders were selected because they are the most common disorders seen in clinical practice in this age group, and it was believed that including all 14 disorders would make the procedure excessively long and discourage participation for the retest interview.
Each symptom question begins with a stem question, which the interviewers were trained to read verbatim. After a stem question, the interviewers were trained to use his/her judgment on whether follow-up probes were needed. Follow-up probes are provided, which are read verbatim unless case-specific adjustments are needed. Interviewers can continue to probe with nonscripted clarifications until they feel satisfied that a symptom is present or not.
Respondents were given a hard copy of the Likert responses to refer to throughout the interview to minimize the amount that interviewers had to verbally repeat the Likert choices. For items rated on problem intensity over the past month, the choices were 0 = Not at all; 1 = A little bit, mild distress, or little or no disruption of activities; 2 = Somewhat, moderate distress clearly present, but still manageable, or some disruption of activities; 3 = A lot, severe distress, or marked disruption of activities; and 4 = A whole lot, extreme distress, or severe problem/unable to continue activities. For items rated on frequency over the past month, the choices were 0 = None of the time; 1 = Little of the time, once or twice, <10%; 2 = Some of the time, once or twice a week, or ∼20%–30%; 3 = Much of the time, several times a week, or ∼50%–60%; and 4 = Most of the time, daily or almost every day, or more than 80%.
Symptoms were organized precisely by the DSM-5 organization within each disorder module. Several symptoms were present in more than one disorder (e.g., sleep difficulty, and concentration) and were asked within each relevant disorder (i.e., asked more than once) so that each disorder module is self-contained in completeness. The scripted probes acknowledged the duplicative questioning to prevent respondents' annoyance at being asked something twice.
The PTSD module included 21 items that assessed for 16 symptoms (4 symptoms were assessed with more than one item). Sixteen of these items were rated on both frequency and problem intensity so that we could test which type of rating worked best. The other five items (intrusive recollections, play reenactment, nightmares about the trauma, flashbacks, and difficulty initiating sleep) were rated only on frequency because our experience indicated that problem intensity was not highly relevant for these items.
The DIPA-L assesses functional impairment in a disorder-specific manner by asking about impairment at the end of each disorder. Five areas of role functioning (with parents, with siblings, with peers, at school/day care, and in public) plus a sixth item of child distress (except for ADHD and ODD) were assessed. Because child distress appears qualitatively different than role functioning and intuitively seems to overlap with simply having many of these symptoms, child distress was not included in statistical analyses of impairment. Continuous variables of impairment were the sum of all five role functioning items. Categorical presence of impairment was counted if at least one of the five items was endorsed. For ADHD, at the end of the hyperactivity and inattentive subtype sections, it was asked whether symptoms are present in different settings to determine if the two-setting requirement was met, which is required by the DSM-5.
For ADHD, there were 18 symptoms and the possible range of Likert severity scores was 0–72. For ODD, there were nine symptoms and the possible range of severity scores was 0–36. For PTSD, there were 21 symptoms and the possible range of severity scores was 0–84. For SAD, there were 10 symptoms and the possible range of severity scores was 0–40. For GAD, there were eight symptoms and the possible range of Likert severity scores was 0–32. For DMDD, there was only one symptom and calculating a severity score was not appropriate.
The time frame of the interview specified that a symptom or behavior be present within the last 4 weeks. If children were on medication for a disorder, caregivers were instructed to answer as if their children were not taking the medication. Diagnoses were generated from computerized algorithms in SAS 9.4 (SAS, Cary, NC). The DIPA-L is available at Dr. Scheeringa's laboratory website at Tulane University.
PTSD was measured with the Young Child PTSD Checklist (YCPC), a 42-item self-administered questionnaire (Scheeringa 2010). The first 13 items ask about traumatic events. The next 23 items map onto the DSM-5 PTSD symptoms rated on a 0–4 Likert scale. The last six items ask about functional impairment. There were no existing measures for PTSD in young children with established psychometric indices, so the YCPC was selected because it was the only known measure that mapped precisely onto the DSM-5 criteria, and one of the few that included instructions for respondents to endorse events only if they were truly life-threatening: the individual felt like he/she might die, or he/she had a serious injury or felt like he/she might get a serious injury, or he/she saw life-threatening events happen to another person. Most of the other existing measures either describe events as stressful or ascribe no valence to the events without the requirements that events ought to have been perceived as life-threatening. Each event can be endorsed Yes, No, or Not Sure. For each endorsed event, it asks for the earliest age it happened, the latest age it happened, and approximately how many times it happened. Items endorsed as Not Sure were counted as missing in data analyses because they could not be clearly attributed as either Yes or No.
Anxiety was measured with the Screen for Child Anxiety Related Disorders Parent Version (SCARED), a 41-item self-administered checklist for caregivers. Items are scored on 0–2 Likert scales. The range of possible total scores is 0–82. The SCARED has been validated in children 7 years and older (Birmaher et al. 1999). Despite the absence of psychometric data on the SCARED in younger children, it was selected because there is no validated anxiety measure in younger children, the SCARED is one of the most widely used instruments for youth, and the questions correspond well with the DSM-5 criteria.
ADHD and ODD were measured with the Swanson, Nolan, and Pelham scale (SNAP), a self-administered checklist for caregivers. This study used the 20 items that map onto ADHD and the 10 items that map onto ODD. Each item is rated on a 0–3 scale. There are no known formal psychometric data on children younger than 7 years; however, this measure was used in the multisite Preschool ADHD Treatment Study (Kollins et al. 2006).
Functional impairment was measured with the Brief Impairment Scale-modified (BIS). The full BIS is a 23-item self-administered questionnaire for caregivers, which measures the severity of children's functional impairment (Bird et al. 2005). The scale was modified for this study by retaining only the 13 items that can be rated in this age group; 10 items that rated academic achievement, sports, and hobbies were omitted. Each item is rated on a 0–4 Likert scale.
Data analysis
To address the first research question, frequency ratings were compared to problem intensity ratings for 16 PTSD items on three considerations. First, if Pearson correlation coefficients between frequency and problem intensity were 0.70 or higher, and statistically significant, this would suggest that frequency and problem intensity are measuring severity in equivalent manners and neither is superior to the other. Second, if mean Likert rating scores were higher for one rating (frequency or problem intensity) relative to the other, this would suggest that the symptom is more behaviorally observable to the caregivers, and the ratings with a higher mean may have greater practical utility. Third, if mean scores on one rating (frequency or problem intensity) changed less from Time 1 to Time 2 relative to the other, this would suggest that the construct is more reliable across closely repeated administrations. Mean frequency ratings and mean problem intensity ratings were compared using paired sample t-tests, with Bonferroni corrections for multiple comparisons. Internal consistency was calculated using Cronbach's α.
For criterion validity (Hypothesis 2), Time 1 DIPA-L interviews were compared to Time 1 criterion measures, which were the self-administered questionnaires (YCPC, SNAP, SCARED, and BIS), and Time 2 DIPA-L interviews were compared to Time 2 questionnaires. Continuous scores of the sum of the DIPA-L Likert ratings for symptoms of each disorder were compared to the total scores of relevant questionnaires with Pearson correlations. SAD and GAD scores from the DIPA-L were combined to create an Anxiety score because the SCARED combines these constructs. Categorical determinations for the presence or absence of symptoms were based on scoring at least 1 or higher on frequency or problem intensity ratings. This low threshold was based on empirical findings from prior studies that caregivers tend to underestimate the presence of both internalizing and externalizing symptoms in children (Mai and Scheeringa 2019). To examine attenuation from Time 1 to Time 2, the proportion of children diagnosed at Time 1 were compared to the proportion at Time 2 with binomial distribution tests.
For test–retest reliability (Hypothesis 3), following Landis and Koch (1977), benchmarks for assessing reliability about the presence of categorical disorders were based on the accepted ranges of Cohen's κ as Poor 0–0.4, Fair to Good 0.4–0.6, Substantial 0.6–0.8, and Excellent 0.8–1.0 (Landis and Koch 1977). Continuous scores of DIPA-L Likert ratings were tested for reliability with ICCs using the fixed set result from Hamer's SAS macro (Shrout and Fleiss 1979; Hamer 1990). The ICC r's were tested for significance of being greater than zero by F tests following McGraw and Wong's (1996) recommendation. Guidelines for interpreting the coefficients followed Cohen (1988) with Small r = 0.10, Medium r = 0.30, and Large r = 0.50. ICC was not computed for DMDD because it contained only one item. The length of time between administrations of the interview was explored as a possible confounder, with cases divided by 30 days or less durations between interviews.
Results
Preliminary analyses
The ages of the children, for the 58 participants, who completed the first interview were 1 (n = 6), 3 (n = 3), 4 (n = 14), 5 (n = 23), and 6 years (n = 12). The type of insurance was 55% private insurance, 38% Medicaid, and 7% self-pay. The durations between interviews were between 5 and 30 days for 38 cases, between 31 and 60 days for 8 cases, between 61 and 90 days for 5 cases, and 94 days for 1 case. The median was 17 days.
The 52 subjects who completed both interviews were compared to the 6 subjects who completed T1, but not T2 interviews. They did not differ on gender, race, age, mother age, mother education level, father age, whether father lived with them, type of insurance, and YCPC, SCARED, ADHD, ODD, or BIS scores (Table 1). The education level of fathers in the non-completer group (M = 19.1 and standard deviation [SD] = 1.0) was significantly higher than the level of fathers in the completer group (M = 15.2 and SD = 3.2), t = 2.41, p < 0.05.
Sample Descriptive Statistics
p < 0.05.
ADHD, attention-deficit/hyperactivity disorder; BIS, Brief Impairment Scale; ODD, oppositional defiant disorder; SCARED, Screen for Child Anxiety Related Disorders; SNAP, Swanson, Nolan, and Pelham; YCPC, Young Child PTSD Checklist; SD, standard deviation; PTSD, posttraumatic stress disorder.
Hypothesis 1: Frequency compared to problem intensity ratings
For the 16 PTSD items that were rated on both frequency and problem intensity, the ratings were compared three ways. First, the correlations between frequency and problem intensity for all, but one symptom, were highly positive (T1 range r = 0.73–1.0, median 0.93 and T2 range r = 0.65–1.0, median 0.93) and significant (p < 0.0001), suggesting there was little difference between frequency and problem intensity. The one exception was the symptom of nonplay reenactment of the traumatic event; at Time 1, r = 0.13 (p = 0.56), and Time 2, r = −0.12 (p = 0.65).
Second, only two comparisons of frequency with problem intensity were different at the p < 0.05 level. Problem intensity of symbolic nightmares at T1 was higher (M 1.21 and SD 1.56) compared to frequency (M 0.62, SD 1.13) (t = −2.50, p = 0.020), and problem intensity of psychological distress at reminders at T2 was higher (M 1.29 and SD 1.21) compared to frequency (M 0.78, SD 0.88) (t = −2.70, p = 0.0156). However, these two differences were no longer significant after Bonferonni corrections for multiple comparisons, again suggesting little difference between the two types of ratings.
Third, after corrections for multiple comparisons, there were no significant differences between means of frequencies at Time 1 compared to frequencies at Time 2, or for means of problem intensities at Time 1 compared to problem intensities at Time 2. Intensity of nightmares was higher at Time 1 compared to Time 2 (p < 0.05), and intensity of avoidance of activities, places, or things was higher at Time 1 compared to Time 2 (p < 0.05), but these were no longer significant after Bonferroni corrections.
These three considerations indicate that ratings of frequency and problem intensity were largely equivalent. For parsimony, it was decided to retain only problem intensity ratings for the remainder of analyses when calculating total scores and diagnoses because intensity scores were higher than frequency scores for 10 of the 16 items, even though these were not statistically significant differences. Higher ratings suggest that problem intensity was more observable and/or salient to caregivers than frequency, suggesting relatively greater utility for problem intensity.
Using only problem intensity ratings to determine the presence of these 16 PTSD items, internal consistency (measured as Cronbach's α) for all 21 items in the PTSD module was 0.95. The internal consistencies for the other four DIPA-L modules were also acceptable: ADHD 0.92, ODD 0.84, SAD 0.78, and GAD 0.89. Internal consistency for DMDD could not be calculated with one symptom item.
Hypothesis 2: Concurrent criterion validity
For the tests of concurrent criterion validity, Pearson r values between DIPA-L continuous scores of Likert ratings and disorder-specific self-administered checklists for PTSD, ADHD, ODD, anxiety, and functional impairment were all positive, large (r > 0.50), and significant at the p < 0.0001 level (Table 2). This was true for both Time 1 and Time 2. For Time 1 and Time 2 combined, the range of correlations for disorders was 0.56–0.94, and the median was 0.75. This supported Hypothesis 2.
Concurrent Criterion Validation: Means and Standard Deviations of Diagnostic Infant Preschool Assessment-Likert Modules and Criterion Measures
p < 0.0001.
Anxiety, separation anxiety disorder and generalized anxiety disorder scores were combined.
ADHD, attention-deficit/hyperactivity disorder; BIS, Brief Impairment Scale; DIPA-L, Diagnostic Infant Preschool Assessment-Likert version; M, mean; ODD, oppositional defiant disorder; PTSD, posttraumatic stress disorder; r, Pearson correlation coefficient; SCARED, Screen for Child Anxiety Related Disorders; SNAP, Swanson, Nolan, and Pelham; SD, standard deviation; YCPC, Young Child PTSD Checklist; PTSD, posttraumatic stress disorder.
Children met criteria for a total of 60 disorders at Time 1 and 50 disorders at Time 2. The most common disorder was ODD, and the least common was DMDD (Table 3). Despite the decrease from 60 to 50 disorders overall, there was no statistically significant evidence for attenuation of specific diagnoses from Time 1 to Time 2. Binomial distribution tests that compared the proportion of diagnoses at Time 1 to the proportion of diagnoses at Time 2 were all nonsignificant. However, the binomial distribution test for functional impairment was significant (p = 0.025), indicating that more children were considered impaired by the DIPA-L at Time 1 (47 out of 51) compared to Time 2 (42 out of 51). When the 14 cases that had >30-day duration between Time 1 and 2 interviews were omitted, this did not change these results.
Test–Retest Reliabilities Between Time 1 and Time 2 for Diagnostic Infant Preschool Assessment-Likert
p < 0.05; ** p < 0.0001.
a = positive at both interviews; b = positive at first interview and negative at second interview; c = negative at first interview and positive at second interview; and d = negative at both interviews.
ADHD, attention-deficit/hyperactivity disorder; DIPA-L, Diagnostic Infant Preschool Assessment-Likert version; DMDD, disruptive mood dysregulation disorder; ICC, intraclass correlation from Shrout Fleiss reliability fixed set; GAD, generalized anxiety disorder; n/a, not appropriate; ODD, oppositional defiant disorder; PTSD, posttraumatic stress disorder; SAD, separation anxiety disorder.
Hypothesis 3: Test–retest reliability
For the examination of test–retest reliabilities for continuous scores, the ICCs were large for all constructs, except for GAD (Table 3). The range of ICC values for disorders was 0.29–0.91, and the median was 0.84. The ICC for impairment, 0.77, was also large. The low ICC for GAD (0.29) appeared to be due to five cases with large discrepancies for GAD ratings between Times 1 and 2. In three of these five cases, relatively high scores were endorsed at Time 1 (8, 11, and 28), but zero items were endorsed at Time 2; in the other two cases, zero items were endorsed at Time 1, but relatively high scores were endorsed at Time 2 (8 and 11). When these five cases were omitted, the ICC, 0.84, was comparable to the other constructs.
When the 14 cases that had >30-day duration between Time 1 and Time 2 interviews were omitted, ICC values stayed essentially the same for most disorders (ADHD, ODD, SAD, and impairment), but showed a slight increase for PTSD and a slight decrease for GAD (Table 3).
For the test–retest reliabilities of categorical status (i.e., diagnoses), the Cohen's κs were Substantial for ADHD (0.79), ODD (0.71), and DMDD (0.66); Fair to Good for SAD (0.56) and GAD (0.49); and Poor for PTSD (0.35). When the 14 cases that had >30-day duration between Time 1 and 2 interviews were omitted, ADHD and SAD improved from Substantial to Excellent, and PTSD improved from Poor to Fair to Good. ODD, GAD, and DMDD did not change categories (Table 3).
Discussion
The DIPA-L showed adequate concurrent criterion validity when compared to disorder-specific scales. All of the correlations between DIPA-L disorders and the criterion scales were large (r > 0.50) and highly significant. Because these data were based on correlations with disorder-specific measures, they represent more specific validation data compared to the original DIPA study that was based on correlations with the Child Behavior Checklist (CBCL), which is not as tightly linked to DSM disorders. It is not surprising then that the median correlation was 0.75 in this study and was 0.50 for the same disorders in the original study.
The DIPA-L also showed adequate test–retest reliabilities on continuous or categorical indices for ADHD, ODD, DMDD, and SAD, largely supporting the second hypothesis. For PTSD, the categorical index (κ) was Poor, but the continuous index (ICC) was Large (0.87). Given the relatively complicated nature of the PTSD diagnostic algorithm, which requires symptoms from three clusters, it may not be surprising that an overall severity score derived from the Likert ratings is a more stable index than the categorical diagnosis.
The relatively low test–retest reliability κ for the categorical PTSD diagnosis (0.35) was not expected. The low κ appears due to five cases that were diagnosed with PTSD at Time 1, but not Time 2. Our speculation is that the Time 1 PTSD diagnoses for these five children were false positives. Caregivers may have initially believed their children exhibited symptoms due to past stressful experiences, but through either the experiences of Time 1 interviews or discussions with their clinicians as part of the normal intake process realized that their children's symptoms were due to other causes. The original DIPA study showed good reliability κ for categorical agreements in a clinic sample (0.67), and the PAPA reported 0.73 in a nonclinical community sample (Egger et al. 2006).
There are few data from studies with older children and adolescents that could provide guidance to interpret this pattern. Despite the development of many diagnostic interviews for 7- to 17-year-old youths, the only known test–retest reliability of PTSD comes from a small study of twenty 7- to 17-year-old clinic patients using the Schedule for Affective Disorders and Schizophrenia for School-Age Children (K-SADS). With interviews conducted on average 17.9 days apart, they reported a κ of 0.56 (Kaufman et al. 1997). For comparison in adults, test–retest reliability for lifetime PTSD diagnosis using the Structured Clinical Interview for DSM-5 was 0.65 (but could not be computed for current PTSD diagnosis because of no cases at Time 2) (Shankman et al. 2018).
The test–retest reliability was unacceptable for GAD on both categorical and continuous metrics. It is noteworthy that test–retest reliability was also problematic using the PAPA in a community sample. In that study, the κ for GAD was the second lowest among 12 disorders tested, the ICC was the third lowest among 11 disorders tested, and it was 1 of 4 disorders that showed significant attenuation between Time 1 and Time 2 (Egger et al. 2006). The low reliability for GAD in this study was skewed by four cases diagnosed at Time 1, but not at Time 2. This suggests either that there is relatively high temporal instability of GAD symptoms or as noted above, in regard to PTSD, caregivers may have initially believed their children had excessive worries, but later came to understand that their worries were not excessive.
It is not completely straightforward to compare continuous metrics of test–retest reliability from this study to those of the original DIPA study because continuous scores in the original study were sum scores of the number of symptoms, with a much smaller range of possible values compared to the continuous Likert scores in this study. Nevertheless, the results were similar between studies, with the median correlation 0.84 in this study, and 0.78 for the same disorders in the original study.
This is the first study with the DIPA-L and provides the first examination of how different types of Likert ratings perform when applied to diagnostic symptoms in this age group. After examining frequency ratings versus problem intensity ratings, several ways for 16 of the 21 PTSD items, it appeared these two types of ratings were largely equivalent. The lone exception in which frequency ratings did not significantly correlate with problem intensity ratings was for the item of non-play reenactment. Speculating from the limited raw data, we noted that this item tended to be rated higher on problem intensity relative to frequency, suggesting that non-play reenactment occurs infrequently, but is perceived as problematic when it does occur. For parsimony, only problem intensity ratings were retained for the 16 tested PTSD items in the final version of the DIPA-L because problem intensity ratings were higher than frequency ratings for some items, indicating that problem intensity was more observable and/or salient to caregivers than frequency.
This study provided additional evidence that DMDD is identifiable in very young children, despite the DSM-5 criterion that it is not to be diagnosed until children are at least 6 years of age. DMDD was diagnosed in one 5-year-old child at both interviews, and in one 6-year-old child at Time 1, but not Time 2. In addition, DMDD was diagnosed in one 3-year-old child at Time 1, who did not return for Time 2. This is consistent with Copeland et al. (2013), who identified DMDD in a community sample of 2- through 5-year-old children. However, DMDD appears rare in this clinical sample, which contrasts with the finding by Copeland et al. (2013) that very young children showed higher rates of DMDD compared to 7- through 17-year-old youths.
Attenuation was not found for any specific disorder even though the total number of diagnoses decreased from 60 at Time 1 to 50 at Time 2. This finding was in contrast to prior studies that found attenuation (Piacentini et al. 1999; Egger et al. 2006). Those studies were larger and nonclinical samples, and it seems likely that attenuation would be significant in a larger sample of this population with more statistical power. Overall, this suggests that attenuation is a phenomenon that occurs in a subset of cases, but is a relatively weak effect that is not apparent in smaller, clinical samples.
This study addressed some limitations of the first DIPA study and extended the development of the DIPA in several ways. The first study recruited a relatively more disadvantaged population at a clinic that accepted only Medicaid clients, whereas this study was conducted at a private practice clinic that served a population in which the majority were non-Medicaid and had private insurance or could pay out of pocket. This helps to extend the generalizability of the DIPA-L to various sociodemographic populations.
The first study compared clinician interviews to research assistant interviews, whereas this study used only research assistants to conduct interviews to control for the variance in types of interviewers. The first study used the CBCL to test concurrent criterion validation, which was not designed to map onto DSM-5 disorders, whereas this study used separate, disorder-specific measures. Finally, with the publication of the DSM-5 in 2013, a new disorder was created for children called DMDD, and this study reported psychometrics on functional impairment for the first time.
A strength of this study is the sampling from a help-seeking, clinic population. There are four other known diagnostic interviews that have been either developed for or used for very young children. The PAPA has strengths of being developmentally modified and tested for test–retest reliability on 12 disorders in a large sample (n = 307), but it was a nonclinic, community sample, and criterion validation was not tested (Egger et al. 2006). The K-SADS has strengths of being tested on 15 disorders in a large sample (n = 204), but it was a nonclinic, community sample, the questions were not developmentally modified on the paper interviews, criterion validity was tested against the CBCL rather than disorder-specific measures (Birmaher et al. 2009), and test–retest reliability has been examined in 20 subjects (Kaufman et al. 1997). The Diagnostic Interview of Children and Adolescents for Parents of Preschool and Young Children has strengths of being tested for test–retest reliability on 14 disorders in a large sample (n = 244), but it was a community sample, had no cases of PTSD, criterion validity was tested against the CBCL, and generalizability to the United States may be limited because the study was conducted in Spain (Ezpeleta et al. 2011). The Diagnostic Interview Schedule for Children, Fourth Edition (DISC-IV), has strengths of being developmentally modified and tested in medium-size samples (n = 128 in one study and n = 97 in another), but they were sampled from nonclinic, childcare centers, test–retest reliability was not examined, and the study was limited to only ADHD, ODD, and conduct disorder (Rolon-Arroyo et al. 2016). In addition, the DISC-IV ADHD and ODD modules have been used in a longitudinal study of very young children (Harvey et al. 2009), and a modified DISC-IV MDD module has been used in the series of studies conducted by Joan Luby on preschool depression (Luby et al. 2003). The DIPA-L is the only instrument with a replication, tested in clinic populations, updated for DSM-5, tested psychometric ratings about functional impairment, and has Likert ratings.
A limitation of the study is size. Despite this limitation, this study was larger than the initial studies for most of the major instruments developed for older children, which are still commonly used, including the DISC-R (n = 39) (Schwab-Stone et al. 1993), the DICA (n = 27) (Welner et al. 1987), and the K-SADS (n = 20) (Kaufman et al. 1997). Another limitation was that the size constrains the confidence in the results for the more infrequently found disorders (PTSD, SAD, GAD, and DMDD), and conclusions about these disorders should be considered tentative.
Another limitation is race, as 87% of the sample was white/Caucasian, although there is no known evidence that race impacts construct validation or test–retest reliability of diagnostic interviews. Another possible limitation is that many children received treatment in between the first and second interviews. While respondents were instructed at the second interview to rate the intensity or frequency of symptoms at the time of the first interview, this may have led to overreporting or underreporting intensity or frequency of symptoms at the second interview due to inaccuracy of retrospective memory.
Conclusions
Given the limitations of this study noted earlier, replication with larger and more diverse samples could be informative. Nevertheless, the updated and revised DIPA-L is a promising diagnostic instrument for both research and clinical work that can provide comprehensive coverage of symptoms and promote reliable assessment for the care of very young children.
Clinical Significance
Recognition of psychiatric problems in very young children and access to clinicians who feel competent working with this population still lag considerably behind recognition, and access available to older children and adolescents (Gleason 2017). This is due to multiple factors, but is likely due, in part, to limited exposure in training programs and fewer measures and interventions created for this age group. These data provide a firmer psychometric foundation for the DIPA-L and simultaneously support the growing evidence base that very young children can, unfortunately, experience psychiatric disturbances that can be validly and reliably assessed, which, in turn, may help expand clinical services and clinical research for this population.
Footnotes
Acknowledgments
The author wishes to thank the clinicians of Child Counseling Associates, LLC for their cooperation: Ruth Arnberger, Allison Staiger, and Marti Tidwell.
Disclosures
The author has received royalties from Guilford Press, Central Recovery Press, and Psychology Today.
