Abstract
The authors demonstrate the increment of clinical validity in early childhood assessment of physical impairment (PI), developmental delay (DD), and autism (AUT) using multiple standardized developmental screening measures such as performance measures and parent and teacher rating scales. Hierarchical regression and sensitivity/specificity analyses were used to identify the differential impact of each domain the scales measure. Significant findings include (a) self-help domains in either parent or teacher questionnaires are more significant contributors than social-emotional domains to early detection, (b) performance measures are stronger predictors than parent or teacher questionnaires in detecting physical impairment or developmental delay, and (c) parent questionnaires measuring self-help skills are a stronger predictor of autism than performance measures. These results support the combined use of parent and teacher rating scales and provide important implications in choosing instruments for different developmental disorders when time and resources are limited.
Over the past 30 years, increased federal and state involvement has focused on improving the development of preschool children through valid and reliable assessment and interventions. More specifically, the Individuals with Disabilities Education Improvement Act (IDEIA) of 2004 states that school districts must screen children who may be at risk for developmental disabilities using valid assessment tools (IDEIA, 2004). The utility of developmental screening and the value of early identification of children at risk have been demonstrated by a body of research (Bayoglu, Bakar, Kutlu, Karabulut, & Anlar, 2007; Shonkoff & Meisels, 2000).
A salient feature to the screening process is the endorsement of a meaningful partnership with parents by screening professionals (Macy, 2012; National Joint Committee on Learning Disabilities [NJCLD], 2010). Evidence from multiple settings and informants are then gathered over time (National Association for the Education of Young Children [NAEYC], 2009; National Association of School Psychologists [NASP], 2005). Obtaining information from multiple sources is supported by research as having higher diagnostic reliability (Kerr, Lunkenheimer, & Olson, 2007). Furthermore, The Standards for Psychological and Educational Testing requires that “In educational settings, a decision or characterization that will have major impact on a student should not be made on the basis of a single test score. Other relevant information should be taken into account if it will enhance the overall validity of the decision” (American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 1999, p. 146).
Although a multi-informant assessment approach may be optimal, more assessments often mean higher costs, which is one of the most frequently cited barriers to formal assessment and screening. Studies have shown that clinicians bear the brunt of costs and are often not adequately compensated for their time and efforts (Glascoe, Foster, & Wolraich, 1997). The majority of the costs pertain to administering, scoring, and interpreting the assessment (Dobrez et al., 2001). Conversely, the U.S. Department of Education has spent a substantial amount of money on eligibility determination. Specifically, in the 1999-2000 school year alone, the amount was US$6.7 billion, according to the Special Education Expenditure Project (SEEP), a national study conducted by the American Institutes for Research (AIR) for the U.S. Department of Education, Office of Special Education Programs (OSEP; Chambers, Parrish, & Harr, 2004). Because of the high demand and strong focus for either screening or in-depth evaluation, it is important to determine the type and quantity of assessment tools needed in order to be as time- and cost-efficient as possible while achieving the most accurate information. Incremental validity (or the increase in validity gained from adding extra measures to the current assessment) helps determine whether or not a particular additional instrument or method could provide a significant improvement in the evaluation outcome (Sackett & Lievens, 2008). The incremental validity of assessment systems for children has been examined primarily on rating scales and mental health assessments (Callahan, Gillis, & Romanczyk, 2011; Hill, Maskowitz, Danis, & Wakschlag, 2008; Johnston & Murray, 2003). Few studies have looked at performance measures and rating scales together, particularly in regards to early childhood psychoeducational measures.
Rating scales, such as parent surveys or questionnaires, are popular methods in assessing children because of their low cost. However, some controversy exists in how much we should rely on parental reports. Some researchers indicate that parental assessments are useful (Johnson et al., 2004; Salt et al., 2005; Tervo, 2005), and others argue that developmental screening procedures that use parental reports cannot be considered evidence based (Lagerberg, 2005). Some researchers suggest that the accuracy of parental reports may depend on the developmental domain being assessed. For example, parental concerns about speech, motor, and behavioral development yielded a high sensitivity to the final diagnosis of the same developmental domain; however, parental concerns about cognitive, global developmental delay, and behavior had limited sensitivity and might lead to lower criterion-related validity than costly assessments administered by clinicians (Chen, Lee, Yeh, Lai, & Chen, 2004; Chung, Liu, Chen, Tang, & Wong, 2011).
Only a handful of studies have compared the predictive validity of both parental and teacher rating scales, particularly with regards to self-help and social-emotional skills (e.g., Power et al., 1998). The incremental validity of adding the more costly performance measures to the prediction has not been investigated. Furthermore, the clinical utility of different assessment tools might vary in prediction accuracy for different disorders. Knowing the advantages and limitations of the rating tools available, clinicians could make evidence-based decisions on whether parental and/or teacher measures are sufficient enough to provide performance-based evidence in identifying disorders. The present study investigates the incremental validity of different measures for different disorders (e.g., using a motor scale to predict a physical impairment).
This study uses a predictive design to evaluate three common sources of information (i.e., parent ratings, teacher ratings, and performance measures) in the context of early childhood screening assessment. More specifically, the incremental validity of using multiple types of scales from the DIAL-4™ (Developmental Indicators for the Assessment of Learning, 4th ed.; Mardell & Goldenberg, 2011) in predicting three common developmental disorders: (a) physical impairment (PI), (b) developmental delay (DD), and (c) autism (AUT) are examined. For each disorder, comparative analyses help answer the following questions to early detection:
Does the Teacher Questionnaire contribute more incremental validity than the Parent Questionnaire?
Does a more costly Performance measure contribute more incremental validity than the Parent and Teacher Questionnaires?
What is the relative validity of the Parent Questionnaire, Teacher Questionnaire, and Performance measure in predicting each type of developmental disorder?
What is the relative validity of parents’ or teachers’ ratings on Self-Help and Social Emotional Development in early detection?
Hierarchical regression models were used to examine the relationship between the predictor sets and predictive outcomes. Clinical utility was further evaluated using sensitivity/specificity analysis. It was hypothesized that the inclusion of content-relevant measures (e.g., using the DIAL-4 Motor Scale to predict PI) along with information from different sources (e.g., examiners, parents, and teachers) would improve the predictive validity in identifying different developmental disorders.
Method
Participants
Data collected for the standardization of the DIAL-4 were obtained from the test publisher (i.e., Pearson). The clinical groups used for analyses included 49 children diagnosed with PI, 63 children diagnosed as being DD, and 50 children diagnosed with AUT. These children were identified as having the particular disability as his or her primary disability or diagnosis in medical and/or school records. All children spoke English or Spanish as their primary language (the DIAL-4 has equivalent forms in English and Spanish) and were able to attempt all of the items following the standard administration procedure. Approximately half of the children were aged 2.6 to 4.5 years, and the other half were aged 4.6 to 4.11 years. There were about twice as many boys as girls in each sample. The majority of children were White (66% to 76%). The mother’s educational level was used as a measure for socioeconomic status (SES; see Livingston & Parker, 2011; Sirin, 2005). The SES for the DD sample was balanced with about 50% having mothers with a high school diploma or below and 50% having mothers with some college education or above. The PI and the AUT samples contained more children from higher-SES families (71% and 84% having mothers with some college or above, respectively).
For each clinical sample, a matched sample of normally developing children (i.e., the control group) was selected from the standardization sample pool. The control group matched the clinical group on age, sex, race and ethnicity, and mother’s education level.
Predictor Measures
The DIAL-4 is an individually administered developmental screening test designed for children aged 2 years and 6 months to 5 years and 11 months. The battery includes three components: (a) Performance measures, (b) Parent Questionnaire, and (c) Teacher Questionnaire. The Performance measures assess children on typical developmental behaviors. It provides scores in three areas (i.e., motor, concepts, and language) and the DIAL-4 Total—a composite scale combining the three areas. The Parent and Teacher Questionnaires record observation-based ratings on Self-Help Development (PQSH for Parent Questionnaire; TQSH for Teacher Questionnaire) and Social-Emotional Development (PQSE for Parent Questionnaire; TQSE for Teacher Questionnaire).
The DIAL-4 was chosen for this study because of the instrument’s long history and popularity, its empirical support of its utility as a viable screening tool (Cizek, 2001; Emmons & Alfonso, 2005; Meisels & Atkins-Burnett, 2005) with strong predictive validity (Rosiak, 2007; Walk, 2005) and the fourth edition’s strong internal reliability (mean reliability ranges from .80s to .90s), content validity (thorough comprehensive literature and expert reviews), construct validity (high correlations with other well-known screening and diagnostic measures), and sensitivity and specificity (high sensitivity, specificity, and diagnostic accuracy).
Statistical Analyses
Hierarchical regression and sensitivity/specificity analyses were used to examine the clinical validity for each sample with matched control. Based on the cost-benefit considerations discussed earlier, the order of entry for predictor variables was the following: (a) the Parent Questionnaire, (b) the Teacher Questionnaire, and (c) the Performance measure. Specifically, for each sample, the clinical membership was used as the predicted variable. At Step 1, the two Parent Questionnaire variables, PQSH and PQSE, were used as predictors; at Step 2, the two Teacher Questionnaire variables, TQSH and TQSE, were added; and at Step 3, the Performance measures were added. As recommended by the DIAL-4 manual, the Motor scale was used as the performance measure for predicting PI and the DIAL-4 Total was used for DD and AUT. Demographic variables (e.g., age, sex, race and ethnicity, and mother’s education level) were not used as predictors because they were matched between the clinical and control groups. At each subsequent step, the change in R2 indicated the unique variance accounted for by the additional predictors (i.e., incremental validity). The standardized regression coefficients showed the relative importance of each predictor.
Sensitivity and specificity are informative statistics on predictive clinical validity. Sensitivity refers to the proportion of children identified as potentially delayed who actually have a special need. Specificity refers to the proportion of children identified as normally developed who actually are on target. There is a trade-off between sensitivity and specificity: A highly sensitive scale could also have a higher risk of misclassifying more normally developing children as delayed (lower specificity), whereas a highly specific scale could pose a higher risk of identifying too many normal children as needing special attention (lower sensitivity). Ideally, tests should be as both sensitive and specific as possible.
In the sensitivity/specificity analyses, one standard deviation below the mean (i.e., standard score 85) was used as the cutoff to demonstrate the impact of additional predictors on decision making. At-risk classification using clinical cutoffs at one standard deviations on a standardized measure is a commonly recommended cutoff for early screening purposes for developmental disorders in young children (e.g., Feil et al., 2005).
Results
Prior to the regression analyses, the independent variables were examined for collinearity. The low variance inflation factor (<2.5 for PI, <2.6 for DD, and <3.2 for AUT) and high collinearity tolerance (>.42 for PI, >.43 for DD, >.34 for AUT) suggested that collinearity is not an issue.
Table 1 presents the standardized regression coefficients (β), R2, and R2 change (ΔR2) for hierarchical regression analyses for each sample at every step. Table 2 reports the sensitivity and specificity using the corresponding set of predictors with a cutoff at a standard score of 85.
Hierarchical Regressions of DIAL-4 Measures for Predicting Developmental Disorders.
Note: DIAL 4 = Developmental Indicators for the Assessment of Learning (4th ed.); PI = physical impairment; DD = developmental delay; AUT = autism; PQSH = Parent Questionnaire on Self-Help Development; PQSE = Parent Questionnaire on Social-Emotional Development; TQSH = Teacher Questionnaire on Self-Help Development; TQSE = Teacher Questionnaire on Social-Emotional Development.
p < .05.
Sensitivity and Specificity Estimates Associated With Different Models.
Note: PI = physical impairment; DD = developmental delay; AUT = autism.
Source: Analyses from the standardization data from the Developmental Indicators for the Assessment of Learning, 4th ed. (DIAL-4). Copyright © 2011 NCS Pearson, Inc. Used with permission. All rights reserved.
Table 1 shows some consistent patterns across models for all three disorders. First, at Step 1, the Parent Questionnaire predictors (PQSH and PQSE) accounted for a substantial and statistically significant amount of variance in PI (36.2%), DD (24.2%), and AUT (55.0%). PQSH was the statistically significant independent variable (β significant at .05 level), whereas PQSE was not. Second, at Step 2, as indicated by ΔR2s, the Teacher Questionnaire predictors (TQSH and TQSE) accounted for a significant amount of additional variance in PI (11.0%), DD (11.3%), and AUT (9.4%) over the Step 1 models. The regression coefficients for self-help measures in both the Teacher and Parent Questionnaires were significant in all models (p < .05), whereas the social-emotional measures in neither the Teacher nor Parent Questionnaire were statistically significant. Third, at Step 3, the Performance measure added a substantial and significant amount of the variance explained in PI (12.0%), DD (12.8%), and AUT (4.0%); the Performance predictor was significant at p < .05 in all models. Finally, in the sensitivity and specificity analyses, the Step 1 model appears to be more specific (specificities are .84 for PI, .77 for DD, and .86 for AUT) than sensitive (sensitivities are .65 for PI, .66 for DD, and .82 for AUT). With the Teacher Questionnaire and Performance measures added at Steps 2 and 3, sensitivity increased to the high .80s and low .90s, whereas specificity decreased to the mid .60s and low .70s, respectively.
A different finding is evident for the AUT sample at Step 3 where all five predictors were entered. In contrast to PI and DD where Performance was the only significant predictor, both PQSH and Performance were statistically significant for AUT with PQSH having higher predictive weight than Performance. This result is consistent with the much higher percentage of variance explained by the Parent Questionnaire at Step 1 for AUT than for PI and DD. Thus, the Parent Questionnaire seems to be a more important predictor for AUT whereas the Performance measure appears to be more important for the other two disabilities.
Discussion
Multi-informant Approach and Its Implications for Practice
Given the lower cost of parent questionnaires, comparing the validity of such rating scales with and without the more costly performance measures is of particular interest. The present regression analyses on the DIAL-4 data support the incremental validity achieved from combining Parent Questionnaires with Teacher Questionnaires or combining both questionnaires with Performance measures: Teacher Questionnaire and the Performance measures provide information not captured by the Parent Questionnaire alone. These findings support the recommendation that clinicians should utilize multi-informants and multi-traits in their assessment approach.
However, these three sets of predictors seem to vary in how much each improves the validity of the diagnosis of the clinical conditions studied: Performance measures are the strongest predictor for PI and DI, and the less costly Parent Questionnaire is most important for predicting AUT. Thus, questionnaires might be a more efficient assessment tool for screening certain disabilities. This finding could be particularly encouraging or discouraging for clinicians when cost and time are so constrained that using questionnaires is the only option.
The clinical utility of combining instruments is further evidenced in the sensitivity and specificity analyses. The single-informant Parent Questionnaire gave a relatively higher specificity than sensitivity for predicting the three disabilities. This pattern indicates that the Parent Questionnaire alone is effective in avoiding overreferral of normal children to special services, but it may cause underreferral of children who are at risk. Thus, using only the Parent Questionnaire may delay or deprive some children from getting the services they need. Lower sensitivity could be particularly troublesome for early childhood screening with the increasing awareness of the importance for early identification and intervention for children with developmental disorders. For example, research shows that early intervention for AUT can remediate many symptoms (Autism Society, n.d.). This study shows that using multi-informant measures is a practical approach to optimize the balance between sensitivity and specificity,
Scale-Specific Results and Implications for Research and Practice
The current findings also suggest that Self-Help Development scores from the questionnaires (PQSH or TQSH) were more predictive than the Social-Emotional Development scores (PQSE or TQSE) of PI, DI, and AUT. This pattern may be due to the type of behaviors measured on the two scales. The behaviors measured on the Self-Help Development scales are rather straightforward and innocuous (e.g., whether or not the child is toilet trained, can eat by him or herself, or get dressed by him or herself). They are fairly easy to observe and rate accurately, and parents are more likely to rate such behaviors honestly. In contrast, behaviors assessed on the Social-Emotional Development scales may be more difficult to assess (e.g., how often a child has tantrums) and subjective to the rater’s preference. For example, parents enforce a wide range of rules for appropriate behavior with their children and show great variation in what type of behaviors they consider as negative (O’Leary, 1995). In addition, some parents may be embarrassed about admitting their children’s emotional problems; therefore, they may not rate behavior that could indicate such problems honestly. This biased parent rating style is supported by Chen et al. (2004) and Chung et al. (2011) who reported that parents’ ratings of their children’s behavioral problems were not as accurate as their ratings of areas such as speech and motor delays. Ritter (1989) found a similar pattern in teacher ratings and reported great variation in teachers’ tolerance of behaviors and what they label as negative. All of these factors could lead to the Self-Help Development scales being more predictive than the Social-Emotional Development scales, although more in-depth research will be needed to fully understand the different predictive outcomes of these two scales.
As previously stated, items on the Self-Help scale measure tasks where solid motor skills, such as dressing one’s self and eating with utensils, are needed. For children diagnosed with a PI, these items are more directly related to the impairment than those on the Social-Emotional scale; thus, they would be more powerful predictors. Similarly, the symptoms of DD are often associated with global deficits in which there are deficits across multiple domains of development. Therefore, the finding that the TQSH score was more predictive also seems logical. However, it is not as clear why the PQSH score was the most accurate predictor for AUT. Prior to this study and the development of the DIAL-4, it was hypothesized that either the Social-Emotional Development scores (either the Parent or Teacher Questionnaires) or the DIAL-4 Total scores would most accurately predict AUT because the disorder’s most prevalent characteristics are social-emotional and cognitive in nature (i.e., the hallmarks of the disorder are poor social skills and cognitive delays). In addition, a body of research involving the Vineland Adaptive Behavior Scales (VABS; Sparrow, Cicchetti, & Balla, 2005) also has shown that individuals with AUT have a distinct profile, with highest scores in Motor and Daily Living, lowest scores in Socialization, and intermediate scores in Communication (Kraijer, 2000; Sparrow et al., 2005). However, more recent research suggests that this score profile may be more evident when age-equivalent scores are used than when standard scores are used (Perry, Flanagan, Dunn Geier, & Freeman, 2009). The VABS Daily Living scale domain is comparable to the Self-Help Development scale on the DIAL-4 and the Socialization domain is comparable to the Social-Emotional Development scale on the DIAL-4. The DIAL-4 Social-Emotional items, however, focus more on emotional and behavioral problems than social skills emphasized on the VABS. As a result, it may be of interest to (a) compare the predictive validity for the special groups studied, particularly AUT, when only the social skills items or only the emotional and behavioral items on the DIAL-4 are used, and (b) examine the predictive power when age-equivalent or other types of scores are used.
The present results suggest that, in a clinical assessment setting, a simple questionnaire inquiring about self-help (e.g., dressing one’s self) and other easily observable and measurable behaviors could provide insightful information and be used as a first step in screening for children who might be at risk. This approach could be particularly helpful when time is limited, and it could provide more accurate and reliable outcomes than parent reports on domains related to the child’s emotions or behavior (as opposed to self-help skills).
Limitations
A potential limitation to the generalizability of the current results comes from the characteristics of the special group samples used. All children in the special groups participated in a daycare program. As a result, the findings might not be generalizable to young children who stay at home. In addition, the majority of the clinical sample was from Caucasian families with relatively high SES (measured by mother’s education level). Research suggests parents in these groups are more likely to seek help for their children’s developmental disorders (Zimmerman, 2005), and the accuracy of their report of their children’s behavior may also vary from that from other demographic groups. Furthermore, racial and ethnic minority groups are generally considered to be underserved by the mental health services system and a constellation of barriers deters ethnic and racial minority group members from seeking treatment (U.S. Department of Health Human Services, n.d.). Although a low number of participants in each minority group prohibited an analysis of the data by subgroups, future research in this area may be worth exploring. Finally, the present study did not address gender differences. Given the research indicating different psychological, social, and behavioral patterns between boys and girls, especially with regard to DD and AUT, variability in the applicability of these findings to either gender group needs further investigation.
Conclusion
In conclusion, given the increasing need for early identification and intervention for children with developmental disorders, a single-measure test is not recommended for most clinical screenings. This study further shows that combining a parent questionnaire with a teacher questionnaire or performance measures in an assessment is a more balanced approach for identifying more children who truly need special help and preventing more children from getting special education services that they do not need. However, when time and money are short, the current results suggest that a simple questionnaire inquiring about self-help and other easily observable and measurable behaviors could be a sound first-step screening tool for getting insightful information about a child potentially at risk. Future research inquiring about the predictive validity of rating scales that measure self-help skills for other disorders might prove useful in determining how many additional disorders it can identify.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
