Abstract
This study reports an independent investigation of the psychometric properties of Desired Results Developmental Profile (DRDP), a teacher-rated measure of school readiness for preschool-aged children. In a sample of 2,031 low-income, 3- to 5-year-old children attending Head Start, we tested three measurement models: a higher order one-factor model, a seven-factor model, and a five-factor model. To explore the appropriateness of the DRDP for use with diverse populations of young children, we used multiple group and differential item functioning (DIF) analyses to determine whether the DRDP works differently for dual language learners (DLL) and non-DLLs. The proposed five-factor structure fits the data best, with greater face and statistical validity. Using this conceptually driven factor structure, the multiple group analyses were robust for DLL and non-DLL preschool students. More than half of the items on the DRDP displayed little DIF. Items measuring emergent language and literacy exhibited DIF favoring non-DLL children.
As early as when schooling begins, low-income children lag behind their higher income peers in critical early academic and socioemotional skills (Duncan & Murnane, 2011). Dual language learners (DLLs) also lag behind their native English-speaking peers on these same skills (Quirk, Nylund-Gibson, & Furlong, 2012). Comprehensive early education programs, such as Head Start, are one effort to support these at-risk children, based on a robust body of research showing that preschool programs can address disparities in the learning opportunities for young children prior to school entry (Yoshikawa et al., 2013). Public preschool programs consistently improve children’s readiness for school in terms of early literacy, mathematics, and social-emotional development (Phillips et al., 2017), with low-income and DLL children benefiting the most from these programs (Duncan & Magnuson, 2013; Gormley, 2008; Magnuson, Lahaie, & Waldfogel, 2006). Over the past two decades, states and municipalities across the country have developed more public educational opportunities, such as voluntary prekindergarten programs, to better serve young children from diverse backgrounds in their preparation for kindergarten.
As preschool education continues to expand nationwide, state and local efforts to monitor and improve programs are rapidly expanding as well. A key component of such efforts is documenting children’s learning outcomes and assessing their readiness for school. For example, the recent Race to the Top–Early Learning Challenge initiative gave priority to applicants who focused on strengthening the use of assessments to understand individual children’s progress and improve program quality (Ackerman & Coley, 2012; Congressional Research Service, 2016; Connors-Tadros, 2014). In turn, there now exists an increased demand for psychometrically sound measures of children’s development and learning across multiple domains for diverse populations of 3- to 5-year-old children. Such assessment measures could yield information that not only informs ongoing decisions about teaching and children’s learning but is also predictive of longitudinal academic achievement after school entry. Assessments that are valid and reliable, meet high psychometric standards, and are appropriate for their intended purpose can also inform a continuous cycle of program improvement.
Desired Results Developmental Profile (DRDP)
One broadly used assessment for promoting and assessing school readiness is the DRDP–Preschool, which is implemented statewide in California and Missouri (California Department of Education [CDE], Early Education and Support Division, 2010; Missouri Department of Elementary and Secondary Education, 2013). Preschools with state funding are required to complete this assessment 3 times a year for all children served, with checkpoints in the fall, winter, and spring. Federally funded programs, such as Head Start, have followed, adopting the DRDP throughout these states. As such, the DRDP is administered to about half a million children each year attending publicly funded preschool programs (Friedman-Krauss et al., 2018).
The DRDP was developed by the Center for Child & Family Studies at WestEd and the Berkeley Evaluation and Assessment Research Center at the University of California, Berkeley, to measure the learning and development of children aged 3 to 5 years. The DRDP comprised 43 items, which fall within seven developmental subscales. According to the developers, it is designed as a process measure for assisting early childhood educators with curriculum planning for individual children and guiding continuous program improvement (CDE, Early Education and Support Division, 2010). However, with federal assessment requirements and the widespread use of the DRDP in California and Missouri, it is frequently used as a summative assessment at several points during the school year. That means rather than utilizing the scores of individual children to tailor programming to the target child, results are often aggregated across children, centers, and agencies; compared for change over time; and reported to others such as the Office of Head Start (e.g., Improving Head Start for School Readiness Act of, 2007).
Despite the widespread use of this measure across Head Start and state-funded preschool centers in these two states, there is surprisingly limited research using this assessment and no research confirming its validity in measuring children’s development, appropriateness for non-native English speakers, or its reliability. A comprehensive search for prior empirical work on the DRDP yielded one published study on the unidimensionality of the assessment using a select number of items and domains (Sutter et al., 2017) and one published study on its cross-age validity (Karelitz, Parrish, Yamada, & Wilson, 2010). Sutter et al. (2017) found that a unidimensional factor of the DRDP provided the best fit to their data from a convenience sample of 34 children. Their analysis only looked at portions of the DRDP for its factor structure with a very small sample. Specifically, they examined three (cognitive, language, and social development) of the seven domains and tested whether they would fit together as one factor of school readiness. Karelitz et al. (2010), using a larger cross-sectional sample (n = 751), showed that the DRDP has valid properties as a screener for identifying relatively low- and high-achieving children from preschool to elementary school. However, they did not test the factor structure of the measure. One other published study has used the DRDP as an outcome but did not report any information on its reliability or validity (Mohler, Yun, Carter, & Kasak, 2009). In addition, we found no detailed technical documentation of the DRDP’s content validity.
Another key element missing from the DRDP’s psychometric evidence is validity for children who are considered DLL. This is particularly troubling for states currently implementing the DRDP, such as California and Missouri, where 45% and 8% of 3- and 4-year-old children, respectively, are DLLs (National Institute for Early Education Research, 2016). The number of young DLLs is also growing rapidly across the United States, with Spanish-speaking DLLs now representing 40% of all Head Start participants, and is the fastest growing subpopulation of students in the United States. Incorporating high-quality assessments into the education of DLL children is essential because DLL students trail their monolingual English-speaking peers in important English language skills at kindergarten entry (Hoff, 2013; Paez, Tabors, & Lopez, 2007), and these gaps in achievement persist through elementary school (Mancilla-Martinez, & Lesaux, 2011). Abedi and Gándara (2006) argue that the achievement gaps often observed between DLL and non-DLL children are in part because measurement tools are often ill-equipped to assess their skills and abilities. Indeed, most assessments are not invariant across DLL groups (Immekus & McGee, 2016; Quirk, Mayworm, Edyburn, & Furlong, 2016). And although the CDE (2018) states on their website that the DRDP “includes specific measures for assessing the English language development of children who are learning English as a second language,” there exists no psychometric validation of this assessment for non-English-speaking children.
In summary, evidence of the DRDP is far too sparse relative to its widespread use among large populations of young children, children most in need of early childhood educational intervention, and the frequency with which teachers are required to use the DRDP. This is concerning given that reliable and valid preschool screening and assessment tools are key for assessing children’s learning and development and providing the appropriate classroom experiences and supplemental services necessary to ensure that all children are successful at school entry (Kagan & Garcia, 2007; Shadish, Cook, & Campbell, 2001; Snow & Van Hemel, 2008). Therefore, a detailed psychometric analysis of the DRDP is urgently needed.
Current Study
Our study is an independent investigation of the psychometric properties of the DRDP. We tested the reliability and validity of the DRDP across the preschool year using two cohorts of 3- and 4-year-old children attending an urban Head Start program. Using data from 2,031 children collected in the fall, winter, and spring of 2014-2015, we (a) tested the fit of the seven developer-defined DRDP subscales as a higher order one-factor model and a seven-factor model, (b) conducted a face validity assessment of the 43 individual items and conceptually derived subscales into which the items belong as a five-factor model, (c) conducted a confirmatory factor analysis (CFA) of these new subscales, (d) tested whether the dimensionality of this conceptually driven model differed for DLL and non-DLL students with multiple group analysis, and (e) conducted differential item functioning (DIF) analysis for whether the DRDP works differently at the item level between DLL and non-DLL children. Based on the limited prior research available, we hypothesized that there will be poor fit of the DRDP on the seven developer-defined subscales and that a conceptually driven model will better fit the data. We also hypothesized that the dimensionality of the DRDP will differ between DLL and non-DLL children based on prior research on assessments of DLL students (Abedi, 2002).
Method
Study Context and Data
Our study drew from administrative data from a large, urban Head Start agency in California in 2014-2015. Our study sample includes 2,031 children, 157 teachers, and 25 centers. In the larger California context, Head Start enrolled more than 100,000 children using federal and state funding during the 2014-2015 program year (Barnett & Friedman-Krauss, 2016). The CDE requires every preschool program receiving state funding to complete the DRDP for each child enrolled (CDE, Early Education and Support Division, 2010). With federal assessment requirements and support from the CDE, this measure, in California, is typically aggregated as an outcome assessment throughout the school year within the Head Start agency and thousands of other preschool programs throughout the state. The organization of DRDP items into separate domains is based on the California Preschool Learning Foundations, which outline the key skills and knowledge a child can gain through a high-quality preschool program (CDE, 2018).
Using the DRDP, children were evaluated by their primary classroom teacher 3 times during the academic year. Once ratings were completed by the teachers, center personnel entered these data into a central information database housed at the agency, where DRDP data were linked with other child-level demographic variables (e.g., gender, race/ethnicity, age) and teacher-level demographic variables (e.g., education, experience). These data were then stripped of unique identifying information before being shared with the primary investigators for research purposes, per the requirements of the University Institutional Review Board.
Approximately half of the children in the sample were male, 8% were Asian, 2% were African American, 78% were Hispanic, and 7% were another race/ethnicity. About 50% of the sample was DLL. The average child age was 4.73 years (SD = 0.69). Almost the entire sample of teachers was female, 12% were Asian, 4% were African American, 64% were Hispanic, and 4% were another race/ethnicity. The demographic characteristics of our sample closely mirror those of the state, with the exception of language spoken by the teacher. The percentage of Head Start teachers in our sample who spoke another language was greater than the state average—64% compared with 76% in our sample (Barnett & Friedman-Krauss, 2016).
Measures
DRDP
The DRDP is a teacher-reported 5-point school readiness rating scale consisting of 43 items organized into seven categories of development and school readiness: (a) Self and Social Development (12 items; for example, cooperative play with others), (b) Language and Literacy Development (10 items; for example, comprehension of age-appropriate text presented by adults), (c) English Language Development (four items; for example, understanding and responding to English literacy activities), (d) Cognitive Development (five items; for example, problem solving), (e) Mathematical Development (six items; for example, number sense of quantity and counting), (f) Physical Development (three items; for example, fine motor skills), and (g) Health (three items; for example, personal safety). The English Language Development items are only completed for children with DLL status. We provide a description of the items in each of the DRDP-derived domains in the first column of Table 1.
DRDP Domain Items From the Original Seven-Factor Structure and Proposed Domains and Items for the Five-Factor Structure.
Note. Items dropped: 13, 17, 18, 23, 24, 25, 26, 38, 39, 40, 41, 42, 43. Items 23, 24, 25, and 26 in the English language development domain were only completed for children with DLL status. DRDP = Desired Results Developmental Profile; DLL = dual language learners.
Each item is presented as a continuum for teachers to rate children’s level of skill development, ranging from (1) Not Yet Exploring, (2) Exploring, (3) Developing, (4) Building, to (5) Integrating. After rating an initial level for the item on the developmental continuum, teachers can also rate the child as “emerging” if the child is beginning to show some skills from the next level. The “emerging” level is considered a half point on the measure in that children may show behaviors or skills associated with the next developmental level, but does not demonstrate those behaviors or skills typically or consistently. Thus, the possible scores are 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, and 5. However, the sample of teachers in our study did not utilize this half point score for any of their ratings of children. Skill points are then averaged across all items within each category to create domain (i.e., subscale) scores and across all domains to create a total score. Teachers use the domains, items, and skill points to classify and rate their observations of children’s school readiness.
Child and teacher characteristics
Demographic information about children and teachers was collected from the administratively linked data. Child-level variables are age, gender, race/ethnicity, and their DLL status. Teacher-level variables include gender, race/ethnicity, highest degree earned, and whether they spoke another language in addition to English. Descriptive statistics for the sample are presented in Table 2.
Descriptive Statistics of Child and Teacher Characteristics.
Analytic Plan
We conduct analyses for each of the three measurement time points separately. Descriptive statistics for the analysis sample, including means, standard deviations, and correlations, and missing data were examined using Stata 14 (StataCorp, 2015). Missingness occurred on the DRDP measure (0%-15% across all three time points) and on the child and teacher demographic variables (0%-5%). Most of the missingness occurred on teacher education level (18%). Logistic regressions indicated that children and teachers’ observable baseline characteristics were not predictive of missingness on the DRDP measure. All models were run using a maximum likelihood estimator, which estimates parameters by maximizing the likelihood of obtaining the observed values (Brown, 2006) and also addresses missing data. Specifically, with this method, all available data were allowed to be included in the analyses, and the parameters with the highest possibility of generating the sample data are identified (Baraldi & Enders, 2010).
Testing the fit of the developer-defined factor structure
First, we performed a CFA of the DRDP domains specified by the measure’s authors using cross-sectional data from all three checkpoints (fall, winter, and spring). We tested two factor structures—a unidimensional model and the original seven domains (hereafter referred to as a seven-factor structure model)—using all of the items from the DRDP. Our study sample well exceeded the minimum sample size guideline of 400 to ensure stable correlations and a probable factor structure (Gorsuch, 2003). Model fit was evaluated based on several global goodness-of-fit indices: the comparative fit index (CFI), the root mean square error of approximation (RMSEA), the standardized root mean square residual (SRMR), and the Tucker–Lewis Index (TLI). Although we report the chi-square test, we did not include this in our analytic decisions because of its sensitivity to large sample sizes (Brown, 2006). We follow the goodness-of-fit recommendations made by Hu and Bentler (1999), with good fit characterized by CFI >.95, RMSEA <.06, SRMR <.08, and TLI >.95.
Face validity assessment
To conceptually group all 43 items independently from the DRDP structure presented by the publishers, we then conducted a Q-sort (Brown, 1993). Twenty-two raters, all doctoral students and faculty in Education and Developmental Psychology, were asked to sort the 43 items according to what they believed belong together. They also provided a label for the groupings of items. We then used the constructs and items from the Q-sort exercise to assist us in conceptually deriving our own categories to test the factor structure with CFA. The first three authors examined all of the Q-sort responses and met periodically to conceptually derive these five categories until consensus was reached. All conflicts were discussed and resolved among the authors.
CFA of conceptually derived subscales
To confirm the conceptually driven structure derived from the Q-sort, we conducted a CFA with the items and their newly assigned constructs. We investigated the goodness of fit of the different combinations of categories and constructs from the Q-sort and assessed the models with the same fit indices mentioned above. These factor descriptions are presented in the second column of Table 1.
Multiple group analysis for DLL and non-DLL students
Informed by these results, multiple group analyses were then used to investigate measurement invariance of the preferred model for non-DLL and DLL children. A set of steps—from least restrictive to most restrictive—was considered in determining the best model fit (Vandenberg & Lance, 2000): (a) same form, (b) equal loadings, (c) equal loadings and errors, and (d) equal loadings, errors, and variances. We examined the change in CFI values of .01 or greater to indicate a significant difference in model fit for testing measurement invariance (Cheung & Rensvold, 2002) because the chi-square difference test is extremely sensitive to large sample sizes such as ours.
DIF
Finally, to determine whether the DRDP worked differently at the item level between DLL (focal group) and non-DLL children (reference group), we conducted DIF analysis with the data separately for each time point and for each subscale. The total score for each domain was used as the estimate of ability. We employed the commonly used method of ordinal logistic regression to examine both uniform and nonuniform DIF. Uniform DIF is detected when the item favors one group over another across all levels of development being measured. For example, non-DLL children may be systematically rated as higher on an item than DLL children, regardless of the overall score. Nonuniform DIF is detected when there is a significant group-by-ability interaction, suggesting that the probability of being rated higher on an item is not the same across ability levels for the two groups (Zumbo, 1999). The logistic regression method involves a series of nested models, where each item is regressed first onto the ability variable alone (Model 1), then onto the grouping variable in addition to the ability variable (Model 2), and then onto the interaction term of the ability variable by grouping variable in addition to their main effects (Model 3). DIF is detected when there is a significant difference in fit between Model 1 and Model 3, suggesting that group membership influences item-level ratings in addition to ability level. The type of DIF, if present, is determined by testing the difference in fit between Models 1 and 2 for uniform DIF and Models 1 and 3 for nonuniform DIF.
We determined differences in model fit using the chi-square difference test. Because of the tendency for the chi-square significance test to overidentify DIF items in large samples even if the effects are negligible, we attempted to reduce type I error by also examining the magnitude of the effect size quantified with the pseudo R2 statistic (Gelin & Zumbo, 2003; Zumbo, 1999). Thus, an item was classified as exhibiting nontrivial DIF if there was a significant chi-square difference test between Models 1 and 3 and if there was a change in R2 from Models 1 to 3 of .035 or greater, which represents at least a moderate effect (Jodoin & Gierl, 2001).
Results
Descriptive Analyses
Prior to conducting substantive analyses, descriptive statistics on the DRDP were computed for the analytic sample. We report the descriptive statistics for our preferred five-factor model, which we discuss in greater detail below. On average, children were rated as either “2: exploring” or “3: developing” on the items across all three assessment time points. Specifically, the means of the items ranged from 1.66 to 2.55 points (SD = 0.66-1.11) for fall, 2.19 to 3.11 points (SD = 0.73-1.00) for winter, and 2.64 to 3.43 points (SD = 0.74-0.96) for spring. The pattern of scores indicates that teachers rated children higher on the items as the school year progressed. Within each of the proposed domains, pairwise correlations between items were moderate to high, ranging from .35 to .75 for fall, .37 to .76 for winter, and .39 to .76 for spring. Table 3 displays the means and standard deviations of the DRDP items grouped by our reorganization of the domains for the winter time point. Correlations of all the items for the winter time point are presented in Table 4. Complete descriptive statistics and correlations for the fall and spring assessment time points are presented in the supplemental materials.
Descriptive Statistics of Items in the Proposed Five-Factor Model.
Note. Descriptive statistics of items for the proposed five-factor model for the fall and spring time points are presented in the supplemental materials.
Correlations of Items Included in the Proposed Five-Factor Model for the Winter Time Point.
Substantive Analyses
Factor structure of the DRDP
We first examined the unidimensional and seven-factor model that corresponds to the designed structure of the DRDP. The unidimensional model fit the data poorly in the fall (CFI = .89, RMSEA = .10, SRMR = .09, and TLI = .90), winter (CFI = .90, RMSEA = .10, SRMR = .09, and TLI = .89), and spring (CFI = .90, RMSEA = .09, SRMR = .09, and TLI = .90). In addition, correlations between the seven domains were moderate (0.39-0.65). The seven-factor model fit the data poorly, as evidenced by CFI = .94, RMSEA = .09, SRMR = .08, and TLI = .91 in the fall; CFI = .92, RMSEA = .09, SRMR = .09, and TLI = .93 in the winter; and CFI = .92, RMSEA = .08, SRMR = .09, and TLI = .94 in the spring.
Face validity assessment and CFA
We operationalized the 39 rating scale items into five domains based on our conceptually driven Q-sort exercise: self-awareness and identity (four items), mathematics (seven items), social skills (eight items), language and literacy (five items), and domain-general cognitive skills (six items). We confirmed this five-factor model using CFA. This model fit the data reasonably well for fall (CFI = .99, RMSEA = .05, SRMR = .01, and TLI = .99), winter (CFI = .98, RMSEA = .07, SRMR = .02, and TLI = .97), and spring (CFI = .99, RMSEA = .07, SRMR = .01, and TLI = .98) time points. All factor loadings were generally large (βs = 0.66-0.81) and statistically significant at p < .001. We considered this five-factor model to be our preferred model. The fit statistics for the unidimensional, five-factor, and seven-factor models are presented in Table 5.
Confirmatory Factor Analysis Model Fit Statistics.
Note. The seven-factor model is the developer-defined model. We follow the goodness-of-fit recommendations made by Hu and Bentler (1999), with good fit characterized by CFI >.95, RMSEA <.06, SRMR <.08, and TLI >.95. All factor loadings are statistically significant, p < .001. Cells for the standardized factor loadings are ranges (minimum and maximum values). CFI = comparative fit index; RMSEA = root mean square error of approximation; SRMR = standardized root mean square residual; TLI = Tucker–Lewis index.
With regard to the specific changes made to the five-factor model, we dropped items related to children’s physical development (three items), which are listed at the bottom of the first column of Table 1. Modification indices were also consulted in our analyses, but they did not improve the model fit. Although the items in this domain were grouped together in our Q-sort exercise, the likelihood ratio test on the difference between the five-factor model and the six-factor model in our series of CFAs suggested that the five-factor model was better and more parsimonious. In addition, we dropped all items in the English language development domain because teachers were instructed to only complete these items for children with DLL status. That is, all children considered non-DLL were missing these four items, preventing us from conducting tests of measurement invariance and making group comparisons between DLL and non-DLL children. Finally, we dropped three low loading items (<0.40; Items 13, 15, and 18) in the language and literacy domain to improve model fit, following the standard recommendation (Floyd & Widaman, 1995; Kline, 2005), even though these items were originally in our proposed five-factor model.
Multiple group analysis
Based on the data fit of the proposed five-factor model, we proceeded to assess the measurement invariance for DLL and non-DLL students. We first fit an unrestricted baseline CFA model to allow factor loadings, factor variances, covariances, and means to be freely estimated across groups. As shown in Table 6, the fit statistics for the measurement invariance CFA indicate that the CFI, RMSEA, SRMR, and TLI values showed excellent fit of these data, suggesting that the factorial pattern of the DRDP was similar across groups. After establishing the baseline model, we constrained the factor loadings to be equal across time. The change in CFI between Model 1 and Model 2 was negligible, and the fit values remained acceptable, suggesting that the factor loadings for the items were invariant across time. The next step was to constrain the intercepts to be equal across groups to evaluate scalar invariance. The CFI difference between Model 1 and Model 3 was the same as the cutoff criterion of 0.01, and the fit statistics were still acceptable, indicating that the intercepts for the items were invariant across groups. Finally, we constrained the residual variances among the items to be equal across groups. Although the change in the CFI value between Model 1 and Model 4 did not attain the desired cutoff value of 0.01, this most restrictive model did not indicate a significant reduction in fit compared with a less restricted model where the data fit the model reasonably well.
Tests of Measurement Invariance for the Proposed Five-Factor Model.
Note. We follow the goodness-of-fit recommendations made by Hu and Bentler (1999), with good fit characterized by CFI >.95, RMSEA <.06, SRMR <.08, and TLI >.95. CFI = comparative fit index; RMSEA = root mean square error of approximation; SRMR = standardized root mean square residual; TLI = Tucker–Lewis index.
DIF results for DLLs and non-DLLs
Table 7 displays the results for the DIF analyses between DLL and non-DLL children. Most items indicated negligible DIF across the three time points: 77% in the fall and 87% in both the winter and the spring. The majority of items that exhibited DIF came from the language and literacy domain. For the fall assessment data, four items displayed intermediate DIF (Items 14 and 19 favored non-DLLs; Items 5 and 29 favored DLLs) and two items displayed large DIF favoring non-DLLs (Items 20 and 22). In the winter assessment data, three items displayed intermediate DIF (Items 14 and 20 favored non-DLLs; Item 29 favored DLLs) and one item displayed large DIF in favor of non-DLLs (Item 22). By the spring assessment, one item exhibited intermediate DIF (Items 19) and two items exhibited large DIF (Items 20 and 22), both in favor of non-DLLs. Looking across all three assessment time points, Items 19, 20, and 22 measuring children’s concepts about print, emergent writing, and phonological awareness exhibited either intermediate or large DIF. All the DIF items showed uniform DIF as the chi-square differences and R2 effect size were statistically significant, displayed in the last four columns of Table 4. Therefore, not all of the items function equivalently for DLL and non-DLL children.
Results of DIF Analyses Between DLL and Non-DLL Students.
Note.. We follow the goodness-of-fit recommendations made by Hu and Bentler (1999), with good fit characterized by CFI >.95, RMSEA <.06, SRMR <.08, and TLI >.95. Cells for the chi-square difference and change in R2 are ranges (minimum and maximum values). DIF = differential item functioning; DLL = dual language learners; CFI = comparative fit index; RMSEA = root mean square error of approximation; SRMR = standardized root mean square residual; TLI = Tucker–Lewis Index.
Discussion
Increasingly, programs, districts, and states are requiring assessments of children’s school readiness prior to kindergarten entry. The DRDP is a measure of school readiness currently used in Head Start centers and state-funded preschools as an assessment tool for kindergarten readiness. California and Missouri have sought to standardize the assessment process by mandating the use of the DRDP—an instrument initially designed to help educators understand student progress and individualize instruction—for all preschool programs receiving state funding. However, until now, little has been known about the psychometric properties of this highly used instrument. Drawing on administrative data from one Head Start agency in California, our study focused on the reliability and validity of the DRDP and explored its appropriateness for use with diverse populations of young children. To the best of our knowledge, this study is the first to establish psychometric information on the DRDP. Below, we summarize our findings and discuss the implications and limitations of this study.
We first demonstrated that the purported seven-factor structure of the measure was not supported by data. In addition, we tested a higher order unidimensional model and found that it was also not a better fitting model. Next, we theoretically derived a five-factor model through a Q-sort exercise and an understanding of the constructs gleaned from the literature. We conducted a CFA and showed that our proposed five-factor structure is a better fit to the data, having greater face and statistical validity (Shadish et al., 2001). We then conducted multiple group analysis and verified that this factor structure was robust when used with DLL and non-DLL preschool students. Finally, measurement equivalence then was also evaluated by examining DIF for the same children at the three time points (fall, winter, spring). Encouragingly, more than half of the items on the DRDP displayed little DIF, with the number of items exhibiting large DIF ranging from zero to two. Items measuring a child’s language and literacy development tended to display DIF favoring non-DLL children. However, only a few items related to language were consistently identified as having DIF across all three time points, and some items displaying DIF at one time point did not indicate DIF at the other time points. No distinct pattern emerged to explain why particular items were only problematic at certain time points. In general, it is encouraging to see that DIF properties did not vary substantially over time, suggesting that the DRDP, when reorganized conceptually, works well throughout the course of the school year, even early in the year when teachers are less familiar with the children and their abilities.
The interpretation of the results of items being identified as having highly significant DIF (“C” DIF) is complex and multidimensional in nature. Across the three measurement time points of the DRDP, we found that Items 19 (“concepts about print”), 20 (“phonological awareness”), and 22 (“emergent writing”) consistently displayed DIF between DLLs and non-DLLs. In terms of children’s print concepts (e.g., functions of print, concept of letter and word, directionality of print), this DIF might be due to teachers’ perceptions of children having little prior experience with print, leading to biased ratings as a source of DIF. A number of studies have documented that non-English-speaking families have different print-related practices in the home (Dixon, Zhao, Quiroz, & Shin, 2012; Schick & Melzi, 2016). For example, Reese and colleagues (Reese, Arauz, & Bazán, 2012; Reese & Gallimore, 2000; Reese & Goldenberg, 2008) showed that Latinx families often focus on environmental print, such as words and letters on food labels and signs on the street. This suggests that children might have to adjust to more academically based print-related activities once they are in the classroom setting. In addition, research in bilingualism indicates that the transfer of phonological skills and emergent writing occur when children have developed some proficiency in both languages (Cummins, 1991; Gillanders, Franco, Seidel, Castro, & Méndez, 2017; López, 2012; López & Greenfield, 2004; Quiroga, Lemos-Britton, Mostafapour, Abbott, & Berninger, 2002). It is possible that teachers might misinterpret this phenomenon as children having poor phonological and writing skills rather than as a common development of second language acquisition. We understand that it is imperative to examine DIF items very carefully by a group of experts including experts in the focal construct, assessment of English learners, and multicultural experts to identify the main causes of such differences between the focal and the reference groups. Although we were not able to explore this further in our study, this area should be a priority for future research.
There are a number of differences between the purported seven-factor model and our final five-factor model that should be noted. One notable difference between the two models is that we dropped items in the physical development and health domains. These domains are certainly important for children’s development (Grissmer, Grimm, Aiyer, Murrah, & Steele, 2010), and this is reflected in the fact that a number of other teacher-reported performance-based assessments also include the physical development domain in their measure, such as the Child Observation Record (COR; High/Scope Educational Research Foundation, 1992), Teaching Strategies GOLD (TS GOLD; Heroman, Burts, Berke, & Bickart, 2010), and the Work Sampling System (WSS; Meisels, Jablon, Marsden, Dichtelmiller, & Dorfman, 1994). In our series of analyses, we originally had a six-factor model that included the physical domain but found that by dropping this domain, we had a more parsimonious model to describe the factor structure of the DRDP. It might also make sense to drop these items because there is a concern in the literature that teachers’ perceptions of children’s physical development, including their fine and gross motor skills, are also influenced by other factors salient to teacher, such as their ability to sit in their seat or pay attention (Cameron et al., 2012a, 2012b). In fact, some studies have found weak to moderate correlations between teacher reports of children’s physical development and their directly assessed motor skills (Lalor, Brown, & Murdolo, 2016; Soderberg et al., 2013). We also dropped the DRDP health domain items because they were not indicators of discrete skill mastery. As the DRDP is scored to reflect the beginning stages of skill acquisition up to full mastery, skills such as personal care, healthy routines, and personal safety did not seem to conceptually fit those scoring procedures. Commonly used teacher-reported measures (e.g., COR, TS GOLD, WSS) also do not include items related to these health domains, so this raises our confidence in the conceptual underpinnings of the final five-factor model.
Another difference between the two models is that we dropped some of the language and literacy items in the final five-factor model. Empirically, Items 13 (“comprehension of meaning”), 15 (“expression of self through language”), and 18 (“comprehension of age-appropriate text presented by adults”) were dropped because they had low loadings. As part of the CFA technique, we worked with the domains we proposed to adjust the five-factor structure and improve model fit. These literacy and language items were in our original conceptualization of the language and literacy domain, but removing them improved our model fit. Direct measures of these skills focusing on comprehension of text, word meaning, and self-expression have been shown to have cultural variation (Dixon et al., 2012; Harris & Schroeder, 2013), and measures that do not consider this variation perform differentially with different ethnic and racial groups (Argulewicz & Abel, 1984; Fernandez, Pearson, Umbel, Oller, & Molinet-Molina, 1992; Lee Webb, Cohen, & Schwanenflugel, 2008). Although the DRDP is based on teachers’ perceptions of children’s learning and development, it is worthwhile to consider whether there may be cultural bias in teacher ratings. These ratings might be influenced by subjective biases in the ways they observe children’s skills (Engelhard, 2002). To be sure, the skills that these three items represent have been shown to be important for children’s later literacy and language achievement (Lonigan, Allan, & Lerner, 2011; Sénéchal, Ouellette, & Rodney, 2006), and future studies on the DRDP should consider all of these items in their psychometric evaluation among diverse preschoolers. Although dropping these items improved the model fit and internal consistency of scores, it reduces the degree to which items might provide adequate coverage of children’s language and literacy skills. Conceptually, we moved Item 14 (“following increasingly complex instructions”) to the domain-general cognitive skills category because this is typically associated with working memory in the executive function literature (e.g., Best & Miller, 2010; Gioia & Isquith, 2004; Klingberg, 2010). Finally, we made the decision to not include Item 17 (“interest in literacy”) in our final model because prior work has suggested that it is difficult to accurately capture children’s interest in literacy activities with teacher reports (Baroody & Diamond, 2013). It might also be that this item represents a component of children’s academic motivation (Oldfather & Wigfield, 1996; Wigfield, Eccles, Schiefele, Roeser, & Davis-Kean, 2006), which would be salient in a teacher’s perceptions of a child’s interest in literacy.
Implications and Limitations
Our results suggest that the DRDP has some promise as an assessment measure of school readiness for use with children of differing language backgrounds, but not in the structure proposed by WestEd and the CDE. We were especially interested in examining measurement invariance for DLL children because of their increasing presence in early childhood programs, the associated challenges of fair and accurate assessment, and the research documenting the achievement gaps between DLL and non-DLL children (National Center for Education Statistics, 2011; Reardon & Galindo, 2009). The children in our sample were all low-income, by definition of attending a Head Start program, and were also majority Hispanic and DLL. Although this represents an important demographic profile for early childhood educators, researchers, and policy makers, this feature greatly limits the extent to which our study is externally valid for other states and preschool programs using the DRDP. Future research should extend our analytic approach to other populations being assessed with the DRDP to develop a comprehensive evidence base for its validity and generalizability. Given the recent interest examining the validity and reliability of teacher-rated assessments in publicly funded preschool programs (e.g., Miller-Bains, Russo, Williford, DeCoster, & Cottone, 2017; Russo, Williford, Markowitz, Vitiello, & Bassok, 2019; Wakabayashi, Claxton, & Smith, 2019), it is important to emphasize that replicating our five-factor DRDP model with other diverse samples, settings, and policy contexts is a critical next step for this work. This replication would help us understand and improve the existing DRDP measure in terms of its psychometric properties so that it can be better utilized in large-scale implementation and as a tool for improving the quality of children’s early learning experiences.
Another limitation of this study is that we were not able to examine the concurrent or discriminant validity of the DRDP by comparing it with other validated measures, such as the Woodcock-Johnson Tests of Achievement (Mather, McGrew, & Woodcock, 2001). Thus, we are unable to assess whether these results might reflect substantial between-teacher differences in the way that the DRDP is used in the classroom or whether the DRDP is a valid representation of certain skills. We were also not able to assess the fidelity with which the DRDP was used or how teachers apply the information collected from the instrument. However, our assessment of the DRDP’s psychometric properties is within business-as-usual preschool practice, which is most relevant for applied research and state assessment policy guidance. It is important to note that more psychometrically rigorous measures, such as the Woodcock-Johnson Tests of Achievement, typically involve direct assessment of children individually, which is often not feasible in group-care settings with mandated ratios of teachers to children. For the DRDP, teachers make notes about children’s performance on DRDP items during care time, but the scoring typically occurs when children are not present, making the DRDP more feasible to utilize for providers. However, despite the popularity of teacher-reported measures for assessing children’s school readiness, the DRDP scores may be driven by assessor variance (Waterman, McDermott, Fantuzzo, & Gadsden, 2012). That is, the variability in children’s DRDP scores is likely to be more attributable to the teachers who completed the measure rather than to the children themselves. Interestingly, Waterman et al. (2012) found about 28% of the variation from a teacher-reported measure was attributed to teachers, and not children. This issue of shared method variance is reflected in a number of other studies that rely on teacher-reported measures, and we encourage future investigations to make efforts to measure children’s school readiness with a variety of methods if feasible.
Given the DRDP’s widespread use, it is key that the evidence of its reliability and validity be developed with samples representative of children for whom it is administered and to understand whether it is appropriate for use in diverse populations. This study and our proposed reorganization of the subscale of the DRDP can help teachers and administrators better assess and forecast children’s school readiness using these five domains. Evaluation of its psychometric properties, as well as a clear understanding of how teachers use the measure to support children’s learning and development, should continue as long as the DRDP is in use. We hope that future research will, however, be able to make better use of this measure based on our proposed five-factor structure. With newer modifications of the DRDP to include eight domains, additional assessments of its factor structure will be needed, and this study can guide such an endeavor. Almost half a million children attending state-funded preschool in California and Missouri each year are being assessed with the DRDP (Friedman-Krauss et al., 2018), yet those data are not being used to inform educational research and there is little evidence of their use to shape educational practice. We hope this study provides insight into how to analyze and interpret large-scale samples of the DRDP more productively.
Supplemental Material
DRDP_SupplementalMaterials – Supplemental material for Psychometric Validation and Reorganization of the Desired Results Developmental Profile
Supplemental material, DRDP_SupplementalMaterials for Psychometric Validation and Reorganization of the Desired Results Developmental Profile by Tutrang Nguyen, Stephanie M. Reich, Jade Marcus Jenkins and Jamal Abedi in Journal of Psychoeducational Assessment
Footnotes
Acknowledgements
The authors are grateful to the many teachers, administrators, and staff who made this study and the larger data collection effort possible.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Research reported in this publication was supported by the Haynes Foundation. Tutrang Nguyen is supported by the Institute of Education Sciences (IES) award #R305B170002 to the University of Virginia. The content is solely the responsibility of the authors and does not necessarily represent the official views of the Haynes Foundation, IES, or the U.S. Department of Education.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
