Abstract
The Caregiver Reported Early Development Instruments (CREDI) are assessments tools for measuring the development of children under age three in global contexts. The present study describes the construction and psychometric properties of the motor, cognitive, language, and socio-emotional subscales from the CREDI’s long form. Multidimensional item factor analysis was employed, allowing indicators of child development to simultaneously load onto multiple factors representing distinct developmental domains. A total of 14,113 caregiver reports representing 17 low-, middle-, and high-income countries were analyzed. Criterion-related validity of the constructed subscales was tested in a subset of participants using data from previously established instruments, anthropometric data, and a measure of child stimulation. We also report internal-consistency reliability and test–retest reliability statistics. Results from our analysis suggest that the CREDI subscales display adequate reliability for population-level measurement, as well as evidence of validity.
Keywords
A growing body of research shows that early childhood is a sensitive period of brain and skill development and has the largest individual and social returns to investments relative to other periods of human development (Grantham-McGregor et al., 2007; Heckman, 2006; Lu et al., 2016; Moffitt et al., 2011; Nores & Barnet, 2010; Peet et al., 2015). Reflecting this promise, the past several decades has seen a surge in global interest in promoting early childhood development (ECD), particularly during the first one thousand days of life (Black et al., 2017). A broad range of ECD intervention approaches (e.g., home visiting programs, early childhood care and education services, nutritional supports) have been developed to meet the needs of children living in diverse settings around the world and are increasingly being prioritized by governments and nongovernmental organizations for large-scale implementation (Richter et al., 2017). At a policy level, the United Nations’ (2015) recently ratified Sustainable Development Goals (SDGs) that specifically focus on ECD under Target 4.2. In fact, Target 4.2 under the SDGs represents the first major global policy initiative to specifically focus on ECD.
Central to the success of ECD intervention and policy efforts is access to reliable, valid, and practically feasible methods for measuring young children’s outcomes. In particular, experts have highlighted the need for global instruments that can be used to capture multiple domains of development (e.g., motor skills, language skills, etc.) in large, culturally diverse samples (Richter et al., 2019). Such approaches are critical for a multitude of purposes, ranging from improving basic understanding of developmental processes globally to evaluating the impact of programs and policies on child outcomes to monitoring progress toward global policy targets.
The large-scale implementation of existing measures of motor, cognitive, language, and socio-emotional development in children younger than 3 years of age is likely not feasible in international contexts. Existing ECD instruments include the Denver Developmental Screening Test (Frankenburg & Dodds, 1967), the Bayley Scales of Infant and Toddler Development (BSID-III; Bayley, 2006), and the Ages & Stages Questionnaire (Squires & Bricker, 2009). These instruments provide information about ECD with enough precision to screen individual children for developmental disabilities or delays. However, these instruments were primarily constructed for U.S. populations and, with some exceptions (e.g., Kerstjens et al., 2009), there is limited evidence on their validity in international contexts (Peña, 2007). Furthermore, the costs and resources associated with purchase and implementation make these instruments difficult to implement in large samples, particularly in resource-limited low- and- middle-income countries (LMICs).
In recent years, a number of instruments have been developed to address the need for cross-culturally comparable ECD measures. For example, Save the Children’s International Development and Early Learning Assessment has shown evidence for validity and easy implementation in international contexts, but it is intended to measure learning and development for children 3.5- to 6.5-years-old (Halpin et al., 2019; Wolf et al., 2017). Similarly, the Inter-American Development Bank’s Regional Project on Child Development (PRIDI), a direct assessment tool, seeks to measure 2- to- 4-year-olds’ motor, cognitive, language, and socio-emotional development using a brief set of indicators that are considered to be valid in culturally diverse contexts. A final example is the INTERGROWTH-21st Project Neurodevelopment Package (INTER-NDA), which was calibrated and tested in eight multiethnic sites across five continents but only targets 22- to 26-month-old children (Fernandes et al., 2014). Given the particularly high plasticity of development during the first 3 years of life (Walker et al., 2011), there is an urgent need for scalable, internationally validated instruments to monitor child development in this specific developmental period.
In response to the limitations of existing ECD measures, we developed the Caregiver Reported Early Development Instruments (CREDI; McCoy et al., 2016, 2018). The CREDI is a simple, caregiver-reported measure developed for large-scale assessment of ECD for children between the ages of zero and three years. The CREDI exists in both a short form and a long form. The short form aims to provide policymakers and NGOs with a single score of overall ECD, and these scores have demonstrated validity evidence in 17 low-, middle-, and high-income countries (McCoy et al., 2018). In contrast, the purpose of the long form is to provide finer-grain information regarding children’s development across multiple domains, including (a) motor skills (including fine and gross motor skills), (b) cognitive skills (including executive functioning, reasoning, problem solving, and pre-academic knowledge), (c) language skills (including expressive and receptive language skills), and (d) socio-emotional skills (including emotional and behavioral self-regulation, emotion knowledge, and social competence). A thorough discussion of the instrument’s construction (i.e., item construction, data collection procedures, etc.) as well as the psychometrics of the short form is provided by [citation redacted].
The aim of the present study is to report the validity evidence for the CREDI long form’s motor, cognitive, language, and socio-emotional subscale scores obtained from N = 14,113 caregiver reports in a multicultural, multinational sample. In assessing the evidence, we followed the recommendations set forth by the Standards for Educational and Psychological Testing (henceforth, Standards; American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 2014) to evaluate whether subscale scores obtained from the CREDI long form support inferences about the developmental status of children under 3 years of age. Complementing existing evidence regarding the CREDI’s test content and cognitive testing provided in McCoy et al. (2018), this study provides evidence of the long form’s (1) Construct validity and the internal structure of the CREDI long form, including the extent to which observed item response patterns are predicted by theory; (2) Criterion validity and the degree to which relations between the CREDI subscale scores and other variables match what would be expected by theory; (3) Reliability in that the subscale scores drawn from the CREDI long form are sufficiently precise for the intended purpose of the instrument (i.e., population measurement of young children’s developmental status in multiple developmental domains); (4) Fairness in that scores do not result in biased conclusions about the developmental status. To provide this evidence, we begin by comparing a variety of potential model specifications, including several multidimensional models that allow CREDI items to load onto multiple subscales simultaneously. We argue that such a multidimensional approach is more conceptually valid for capturing ECD during infancy and toddlerhood, when children’s observable behaviors often reflect multiple underlying skills or capacities (e.g., pointing as reflecting both expressive communication and motor skills).
Measures and Methods
Participants
We collected data from a sample of 14,113 primary caregivers of children aged 0–35 months old from 21 sites across 17 high- and LMICs. 1 The mean age of children was 20.3 months (SD = 9.41). Approximately half of the children were male (50.2%). Geographically, 53.2% of respondents were from Africa (Ghana, Tanzania, and Zambia), 20.68% from Asia (Bangladesh, Cambodia, India, Jordan, Laos, Nepal, Pakistan, and the Philippines), 21.0% from Latin America (Brazil, Chile, Colombia, and Guatemala), and 6.4% from the United States (see Table 1 for details about the sample). The CREDI was translated (and back-translated) from English to local languages in all sites. Within each site, surveys were administered to children participating in local research projects. Although samples were predominantly convenience-based, several sites (e.g., Brazil, Nepal, Cambodia) included samples that were representative of subnational units (e.g., districts or zones).
CREDI Sample Description.
Note. Income per person and day computed by dividing purchasing-power-parity adjusted per capita income in each country by 365 days. Stunting data refer to children under age 5 and was retrieved from http://data.unicef.org/topic/nutrition/malnutrition/.
The study was reviewed by each site’s Institutional Review Board, and all data collection was conducted in accordance with local ethical standards. All caregivers gave informed consent.
CREDI
In administering the CREDI, we asked caregivers to report whether their child can or does exhibit a range of milestones, skills, and behaviors compiled to measure motor, cognitive, language, and socio-emotional development for children under 36 months of age. We developed and refined the wording of the items refined using a multiphase process that has been documented previously (see citation redacted). Overall, we field-tested 149 items. Caregivers responded to up to 103 dichotomous items that were identified as appropriate given the child’s age. Caregivers could answer all CREDI items with a “yes,” “no,” or “I don’t know” response. We treated all “I don’t know” responses as missing values in the analysis.
Of the 149 items tested, we excluded 39 from further analysis as these items: (1) showed >10% “don’t know” responses, (2) were understood by fewer than 80% of caregivers on cognitive interviews, (3) showed poor agreement levels (unadjusted for chance agreement) of Cohen’s
Construct Validity
Consistent with the recommendations from the Standards, we gathered construct validity evidence by demonstrating that there is a theoretical basis for explaining item response patterns (i.e., the internal structure). In developing ECD instruments, exploratory factor analysis (EFA) is often used as a starting point for evaluating the internal structure of the items, including ascertaining the dimensionality of the instrument and assessing internal structure using factor loadings (e.g., Fernandes et al., 2014; Ghandour et al., 2019). In contrast, in educational assessment and the item response theory literature, test developers often employ confirmatory approaches (e.g., confirmatory factor analysis [CFA]) in which the loading structure of items to constructs is pre-specified according to a panel of experts (Liu & Kang, 2019). Both approaches—EFA and CFA—have advantages and disadvantages. On the one hand, with CFA, there is no guarantee that the theoretical loading structure specified by a panel of experts best explains item responses. On the other hand, traditional EFA models make strong distributional (i.e., normality of the underlying factors) and parametric (i.e. linearity) assumptions that likely do not hold perfectly in real-world data; consequently, solutions from EFA that differ from theoretical expectations may be reflective of the sensitivity of the parameter estimates to assumption violations when modeling the data, as opposed to an accurate indication of the true underlying structure of the instrument. Indeed, although a traditional EFA was conducted (contact first author for details), the EFA solution was determined to be inconsistent with theory because all but 6 items (2 motor items and 4 socio-emotional items) loaded onto two factors and these two factors had no clear theoretical delineation to theorized ECD constructs.
Our approach for testing the internal structure of the CREDI long form utilized a hybrid between CFA and EFA. Consistent with CFA, we fit multidimensional item factor analysis (IFA) models to the data using a theoretically grounded factor loading structure developed by a team of 16 external expert advisors. However, we also use disagreements among the panel of experts to specify alternative models and evaluate the corresponding fit. Thus, our hybrid approach attempted to strike a balance between identifying a factor loading specification that maximizes model-data consistency (i.e., the goal of EFA) while ensuring that we remain tied to theory (i.e., the goal of CFA).
To begin this process, our panel of expert advisors analyzed each item and voted for all ECD domains that the item was hypothesized to measure (i.e., motor, cognitive, language, and/or socio-emotional; see Online Supplemental Appendix Table 2 for a fully tally of all expert votes). These experts included developmental psychologists and pediatricians representing a range of countries. Items with potential implications for multiple areas of development could be flagged as representing more than one domain of development so as to allow for cross-loadings. In other words, unlike what has been done in traditional ECD instruments that provide subscores (c.f. Bayley, 2006; Fernandes et al., 2014; Squires & Bricker, 2009), we specify cross-loadings and do not require that items are assigned to one and only one ECD domain. We hypothesized that specifying the presence of cross-loadings would better reflect the internal structure of the data because any given item may indicate children’s development across several domains. This is especially likely in the first 3 years of development when children’s observable behaviors often reflect multiple different skills and capacities. For example, most traditional measures of ECD claim infants’ use of gestures (e.g., pointing, grabbing) as a “pure” representation of their language abilities, whereas it is likely that these behaviors also reflect skills in motor development (Bowman et al., 2018).
We tested three alternative loading specifications by varying the minimum number of expert votes required to freely estimate a loading across the factors (i.e., domains). These four-factor IFA models are visualized in Figure 1. In the first model (Model A.1), loadings were freely estimated if at least eight (of the 16) experts agreed that the item loaded on a domain. The second model (Model A.2) and third model (Model A.3) reduced the required number of votes to free a loading to six and four, respectively. We did not test less restrictive specifications, as freeing loadings with fewer than four votes led to convergence issues.

(a) Four-Factor IFA Model That Specifies Cross-Loadings; (b) Four-Factor IFA Model Without Cross-loadings; (c) Three-Factor IFA Model with the Factor Representing Cognitive and Socio-Emotional Skills Combined and with Cross-Loadings specified; (d) Single-Factor IFA Model.
In all models, we relaxed the (unconditional) multidimensional normality assumption traditional in multidimensional IFA models because such an assumption is likely untenable. For example, we did not think it would be plausible to assume that motor subscale scores for all children aged 0–35 months would follow a symmetric distribution as would be implied by the traditional normality assumption. In other words, a symmetric assumption would imply that motor scores followed linear age gradients, whereas we expect nonlinear gradients with the fastest rate of change occurring early in development and then tapering with age. The result of a tapered age gradient would be a left-skewed marginal distribution of motor scores.
To accommodate nonlinear age gradients, all four factors were modeled with a linear, quadratic, and cubic function of age as covariates. In this way, subscale scores were assumed to be multivariate normally distributed for children of the same age, even if the marginal distribution is not normally distributed. Intercepts were fixed to zero and residual variances were fixed to one for model identification. In addition, we included site fixed effects (with the sample from Jordan as the reference group) to account for planned missingness, as not all items were administered in all sites. (Items exhibited an average missingness rate of 50.3% and ranged from 14.6% to 76.9% across sites.) We employed maximum likelihood estimation, which assumes that data are missing at random conditional on the observed responses, age, and between site differences in factor scores. Analyses were conducted in Mplus Version 8.3 (Muthén & Muthén, 2017).
To minimize overfitting and maximize model-data consistency, we next pruned Models A.1–A.3, fitted using the loading specifications. In theory, after reverse-coding items (as appropriate), all items should be positively correlated with the specified developmental domain, implying that all loading estimates should result in positive values. In fitting the models, however, we encountered overfitting behaviors where negative loading estimates on one domain often accompanied unreasonably strong positive loading estimates on another domain. For a small subset of items, this undesirable compensating behavior was so severe when fitting Model A.3 that it led to convergence problems. We considered the instability induced by this compensation as an indication of overfitting to our sample because theory would suggest that all factor loadings in the population would be positive. Overfitting implies that the model is overly complex and does not optimize predictive fit to out-of-sample data compared to a more parsimonious model (c.f., Hastie et al., 2009). Thus, a current focus in measurement and structural equation modeling is developing methods to minimize overfit by reducing model complexity to improve the generalizability of inferences (e.g., Jacobucci et al., 2016). Our approach to reduce model complexity was to specify linear inequality constraints that required loading estimates to take on nonnegative values only. Loadings with estimates at the boundary of the constraint (i.e., equal to zero) were removed from consideration. We subsequently fit a model without specifying any constraints and removed any nonsignificant loadings to arrive at our final solution.
Next, we conducted likelihood ratio tests and compared information criteria for the three pruned models (Models A.1–A.3) to select a final model. After selecting the best fitting model, we assessed whether cross-loadings could be ignored by fitting a new four-factor IFA model (Model B diagrammed in Panel B of Figure 1) in which we assigned items to the factor corresponding to the most positive standardized loading from the final model best fitting model in Models A.1–A.3.
We evaluated the dimensionality of the data by assessing model fit of the best fitting of the IFA model with four factors (i.e., Model A or Model B, which include separate factors for motor, cognitive, language, and socio-emotional skills) compared to IFA models with fewer dimensions (Model C and Model D). In fitting Model A and Model B, we consistently found strong, positive residual correlations between the factors representing cognitive and socio-emotional skills (approximately r = .80). Consequently, we tested a model that combined these factors (Model C) compared to a four-factor solution. Next, we tested whether a four-factor solution fits better than a unidimensional model (Model D) specified with a single factor representing one general ECD construct. If the data support a model specified with four factors, then we would expect that the best fitting four-factor solution (Model A or Model B) would fit better than the three-factor solution (Model C) and the unidimensional model (Model D).
Criterion-Related Validity
Following the recommendations of the Standards, we assessed the criterion-related validity evidence by evaluating whether the relations of subscores with other measures and known correlates of children’s development are consistent with theory. We studied the correlations between CREDI scores with anthropometric data and household stimulation measures because these variables have been shown to predict children’s development (Sudfeld et al., 2015; Walker et al., 2011). Additionally, associations between the CREDI subscores and scores from concurrent ECD measures obtained from a subsample of participants were also studied to investigate convergent and discriminant relations. Local collaborators within the data collection sites selected concurrent measures based on children’s age and cultural appropriateness and included (1) the ASQ Social-Emotional (ASQ: SE; Squires et al., 2002) collected from 234 Chilean children, (2) the BSID-III, collected from 1,036 Tanzanian children, (3) the INTER-NDA, collected from 921 Zambian children, (4) the MacArthur-Bates CDI collected from 180 Chilean children, and (5) the PRIDI, collected from 598 Brazilian children from 2 to 3 years old. Online Supplemental Appendix Table 1 presents a brief description of each instrument. In analyzing convergent and discriminant validity, we calculated partial correlations using polynomial regression to control for the strong confounding effect of age; we also controlled for between-site differences in scores by specifying fixed effects in the regression model.
We collected anthropometric and household stimulation data for 8,925 children in seven countries. HAZ scores (height-for-age or length-for-age z-scores for children less than 24 months) were calculated using the WHO child growth standards (Onis, 2006). Child stimulation was measured following UNICEF guidelines (2014), totaling the number of adult–child activities as reported by the main caregiver, including reading, telling stories, singing songs, taking outside the child, playing, and naming, counting, or drawing objects.
Reliability
We tested two forms of reliability in this study. First, we examined the stability of scores (i.e., test–retest reliability) using data collected from 575 caregivers in Guatemala, Jordan, and Lebanon, who completed the CREDI twice over a 7- to 10-day administration period. We calculated interclass correlation coefficients (ICCs) to measure the stability of scaled scores. We fit a one-way random effects analysis of variance to estimate the intraclass coefficient 1, or ICC(1). We chose the one-way random effects model over two-way alternatives because the one-way model measures the absolute agreement between scores across the two points in time by estimating the correlation between time points (McGraw & Wong, 1996).
Second, we analyzed internal-consistency reliability by studying pairwise tetrachoric correlations and by calculating Cronbach’s α values. We relied on Cronbach’s α statistics rather than coefficient omega statistics because the latter assumes unidimensionality (see Bandalos, 2018, p. 395) which is not amenable to the multidimensional measurement approach we adopted in this study. Specifically, for each domain, we evaluated separate Cronbach’s α values for children aged 0–11 months, 12–23 months, and 24–35 months. We note that reporting a single α value across all ages is not appropriate because item responses are so highly correlated with age. Consequently, a single value would suggest greater precision of the instrument than warranted when an important goal of the instrument is to discriminate among children of the same age.
Fairness
We investigated measurement noninvariance by studying whether there is evidence of test-level bias in scores across (a) high, (b) middle-high, and (c) low country income groups, as indicated by differential test functioning. We used only data from the fourth and last round of pilot testing, when the administration of the CREDI most resembled its current form. Thus, the total sample size for assessing invariance was N = 6,545 caregivers.
In the present study, we conducted pairwise tests comparing differential test functioning across each income group, separately by domain (i.e., motor, cognition, etc.). We used the simulation procedure advanced by Chalmers’ et al. (2016) to form a sampling distribution for the unsigned differential test functioning (uDTF) statistic to conduct significance testing. The uDTF is interpreted as the average absolute difference in predicted total scores given children’s position on the scale for a particular domain (e.g., motor, cognition, etc.), where we used the maximum-a-posteriori factor scores to approximate a child’s position on the scale. As an absolute difference, the uDTF is a conservative statistic and represents an upper bound in measuring differential test functioning. If the estimated uDTF statistic is statistically significant, such evidence suggests that abilities differentially predict item response patterns and may indicate possible test-level bias. Relying on Stark et al.’s (2004) proposed Cohen’s d, we analyzed the substantive size of the uDTF to ascertain whether evidence of bias is practically important,
where
Results
Construct Validity
Of the 110 initial CREDI items, 108 items exhibited positive loadings on at least one domain across the three initial loading specifications discussed in the Measures and Methods section and outlined in Panel A of Figure 1 (Models A.1–A.3). The two items that did not exhibit a positive loading under any of the considered specifications included (1) “Does the child often cry for no reason (e.g., when he/she is not hungry or tired)?” (reverse coded), and (2) “Does the child cry or whine when he/she is made to wait for something he/she wants (e.g., toy or food)?” (reverse coded). These items were subsequently removed when fitting pruned versions of Model A.1–A.3. Likelihood ratio tests suggested that the more stringent eight-vote threshold (Model A.1) and six-vote threshold (Model A.2) for specifying cross-loadings resulted in a decrement in model fit relative to the less strict four-vote model (Model A.1 vs. Model A.3: χ 2(18) = 3,999.64, p < .001; Model A.2 vs. Model A.3: χ 2(7) = 1,425.28, p < .001).
Relative to Model A.3, likelihood ratio tests also identified a significant decrement in model fit if cross-loadings were not specified (Model B vs. Model A.3: χ 2(26) = 2,106.30, p < .001), if a three-factor solution was employed by combining the factors representing cognitive and socio-emotional skills (Model C vs. Model A.3: χ 2(81) = 16,388.35, p < .001), or if a unidimensional model (Model D) was utilized (Model D vs. Model A.3: χ 2(100) = 15,106.78, p < .001). Combined with the fact that Model A.3 also minimized both the Akaike information criteria (AIC) and the Bayesian information criteria (BIC) across all fitted models (see Table 2), the data therefore suggest a four-factor model with cross-loadings maximizes model-data consistency. Thus, we selected Model A.3 as the final model for the CREDI long form. Observed loading patterns are reported in Table 3, and Online Supplemental Appendix Table 2 reports standardized factor loading estimates for this final model (Model A.3); unstandardized factor loadings and threshold estimates are provided in Online Supplemental Table 1).
Model Fit Across Fitted IFA Models.
Note. N = 14,113. All models fit to the same J = 108 items.
Observed Loading Patterns From the Best-Fitting Model (Model A.3).
Note. ✓ indicates positive and significant loading estimate.
The correlations among the residuals of the motor, cognitive, language, and socio-emotional factors from the final model (Model A.3) suggest that the factors themselves displayed adequate discrimination to justify a four-factor solution. Except for the residual between the factors representing cognitive and socio-emotional skill (r = .81 p < .001), these values ranged from r = .49 (p < .001) between language and socio-emotional skills to r = .67 (p < .001) between motor and cognitive skills. Children’s scores on one factor most often explained less than half the variance in scores on a separate factor, holding age constant and controlling for mean differences in scores between sites.
Criterion-Related Validity
For each of the four ECD domains, we found evidence of criterion-related validity. The partial correlation between HAZ and CREDI subscores ranged from r = .16 (p < .001) to r = .20 (p < .001). These partial correlations were similar to or larger than those observed between HAZ and scores from concurrent ECD instruments in this sample. Similarly, CREDI subscale scores were positively associated with child stimulation, with partial correlations ranging from r = .21 (p < .001) to r = .25 (p < .001). As observed with HAZ, CREDI subscale scores were more positively correlated with stimulation than scores from the previously established ECD measures (although we recognize that this may be in part a function of same-reporter bias). Finally, CREDI subscale scores were positively associated with the PRIDI scores (a composite measure of overall development) in Brazil, and partial correlations ranged from r = .37 (p < .001) to r = .47 (p < .001).
We also found that convergent and discriminant relations between CREDI motor and languages subscales with subscores from alternative ECD measures generally matched that expected by theory. Partial correlations with CREDI motor scores were strongest for gross motor scores from the BSID-III (r = .26, p < .001) and from the INTER-NDA (r = .50, p <.001) but were also positively associated with fine motor skills (BSID-III: r = .22, p < .001; INTER-NDA: r = .18, p < .001). Partial correlations with language, cognitive, and socio-emotional scores from these alternative measures ranged from r = .12 (p < .001) to r = .24 (p < .001) for the BSID-III scores and from r = .16 (p < .001) to r = .34 (p < .001) for the INTER-NDA scores.
CREDI language scores displayed similar convergent and discriminant validity evidence. Language scores exhibited strong, positive partial correlations with the MacArthur–Bates CDI (r = .60, p < .001), with expressive language scores from the BSID-III (r = .26, p < .001), and with expressive language scores from the INTER-NDA (r = .42, p < .001). In contrast, scores from other ECD domains were less positively correlated with CREDI language scores and were found to range from r = .12 (p < .001) with BSID-III’s socio-emotional scores to r = .40 (p < .001) with INTER-NDA’s gross motor skills. CREDI language scores also exhibited positive partial correlations with receptive language measures (BSID-III: r = .14, p < .001; INTER-NDA: r = .20, p < .001). In summary, the CREDI language subscale displayed evidence of both convergent validity and discriminant validity, especially as it relates to expressive language subscales from alternative instruments.
Moreover, positive partial correlations with concurrent cognitive and socio-emotional subscales provided evidence for convergent validity; however, there was less evidence for discriminant validity. As expected, CREDI cognitive and socio-emotional scores exhibited positive partial correlations with equivalent subscales from the BSID-III (cognitive: r = .17, p < .001; socio-emotional: r = .13, p < .001), the INTER-NDA (cognitive: r = .25, p < .001), and the ASQ: SE (socio-emotional: r = .31, p < .001). However, CREDI cognitive scores exhibited even more positive partial correlations with concurrent expressive language scores (BSID-III: r = .25, p < .001; INTER-NDA: r = .36, p < .001). Likewise, for children of the same age, CREDI socio-emotional scores were more positively correlated with language scores from the BSID-III (receptive: r = .15, p < .001; expressive: r = .24, p < .001), while ASQ: SE scores were most positively correlated with CREDI cognitive scores (r = .33, p < .001). In summary, although we found evidence that CREDI cognitive and socio-emotional scores were positively correlated with measures from alternative instruments, we did not find that these scores were most correlated with concurrent cognitive and socio-emotional measures. We provide a possible explanation for this in our Discussion. Online Supplemental Appendix Table 3 contains a table of partial correlations all measures.
Reliability
Moderate-to-strong correlations between scores provided evidence of test–retest reliability. The ICC(1) model ranged between .70 and .81 across the domains (Motor: ICC(1) = .81, 95% CI [.76, .85]; Cognitive: ICC(1) = .79, 95% CI [.74, .83]; Language: ICC(1) = .70, 95% CI [.63, .76]; Socio-emotional: ICC(1) = .78, 95% CI [.73, .83]). These ICC(1) values indicate moderate levels of stability over time for language scores and good levels of stability for the other domains (Koo & Li, 2016).
Strong pairwise tetrachoric correlations and acceptable Cronbach’s α values provide evidence of internal-consistency reliability within each of the four domains. Tetrachoric correlations were all positive and averaged around .80 within each domain (Motor: M = .78, SD = .12; Cognitive: M = .80, SD = .12; Language: M = .78, SD = .12; Socio-emotional: M = .81, SD = .14).
We also found that the Cronbach’s α values ranged between .64 and .94 across the four domains and three age-groups (see Table 4). Internal consistency was slightly lower for the socio-emotional subscale relative to the other ECD domains, which is perhaps not surprising given the diversity of socio-emotional skills included (e.g., emotion knowledge, self-regulation, social competence, etc.).
Cronbach’s
Fairness
For the motor, language, and socio-emotional domains, we found evidence of statistically significant, but substantively small levels of differential test functioning when comparing scores across country income groups (Table 5). uDTF effect sizes in the motor, language, and socio-emotional domains ranged from d = 0.04 (Language, high- vs. middle-high income groups: Est. = 0.33, p = .206) to d = 0.09 (Socio-emotional, high- vs. middle-high income groups: Est. = 0.39, p <.001). These effect sizes are universally accepted as small in substantive size (c.f., Cohen, 1988). The small levels of observed differential test functioning indicate that the statistically significant findings of differential test functioning are artifacts of the large sample size (N = 6,545), but likely do not suggest that test-level bias threatens the validity of inferences regarding children’s development when comparing across country income groups.
Unsigned Differential Test Functioning (uDTF) by Country Income Group Comparison Across ECD Domains.
Notably, cognitive scores demonstrated the strongest uDTF effect sizes, with the uDTF strongest between middle-high versus low-income countries and taking on a value of d = 0.18 (Est. = 1.06, p < .001). Although such a values arguably classifies the differential test functioning as moderate rather than small, we note that the uDTF statistic is a conservative statistic and likely overestimates the amount of differential functioning that would change conclusions when comparing scores across income groups.
Discussion
In this article, we have used a large (N = 14,113), multicountry and multicultural sample to assess the validity evidence for the motor, language, cognitive, and socio-emotional subscales for the long form of the CREDI. We found sufficient evidence to justify a four-factor solution, as well as acceptable internal-consistency reliability and test–retest reliability. We also found evidence of concurrent validity, although the adjusted CREDI cognitive and socio-emotional scores were more strongly correlated with concurrent scores representing nonequivalent domains than concurrent scores representing the same domain. Regarding the cognitive domain, CREDI scores were more strongly correlated with concurrent expressive language scores than they were with concurrent cognition scores. Although it may seem that the factor representing cognition is more accurately a measure for a language construct, we believe this explanation is unlikely. If adjusted CREDI cognitive scores represented a language construct, then we would expect an unusually strong residual correlation between the cognitive and language domains from Model A.3. Although there was a moderate-to-strong residual correlation between the cognitive and language factors (r = .62, p < .001), this association was weaker than the corresponding residual correlation between the cognitive and motor factors (r = .67, p < .001).
An alternative explanation is perhaps that concurrent measures of cognitive and expressive language development in young children (e.g., the BSID-III, INTER-NDA) have not allowed for items to load on multiple domains. As a result, these measures may be confounding cognitive and language development in ways that inflate their expressive language subscales relative to the CREDI cognitive subscale. Conceptually, indicators of expressive language (in the CREDI and in the concurrent measures) often tap into children’s latent cognitive abilities through asking children to describe complex constructs or explain (i.e., make sense of) situations. In fact, of the 44 items that loaded on the factors representing cognition or language, greater than one third (15 items) loaded on both factors simultaneously. Moving forward, additional work is needed to better understand the relations between these complex constructs and to identify more precise ways to operationalize them in distinct ways.
The weak discriminant validity evidence for CREDI’s socio-emotional subscale is unsurprising. Socio-emotional development is an extremely broad construct encompassing a highly diverse set of skills ranging from getting along with others (social competence) to inhibiting impulsive behavior (self-regulation) to identifying and responding to emotions (emotion knowledge; Jones et al., 2016). Accordingly, it is no surprise that the socio-emotional measures from the BSID-III, ASQ: SE, and CREDI all focus on different facets of socio-emotional development, complicating comparisons of these scales. The BSID-III and ASQ: SE tend to emphasize adaptive behaviors (e.g., sleep, behavior during mealtimes), whereas the CREDI does not emphasize these behaviors. Further research extricating and incorporating the distinct constructs that comprise socio-emotional development will be needed.
The evidence suggesting unsubstantial levels of differential test functioning in the motor, language, and socio-emotional domains is encouraging as it is suggestive that item-level measurement invariance is likely not acting systemically in one direction so as to bias conclusions when comparing mean differences in scores across country income groups. However, researchers using the CREDI scores proceed cautiously in comparing scores across populations for several reasons. The present study only evaluated evidence of measurement invariance across country income groups. Therefore, we cannot establish whether measurement noninvariance would invalidate conclusions if comparing populations defined by some other set criteria; future research should examine whether there is evidence of differential test functioning across alternatively defined populations. Meanwhile, we encourage users of the CREDI to acknowledge that conclusions may be dependent on the assumption of especially important, given the finding that the size of the uDTF for cognitive scores arguably does designate it as substantively small. It is difficult to project the implications of this finding in practice because we remain unaware of guidance in the literature for when the substantive size of the uDTF designates it as concerning and jeopardizes the validity of conclusions. Future methodological research should focus on providing such guidance.
Although our findings suggest favorable evidence for construct validity of the subscales, we have developed using CREDI’s long form, potential users should consider several limitations. Our culturally and linguistically diverse sample was obtained by convenience and not necessarily representative of any stringently defined global population. Thus, next steps include defining a target population, then obtaining representative samples from this target.
More evidence is also needed to firmly establish criterion-related validity. Longitudinal data would provide predictive validity evidence using distal outcomes, including school readiness, academic performance, and later mental health and emotional well-being. Given that we found differential test functioning for the motor domain, researchers should be cautious when comparing mean differences in motor scores across countries. Future work should focus on identifying the sources of this differential functioning and investigate item-level measurement noninvariance. Lastly, the CREDI subscales measure aspects of ECD that are shared across cultures, but they are not designed to meaningfully capture important phenomena measured by culturally specific instruments. Moving forward, we strongly recommend that researchers pair the CREDI with direct assessments that can target culturally specific processes while mitigating bias (e.g., social desirability) associated with caregiver report.
We also found that items frequently measured multiple domains simultaneously and that specifying cross-loadings resulted in improved model-data consistency. To assist in scoring in the presence of cross-loadings, we provide users with a web-based scoring application. Users can access all resources at the CREDI website: https://sites.sph.harvard.edu/credi/. Cross-loadings are consistent with developmental theory in that children’s observable behavior often requires the recruitment of skills from multiple domains, especially early in life. Yet, to our knowledge, existing ECD instruments ignore item-level multidimensionality, as items are typically assigned to a single developmental domain during the calculation of subscores. Our findings indicate that such practices may result in a misspecified measurement model. Future research should examine whether conclusions about children’s development are sensitive to such misspecification, as would by hypothesized by previous simulation studies (c.f., Curran, 1994).
In conclusion, we have shown that scores from the CREDI long form demonstrate evidence of construct and criterion-related validity and are sufficiently precise for population measurement purposes. The CREDI long form is designed to be globally relevant and applicable across cultures. As a self-report measure, the CREDI long form is efficient to implement and can be used in public policies to monitor child development and to assess interventions, with the goal of improving outcomes of children around the world. Toward this end, recent research suggests that the simple act of interviewing caregivers in measuring their children’s development may itself help caregivers become more aware of and attentive to their children’s milestone attainment and behaviors (Altafim et al., 2020).
Supplemental Material
Supplemental Material, sj-pdf-1-jbd-10.1177_01650254211005560 - Validation of motor, cognitive, language, and socio-emotional subscales using the Caregiver Reported Early Development Instruments: An application of multidimensional item factor analysis
Supplemental Material, sj-pdf-1-jbd-10.1177_01650254211005560 for Validation of motor, cognitive, language, and socio-emotional subscales using the Caregiver Reported Early Development Instruments: An application of multidimensional item factor analysis by Marcus Waldman, Dana Charles McCoy, Jonathan Seiden, Jorge Cuartas, CREDI Field Team and Günther Fink in International Journal of Behavioral Development
Supplemental Material
Supplemental Material, sj-pdf-2-jbd-10.1177_01650254211005560 - Validation of motor, cognitive, language, and socio-emotional subscales using the Caregiver Reported Early Development Instruments: An application of multidimensional item factor analysis
Supplemental Material, sj-pdf-2-jbd-10.1177_01650254211005560 for Validation of motor, cognitive, language, and socio-emotional subscales using the Caregiver Reported Early Development Instruments: An application of multidimensional item factor analysis by Marcus Waldman, Dana Charles McCoy, Jonathan Seiden, Jorge Cuartas, CREDI Field Team and Günther Fink in International Journal of Behavioral Development
Supplemental Material
Supplemental Material, sj-pdf-3-jbd-10.1177_01650254211005560 - Validation of motor, cognitive, language, and socio-emotional subscales using the Caregiver Reported Early Development Instruments: An application of multidimensional item factor analysis
Supplemental Material, sj-pdf-3-jbd-10.1177_01650254211005560 for Validation of motor, cognitive, language, and socio-emotional subscales using the Caregiver Reported Early Development Instruments: An application of multidimensional item factor analysis by Marcus Waldman, Dana Charles McCoy, Jonathan Seiden, Jorge Cuartas, CREDI Field Team and Günther Fink in International Journal of Behavioral Development
Supplemental Material
Supplemental Material, sj-pdf-4-jbd-10.1177_01650254211005560 - Validation of motor, cognitive, language, and socio-emotional subscales using the Caregiver Reported Early Development Instruments: An application of multidimensional item factor analysis
Supplemental Material, sj-pdf-4-jbd-10.1177_01650254211005560 for Validation of motor, cognitive, language, and socio-emotional subscales using the Caregiver Reported Early Development Instruments: An application of multidimensional item factor analysis by Marcus Waldman, Dana Charles McCoy, Jonathan Seiden, Jorge Cuartas, CREDI Field Team and Günther Fink in International Journal of Behavioral Development
Footnotes
Authors’ note
CREDI Field Team comprised of (in alphabetical order) Elisa Altafim, Alexandra Brentani, Andreana Castellanos, Alexandra Chen, Anne Marie Chomat, Wafaie Fawzi, Cristina Gutierrez de Piñeres, Jena Hamadani, Natalia Henao, Pamela Jervis, Codie Kane, Jeffrey Measelle, Patricia Medrano, Lauren Pisani, Muneera Rasheed, Peter C. Rockers, Jonathan Seiden, Christopher R. Sudfeld, Fahmida Tofail, Christine Wong, Dorianne Wright, and Aisha K. Yousafzai.
Acknowledgments
The authors would like to acknowledge the intellectual contributions of the CREDI Advisory Panel members and our data collection partners. The authors thank the thousands of children and caregivers who participated in this research. Finally, that authors are also grateful to Katherine Masyn for her insights and suggestions and for generously sharing her computational resources so that the authors could make progress in a timely manner.
Author contributions
Marcus Waldman conceptualized the validity study, designed the methodological approach, conducted the psychometric analysis, wrote the initial draft, and approved the final manuscript. Dana Charles McCoy assisted in the design of the study, conducted preliminary statistical analysis, developed the data collection instruments, reviewed and revised drafts, and approved the final manuscript. The CREDI Field Team assisted in the development of data collection instruments, conducted all data collection, reviewed and revised drafts, and approved the final manuscript. Günther Fink assisted in the design of the study, conducted preliminary statistical analysis, developed portions of the data collection instruments, led the study sampling, reviewed and revised drafts, and approved the final manuscript. Jonathan Seiden and Jorge Cuartas assisted with the statistical analysis, reviewed and revised drafts, and approved the final manuscript.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors would like to acknowledge funding and support provided by the Saving Brains Program from Grand Challenges Canada (Grant Number 0073-03).
Supplemental Material
Supplemental material for this article is available online.
Note
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
