Abstract
Social-emotional (SE) skills in the early developmental years of children influence outcomes in psychological, behavioral, and learning domains. The adult ratings of a child’s SE skills can be influenced by sex stereotypes. These rating differences could lead to differential conclusions about developmental progress or risk. To ensure that differences between boys and girls in SE skills are accurate, validity evidence should support that differences are not based on such issues as stereotypes influencing ratings. Differential item functioning (DIF) analysis allows for the assessment of group differences in item responses while controlling for ability. This study utilized a new multilevel Mantel–Haenszel (MMH) DIF procedure to examine sex differences in item responses for examiner ratings of children’s SE skills on the Brigance Inventory of Early Development III SE scale. Of 50 items examined, 4 were identified as large DIF items. The scores do not appear to be influenced by item-level rating distortions based on sex stereotypes.
The development of social-emotional (SE) skills during childhood includes emotional expression and regulation in socially acceptable manners, and the formation of secure relationships (Yates et al., 2008). Such skills are considered essential to healthy development, school readiness, and long-term life success. Theory suggests that development of SE skills is rooted in early interactions between children and their caregivers (Thompson, 2006). Indeed, SE skills such as peer interactions and shared play are important for development of social competence and emotional and cognitive skills (Bierman et al., 2008). In school, where children spend the majority of their formative years, teachers view SE skills and regulatory behaviors as significant indicators of school adjustment (Rimm-Kaufman, Pianta, & Cox, 2000). Intervention programs and curricula exist to target the development of such skills (e.g., Domitrovich, Cortes, & Greenberg, 2007), as do centers (e.g., Center on the Social Emotional Foundations of Early Learning [CSEFEL]) devoted to studying SE skills in early childhood.
Assessment of SE skills for children at risk of developmental delays is particularly important. Generally, a caregiver, a teacher, or both evaluate SE skills through ratings of statements about the child. These ratings produce a score that reflects the child’s development. A sex stereotype in these ratings may exist in expectations about children’s development or achievement in these areas. In the physical domain, for example, boys and girls have been rated differentially (Chalabaev, Sarrazin, Fontayne, Boiche, & Clement-Guillotin, 2013; Chalabaev, Sarrazin, Trouilloud, & Jussim, 2009; French & Mantzicopoulos, 2007), where girls may be rated higher at jumping rope and boys better at climbing trees. Differential ratings based on expected sex differences may also exist for SE domains focused on play with peers.
Sex differences in developmental domains are assumed to account for a low percentage of rating variance (Ardila, Rosselli, Matute, & Inozemtseva, 2011). However, evidence is needed to support that measures of SE skills are as free as possible of irrelevant differential ratings, which could be based on sex stereotypes (Chalabaev et al., 2013; Chalabaev et al., 2009; Eccles, Wigfield, Harold, & Blumenfeld, 1993; French & Mantzicopolous, 2007; Wigfield et al., 1997). Thus, validity evidence is needed to ascertain whether observed sex differences in the SE domain are a result of actual differences or an artifact of a lack of measurement invariance (Thissen, Steinberg, & Gerrard, 1986) starting at the item level.
Differential item functioning (DIF) analysis allows for assessment of group differences in item responses, given an equivalent level of the trait being measured. DIF occurs when individuals from different groups have different probabilities of item endorsement after being matched on the measured ability (Zumbo, 1999). Within many contexts of DIF assessment, persons are sampled at a group level (e.g., schools) and not at an individual level. For example, children (Level 1) may be clustered within schools or educational sites (Level 2). DIF analyses not accounting for such multilevel data can result in Type 1 error inflation (French & Finch, 2010). Multilevel DIF methods can control for such inflation. In this study, data were collected from children at various educational sites across the United States. Thus, there was a need to account for the multilevel structured data. The goal of this study was to examine gender DIF of examiner SE skills ratings on the Brigance Inventory of Early Development III (IED III; French, 2013) using the new multilevel extension of the Mantel–Haenszel (MMH) DIF procedure. The results will provide additional validity evidence for the SE scores on the IED III.
Method
Participants
Data were obtained from a U.S. standardization sample (N = 684; 49% female; age range = 3 years 0 months to 7 years 11 months) where sample demographics closely match U.S. student demographics (Midwest 30%, Northeast 12%, South 36%, and West 22%) with 31%, 45%, and 17% residing in a city, suburban, or rural setting, respectively. Sample racial and ethnic distributions were as follows: 58% White, 10% African American, 1% American Indian, 6% Asian, 1% Hawaiian, and 16% Multiracial. Children were sampled from child care or educational sites (e.g., Head Start, Pre-K programs, schools). See the IED III technical manual for additional information.
Instrument
The Brigance IED III (French, 2013) is an individually administered measure designed for children aged birth through 7 years 11 months to help determine school readiness and eligibility for special services, to allow comparisons of skills across multiple domains, and to inform instruction. The IED III consists of five domains (e.g., physical development, language development), and its scores have reliability evidence (e.g., internal consistency range = .80–.97, test–retest = .92–.99, inter-rater reliability range = .82–.99). Confirmatory factor analysis supports the hypothesized internal structure of the instrument, and it demonstrates strong associations with relevant variables (e.g., Battelle Developmental Inventory, 0.64; Vineland Adaptive Behavior Scales, 0.69; and Wechsler Intelligence Scale for Children, 0.68; Vineland-II, 0.70; see the technical manual for additional information).
This study focused on the SE Development scale comprised of the subdomains interpersonal skills and self-regulatory skills. The SE domain consists of 50 items (three response options) rated by a caregiver or an examiner where students were clustered in child care centers or schools. For scoring purposes according to the technical manual, response options are scored dichotomously. An item score of 1 indicates persistence of the behavior, whereas 0 indicates that the behavior occurs infrequently. Four unidimensional scales were evaluated including Prosocial and Regulation skills (16 items), Motivation and Confidence skills (13 items), Peers and Play skills (9 items), and Adult Relationship skills (12 items).
Analysis
The popular Mantel–Haenszel (MH) test for DIF detection was employed to test the 50 items for DIF. MH is effective for detecting uniform DIF for dichotomous items and can be thought of as a chi-square test comparing item responses between groups at every score level for every item (de Ayala, 2009). We specifically used a modified MH statistic that accounts for multilevel data (French & Finch, 2013). To classify items, dual criteria were used. For an item to be identified with DIF, MMH had to be statistically significant (p < .01), and the associated effect size had to be large. The effect size was based on the delta scale and associated ETS (Educational Testing Service) guidelines (i.e., large effect = |D| > 1.5; de Ayala, 2009).
Results and Discussion
The intraclass correlations ranged from 7% to 11%, indicating that a relatively small portion of variance in item responses was accounted for by sampled sites. Across the prosocial and regulation subdomain, 19% (3) of the items met the dual DIF criteria. These included, “Does ____ exercise control and constraint so others with not be hurt during play?” (D = 1.91); “If supervised by an adult, does ____ take turns without undue objection?” (D = 1.65); and “Does ____ react to a disappointment or failure in an acceptable manner by being a good sport and refraining from shouting or getting upset?” (D = 1.51). As seen in Figure 1, the first item favored girls (Panel A) while the other two favored boys (Panels B and C). In the adult relationships domain, the item “Does ___enjoy sharing information with you about himself/herself, such as things he/she likes, names of his/her family members or pets, or what he/she did over the weekend?” was not statistically significant (p = .03) but had a large effect size (D = 1.61), which was worth noting. As seen in Panel D, the item favored girls. No items met the DIF criteria or had large effect sizes in the motivation and confidence or peers and play subdomains.

Item characteristic curves for the four large DIF items.
This study was designed to examine sex differences in the SE domain of the IED III. Large DIF was observed in 8% (4 of 50 items) with 75% of these located in the prosocial and regulation subdomain. Of the four items, two favored males and two favored females. With these findings, it appears that raters do exhibit some differences in rating SE skills following sex role expectations (boys favored more on play and regulation items, girls favored on an adult relationship item). However, given the small number of DIF items across domains, these findings suggest that there is no major concern of DIF influencing IED III SE scores. Thus, the comparisons of scores for these groups are likely not to be influenced for a lack of measurement invariance at the item level. Practitioners can have confidence that sex comparisons are not distorted by measurement distortion at the item level.
This study provides evidence to support the use of the IED III SE scores in comprehensive evaluation systems to assist with the identification of students who may be at risk for social or emotional developmental delays. That said, regardless of the amount of DIF the IED III may not exhibit, practitioners should be aware of sex role stereotypes when rating children on SE skills. Directions to calibrate raters appropriately should be explored.
Footnotes
Authors’ Note
The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The multilevel differential item functioning (DIF) example was a component of the research supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305D110014 to Washington State University.
