Abstract
As part of efforts throughout China to improve the outcomes of individuals with disabilities, the Shanghai government has launched a campaign to screen at least 95 percent of newborns. To assist in meeting this goal, the Ages & Stages Questionnaires (ASQ), Third Edition, was translated into Chinese and the feasibility of a screening system using the ASQ-Chinese translation (ASQ-C) was investigated in Shanghai. Twenty-nine primary children’s healthcare clinics and several district-wide children’s healthcare institutes participated. Validity and reliability of the ASQ-C were studied as well as utility in pediatric clinics as part of well child visits. Using a sample of more than 8000 caregivers and children from 3 to 66 months of age, screening cutoff scores for each of the 19 ASQ-C intervals were determined, based on two standard deviations below the mean domain score. Inter-rater agreement between ASQ-C completed by 519 parents and a professional assessor was .89. Test-retest reliability for 651 caregivers who completed two ASQ-C at a 1–4 week interval was .91. Cronbach’s coefficient alpha measuring internal consistency ranged from .37 to .79. Convergent validity, measuring agreement between Bayley Scales of Infant Development, 2nd edition, and Denver II outcome categories (i.e. risk, typical) and ASQ-C outcomes (i.e. risk, typical), ranged from .57 to .94. Results from this pilot study suggest the ASQ-C is a promising screening instrument for identification of developmental problems in the Shanghai region. Implementation of a universal screening system in pediatric clinics has the potential to assist in early identification of developmental delays, referral to rehabilitative services, and improvement of developmental outcomes for young children and their families.
Keywords
While China has recently concentrated on improving public health services, other human services, such as those for individuals with disabilities, have lagged behind economic and technological advances. A new national emphasis has focused on improving the outcomes of people with disabilities across the age span, including preschool children, who have previously received few rehabilitation services other than health monitoring and treatment of childhood diseases.
Medical providers in Shanghai, the financial center of nearly 17 million, have targeted prevention and early identification of young children with disabilities as part of this national emphasis. Two main efforts include a public campaign to improve prenatal care and a massive effort to improve developmental screening, with a goal of conducting comprehensive screening on at least 95 percent of all newborns. For more than 20 years, pediatricians relied primarily on the Chinese version of Denver Developmental Screening Test (DDST) (Song and Zhu, 1981). With modest sensitivity and specificity, and inconsistent interpretation of questionable results, thousands have gone undiagnosed.
The Shanghai Children’s Health Care Institute identified parent-completed developmental screening tests as one potential solution for early identification of disabilities, due to cost-effectiveness and limited required professional time (Hix-Small et al., 2007). Furthermore, parents can obtain information about their children’s development by completing screening questionnaires, and development can be monitored over time, thus maximizing developmental outcomes for children including those with disabilities. Using screening test results, pediatricians can refer early on those children and families in need of specialized intervention such as physical therapy and special education. Children will then receive rehabilitative services and their families needed supports in a timely manner.
Positive results with parent-completed developmental screening have been reported in the US, such as by Glascoe (2007), Knobloch et al. (1979), Sonnander (1987), Squires et al. (1990), and Bricker et al. (1988). The philosophical underpinnings of most social services and education programs and new legislation in China support the importance of parent participation and decision-making in the process of screening and developmental monitoring.
The Ages and Stages Questionnaire: A Parent-completed, Child-Monitoring System-Third Edition (ASQ-3; Squires and Bricker, 2009) consists of 21 questionnaires for children 1 to 66 months and is used for early identification of developmental delays. The ASQ-3 has been empirically studied using more than 15,000 children, with robust sensitivity and specificity results (Squires and Bricker, 2009), and was chosen by Shanghai practitioners as a potential valid tool to pilot.
The study purpose was to evaluate a newly created Chinese translation of the ASQ-3, the Ages and Stages Questionnaire-Chinese (ASQ-C; Bian et al., in press) in Shanghai, China, and to study the validity and reliability in terms of identification of developmental delays in preschool children. Four research questions were asked. First, what are optimal screening cutoff scores for the ASQ-C and how do these compare to US cutoff sores? Second, is the ASQ-C reliable for use with children in Shanghai, in terms of internal consistency, test-retest and inter-rater reliability? Third, is the ASQ-C a valid instrument for screening children and early identification of delays in children in the Shanghai region? Finally, what is the utility of the ASQ-C according to parents/caregivers and pediatricians in Shanghai?
Patients and methods
Participants and setting
In a six-month period, 8472 children were recruited from 18 districts, health care clinics, and institutes in greater Shanghai. Research assistants who received training in screening procedures and use of the ASQ recruited children ages 3 months to 5 years old, using a stratified sampling method based on the most recent Shanghai census reports. Parents/caregivers 1 of possible participants were contacted by phone. A random sample was asked to participate in convergent validity and reliability studies; 483 children participated in the convergent validity study; 1135 in the reliability study; 651 in the test-retest and 519 in the inter-rater reliability study, also completed at Shanghai health care clinics and institutes.
Measures
Demographic form
The demographic form consisted of caregiver educational attainment and family income questions, as well as child ethnicity, birth weight, gender, date of birth, and disability status.
Ages and Stages Questionnaires-Chinese translation (ASQ-C)
The ASQ-C (Bian et al., in press) is a Chinese version of the ASQ-3 that was translated to be culturally-relevant. (A pre-publication version of the ASQ-3 was used as a basis for the translation; minor changes were made to the translation when the final version of the ASQ-3 was available.) On each ASQ-3 interval there are six questions in each of five developmental areas: a) communication, b) gross motor, c) fine motor, d) problem solving, and e) personal-social. Parents score 0 for ‘Not Yet’, if the child has not developed the skill; 5 for ‘Sometimes’, if the skill is emerging; 10 for ‘Yes’, if the child demonstrates the skill frequently or most of the time. Domain scores are used to classify children as ‘development on schedule’ (i.e. scores above the cutoff score in all domains), or ‘at risk’, refer for further assessment (i.e. below the cutoff score in any domain).
The ASQ-3 has been empirically studied in the US with more than 15,000 children. Test-retest reliability between two parent-completed questionnaires was 92 percent (N = 175); inter-observer reliability between parents and professionals was 93 % and convergent validity varied from 76 percent to 88 percent across intervals. Sensitivity (i.e. ability to detect children with delays) was 86 percent and specificity (ability to detect children developing typically) was also 86 percent (Squires et al., 2009).
The ASQ-3 was translated into standard Simplified Chinese by developmental specialists and language experts, back translated into English, revised after field-tests, and reviewed again by Chinese child development experts. Forty-eight research assistants received training on early screening and identification of developmental delays, administering the ASQ, and study procedures. Initial work and subsequent translation efforts followed international guidelines on assessment translation (Bertram and Pascal, 2001; Hambleton et al., 2005; Tsai et al., 2006 Squires et al., 2009). Adaptations for language (e.g. pronouns, ‘ed’ verb endings) and culture (e.g. use of chopsticks) were made while retaining the meaning of the US English version. Nineteen ASQ-3 age intervals were translated and tested; the two- and nine-month questionnaires were still under development in the US.
Bayley Scales of Infant Development-second edition (BSID II)
BSID II (Bayley, 1992) is a diagnostic tool for children 1–42 months and served as the gold standard for infant and toddler development. The mental scale assesses cognitive, language, and personal-social development, and has been used successfully in Shanghai (Ding et al., 2007). A preliminary study suggested medium to high validity (.54–.76), good reliability (>95%) and utility when compared to the Gesell Developmental Schedules (Knobloch et al., 1980), a widely used diagnostic tool in China until recently.
Denver Development Screening Test-second edition (Denver II)
The Denver II (Frankenburg and Dodds, 1992) is a 125-item standardized screening measure that was translated and studied in Shanghai with 2826 children, with acceptable results (Chen et al., 2007). The Denver II served as a second measure of convergent validity; the ages of 25%, 50%, 75% and 90% of passing Denver II items were adjusted based on Shanghai norms. While the limitations of the Denver II were recognized, it was selected as a second measure of convergent validity because there were no other standardized developmental instruments translated into Chinese and validated with a Chinese population for 4- and 5-year-old participants.
Utility interview
To compile descriptive information on utility, parents were interviewed and asked how they felt about completing the ASQ-C, at the end of their pediatric visit. Pediatricians were interviewed by the developmental assessors about their experiences with the ASQ-C during both research phases (i.e. initial ASQ completion and subsequent convergent validity/reliability).
Procedures
ASQ-C completion
Shanghai parents brought their children to health care clinics during a six-month time period. They gave written consent to participate, including procedures for assuring confidentiality, before completing study measures. Pediatricians assisted in completing the questionnaire and demographic form, offering assistance with reading, interpreting, and providing testing materials such as toys and books. The ASQ-C took approximately 15 minutes to complete and was scored by the pediatrician. Results were shared with parents, based on US norms (Chen et al., 2007). At the end of the visit, parents were given a written set of age-appropriate developmental games and activities from the ASQ-3 User’s Guide (Squires et al., 2009).
Validity
Optimal cutoff scores
Optimal cutoff points were first examined using procedures similar to those outlined by Squires and Bricker (2009) for the ASQ-3. Conditional probabilities were calculated and compared for: 1) sensitivity, 2) specificity, 3) true positive proportion, 4) false positive proportion, 5) over-referral, and 6) under-referral, based on convergent validity agreement. See Figure 1.

Contingency table comparing ASQ and concurrent measures.
The second strategy entailed determining the percent of children identified as ‘at risk’ using potential cutoff scores of 2, 1.5, and 1 standard deviation below the mean domain scores. The target of 12–16 percent of children identified in one developmental area (i.e. one area below the cutoff score) and 2–7 percent identified in two or more areas was adopted as the desired percentage of children to be identified for further assessment at each interval. Relative Operating Characteristic (ROC) analyses and percent of children identified in one and two areas were also calculated. After analyzing these comparisons, a cutoff point of two standard deviations below domain mean scores appeared to be the most balanced cutoff point for the Shanghai sample in terms of true positive and false positive proportions across the 19 intervals.
Convergent validity
The BSID-II and the Denver II were used for determining convergent validity agreement. The Mental Development Index (MDI) of the BSID-II was used generally for children ages 3 to 31 months, and the Denver II for children older than 31 months (Bayley, 1992; Ding et al., 2007). Within six days of the first ASQ-C completion, parents brought their child in to the clinic for BSID-II or Denver II, administered by one of two trained developmental pediatricians who had established inter-rater reliability of >90 percent (delayed/typical) classifications. Eight pediatric research assistants with previous screening experience assisted with Denver II administration and were trained to above 90 percent reliability for classifications (delayed/typical) (Chen et al., 2007).
Convergent validity analyses depended first upon the determination of ASQ-C cutoff scores using relative operating characteristic (ROC) analysis and descriptive statistics, as described above. Once a 2 standard deviation cutoff by domain on the ASQ-C was determined, classifications on the BSID-II and Denver II (risk/typical) were compared to classifications on the ASQ-C (risk/typical). Sensitivity, specificity, over-referral and under-referral rates were calculated.
Reliability
Internal consistency, the extent to which items on the assessment tool measure the same underlying construct, was calculated by examining the relation among items with domain areas, and the relation of domain scores with the overall score. Pearson product moment correlation coefficients and Cronbach’s coefficient alpha were calculated (Cronbach, 1951) as well as Item Response Theory (IRT) modeling (Embretson and Reise, 2000; Ferrando and Lorenzo-Seva, 2005; Fraley et al., 2000).
To assess inter-rater reliability, test administrators completed one ASQ-C on the child after BSID-II or Denver II completion, within six days of the first ASQ-C administration by parents. To investigate test-retest reliability, parents completed a second ASQ-C before the administration of the BSID-II or Denver II, within 30 days (an average of 15 days). Similar to the first administration, medical providers assisted parents/caregivers in questionnaire completion. Inter-rater reliability was examined by comparing the classifications based on the parent-completed ASQ-C (typical, risk) with the classifications based on the ASQ-C (typical, risk) on the same child completed by an assessor.
Test-retest reliability was measured as percentage agreement between classifications of one ASQ-C completed by parents and a second ASQ-C completed by parents within 30 days.
Comparison of ASQ scores between samples of China and US
Mean domain scores of the Chinese sample were compared with mean domain scores of the US sample to investigate differences in performance on the two ASQ versions. A multivariate analysis of variance (MANOVA) in SPSS was conducted comparing domain scores across intervals between the samples (Howell, 2007).
Results
Demographic information
Demographic results are presented in Table 1. Approximately 90 percent of parents who were contacted by telephone and asked to participate did so. No data were gathered on the characteristics of the 10 percent who chose not to participate.
Demographic characteristics of subjects
The gender ratio, family yearly disposable income per person, and region of residence were similar to the Shanghai 2007 population census. Eighty percent had completed a high school degree or above, had a yearly income that higher than ¥20,000 (US$2927), and more than 95 percent of the sample resided in the urban area.
Descriptive statistics
Means and standard deviations (SD) were calculated for all 19 ASQ-C intervals from 4 to 60 months. Results for selected ASQ-C intervals are summarized in Table 2.
Comparison of ASQ scores for the US normative sample and Shanghai sample for selected intervals
CM: communication; GM: gross motor; FM: fine motor; CG: problem solving; PS: personal-social.
Raw score difference >5 (‘sometimes’ score); bp < .01, cp < .05.
Validity
Convergent validity
Convergent validity was examined by comparing children’s classifications based on the selected optimal cutoff scores with the classifications based on either the BSID-II or Denver II results. Cutoff scores that indicated developmental problems were determined for the BSID II based on its test manual (i.e. 1.5 SD below the mean, or standard score of 75) (Bayley, 1992). For the Denver II, the cutoff score was chosen based on cutoff scores determined for Shanghai (Chen et al., 2007).
ASQ-C classification agreement was compared with children’s classifications on convergent measures (BSID II or Denver II), and sensitivity, specificity, true positive, false positive, over-referral and under-referral were computed, as shown in Table 3. Sensitivity ranged from .50 at 24 months to 1.00 at 18 months; specificity ranged from .80 to .94. BSID II agreement was generally higher than Denver II agreement across age intervals.
ASQ classification statistics a by age interval
Equations for calculating are located in Figure 1.
Based on Bayley Scales of Infant Development-Second Edition (BSID II).
Based on Denver Developmental Screening Test-Second Edition (Denver II).
No cases in this cell; could not be calculated.
Reliability
Internal consistency
Pearson product moment correlation coefficients ranged from .52 to .80 (M = .66) for the communication domain; .44 to .80 (M = .70) for gross motor; .64 to .86 (M = .72) for fine motor; .60 to .84 (M = .78) for problem solving; and .63 to .83 (M = .77) for the personal-social domain. All correlations were significant at p < .01. Cronbach’s coefficient alpha for domain scores ranged from .46 to .73 (M = .58) in the communication domain; .37 to .79 (M = .62) in gross motor; .39 to .73 (M = .59) in fine motor, .35 to .70 (M = .54) in problem solving, and .38 to .65 (M = .53) in the personal-social domain.
The internal structure of ASQ-C was further investigated with IRT modeling (Embretson and Reise, 2000; Ferrando and Lorenzo-Seva, 2005; Fraley et al., 2000). Rasch analyses were conducted to examine item functioning analysis for translated items, and to provide additional statistics pertaining to the integrity of the ASQ-C. Ideally, mean square fits statistics, estimated with Winsteps v3.58, should range between 0.50 and 1.50 to show good item fit; with exception of very few items, all items fit the Rasch model very well. Average values per domain (n = 5) by level (n = 19) were all approximately 1.00 and served to reinforce interpretation of internal consistency. These fit statistics confirmed that ASQ-C response patterns were consistently related to domain total scores.
Inter-rater reliability
The agreement between ASQ-C questionnaire classifications (i.e. risk/typical; monitor range was included in typical category, as suggested by Squires et al., 2009) on the same child completed by 519 parents and by one of two assessors was 89 percent. Protocols that had two or more uncompleted items in one or more test areas were eliminated from the analysis often due to test administrators having little opportunity to observe children engaged in activities such as eating, toileting, etc.
Test-retest reliability
Six hundred and fifty-one parents completed two questionnaires at a 10–30-day interval. Agreement of test classifications between time 1 and time 2 was 91 percent.
Comparison of Shanghai and US ASQ scores
The mean domain scores on 19 age intervals were compared with US sample to investigate differences. Means and standard deviations for selected intervals (i.e. 12, 24, 36, 48, 60 months) are presented in Table 2. Out of 95 comparisons across all 19 intervals, 78 were found to be significantly different using MANOVA (p < .01, 69 and p < .05, 9). As a comparison, differences in raw scores greater than 5, corresponding to the smallest incremental difference on the ASQ, were identified in the two samples (Heo et al., 2008; Janson and Squires, 2004). Five raw score points represented a ‘sometimes’ response, the smallest increment in ASQ scoring (yes = 10; sometimes = 5; not yet = 0). Twenty-three differences out of 95 were larger than five raw score points (i.e. a sometimes response). Thus, mean ASQ domain scores varied between US and Shanghai samples.
Utility of parent-completed screening measures
Parents and pediatricians were interviewed after ASQ-C completion and asked one general open-ended question: ‘How did you feel about completing the ASQ-C?’ Detailed notes were taken during these interviews, which were then transcribed and entered into a database. Parents and pediatrician interviews were analyzed separately for reoccurring themes. Approximately 80 percent of parents/caregivers responded; of these, over 90 percent said they enjoyed completing the ASQ-C and could understand most items when completed with the pediatrician. They felt ASQ-C was worthwhile to complete, and provided useful information. Parents appreciated getting the accompanying ASQ activities and were eager to practice skills with children who had low scores in developmental areas. Parents also felt that the completion of ASQ-C was valuable and they were excited to do ASQ-related play activities with their children.
All pediatricians gave utility feedback during interviews. They remarked that caregivers could easily complete the ASQ-C questions with their help but might have trouble answering the ASQ-C on their own, due to limited literacy skills. Pediatricians also felt that the time involved was reasonable (15 minutes) and provided worthwhile information. Pediatricians indicated that they learned new developmental information about children, as their primary source of developmental information had previously been Denver II completion. Based on interview results, pediatricians, who do not receive the same in-depth academic preparation as pediatricians in the US (i.e. an average of six years training completed after high school graduation in China compared to nine to ten years for US pediatricians), felt they learned about typical development when they helped parents/caregivers complete the ASQ-C during well child visits. Pediatricians also felt that the model of completing the ASQ-C together with families worked well and assisted them in diagnosing children with delays and referring them to special services before these delays became more serious. Additionally, pediatricians appeared willing to incorporate the ASQ as a developmental screening tool in well child visits and had adequate time to do so.
Discussion
The present study investigated the translation and testing of the ASQ in Shanghai as part of a larger initiative in China, aimed at improving screening and early identification for young children with disabilities so that they can receive appropriate interventions in a timely manner. Results indicated that the newly translated ASQ-C had high reliability and validity; internal consistency was high, with robust overall correlations indicating strong agreement between the ASQ-C, BSID-II, and Denver II. Differences between Shanghai and US mean scores appeared most frequently in communication and personal-social areas – domains that would logically be affected by different cultural settings and translation into a language with a structure very different from English.
When comparing US and Shanghai mean domains scores, results indicated some significant differences between the developmental rates of children in Shanghai and those of children the US, which was similar to findings by the Janson and Squires (2004) study that examined differences between US and Norwegian samples. Overall, children in the US had higher scores than children in Shanghai on the 12 to 36-month ASQ intervals – perhaps due to cultural and educational differences. For example, most of the Shanghai sample (i.e. > 95%) was ‘only’ children, perhaps overprotected by parents and grandparents in early ages and not encouraged to explore their external environment. However, after age 3, mean domain scores became more similar between the two samples. In Shanghai, 90 percent of children enter preschool at 27–36 months of age and continue early childhood education until six years. In preschool settings, they are encouraged to acquire developmental and pre-academic skills, perhaps explaining why these children gradually caught up and even surpassed the US sample after the age of 3. As with Janson and Squires in Norway (2004) and Heo et al. in Korea (2008), language and personal social domains displayed the most differences in scores, again being influenced by cultural practices. For example, in the US, parents are encouraged to let their children eat finger foods and eat independently beginning at six months. In China and Korea, parents most often feed their children until after age 2, at which time they begin using chopsticks. Fine motor, personal social, and adaptive skills in both Korean and Shanghai samples were lower than those of children in the US from four to 27 months; at the 30-month interval, Shanghai and Korean scores started to increase and were higher than the US sample up to the 60-month interval.
When examining the Shanghai data on its own it was found that children born in the summer months appeared to have slightly higher gross motor scores than those born in the winter months. The hypothesis for this difference was that children born in the summer start crawling and achieving pre-walking milestones before the age of one when they go outside and play and walk during the spring and summer. Conversely, children born in the winter have less outdoor experience time. In addition, they wear bulkier and more inhibiting clothes in winter that may slightly delay achieving gross motor milestones.
Caregivers in this study included both parents and grandparents who cared for the children during the day. One interesting result of this study was the willingness of grandparents to participate in completion of the developmental questionnaires, and their eagerness to receive ideas for games and activities to play at home with their grandchildren. These grandparents often lived in the same household as the target child and appeared to be very familiar with the child’s current skill repertoire and able to complete the ASQ-C, with some assistance from providers. They especially enjoyed the ASQ activities and felt that these activities gave them ideas of things to do during the day, were good for guiding play, and would help them establish a warm and positive relationship with their grandchildren. This familiarity and acceptance by grandparents may not be present in other cultures but appeared to be widespread in this Shanghai sample.
Limitations
Three major limitations existed. First, this was not a random sample from throughout China, even though it was representative of families visiting well child clinics in Shanghai. Additional information should be gathered with a larger sample from other regions in China in order to make an overall comparison between the US and Chinese samples.
A second limitation is the use of a screening test for convergent validity analyses. A better ‘gold standard’ assessment is needed to investigate the agreement between ASQ-C screening classifications and the developmental status of the child, especially those older than 3 years. The Denver II is known to have shortcomings. The Bayley III (Bayley, 2005), an in-depth diagnostic test with a more solid psychometric base, has not been translated or tested in China, and thus was not available for this study. Even though there no true ‘gold standards’ in child development assessments, those with fewer errors should be used (Aylward, 2009; Salvia and Ysseldyke, 2008). Further research studies should include in-depth developmental assessments of Chinese children using tests with established validity and reliability for the Chinese population. These measures may be available in the future.
A third limitation relates to the limited numbers of children who were found to be at risk or delayed in this sample, with classification statistics being determined by one or two children at several intervals (i.e. 18, 24, 30 months). A larger and more diverse sample from other cities and rural areas in China is needed to confirm the result of this pilot study. In addition, beyond screening, the next steps in the process – developmental evaluation and referral to specialized early intervention services – need to be studied systematically. Early identification is dependent upon these next steps if developmental outcomes are to be improved.
Despites these limitations, research findings indicate the ASQ-C met minimal feasibility and psychometric standards for screening tests (Marks et al., forthcoming). Further study in Shanghai and throughout China will confirm these results.
Early mass screening and identification will improve developmental outcomes of diverse children in international settings. Parent-completed questionnaires such as the ASQ and Parents’ Evaluation of Developmental Status (PEDS) (Glascoe, 2007) have been successfully translated and adapted, and are in use in several countries (Campos et al., 2010; Dionne et al., 2004; Heo et al., 2008; Schonhaut et al., 2009; Yu et al., 2007). Whereas young children throughout the world share common developmental phenomena, cultural and language differences need to be carefully considered so that these measures are accurate and acceptable in diverse settings.
As with the ASQ-C, translation and back translation with cultural adaptations aimed at parenting populations are the first steps to successful test adaptation. Determining standard scores and cutoff points for the target population are a second step, followed by studies of psychometric properties. Improving developmental outcomes requires parents, native speakers, researchers, and clinicians working together to translate and adapt a test that will be effective, culturally relevant, and acceptable to families with young children. Early identification of delays and effective evidence-based interventions will combine to improve the quality of life for young children and their families.
Footnotes
Funding
Support for this project came from a three-year grant for Renewing the Public Health System in Shanghai (2007–2009).
Conflict of interest
Co-author Jane Squires reports a conflict of interest as co-author of the Ages and Stages Questionnaires, and receives royalties from their publication.
