Abstract
This article synthesized evidence for the validity and reliability of the Strengths and Difficulties Questionnaire in children aged 3–5 years. A systematic review using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement guidelines was carried out. Study quality was rated using the Consensus-based Standards for the Selection of Health Measurement Instruments. In total, 41 studies were included (56 manuscripts). Two studies examined content and cultural validity, revealing issues with some questions. Six studies discussed language validations with changes to some wording recommended. There was good evidence for discriminative validity (Area Under the Curve ≥ 0.80), convergent validity (weighted average correlation coefficients ≥ 0.50, except for the Prosocial scale), and the 5-factor structural validity. There was limited support for discriminant validity. Sensitivity was below 70% and specificity above 70% in most studies that examined this. Internal consistency of the total difficulty scale was good (weighted average Chronbach’s alpha parents’ and teachers’ version 0.79 and 0.82) but weaker for other subscales (weighted average parents’ and teachers’ range 0.49–0.69 and 0.69–0.83). Inter-rater reliability between parents was moderate (correlation coefficients range 0.42–0.64) and between teachers strong (range 0.59–0.81). Cross-informant consistency was weak to moderate (weighted average correlation coefficients range 0.25–0.45). Test-retest reliability was mostly inadequate. In conclusion, the lack of evidence for cultural validity, criterion validity and test-retest reliability should be addressed given wide-spread implementation of the tool in routine clinical practice. The moderate level of consistency between different informants indicate that an assessment of a pre-schooler should not rely on a single informant.
Behavioral and emotional problems in pre-schoolers can impact upon their transition into primary school (Eivers, Brendgen, & Borge, 2010; White, Connelly, Thompson, & Wilson, 2013), lead to on-going problems in middle-childhood (Kim-Cohen et al., 2009) and adulthood (Kim-Cohen et al., 2003), and affect educational achievement (Bierman et al., 2013). Behavioral problems in children as young as three have been shown to be predictive of problems later in life, including depression and anti-social personality disorders (Caspi, Moffitt, Newman, & Silva, 1996; Kessler et al., 2005). A key preventative strategy is therefore to enhance identification of children with behavioral problems from a pre-school age, so that support programs can be put in place (Doughty, 2005).
Many countries use the Strengths and Difficulties Questionnaire for parents (SDQ-P) and for teachers (SDQ-T) to screen children (R. Goodman, 1997; R. Goodman, Meltzer, & Bailey, 1998). The SDQ is a 25-item questionnaire for assessing children’s psychosocial attributes (positive and negative behaviors), made up five subscales: Emotional Symptoms, Conduct Problems, Hyperactivity, Peer Problems, and Prosocial Behaviour (R. Goodman, 1997; R. Goodman et al., 1998). Higher scores on the four subscales that report on difficulties reflect more significant problems, whereas higher scores on the Prosocial subscale denote better social behavior. Scores from the first four subscales are summed to give an overall Difficulty score ranging from 0 to 40. Score distributions in large populations have been used to derive score thresholds for each subscale, as well as the total Difficulties score. These are used to classify children’s difficulties as “normal,” “borderline” and “abnormal.” The SDQ also includes a page asking whether the reported difficulties cause the child distress (1 item) or impairment in their daily life (4 items for parents and youth, and 2 items for teachers) (R. Goodman, 1999). Whilst answers to these questions are useful for clinicians, they are not included in the scoring of the SDQ.
A recent review of 48 studies using the SDQ in 4–12-year-olds concluded the SDQ was a good screening instrument but that further evidence for predictive validity in longitudinal studies was required (Stone, Otten, Engels, Vermulst, & Janssens, 2010). Our scoping indicated this review did not capture all relevant psychometric studies of the SDQ. In addition, whilst their review synthesized data, they did not provide critical appraisal of the methodological quality of the included studies. Hence, we undertook a systematic review to identify and critically appraise evidence for the validity and reliability of the SDQ in pre-school children (aged 3–5 years).
Methods
The review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement guidelines (Liberati et al., 2009) and captured published studies reporting on reliability and/or validity of the SDQ-P/SDQ-T and that had included data on pre-school children (aged 3–5 years). Wildcards and truncation were used as specified by different databases:
(“SDQ*” OR “strength* and difficult* questionnaire*”) AND (psychometric* OR validat* OR validit* OR reliab* OR rasch* OR “factor* analysis*” OR “factor* structur*”)
No date or language restrictions were set. The search included studies published up to 31 March 2014. Hand searches of reference lists of relevant articles were conducted. Studies that only included data on older children (aged 6 years and above) were excluded. All references were downloaded into EndNote X4 (Thomson Reuters, 2010). Systematic review registers, such as Cochrane (http://www.cochrane.org/) or PROSPERO (http://www.crd.york.ac.uk/prospero/) do not include reviews of outcome measures, hence this review protocol is not registered. Box 1 presents established definitions for psychometric properties, which we applied to all papers.
Definitions used of measurement properties (R. Goodman, 1997; Mokkink et al., 2010; Streiner & Norman, 2008, pp. 261–263)
The degree to which the content of an instrument is an adequate reflection of the construct to be measured
Discriminative validity: Ability of a tool to discriminate between two extreme groups Convergent validity: The degree to which the scores of the (new) scale relate to scores on other measures to which it should be related Discriminant/divergent validity: The degree to which the scores of the (new) scale do not relate to scores on another scale that measures dissimilar constructs Structural validity: The degree to which the scores of an instrument are an adequate reflection of the dimensionality of the construct to be measured Cross-cultural validity: The degree to which the performance of the items on a translated or culturally adapted instrument are an adequate reflection of the performance of the items of the original version of the instrument
Concurrent validity: The correlation of the instrument with a “gold standard” criterion administered at the same time Predictive validity: The correlation of the instrument with a “gold standard” criterion that will be available in the future
The degree of the interrelatedness among the items
Intra-rater reliability: The extent to which scores for people who have not changed are the same for repeated measurement by the same rater Inter-rater reliability: The extent to which scores for people who have not changed are the same for repeated measurement by different raters (of the same type) on the same occasion Cross-informant consistency: The extent to which scores for people who have not changed are the same for repeated measurement by different types of raters on the same occasion Test-retest reliability: The extent to which scores on the same version of questionnaire for people who have not changed are the same for repeated measurement over time
The systematic and random error of a person’s score that is not attributed to true changes in the construct to be measured
The ability of an instrument to detect change over time in the construct to be measured
Two reviewers (KC, PK) screened all titles and abstracts and if needed the full article, to determine whether the article was eligible; any discrepancies were discussed until a consensus was reached. All studies were critically appraised by two reviewers (KC, PK). Psychometric properties reported in the studies were rated using the COSMIN (COnsensus-based Standards for the selection of health Measurement INstruments) quality score (excellent, good, fair, poor) by obtaining the lowest rating of any item in a box (i.e., “worst score counts”) (Mokkink et al., 2010; Terwee et al., 2012). For this review we considered inter-rater reliability as having been assessed when the study had evaluated consistency of scores between the same type of informants (e.g., two teachers). Studies that examined consistency between different types of informants (e.g., parent and teacher) who would be using different information on the child to derive their scores were considered to have examined cross-informant consistency (R. Goodman, 1997). If the paper did not explicitly state or report what had been done, we rated this as not having been done.
Quantitative data extraction followed COSMIN guidance and included data on study procedures, participants, assessments (if relevant), key findings from the appraisal, and reviewers’ comments. As with the critical appraisal, KC independently extracted data with a random sample of studies audited by the second reviewer (PK). Any uncertainties or discrepancies were discussed between the two reviewers and resolved by that discussion. Results from individual studies reported in multiple papers were combined to avoid risk of bias across studies. For the purpose of rating whether reported results for each psychometric property were acceptable, we used the following criteria (Dancey & Reidy, 2007; Hu & Bentler, 1999; Streiner & Norman, 2008; Terwee et al., 2007): Cronbach’s alpha ≥ 0.7 (for group use) and ≥ 0.85 (for use with individuals); Intraclass Correlation Coefficient (ICC) ≥ 0.70; Correlation coefficients for reliability ≥ 0.80; Correlation coefficients for convergent validity ≥ 0.50; Receiver Operating Characteristic (ROC) curves—Area Under the Curve (AUC) ≥ 0.80; Kappa coefficient ≥ 0.70; Sensitivity ≥ 80% and Specificity ≥ 60%; Confirmatory Factor Analysis (CFA): Root Mean Squared Error of Approximation (RMSEA) < 0.06 good fit, ≤ 0.08 acceptable fit; Comparative Fit Index (CFI), Goodness of Fit Index (GFI), (non) Normed Fit Index (NFI) and Tucker-Lewis Index (TLI) > 0.90 good fit, 0.8–0.9 acceptable fit; (Standardized) Root Mean Squared Residual (SRMS) < 0.08 good fit. Principal component or exploratory factor analyses were not included in the review since we were particularly interested in evaluating evidence for the established Goodman factor structure.
Two qualitative papers were identified that explored content and cultural validity. They were critiqued using the Critical Appraisal Skills Programme tool for qualitative studies (Critical Appraisal Skills Programme [CASP], 2003) and summarized narratively.
Data synthesis
Data from papers were extracted for each psychometric property. Sample size-weighted averages and standard deviations were calculated for internal consistency (Cronbach’s alpha), cross-informant consistency (correlation coefficients) and convergent validity (correlations coefficients). Weighted standard deviations account only for between-study variability, as within-study standard errors were often not reported or not obtainable, notably in the case of Cronbach’s alpha. All correlation coefficients types were considered equivalent for the purpose of computing summary statistics. Due to the possible heterogeneity of the groups and the different types of sample correlations involved, weighted summaries should be taken as indicative only.
Results
In the review, 56 manuscripts were included reporting on 41 studies from 28 countries (Figure 1) with data from general or clinical populations (34 versus 13 manuscripts respectively; eight including both general and clinical samples) and one paper reviewing the translation of SDQ without any reference to a specific population. As noted above, definitions of psychometric terms are provided in Box 1.

Literature search results.
Content validity
Studies were considered to have addressed content validity if they explicitely examined the degree to which the content of an instrument is an adequate reflection of the construct to be measured (Mokkink et al., 2010). This entailed questions from the COSMIN checklist such as relevance of the questions to the construct, to the study population and for the purpose of the measurement instrument. Two studies were included. Williamson et al. (2010) carried out a study in Aboriginal community-controlled health services. Participants included Aboriginal parents, research assistants, youth workers, medical services staff, and education officers. The study’s limitations included a lack of detail around sampling, data saturation and an absence of participants’ quotes to substantiate interpretation. Participants reported that the use of a questionnaire as opposed to a general conversation or interview was deemed culturally inappropriate and problematic for those with literacy issues. Inter-relationships with peers were considered of less importance than relationships with family and participants felt that many important aspects of children’s behavior and emotions were not covered by the SDQ. They also reported that the SDQ might not be completed honestly for fear of use of the data by other services or answers reflecting badly on their parenting skills. Participants recommended the re-wording of several questions and response scales, for example to enhance cultural clarity, although no questions were considered offensive.
White et al. (2013) included child development officers with direct responsibility for groups of children and head teachers from pre-school establishments in Scotland. This study also did not describe data saturation but otherwise provided a complete description of its methods. Despite the age of the children who staff worked with in this study (3–5 years), the SDQ used was the 4–16-year-old version rather than the 3–4-year-old version. Participants reported that using the SDQ provided a valuable opportunity to reflect on the emotional and social development of the children and to be able to share this with parents and the primary school. However, teachers reported that in most cases the SDQ did not reveal anything that they were not already aware of. Whilst most of the items were considered straightforward, participants reported that two items caused unease (often lies and cheats; steals from home, school or elsewhere). Staff expressed some concerns about parent reactions to the teacher-completed SDQ, about the SDQ leading to labelling of children. They also reported that the answers would be dependent on who completes the tool, which is why some completed it as a team. In addition, the use of the tool was perceived as a paperwork burden.
Construct validity
Discriminative validity
Discriminative validity (mostly between clinical and community groups) was evaluated in 12 studies (Table 1). Overall, the quality of the studies was fair. Nine studies reported ROC curves and AUC values (Table 1, median sample size 338, range 94–845). These were acceptable in eight studies for SDQ-P and in four studies for SDQ-T (Table 1). The remaining three studies used different analytical approaches (distributional statistics, chi-squared test, kappa statistics and discriminant analysis) and supported the discriminative validity of SDQ (De Giacomo et al., 2012; R. Goodman, 1999; Petermann, Petermann, & Schreyer, 2010).
Area Under the Curve (AUC) values from studies examining discriminative validity.
* Assessed with the COSMIN tool.
Convergent validity
In total, 21 studies were identified as having examined convergent validity with 17 being included in this review. Two (R. Goodman, 1999; Holtmann, Becker, Banaschewski, Rothenberger, & Roessner, 2011) were excluded as they did not evaluate the original 5-factor structure; the sample size for one was too small according to the COSMIN criteria (n = 48) (Gruenert, Ratnam, & Tsantefski, 2006); and one did not meet our pre-defined criteria (Hill & Hughes, 2007) for convergent validity set out in Box 1 (i.e., they looked at convergence of the scores between parent and teacher versions of SDQ, rather than between SDQ and another measure). Sixteen studies used correlation coefficients and were included in data synthesis (Table 2). The one remaining study (Bourdon, Goodman, Rae, Simpson, & Koretz, 2005) used logistic regression and reported a significant association (p < .0001) between children’s service use and parent-rated Total Difficulty score, that is, 45% of children with high Total difficulties score (above the 90th percentile) used at least one type of mental health services. However, we cannot be certain if non-use of services is due to the validity of the SDQ or unavailability or non-uptake of services.
Summary findings from studies examining convergent validity.
* Full data extracted are provided in Supplementary file A (Becker et al., 2004; Birkás et al., 2008; Downs, Strand, Heinrichs, & Cerna, 2012; Du et al., 2008; Ezpeleta et al., 2012; R. Goodman, 1997; R. Goodman & Scott, 1999; Hawes & Dadds, 2004; Janssens, 2009; Klasen et al., 2000; Mathai, Anderson, & Bourne, 2002; C. Mieloo et al., 2012; C. L. Mieloo et al., 2014; Petermann et al., 2010; Theunissen, Vogels, De Wolff, & Reijneveld, 2013; Van Leeuwen, Meerschaert, Bosmans, De Medts, & Braet, 2006.)
The methodological quality of the majority of the 16 included studies (median sample size n = 182, range 21–1940) was fair according to the COSMIN rating (Supplementary file A). Apart from three exceptions (R. Goodman, 1997; Hawes & Dadds, 2004; C. L. Mieloo et al., 2014), all studies reported moderate or strong correlation coefficients. However, the coefficients reported on the Prosocial subscale were low in magnitude in 7 out of 8 studies. Weighted average correlation coefficients indicate that convergent validity of SDQ is acceptable for the Total Difficulties (Parent 0.67, Teacher 0.78), Emotional, Conduct, and Hyperactivity subscales (Parent 0.55–0.63, Teacher 0.54–0.80) but unacceptable for the Peer Problems (0.49 for both SDQ versions) and Prosocial (Parent 0.18, Teacher 0.35) scales (Table 2).
Discriminant validity
Eight studies reported having examined discriminant validity. However, six evaluated the ability of the SDQ to differentiate between extreme groups of respondents, that is, discriminative validity rather than discriminant validity (see definitions in Box 1), and they are reported under discriminative validity section. The two included studies used a Multitrait-multimethod (MTMM) analysis (A. Goodman, Lamping, & Ploubidis, 2010; Hill & Hughes, 2007) comparing scores between dissimilar subscales of the SDQ. Both studies were of fair quality according to COSMIN criteria and reported limited support for the discriminant validity of the SDQ scales.
Structural validity
A total of 27 studies evaluated the structural validity of the SDQ as specified by Goodman, of which 17 used a CFA and were included (Table 3 and Supplementary file B, median sample size 1068, range 129–56864). One study carried out a CFA on each of the SDQ scales and is therefore not included in Table 3 (Thabet, Stretch, & Vostanis, 2000). Most studies were of fair quality with the most common weakness not describing how missing data were handled. One study dichotomized the SDQ scores (i.e., grouping categories 1 “somewhat true” and 2 “certainly true”). All of the 13 studies with parents and the study with custodial grandparents demonstrated acceptable to good evidence for the 5-factor structure. Of the nine studies that examined structural validity of the teachers’ version of the SDQ, eight reported it as acceptable to good.
Summary findings from studies examining structural validity.
* Assessed with the COSMIN tool.
** All fit statistics reported in the studies are provided in Supplementary file B. CFA = Confirmatory Factor Analysis; SEM = Structural Equation Modelling.
Cultural validity
Six studies discussed the translation process utilized for the SDQ into Arabic, Maltese, Bangla, Urdu, and Chinese (quality ratings fair to excellent). Four of these mentioned that forward and backward translations were used, but insufficient detail of the process or changes made to translations during the process were provided (Alyahri & Goodman, 2006; Cefai, Camilleri, Cooper, & Said, 2011; R. Goodman, Renfrew, & Mullick, 2000; Thabet et al., 2000). Parents in the studies in Pakistan and the Gaza strip reported difficulties with literacy and required support to complete the Urdu and Arabic SDQ (Samad, Hollis, Prince, & Goodman, 2005; Thabet et al., 2000). Samad et al. (2005) provided specific examples that outlined the need for cross-cultural adaptations to some questions. For example, they translated words such as “steals” and “lies” more “subtly” but did not specify how the wording was changed.
Toh, Chow, Ting, and Sewell (2008) raised concerns about the Chinese version of the SDQ and undertook an independent back-translation as recommended in the literature (Beaton, Bombardier, Guillemin, & Ferraz, 2000). They concluded problems with the Chinese version in use (Du, Kou, & Coghill, 2008), including: flow and grammar; wrongly written Chinese characters; some deviation in translation from the original meaning; problems with translation of the response category “true;” additions of the verbs “will” and “can” that may change the meaning of the statement; and use of the same questionnaire for all age groups. However, as yet these changes have not been included in a revised version.
One study examined measurement invariance with respect to ethnicity between British Indian and British white children using data from the 1999 and 2004 British Child and Adolescent Mental Health Surveys (A. Goodman, Patel, & Leon, 2010). All parents completed the English version of the SDQ and the multi-group confirmatory factor analyses provided evidence of acceptable fit to the parent and teacher SDQ across ethnicity. One qualitative study addressed cultural validity and content validity of the SDQ for Aboriginals in Australia (Williamson et al., 2010) as previously discussed.
Criterion (concurrent and predictive) validity
Six studies (median sample size of 500, range 86–7984) examined criterion validity by comparing scores from the SDQ total difficulties and/or subscales to a “gold standard” clinical diagnostic interview with clinical samples (Bekker, Bruck, & Sciberras, 2013; R. Goodman et al., 2000), community samples (Ezpeleta, Granero, la Osa, Penelo, & Domènech, 2012; R. Goodman, 2001; Mathai, Anderson, & Bourne, 2004) and children in care (R. Goodman, Ford, Corbin, & Meltzer, 2004). Methodological quality was fair in most cases.
Four studies reported sensitivity that was considered inadequate by our criteria (< 70%) (Bekker et al., 2013; Ezpeleta et al., 2012; R. Goodman, 2001; Mathai et al., 2004). One study reported sensitivity of 63% for “private household children” as rated by their parents, but 85% for “looked-after children” (i.e., children at foster homes or residential homes) as rated by their carers (R. Goodman et al., 2004). R. Goodman et al. (2000) reported high sensitivity (> 80%) of three SDQ subscale scores (Conduct, Emotional, Hyperactivity) in identifying children who were clinically diagnosed with a disorder. This study was carried out with children referred to a multidisciplinary child mental health clinic rather than a general population.
Most studies reported adequate specificity (> 70%). One study showed inadequate specificity (47%) of the conduct subscale for their London sample (although a large number of the false negatives were children with possible conduct problems; R. Goodman et al., 2000). Another study (Bekker et al., 2013) showed specificity was below 50% for both Emotional and Conduct subscales, resulting in relatively large numbers of children incorrectly being identified as having problems in this area.
Two studies used coefficients of determination or R2 to identify the proportion of variance. Goodman and Goodman (2011) reported R2 = 0.95 and R2 = 0.91 for the parent and teacher versions, respectively. A. Goodman et al. (2012) analysed population SDQ data from seven countries and compared SDQ “caseness” (prevalence based on the mean total difficulty scores, adjusted for the population’s age and sex composition) against the measured prevalence of disorder using the Development and Well-being Assessment (DAWBA) tool. They reported average R2 = 0.29 and R2 = 0.56 for the parent and teacher versions, respectively. The authors concluded that SDQ scores cannot be compared cross-nationally without population-specific norms.
Two studies used odds ratios to estimate the likelihood of receiving a diagnosis at baseline (concurrent validity) and 3 years later (predictive validity) (A. Goodman & Goodman, 2009; A. Goodman, Lamping, et al., 2010). Their findings generally supported criterion validity.
Internal consistency
Among the manuscripts, 34 examined the internal consistency of the SDQ (median sample size of 739, range 48–22108). Five were not included in data synthesis: two did not look at the original 5-factor structure (Holtmann et al., 2011; Stringaris & Goodman, 2013); one reported scores combined across subscales (Niclasen, Teasdale, et al., 2012), and one across samples (Thabet et al., 2000); with sample size for one being too small (n = 48) (Gruenert et al., 2006).
We extracted 282 Cronbach’s alphas from 26 studies (Table 4 and Supplementary file C). Of these, 150 (53.1%) fell above the acceptable threshold of 0.70, but only 16 (5.6%) were ≥ .85. The weighted average Cronbach’s alpha for the SDQ-P total score was 0.79 and for the subscales it ranged between 0.49 and 0.69. Cronbach’s alpha for SDQ-T (the teacher version of the scale) total score was 0.82, and for the subscales it ranged between 0.69 and 0.83. All subscales of the SDQ-P (the parent version of the scale) fell below the threshold of ≥ 0.70, which could be seen as an indication of inadequate internal consistency of those subscales. In general, the SDQ-T appears to have a higher internal consistency than the SDQ-P, and no single subscale presented Cronbach’s alpha values acceptable for individual use, that is, ≥ 0.85.
Summary findings from studies examining internal consistency (Chronbach’s Alpha).
Three studies used other statistics, specifically omega coefficients (Ezpeleta et al., 2012), model-based reliabilities (Gómez-Beneyto et al., 2013), and composite reliability (CR) and average variance extracted (AVE) (Niclasen, Skovgaard, Andersen, Sømhovd, & Obel, 2012). Their findings supported the internal consistency of SDQ-P and SDQ-T.
Reliability
Inter-rater reliability
One study examined inter-rater reliability between two parents and between two teachers, using Spearman Ranked correlation coefficients (Borg, Pälvi, Raili, Matti, & Tuula, 2012). These ranged between 0.42 and 0.64 when parents’ scores were compared, and between 0.59 and 0.81 when teachers’ scores were compared.
Cross-information consistency
Thirteen studies assessed consistency of scores between different types of informants. Of these, two reported only the range of correlation coefficients ([Birkás, Lakatos, Tóth, & Gervai, 2008]: 0.31–0.65; [Cefai et al., 2011]: 0.14–0.37). Eleven studies were included in data synthesis (median sample size 512, range 99–7313) (Table 5). The quality of the studies was mostly fair. Correlation coefficients were weak to moderate, with weighted averages ranging between 0.25 and 0.45.
Summary findings from studies examining cross-informant consistency.
* Assessed with the COSMIN tool.
Test-retest reliability
Six studies assessed test-retest reliability of the SDQ (sample size median 592, range 34–2091, Table 6). In most cases, the methodological quality of the studies was fair. Most of the reported correlation coefficients indicate inadequate test-retest reliability. Only one study reported adequate test-retest reliability of the SDQ-P Total Difficulties score (ICC = .85) (R. Goodman, 1999). As for SDQ-T, stability of Total Difficulties and Hyperactivity-Inattention scores was adequate in the one included study (.80 and .82, respectively) (R. Goodman, 2001).
Summary findings from studies examining test-retest reliability.
Responsiveness
One study examined the responsiveness of the SDQ following services from a community child and adolescent mental health services (Mathai, Anderson, & Bourne, 2003). Quality of the study was fair with the main limitations being a lack of clarity on what intervention occurred during the 6-month period, the large loss to follow-up (66%), and a lack of a priori hypotheses. Improvements were observed on the SDQ total difficulty scale (effect size [ES] 0.45), emotional scale (ES 0.47) and conduct scale (ES 0.35), which concurred with changes on the clinician-administered Health of the Nation Outcome Scales for Children and Adolescents (HoNOSCA). However, it is not known if the findings would be similar for those who did not return to the service.
Discussion
This systematic review is the first one to use a standardized critical appraisal tool for the evaluation of psychometric properties of the SDQ. In addition to supporting evidence for a number of these psychometric properties we found lack of evidence for the test-retest reliability, cultural validity and criterion validity of the tool as well as poor convergent validity of the Peer problem and Prosocial scales. The lack of evidence is of concern given the tool is used across the world to identify which children need support to manage their behavioral and emotional problems. Given that such problems can impact upon their transition into primary school and affect educational achievement (Bierman et al., 2013; Eivers et al., 2010; White et al., 2013) reliance on SDQ scores is inadequate to identify such children.
Evidence for the discriminative validity of the SDQ was good, in other words, the SDQ is able to separate out populations hypothesized to have markedly different scores. The 5-factor structural validity of the SDQ was also good, providing confidence in the Goodman structure (R. Goodman, 2001) that tends to be employed in clinical practice including in New Zealand (Ministry of Health, 2008). Most studies demonstrated good evidence for the scale’s convergent validity when compared with other scales measuring similar constructs, except for the Peer problem and Prosocial scale with this requiring further investigation.
Given the widespread use of the SDQ worldwide, we were surprised to find only one study that examined the content and cultural validity from the parents’ perspectives (Williamson et al., 2010). This study identified concerns about the SDQ (e.g., fear of use of the data) and made recommendations for the re-wording of several questions and the response scales that would improve cultural clarity. However, there have as yet been no changes in the Australian version of the tool. Similarly, only one study was found that explicitly examined content validity as perceived by teachers (White et al., 2013). Further work on cultural validity therefore seems warranted.
At time of writing there are 79 different language versions available from the Youth in Mind website (http://www.sdqinfo.org/). Translations and adaptations are not permitted without the involvement of that study team, which provides confidence in the robustness of translations. Their procedures include forward and backward translations by teams of people and the final version is signed off by Youth in Mind (personal communication, www.youthinmind.info). However, these data are not publicly available, and hence we identified few studies explicitly describing the language translation or cross-cultural adaptation procedures. It was of concern that significant problems have been identified with the Chinese version (Toh et al., 2008) although the Youth in Mind group reported this version is currently being revised (personal communication, www.youthinmind.info).
Evidence for criterion validity of the SDQ was stronger in clinical than general population samples. Whilst this is perhaps unsurprising, it is of concern given the tool is specifically used in screening children to identify those who would benefit from further assessment or services. In addition, one study that pooled data from seven countries showed that the prevalence estimators derived from the SDQ scores spread very broadly across the countries (A. Goodman et al., 2012). Consequently, SDQ scores cannot be compared cross-nationally without population-specific norms.
Findings regarding the internal consistency of the SDQ Difficulty scale indicate that it is acceptable for comparing groups, but not adequate for clinical decision-making. In addition, the internal consistency of the 5 subscales was not borne out in the review.
Goodman argued in his 1997 paper that parents and teachers make SDQ ratings based on different sources of information and that comparing their scores is therefore an investigation of cross-informant consistency as opposed to inter-rater reliability (R. Goodman, 1997). This view was echoed by Stone and colleagues (Stone et al., 2010), drawing on research by others (Achenbach, McConaughy, & Howell, 1987), who emphasize that informants such as parents or teachers see the children in different contexts and interact with the children in different ways. For this reason, Goodman recommends the use of correlation coefficients, rather than intra-class correlation coefficients and this has been followed by researchers examining the SDQ (R. Goodman, 1997). Our weighted averages of coefficients between different informants ranged from 0.24 to 0.45 and were similar to those found by Stone et al. (2010) (range 0.26–0.47). These authors claim a meta-analytical correlation coefficient of 0.27 between parents and teachers can be used as a benchmark of agreement or data quality and that therefore the weighted average in their review demonstrate good inter-rater agreement. We contend that the coefficient values found in Stone et al. (2010) and our review are actually rather low and indicate that at best 22% of variance can be explained by scores from different informants.
Our review has a number of strengths, including the use of a systematic and replicable search strategy, use of the PRISMA guidelines, two reviewers and a validated critical appraisal tool. In addition, we explicitly identified a priori the criteria by which we would judge if statistical findings were acceptable for each psychometric property. It is possible that we identified a larger number of study limitations of studies than others have (Stone et al., 2010) as the COSMIN tool is relatively new (Mokkink et al., 2010; Terwee et al., 2012). In addition, many included studies presented data on children with a wider age range than what we were particularly interested. Comparing the findings from the smaller number of studies which had only included younger children with the remaining studies suggest similar findings. However, further validation work in the younger age group seems warranted based on this review.
Conclusion
The systematic review has shown that the evidence for the discriminative and structural validity of the SDQ is strong, as is the evidence for convergent validity (apart from the Peer problems and Prosocial scales) and the internal consistency of the SDQ Total Difficulty scale. The lack of evidence for other psychometric properties, in particular test-retest reliability, cultural validity and criterion validity, should be addressed given the wide-spread implementation of the tool in routine clinical practice. Furthermore, the moderate level of consistency between different informants indicate that an assessment of a pre-schooler should not rely on a single informant. Further work is required to examine these psychometric properties in parallel with qualitative work that can explore acceptability and validity of the SDQ in more depth. Whilst such evidence is gathered it remains critical to not solely rely on SDQ scores but also consider parents’ and teachers’ reports before determining the needs of pre-school children.
Footnotes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
