Abstract
The self-report version of the Strengths and Difficulties Questionnaire is widely used in clinical and research settings. However, the measure’s suitability for younger adolescents has recently been called into question by readability analysis. To provide further insight into the age-appropriateness of the self-report Strengths and Difficulties Questionnaire, readability was assessed at the item level alongside consideration of item quality criteria, its factor structure was analyzed, and measurement invariance between adolescents in Year 7 (age 11-12 years) versus Year 9 (age 13-15 years) was tested. The measure showed a wide range of reading ages, and the theorized factor structure was unacceptable. Measurement invariance was therefore considered for a flexible exploratory structural equation model, and no evidence of differences between age groups was found. Suggestions are made for the measure’s revision based on these findings.
Keywords
The self-report version of the Strengths and Difficulties Questionnaire (SDQ) is a popular measure of mental health in 11- to 16-year-olds (R. Goodman et al., 1998; Johnston & Gowers, 2005) that has been extensively used in epidemiological research (e.g., Hafekost et al., 2016; NHS Digital, 2018; Polanczyk et al., 2015). Self-report measures are generally attractive in research, particularly in longitudinal and large-scale studies. This is partly because young people can be easier to recruit than parents, and data burden is reduced compared with teacher report methods (Humphrey & Wigelsworth, 2016). Moreover, such measures allow direct assessment of the young person’s perspective in accordance with policy recommendations (Deighton et al., 2014). Despite these advantages, scale- and subscale-level analysis suggest the SDQ may be unsuitable for those with reading ages below 13 to 14 years (Patalay et al., 2018). Not only is this higher than the intended 11-year-old population, it also exceeds general scale development recommendations, which suggest that measures should never exceed the reading level of a 12-year-old (Terwee et al., 2007). There is also evidence to suggest that the reading age of individuals can be up to five grades lower than their reported education grade, especially for those experiencing mental health difficulties (Jackson et al., 1991; Jensen et al., 2006).There is, therefore, a need for better understanding of the age appropriateness of this measure.
Though the self-report SDQ has been consistently employed in large national studies (e.g., Hafekost et al., 2016; NHS Digital, 2018), and has been recommended for research and clinical settings (Vostanis, 2006; Wolpert et al., 2015), robust evidence of its factor structure is also scant. Two review articles have broadly advocated for the use of the self-report SDQ, as a well-validated measure (Vostanis, 2006; Wolpert et al., 2015). However, it should be noted that psychometric evidence underpinning their recommendations often related to translated versions, though psychometric characteristics are likely version dependent (Flake et al., 2017). Indeed, the self-report SDQ has shown only partial measurement invariance across different language versions (Ortuño-Sierra, Fonseca-Pedrero, et al., 2015). Furthermore, studies of the English version on which recommendations were made particularly failed to report model fit (R. Goodman, 2001; R. Goodman et al., 1998). Furthermore, though exploratory factor analysis was used in the original study, a five-factor solution was retained despite substantial cross-loadings for seven items (R. Goodman, 2001), suggesting potential problems with the structure. Where confirmatory factor analysis (CFA) techniques were employed to analyze the self-report English version, the proposed structure was also shown to be problematic, with inconsistent fit based on recommended guidelines. These suggest values of around .95 for the comparative fit index (CFI) and around .06 for the root mean square error of approximation (RMSEA) can be judged to be acceptable (Hu & Bentler, 1999). A. Goodman et al. (2010) found CFI = .837 and RMSEA = .063 via weighted least squares means and variance adjusted (WLSMV), while Percy et al. (2008) reported CFI = .817 and RMSEA = .047 via robust maximum likelihood (MLR) estimation. The consistently low CFI may be due to problems with the pattern of covariances specified in the model, consistent with the known substantial cross-loadings (R. Goodman, 2001; Percy et al., 2008), though discrepancies between RMSEA and CFI can occur for many different reasons (see Lai & Green, 2016, for more details). The fact that both studies include adolescents as young as 11 to 12 years may also have contributed to model misfit.
This lack of clear support for the self-report SDQ’s factor structure suggests a need for more detailed examination of its psychometric qualities, as has been explicitly called for in a recent systematic review (Bentley et al., 2019). This is particularly necessary given the centrality of the measure in adolescent mental health research (e.g., Deighton et al., 2019; Dray et al., 2016; Hafekost et al., 2016; NHS Digital, 2018; Polanczyk et al., 2015; Wigelsworth et al., 2012). Although evidence based on the SDQ suggests an increase in mental health difficulties in mid-adolescence, around ages 14 to 15 years (Deighton et al., 2019; Dray et al., 2016), it is not clear whether differences between early adolescents, around ages 11- to 12-, and the 14- to 15-year age group are due to differences in measurement properties, or the SDQ’s high reading age (Patalay et al., 2018). Indeed, measurement invariance between different age groups is yet to be examined, which we therefore sought to address in the current study. The choice of age groups in the current study was selected for pragmatic reasons since we conducted secondary data analysis. Nevertheless, the use of this data set enabled examination of the key transition to mid-adolescence. It also allowed comparison between the SDQ’s youngest intended age (11 years old), as per its original validation (R. Goodman et al., 1998), and the recommended minimum age (13 years old) based on recent readability findings (Patalay et al., 2018).
While the analysis of readability by Patalay et al. (2018) provided valuable insight into the age appropriateness of the measure, readability was only considered for whole subscales meaning three issues remain unexplored. First, while considering items together as subscales or whole measures allows the use of texts of more appropriate length for readability formulas, information is lost about individual items (Oakland & Lane, 2004). Second, the presentation of items in accordance with psychometric best practice, including factor structure, should also be considered. For instance, items should have appropriate response formats and consist of single statements to avoid confusion (Saris, 2014; Terwee et al., 2007). Finally, while age invariance of the proxy version has been considered (He et al., 2013), measurement invariance of the self-report English instrument has not been tested, to our knowledge. Based on these identified gaps, we aimed to explore the following for the self-report SDQ: (1) item-level readability, (2) item quality, (3) the factor structure, and (4) age measurement invariance between English secondary school students in Year 7 (age 11-12 years) and Year 9 (age 13-15 years). We hypothesized the reading age to be higher than the intended population, consistent with Patalay et al. (2018) and that item quality would vary according to psychometric criteria (this has not been evaluated previously and was therefore exploratory). Given that findings on the structure of the SDQ have been conflicting, we were unable to hypothesize which structure would be the most appropriate, thus the third aim of our study was also necessarily exploratory. Finally, we hypothesized nonmeasurement invariance between the two age groups, as we expected the Year 9 group to have a better understanding of the items, based on previous readability evidence (Patalay et al., 2018).
Method
Secondary data analysis was conducted of a large project aimed at promoting resilience in six areas of England, chosen on the basis of need. The original data set consisted of 30,842 students, though 552 cases were excluded (1.8%) from current analyses since these had missing data for all SDQ items. Students were in Years 7 (51.4%, aged 11-12 years, M = 12.21, SD = 0.29) and 9 (48.6%, aged 13-15 years, M = 14.20, SD = 0.29) from 114 schools (52.4% female). The ethnicity of our sample was very similar to national figures (Department for Education, 2017b) with 74.1% White, 9.5% Asian, 5.7% Black, 3.9 Mixed, 0.2% Chinese, 1.5% any other ethnic background, and 1.2% unclassified. The proportion of pupils with a special educational need was 11.6%, compared with the national figure of 14.4% (Department for Education, 2017c). Rates of low income were above average in this community sample, given the focus of the project: The percentage of students who had ever been eligible for free school meals was 36.4%, which was above the national average of 29.1% for those eligible in the previous 6 years (Department for Education, 2017a).
Total difficulties scores for the SDQ were also above rates expected in community samples, based on the measure’s 20-year-old bandings (R. Goodman et al., 1998): 62.2% scored in the “normal” range compared with 80% in the validation sample, 18.4% scored in the “borderline” range compared with 10% in the validation sample and 19.6% scored in the “abnormal” range compared with 10% in the validation sample. However, self-reported psychological well-being in the current sample (M = 23.88, SD = 5.33) was similar to the average found in a nationally representative sample of 16- to 24-year-olds (M = 23.57, SD = 3.61; Ng Fat et al., 2017). Reading ability was also below average based on end of primary school test results, with 63% of the Year 7 cohort reaching the expected grade compared with the national result of 66% (Department for Education, 2016), and 72.2% of the year nine cohort compared with the national result of 78% (Department for Education, 2014).
Following approval by the UCL Research Ethics Committee (UCL Ref: 8097/003) survey data were collected via a secure online portal during the normal school day from students whose parents had not opted-out. The SDQ was completed as part of a battery of measures, all of which had explanations for items found to raise issues during piloting. These were constructed to help pupils without altering items, and since researchers did not administer the survey face-to-face they could not respond to queries. Pupils were instructed that these could be obtained by hovering their mouse over certain words. For example, if pupils hovered over the word “restless,” they were given the explanation “unable to stay still.” All items which had explanations are indicated in Table 2.
Students responded to the 25-item SDQ using a 3-point Likert-type scale (not true, somewhat true, certainly true) (R. Goodman et al., 1998). These 25 items form five subscales of five items each (more detail on the content of items can be found in the Results section, see Table 2). Internal consistency coefficients are presented in several formats to reflect both the typically reported standard (Cronbach’s alpha), as well as formulas that account for violations likely present in the data (see Table 1). Ordinal alpha accounts for the ordinal nature of Likert-type items since it is based on the polychoric correlation matrix (Gadermann et al., 2012), while McDonald’s omega is a model-based reliability, which does not assume tau-equivalence (Raykov & Marcoulides, 2016). In line with other assessments of the SDQ (Bøe et al., 2016; Panayiotou et al., 2019), ordinal alpha and omega were than Cronbach’s alpha in the current sample (see Table 1). Exploratory structural equation model (ESEM) factor loadings can also be found in Supplemental Table S1 (available online).
SDQ Subscale Reliability Coefficients.
Note. SDQ = Strengths and Difficulties Questionnaire.
Hierarchical omega coefficient.
Analysis
Readability Testing
Calculating multiple readability estimates is recommended given the lack of a gold standard readability formula and the variability in their focus (Janan & Wray, 2012). The current study applied four widely used and established readability assessments, all of which are calculated by incorporating different text components. The Dale–Chall Readability Formula (DC; Chall & Dale, 1995; Dale & Chall, 1948), considers the percentage of difficult words, and the average sentence length. Difficult words are those that do not appear on the Dale–Chall Readability word list:
where DW = total number of difficult words, TW = total number of words, and AWS = average number of words per sentence.
The Flesch–Kincaid Reading Grade (FK; Kincaid et al., 1975), considers average syllables per word and the average sentence length:
where AWS = average number of words per sentence and ASW = average number syllables per word.
The Gunning Fog index (GFI; Gunning, 1952) considers number of words, sentences and hard words (those with three syllables or more):
where AWS = average number of words per sentence, HW = total number of hard words, and TW = total number of words.
Finally, the Coleman–Liau index (CLI; Coleman & Liau, 1975) incorporates number of letters instead of syllables:
where LW = average number of letters per100 words and SW = average number of sentences per 100 words.
All indices provide readability as a U.S. grade level. The readability of SDQ items and subscales was then calculated by averaging the U.S.-grade level score of the four indices, and then adding six to get the average reading age. The age appropriateness of SDQ items was judged against the original minimum recommended age of 11 years (R. Goodman et al., 1998).
Item Quality Criteria
Consistent with readability indices, psychometric guidance suggests scale items should be simple in language and grammar, regardless of the age of the target population (Irwing & Hughes, 2018; Terwee et al., 2007). Beyond this, other important aspects of the content and structure of items must be considered alongside readability tests, for a more comprehensive assessment (Oakland & Lane, 2004). Additional item quality criteria deemed relevant to age-appropriateness and mental health were therefore identified to supplement readability analyses. First, items should ideally consist of single statements (Irwing & Hughes, 2018; Saris, 2014; Terwee et al., 2007), and avoid reverse wording to reduce confusion (Irwing & Hughes, 2018; van Sonderen et al., 2013). Floor and ceiling effects (endorsement of the lowest or highest response at >15%) should not be present. Absence of these is an indication that measures reliably distinguish individuals across the range of symptoms (Terwee et al., 2007). Items should also be presented with a clear and appropriate reference period to the concept under study (Irwing & Hughes, 2018; Saris, 2014). Since all items had the same reference period, we used the first three criteria to assess items and considered those that satisfied two out of three to be of higher quality.
Factor Structure and Measurement Invariance
Given the poor factor structure of the self-report SDQ in other samples (A. Goodman et al., 2010; R. Goodman, 2001; Percy et al., 2008), we considered both CFA and ESEM with geomin rotation (see Figure 1). We estimated three CFA models, the first of which was a correlated structure based on the original theoretical structure of the measure representing the five subscales typically used (R. Goodman, 2001). Second, we included a correlated two-factor higher order structure in which emotional problems and peer problems loaded onto a second-order internalizing factor, and conduct problems and hyperactivity loaded onto a second-order externalizing factor as suggested elsewhere (A. Goodman et al., 2010). Third, we estimated a bifactor model (Chen & Zhang, 2018) with a general difficulties factor, and four residual difficulty subdomain factors. This model has shown some promise in other language versions (e.g., Ortuño-Sierra, Chocarro, et al., 2015) and allows the total difficulties subscale to be represented as a general factor after accounting for specific variance captured by each of the four problem domains. The prosocial factor was excluded from both the bifactor and higher-order models since these were used to examine the hypothesized four-factor total difficulties score (R. Goodman, 2001). We finally tested a five-factor ESEM model, which was used to explore age measurement invariance.

Models tested.
Where measures lack proposed dimensionality, as is the case with the self-report SDQ (A. Goodman et al., 2010; R. Goodman, 2001; Percy et al., 2008), and invariance testing is warranted, given recent claims about age (Deighton et al., 2019; Dray et al., 2016; NHS Digital, 2018), ESEM techniques can be used (Marsh, Nagengast, et al., 2013). As others have pointed out, though ESEM structures should not be used to conceal problems with a measure, they can provide a more realistic framework for measurement invariance analysis where CFA models do not fit sufficiently well (Tóth-Király et al., 2017). Furthermore, given the substantial cross-loadings and shared variance in the SDQ (R. Goodman, 2001; Percy et al., 2008), ESEM can provide a more robust approach than post-hoc addition of parameters (e.g., cross-loadings) following modification indices (Chiorri et al., 2016). We therefore opted to extract five factors in line with the original theoretical model, but in ESEM every item is permitted to load onto every factor so that shared variance in the data is not misspecified.
When accounting for the fact data were sampled from pupils clustered in schools (using type = complex), the ESEM models required greater numbers of parameters to be estimated than there were schools in the sample (165 > 114), thus resulting in a warning about the trustworthiness of standard errors. Given that the implications of this in model estimation are not well understood (Muthén & Muthén, 2016), and parameter estimates would not be directly affected, clustering effects were not controlled for. This decision was guided by the small intracluster correlations for the SDQ variables (<.05) and the fact that controlling for clustering made little difference to the standard errors and therefore conclusions (results can be provided on request). For consistency we therefore did not account for clustering in any model.
Model fit was judged in line with published recommendations. Chi-square statistics are reported, but not interpreted as indicating fit given their known sensitivity to sample size. The CFI and the Tucker–Lewis index (TLI) were considered to be acceptable at around .95, and RMSEA around .06 (Hu & Bentler, 1999). The standardized root mean squared residual (SRMR) was considered to be acceptable <.08 in the absence of any large residuals (Asparouhov & Muthén, 2018). In addition to these standardized indices, the Akaike information criterion (AIC) and Bayesian information criterion (BIC) are also reported to compare models with the same outcome variables, with lower values indicating better model fit.
Chi-square difference testing is typically used to compare the fit of measurement invariance models. However, its sensitivity to sample size made this inappropriate for our study, suggesting approximate fit indices should be used. Since the majority of measurement invariance simulations focusing on performance of fit indices have treated items as continuous (Chen, 2007; Cheung & Rensvold, 2002; Meade et al., 2008), the degree to which common fit indices are appropriate for comparing models using polychoric matrices and WLSMV is unclear. For instance, given that the chi-square of WLSMV is not comparable in the same way as for maximum likelihood, CFI comparisons might not be appropriate in these cases (Sass et al., 2014). Analyses were therefore conducted in Mplus 8.3 using MLR and treating items as continuous. This also allowed us to account for the nonnormality of the data and enabled missing data to be handled via full information maximum likelihood under the assumption of missing at random (Muthén & Muthén, 1998-2017). All cases with data for at least one SDQ item were therefore included in our analysis. Though items were treated as continuous, floor effects were likely in a screening measure, so sensitivity tests for the CFA and ESEM models were conducted, in which items were treated as ordinal using WLSMV (Brown, 2015; Li, 2016). 1
Measurement invariance testing was conducted by estimating baseline models for each age group separately, followed by a configural model in which parameters were freely estimated in each group, a metric model with loadings constrained to be equal across groups, and finally a scalar model in which intercepts were also held equal (Muthén & Muthén, 1998-2017). Given the large sample size, CFI difference (ΔCFI) was used to judge approximate invariance (Sass et al., 2014). In line with wider ESEM literature (Marsh, Nagengast, et al., 2013; Marsh, Vallerand, et al., 2013; Tóth-Király et al., 2017), and specific invariance analysis of the SDQ (Chiorri et al., 2016), we adopted a threshold of .01 for ΔCFI. This cutoff has been shown to perform well with the Mplus calculation of CFI and under different conditions of invariance and noninvariance (Chen, 2007).
Results
Readability Estimates
Table 2 presents the four readability estimates by U.S. grade-level, the average across the four indices, and the reading age in years. Estimates were calculated for the introductory text, individual items, subscales, and total scale. The introductory text was found to have a reading age considerably greater than 11. Similarly, Items 3, 13, 16 (emotional), 4, 20 (prosocial), 10, 15 (hyperactivity), and 14 (peer problems) were calculated as having readability estimates greater than 12 years old. Of the five subscales, emotional problems and hyperactivity were calculated as having the highest reading ages (>12 years). However, despite appropriate estimates for the remaining subscales and total scale, conduct problems was the only subscale not to include any items with a reading age greater than 12 years. Items 10, 13, 15, 16 and 20 were of particular concern with reading ages greater than 15 years.
SDQ Items Floor/Ceiling Effects, Item Quality Scores, Readability Estimates by U.S. Grade-Level, Average Grade level Estimate Across Indices and Reading Age.
Note. In bold are the estimates for the subscales and total scale. Underlined words are those for which additional explanations were provided when the mouse was hovered over them during online administration. + = high quality; − = low quality; (R) = Reversed items.
Item Quality Criteria
The measure’s items, floor/ceiling effects, and quality scores can be found in Table 2. While we expected varied quality, results were not favorable with 17 items (68%) showing poor quality (see Table 2). Specifically, of the SDQ’s 25 items, 14 (four emotional problems, four conduct problems, three hyperactivity–inattention, two prosocial, and one peer problems) clearly include more than one statement, and therefore request a response about more than one experience. The measure also has five reversed items across the conduct problems, hyperactivity, and peer problems scales. All 20 difficulties items showed substantial floor effects, ranging from 21% to 85%, and a further eight also had ceiling effects, ranging from 15% to 34 %. The prosocial items showed ceiling effects, ranging from 29% to 69%, and one also had a floor effect at 16%.
Readability Versus Item Quality
Though our readability methodology suffers from applying formulas to short texts (Oakland & Lane, 2004), this was considered alongside item quality criteria so that items could be evaluated more comprehensively. For instance, the item with the lowest reading age, “I worry a lot,” also performed well in terms of item quality since it is not reversed and consists of a single statement. Conversely, the item “I fight a lot. I can make other people do what I want” has a low reading age, but introduces confusion since respondents must affirm two independent behaviors. The item “I am easily distracted, I find it difficult to concentrate” has the highest reading age because it contains several multiple-syllable words. Another consideration is that the measure is often deployed in schools (e.g., Wigelsworth et al., 2012), as was the case for our sample. On one hand, young people in schools may regularly be talked to about concentration and therefore be more readily primed to recognize these words than readability formulas would suggest. However, item quality criteria confirm this statement is unnecessarily complex, containing two statements. Readability and age-appropriateness of measures are therefore complement one another and are more complex than any one type of analysis might suggest.
Factor Structure and Measurement Invariance
School year group was available for all but one participant, and missingness for SDQ responses ranged from 0.5% to 1.5%. Variance and covariance coverage were high (>.97) for SDQ items suggesting that estimates were likely to be trustworthy (Muthén et al., 2017). Since data were not missing completely at random, χ2(13,289) = 17509.62, p < .0001, we explored missingness at the subscale level, using gender, age, ethnicity, self-reported well-being, special educational needs, and free school meal eligibility as predictors. Special educational needs (odds ratio [OR] = 0.23-0.37) predicted less missing data for all subscales. Unclassified ethnicity predicted less missing data for all but the conduct problems subscale (OR = 0.01-0.22). Asian ethnicity predicted less missing data for peer problems, prosocial behaviour and hyperactivity (OR = 0.28-0.35). Higher well-being predicted less missing data for peer problems and prosocial behaviour (OR = 0.92-0.93), while girls (OR = 0.33) and those from Black ethnic backgrounds (OR = 0.28) were less likely to have missing data for prosocial behavior.
Fit of all models estimated is provided in Table 3. The original correlated five-factor structure was found to have poor fit, as did the higher order model. The bifactor structure of the four difficulties subscales similarly indicated a total difficulties score to be problematic, even though bifactor structures are highly parameterized with a tendency to overfit (Murray & Johnson, 2013). As expected, given the flexibility of such models, the ESEM solution provided a much better fit to the data. Nevertheless, primary ESEM loadings were strongly related to their corresponding parameters in the CFA model. This was established via a correlation between loadings from the ESEM and CFA models (r = .65) following the example by Marsh, Vallerand, et al. (2013).
Model Fit for Main and Sensitivity Analysis Models.
Note. df = degrees of freedom; ESEM = exploratory structural equation modeling; MLR = robust maximum likelihood; WLSMV = weighted least square mean and variance adjusted; AIC = Akaike information criterion; BIC = Bayesian information criterion; RMSEA = root mean square error of approximation; CFI = comparative fit index; TLI = Tucker–Lewis index; SRMR = standardized root mean squared residual; λ = factor loadings; h2 = item communalities.
p < .01.
The ESEM solution (see Supplemental Table S1) revealed nine items to cross-load with a discrepancy of <.30 between the highest and second highest loadings, which is indicative of problems with the item (Matsunaga, 2010). Each of the five reversed items also loaded higher than .34 on the prosocial factor and less strongly on their theorized difficulties factors. The prosocial factor was not correlated with the emotional problems and peer problems factors at a significant level. Similarly, the hyperactivity factor was not significantly associated with the peer problems factor. Factor correlations beyond this were in expected directions, with the largest associations seen between hyperactivity and conduct problems (r = .49), and emotional problems and peer problems (r = .38). Sensitivity analysis also revealed that accounting for the categorical nature of items via WLSMV had little impact on results. No changes in fit or loadings were seen in terms of recommended cutoffs, supporting confidence in the main results reported based on MLR.
Acceptable model fit was found for the two age groups separately. Consistent with findings for the parent version with middle and older adolescents (He et al., 2013), but counter to our hypothesis based on previous readability evidence, approximate age measurement invariance was supported, as the ΔCFI was found to be less than .01 in all comparisons (see Table 4).
ESEM Age Measurement Invariance Findings.
Note. Robust maximum likelihood was used. df = degrees of freedom; ESEM = exploratory structural equation modeling; AIC = Akaike information criterion; BIC = Bayesian information criterion; RMSEA = root mean square error of approximation; CFI = comparative fit index; TLI = Tucker–Lewis index; SRMR = standardized root mean squared residual; Δχ2 = chi-square difference test; ΔCFI = CFI difference; h2 = item communalities.
p < .01.
Discussion
Though the self-report SDQ is widely used, including to study age differences (Deighton et al., 2019; Hafekost et al., 2016; Johnston & Gowers, 2005), evidence of its age appropriateness has been limited. Building on existing evidence (He et al., 2013; Patalay et al., 2018), we addressed this gap by considering the measure’s item-level readability, item quality, factor structure, and age measurement invariance. Items showed a wide range of reading ages, which was more varied than previous subscale-level analysis had indicated (Patalay et al., 2018). Many items also appeared to be too difficult for the intended age group. Beyond this, a substantial proportion of the measure was found to be problematic in terms of item quality, and the proposed factor structure was a poor fit to the data. ESEM allowed approximate measurement invariance to be tested between students in Year 7 versus Year 9, which suggested that this flexible structure was invariant across these groups.
While Patalay et al. (2018) had already demonstrated the measure may not be suitable for adolescents under 13 years, their analysis was unable to clarify which items might be problematic. In fact, our results suggest scale and subscale-level reading scores could be misleading since these suggested levels around age 11 years. Counter to our first hypothesis, item-level readability was much more varied than that found previously at the subscale level. We found some items to be much more difficult and others much easier. For instance, while the emotional problems subscale had an average reading age of 12.68 years, the item “I worry a lot” performed much better with an average reading age of 5.41 years. This item is therefore an example of optimal simplicity.
Beyond the item-level analysis, the instructions did not meet recommendations published elsewhere that even adult scales should have reading ages of no more than 12 years (Terwee et al., 2007). This suggests there may have been problems even for higher quality items. In fact, special attention to instructions has been recommended for surveys with young people since clearer and more detailed instructions can be associated with greater reliability (Omrani et al., 2018). Similarly, Though the stated reference period in the SDQ instructions is clear, that is, not subjective such as “often,” but finite, “over the past 6 months,” this may not be appropriate to the assessment of symptoms in adolescents. Younger adolescents, in particular, tend to find long reference periods challenging, and guidelines suggest very recent or current reference periods may lead to more valid responses in this age group (Bell, 2007; de Leeuw, 2011).
As well as clarifying readability analysis, consideration of item quality criteria revealed the measure to have certain other problems. Alongside the fact that over half of items contain multiple statements, the SDQ also contains five reversed items. While such items are common, it is generally advised that these be avoided since they tend not to factor well or be opposite indicators as developers intend them to be (Ebesutani et al., 2012; Suárez-Alvarez et al., 2018; van Sonderen et al., 2013). In the current study, it was clear the reversed items were not measuring the subscale constructs cleanly, as ESEM results revealed all these items to have substantial cross-loadings. This is also consistent with findings in other language versions of the SDQ (Garrido et al., 2020; van de Looij-Jansen et al., 2011). Specifically, we found each of the reversed items loaded more strongly on the prosocial factor than on their respective theorized factors. Some shared variance could reasonably be anticipated. However, the magnitude of these cross-loadings (particularly on the prosocial factor), suggests that beyond age-appropriateness, these items may also face wider validity problems. Reversed items can affect instrument structure through misresponse since their content may not be perceived as opposite to positively worded statements (Weijters & Baumgartner, 2012). Though we did not explicitly examine common method effects, our ESEM results suggest reversed items could have introduced noise into the structure through similarity to prosocial items, as they all relate to positive behaviors.
Item quality criteria also provided insight into the measure’s applicability across the range of symptoms. In our community sample, which showed above average levels of mental health difficulties, high levels of floor or ceiling effects were seen for every item. While this is a common feature of clinical measures used in samples with predominantly healthy individuals, the measure’s use may be somewhat limited, particularly if recommended dimensional approaches to understanding symptoms are adopted (Krueger et al., 2018). This is because measures with high floor and ceiling effects tend to have less discriminatory ability and responsiveness; in other words they may be less able to detect change and discriminate between individuals with different levels of problems (e.g., high versus boderline; de Vet et al., 2011). The 3-point response format may contribute to the skewed nature of the data since having more categories can be associated with higher reliability and validity (Lozano et al., 2008). While there is relatively little research on number of response categories with young people, available evidence suggests around four options may provide a good balance in terms of memory, reading, reliability, and stability (Bell, 2007; Omrani et al., 2018).
Beyond the issues already identified, further elements have also been suggested as indicators of psychometric quality. Of particular relevance to the current study is that measures should ideally be developed in consultation with the target population (Irwing & Hughes, 2018; Terwee et al., 2007), since this allows assessment of acceptability and bias of items. It is possible that some of the psychometric problems identified in the SDQ are compounded by such issues, as to the authors’ knowledge, such consultation did not take place in the development of the SDQ.
Regarding the SDQ’s structure, we found the five correlated subdomains to be a poor fit to the data and uncovered substantial shared variance across factors in the ESEM solution. Both the higher order internalizing/externalizing model, and the bifactor difficulties model also failed to show good fit. These results indicate that using the SDQ to calculate subdomain scores is questionable (Raykov & Marcoulides, 2011). Our ESEM results further suggest the hypothesized structure may be problematic since several items loaded onto more than one factor.
The instrument’s poor fit may also be explained by satisficing theory, which is considered to be of particular relevance to adolescents (Krosnick, 1991; Omrani et al., 2018). This holds that the greater the cognitive demand on participants, the lower the reliability of their responses, as steps involved in providing appropriate responses are skipped (Krosnick, 1991; Omrani et al., 2018). The following results in this study could support such an account: (1) subscales showed mixed reliability, as measured through internal consistency; (2) the instructions had a higher reading age than the lowest limit of the intended population; (3) many items did not have appropriate reading ages, with some at very high levels; (4) the reference period of 6 months is often considered to be inappropriate for younger adolescents (Bell, 2007; de Leeuw, 2011); (5) several items, particularly those with reverse wording, were found to tap into more than one construct; and (6) many items contained multiple statements which tend to increase cognitive load (Oakland & Lane, 2004).
Since we found the hypothesized CFA structures to be inadequate, we proceeded to invariance testing with the ESEM model, which as expected showed excellent fit. We found no evidence of differences in how 11- to 12-year-olds versus 13- to 15-year-olds responded using this flexible model. Since we used ΔCFI to establish approximate invariance, we interpret our findings as suggesting that any differences between groups are likely insubstantial. Though we anticipated older students might respond markedly differently, as previous research suggested the SDQ may be more appropriate to their reading ability (Patalay et al., 2018), our results suggest that both groups responded to it with the same level of ease and/or difficulty. Still, our readability evidence suggests that items with a reading age above 14 years may have been too difficult for both groups. In fact, our sample had below average ability in reading which could also support the idea that approximate invariance was caused in part by (high) reading age items being equally difficult for both groups. Further work is needed (e.g., cognitive interviews with young people) to consolidate our findings.
Taken together, our findings indicate a large proportion of self-report SDQ items are less appropriate for use with younger populations. The current study is the first to provide a detailed item-level readability analysis, thus uncovering specific issues with the self-report SDQ. While previous evidence suggested four of the five subscales had reading ages higher than the recommended minimum age (Patalay et al., 2018), the current study indicates this may be not be the case for all items. Still, our findings call for caution when using the self-report SDQ with younger adolescents or populations with mental health difficulties, since this group may have below average reading ability (Jensen et al., 2006; Moilanen et al., 2010). It should also be noted that self-report adolescent mental health measures have generally been found to be poor in terms of psychometric quality (Bentley et al., 2019). It is therefore important that researchers and clinicians consider carefully the psychometric quality and reading age of their chosen instrument in relation to their sample (Jensen et al., 2006).
Our study brought together robust and complementary methodological approaches to comprehensively assess age-appropriateness of a widely used measure for the first time. Indeed, our findings highlight the importance of conducting supplementary analysis such as readability and item quality alongside invariance testing, since these can provide additional insight. Together, assessment of item quality and readability with factor analysis suggested that the scale contains several difficult statements and psychometrically poor items with a response scale that likely prevents it from capturing the full spectrum of symptoms experienced in the general population (Terwee et al., 2007).
Despite these methodological strengths, a number of limitations must be acknowledged. First, though we attempted to overcome the problem of losing information about items when applying readability formulas to subscales, our item-level readability results should be interpreted carefully. These formulas were not designed for this purpose and therefore may not be as reliable as when used with longer passages (Oakland & Lane, 2004). However, we are confident that high-scoring items are likely inappropriate for younger audiences since they also showed poor item quality. It has also been suggested that assessment of readability at the item level is vital since this reflects how respondents actually perceive scale texts, particularly since individual items may be skipped or invalid responses provided when demands are too great (Calderón et al., 2006). In addition, although readability results were considered alongside other well-established indicators of item quality, these were not based on a standardized measure.
We also treated items as continuous so we could employ the more robust ΔCFI for invariance testing, though our data were ordinal. The skewness in our data was controlled for by using MLR and sensitivity analysis using WLSMV supported these findings. Also, though our large sample size was likely an asset for assessing the generalizability of floor and ceiling effects, and the factor structure of the measure, it is not currently clear how approximate difference testing using ΔCFI is affected by samples of the magnitude reported here. It is also possible that the explanations provided via the online portal affected measurement invariance by masking the differences in ability between the older and younger cohort. However, in any large-scale research with young people it is likely that support would be provided in some form (e.g., by a teacher or researcher). It is therefore likely very difficult to provide measurement invariance analysis across age groups without some kind of confound for ability.
Results must also be interpreted only for the ESEM model, which is less restrictive, with cross-loadings freely estimated. The theorized CFA model by Goodman et al. (1998) was not suitable for measurement invariance testing, and we therefore stress that invariance of this model could not be determined. Though lack of control over a priori structure in ESEM is therefore a limitation (Marsh et al., 2011), five factors corresponding to the original theoretical model were extracted allowing us to accommodate issues such as cross-loadings without resorting to post hoc model modification. Similarly, though the large number of parameters in ESEM is a limitation, our large sample size was likely able to handle this with a ratio of 163.7 cases per parameter. Finally, though our sample was large, it was not representative of the general population since deprivation was seen at higher levels, given the focus of the project from which data were drawn.
Conclusion and Future Directions
While the self-report SDQ has been used extensively, our study suggests the measure would benefit from revisions two decades on from its original development. It is perhaps surprising that such a widely used measure suffers from issues such as those described here, although as our findings suggest, this is possibly due to the lack of attention to robust scale development practices (e.g., omission of cognitive interviews with young people). Items should be simplified, with reversed wording and multiple statements replaced with simpler alternatives, and more straightforward language used for items with high reading ages. We also recommend that such amendments be made in consultation with young people in line with policy and psychometric best practice (Deighton et al., 2014; Irwing & Hughes, 2018; Terwee et al., 2007).
Supplemental Material
supplement_material – Supplemental material for Age Appropriateness of the Self-Report Strengths and Difficulties Questionnaire
Supplemental material, supplement_material for Age Appropriateness of the Self-Report Strengths and Difficulties Questionnaire by Louise Black, Rosie Mansfield and Margarita Panayiotou in Assessment
Footnotes
Acknowledgements
The data used in this study were collected as part of the HeadStart learning program. The authors are therefore grateful for the work of the wider research teams at the Anna Freud Centre and the University of Manchester for their role in coordinating the evaluation, as well as collecting and managing the data. In particular we would like to thank Prof. Jessica Deighton for her feedback on an earlier draft. The authors also acknowledge the National Pupil Database from which demographic data were obtained. Finally, we are extremely grateful to all students who took part in this study as well as the local authorities and schools for their help in recruiting them.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The data used in this study was collected as part of HeadStart learning program and supported by funding from the National Lottery Community Fund, grant R118420. The funders did not seek to influence any aspect of the secondary analysis reported in this study. The content is solely the responsibility of the authors and it does not reflect the views of the National Lottery Community Fund. Louise Black is funded by the National Institute for Health Research and Rosie Mansfield is funded by the Department for Education.
Supplemental Material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
