Abstract
The Working Memory Rating Scale (WMRS) was designed as a behavioral rating tool to assist teachers in identifying students at risk of working memory difficulties. The instrument was originally normed on 417 monolingual English-speaking children from the United Kingdom. The purpose of this study was to test the reliability and validity of the WMRS on 459 first- to third-grade Spanish-speaking English Language Learners. Results indicated the one-factor model demonstrated adequate fit to the data. High convergent validity emerged with the Conners’ rating scale, but weak correlations occurred with achievement, vocabulary, phonological, short-term memory, and working memory measures. Implications and directions for future research are discussed.
The Working Memory Rating Scale (WMRS; Alloway, Gathercole, & Kirkwood, 2008) is a behavioral-based rating scale designed to measure working memory (WM) difficulties in school-aged children. WM is theorized as a multi-component structure in the central executive system, consisting of the phonological loop, visual-spatial sketchpad, and the episodic buffer. The WM system is thought to temporarily store and manipulate auditory or visual stimuli (e.g., Alloway & Alloway, 2010; Baddeley, 1996; Baddeley & Hitch, 1992, 1974; Gathercole & Pickering, 2001). The limited capacity of WM has been found to be related to learning difficulties in special student populations, including bilingual learners, and learners with cognitive disabilities (Alloway & Alloway, 2010; Gathercole & Pickering, 2001; St Clair-Thompson, 2014; Swanson, Orosco, Lussier, Gerber, & Guzman-Orth, 2011).
Identifying a cognitive component related to learning difficulties is especially important for bilingual, or English Language Learner (ELL) students; most traditional cognitive or psychoeducational assessment results are confounded by ongoing threats to validity with the high linguistic demands embedded in directions or assessment tasks. In other words, the assessments often lack the ability to determine whether ELLs are having difficulty learning due to the content or due to their language acquisition. Using WM as an indicator of academic risk in the classroom means students could experience difficulties in following multi-step instructions, remembering details, following through with task completion, and keeping track of progress in multi-step tasks (Alloway, Gathercole, Kirkwood, & Elliott, 2009). As a result, a screening tool to identify children with WM difficulties can support teachers in their efforts to support their students’ learning. Teachers can help mitigate these difficulties that may affect student progress in mastering both content and language, by providing targeted interventions such as presenting instructions orally and visually, and increasing the frequency of prompting or providing memory aids to address confusion or forgetting in multi-step processes.
Providing empirical evidence for the validity (e.g., Messick, 1989) and reliability of a test is often a primary research objective for high-stakes assessments (Young, 2009; Young et al., 2008) and should be a primary objective for low-stakes, classroom-based assessments as well. A concern motivating this validation process for classroom-based assessments is the ongoing practice of administering the same assessment to all children within diverse elementary-level classrooms. However, practitioners still lack research-based assessments to screen for and diagnose the difficulties children are experiencing in the classroom, especially when the students are ELLs. Currently, there are more than 11 million ELLs in the United States, and a vast majority of these students are Hispanic/Latino, speak Spanish, and perform disproportionately to their peers (Aud et al., 2011; Hemphill & Vanneman, 2011). Therefore, it makes practical sense to develop and validate classroom-based assessments for diagnostic purposes, to identify why some ELLs are experiencing learning difficulties.
Although WM may be linked to learning difficulties, it is critical to note that the WMRS measures overt behaviors used to rate WM, that is, a simultaneous storage and retrieval system. For students who are ELLs enrolled in K-12 classrooms in the United States, manipulating this simultaneous storage and retrieval system can be an arduous task. ELLs in K-12 classrooms are already tasked with the dual responsibility of learning content knowledge taught in English, while they are also learning the English language. Taking this dual task into account for ELLs in today’s classrooms, the question remains, “Is the WMRS score a measure of WM or is it a measure of the already demanding experience of learning content in English while also learning English?”
The purpose of this study is to establish reliability and validity evidence for the WMRS teacher ratings of their Spanish-speaking ELL students. To accomplish this, we investigated the following questions: (a) Is the WMRS a reliable screening for Spanish-speaking ELLs with WM difficulties? (b) What is the internal structure of the WMRS? (c) What does the WMRS measure when compared with a bilingual battery of cognitive (i.e., WM, short-term memory [STM]), language, literacy, and behavioral assessments?
Method
Participants
In 2009-2010, a total of 500 students identified by their district as Spanish-speaking ELLs across 6 schools were tested in English and Spanish. A battery of assessments composed of general achievement, language, literacy, cognitive, and behavioral measures was administered. All assessments were counterbalanced for language and presentation order. Of the 500 students, they ranged almost evenly for gender and across grade level (first-third grade). All students identified their ethnicity as Hispanic/Latino, and most reported speaking Spanish at home. A total sample of 459 students was retained for this analysis; the demographic characteristics of the sample are shown in Table 1.
Student Demographic Characteristics in 2009-2010.
Note. PPVT = Peabody Picture Vocabulary Test; TVIP = Test de vocabulario en imagénes de Peabody.
A total of 75 teachers completed the WMRS on students within their classroom. Each teacher completed one WMRS per student, for a total of 1 to 21 ratings, depending on the number of students in their class. Almost all of these teachers taught the children exclusively in English, with few teaching in a combination of English and Spanish.
Instruments
The measures administered to the students were delivered in English and Spanish. When available, the commercially available Spanish version of the assessment was administered, as noted in the instrument description. For measures unavailable in Spanish, a research version of the assessment was adapted into Spanish by fluent proficient and native Spanish speakers. Iterative review and revisions with native speakers were conducted, resulting in the final measure used for the research study.
Teacher Rating Scales
WMRS
The WMRS (Alloway et al., 2008) is a commercially available instrument designed to measure the WM abilities of students in the classroom, as rated by classroom teachers. The WMRS consists of 20 items describing characteristics of students with low WM abilities. Teachers respond on a 4-point Likert-type scale, 0 = not typical at all, 1 = occasionally, 2 = fairly typical, and 3 = very typical, with a higher score indicating greater risk of WM difficulties.
The WMRS was normed with a representative sample of primary school students from England, who ranged in age from 5.1 to 11.5 years. All students spoke English as their first language and were a representative sample of the ethnic and gender diversity in the United Kingdom. Split-test reliability was established at 0.97; construct validity was established with the WMRS total score and the Automated Working Memory Assessment resulting in moderate to weak correlations. Diagnostic validity was also established with the Wechsler Intelligence Scale for Children–Fourth U.K. Edition (WISC-IVUK); students at risk with the WMRS were generally found to be at risk as measured by the WISC-IVUK, indicating validity as a diagnostic indicator for children in the United Kingdom.
Conners’ Teacher Rating Scale–Revised: Short Form (CTRS-R:S)
The CTRS-R:S (Conners, 2001) is a teacher rating scale commonly used to assess problematic student behavior in the classroom. Teachers respond on a 4-point Likert-type scale ranging from 0 = not true at all, 1 = just a little true, 2 = pretty much true, and 3 = very much true.
Phonological Processing Measures
Segmenting and blending
Two 20-item tasks that measured segmentation and blending were administered in English and Spanish. The segmentation task required students to separate and say a word in individual phonemes (Leafstedt & Gerber, 2005), while blending required students to combine individual sounds together to say a word. The blending task was adapted into Spanish from the Comprehensive Test of Phonological Processing (Wagner, Torgesen, & Rashotte, 2000).
Pseudoword reading task
This measure was developed from the word attack subtest of the Woodcock Reading Mastery Test (WRMT; Woodcock, 1998). The measure required the child to orally read the list of pseudowords arranged in increasingly difficult order. The Woodcock technical manual reports internal reliability of the subtest at .88. A Spanish version of the task was adapted and administered.
Oral (Expressive) Language
Expressive One Word Picture Vocabulary Test (EOWPVT)
The EOWPVT (Brownell, 2001) assesses English and Spanish productive vocabulary. The pictures were arranged in order of hierarchical difficulty and the child was then asked to name each picture in each language until a ceiling was achieved.
Syntax
The Morphological Closure subtest from the Illinois Test of Psycholinguistic Ability III (Hammill, Mathers, & Roberts, 2001) was administered to test the children’s oral grammar competence. This oral close task measures metalinguistic knowledge at a syntactic level (Gottardo, Yan, Siegel, & Wade-Wooley, 2001). The measure was adapted to Spanish and administered similarly.
Naming Speed: Rapid Naming of Digits and Letters
The examiner presented the participant with an array of letters or digits (Wagner et al., 2000). Participants were asked to name the items as quickly as possible for each timed stimulus set. If they hesitated on an item for 10 s, they were directed to the next item.
Random Generation
Each participant was asked to write numbers as quickly as possible, first in sequential order and then in a random order (e.g., “out of order”) for 30-s intervals. Participants were also requested to perform the task with letters.
Memory: STM
Four measures of STM were administered in Spanish and English: Forward Digit Span, Backward Digit Span, Word Span, and Pseudoword Span. The Forward Digit Span and Backward Digit Span tasks were from the Wechsler Intelligence Scale for Children–Third Edition (WISC-III; Wechsler, 1991). The Word Span and Pseudoword Span Tasks (Swanson & Beebe-Frankenberger, 2004) were presented in the same manner as the Digit Span measures. Parallel versions were developed and administered in Spanish for two tasks.
Memory: Executive WM
The Conceptual subtest (Swanson, 2008) was administered in English and Spanish and used as an indicator of WM processing that involves the ability to organize sequences of words into abstract categories.
The children’s adaptation (Swanson, 1992) of Daneman and Carpenter’s (1980) Listening Sentence Span Task was administered in English and Spanish. This task required the presentation of groups of sentences, read aloud, for which children tried simultaneously to understand the sentence content and to remember the last word of each sentence. After each group of sentences was presented, the participant answered a question about a sentence; the participant was then asked to recall the last word of each sentence.
The Rhyming Span Test (Swanson & Beebe-Frankenberger, 2004; Swanson, Sáez, & Gerber, 2006) was administered in English and Spanish and used as an indicator of WM processing of acoustically similar words; as the test progressed, the lists gradually increased in length.
To assess controlled attention, often referred to as updating (e.g., Miyake, Friedman, Emerson, Witzki, & Howerter, 2000), an experimental task adapted from Swanson and Beebe-Frankenberger (2004), was also administered. A series of one-digit numbers were presented that vary in set lengths of nine, seven, five, and three. No digit appeared twice in the same set. Participants were told that they should only recall the last three numbers presented. After the last digit was presented, the participant was asked to name the last three digits, in order.
General Achievement Measures
Fluid (nonverbal) intelligence
The Raven Colored Progressive Matrices (CPM; Raven, 1976) was used as an indicator of nonverbal or fluid intelligence. Directions were given in English and Spanish. The matrices progressively increased in difficulty.
Arithmetic
The arithmetic subtest from the Wide Range Achievement Test–III (WRAT-III; Wilkinson, 1993) was administered, with directions given in English and Spanish. The dependent measure was the number of problems correct, which yielded a standard score (M = 100, SD = 15).
Reading
The Woodcock Munoz Language Survey–Revised (WMLS-R) test establishes a normed-referenced reading level in English and Spanish (Woodcock, Muñoz-Sandoval, & Alverado, 2005). Word Identification and Passage Comprehension subtests were administered.
Receptive Vocabulary
Peabody Picture Vocabulary Test (PPVT; Dunn & Dunn, 1981) was administered in English. This discrete receptive vocabulary test requires students to look at a grid of four pictures while the administrator says a word aloud. The student is required to point to, or say, the number of the picture that matches the target word.
Test de vocabulario en imagénes de Peabody (TVIP; Dunn, Lugo, Padilla, & Dunn, 1986, i.e., the Peabody Picture Vocabulary Test in Spanish) is similar to the PPVT in the presentation and administration. Children were presented with four pictures and asked to identify the picture for a word read aloud in Spanish.
Results
Descriptive Statistics
Table 2 presents the descriptive statistics of the WMRS. On average, teachers rated students low across all items. However, there was also substantial variation across all items, indicating some students were rated as exhibiting increased amounts of the problematic behaviors described by the WMRS.
Inter-Item and Item-Total Pearson Correlation Matrix for the WMRS Responses From Teachers of ELLs.
Note. WMRS = Working Memory Rating Scale; ELL = English language learners. All values significant at p < .001.
The obtained correlation coefficients indicate that all 20 items were positively correlated (all ps < .001). The majority of correlations were moderate to strong (e.g., between .40 and .79) and ranged from .27 to .79.
Reliability
Cronbach’s alpha for the 20 items on the WMRS was .98. No item deletions would have increased this value by more than .003. However, while performing assumptions checks, results indicated Tukey’s test for additivity was significant (p < .001). To correct this problem, scores were transformed to the power of .469 and .711 (Tukey’s estimate of power) and transforming the data to the square root to achieve additivity. The assumption was still not met. That is, there appeared to be an interaction among the items of the WMRS with this sample of Spanish-speaking ELLs. Simply put, the items were not linearly related, indicating that the variances associated with each item were not homogeneous across the items and, by extension, covariances across all pairs of items cannot be equal. Specific pairs of items with low correlations may be reflective of the nonlinearity, and, given the large range of correlation coefficients, it is not surprising pairwise covariances are not homogeneous. Moreover, the WMRS’ use of 20 items to measure a single construct may be problematic with this sample of ELLs. That is, the large number of items may be contributing to greater heterogeneity in the covariances and, due to the items’ dependence on teacher observation of student behaviors, there is the potential for language acquisition behaviors to be conflated with WM behaviors. If so, this can lead to weaker or stronger covariances depending on the specific pair of items. Furthermore, an interaction may indicate correlated error scores, which can lead to an inflation of coefficient alpha (Zimmerman, Zumbo, & Lalonde, 1993). This situation may be a violation of the items being essentially tau-equivalent and as a result, may underestimate alpha (Zimmerman et al., 1993). Although this is not unrealistic in actual test data, it is problematic that the violation of this assumption could not be corrected by transforming the data. Thus, obtaining a significant value (p < .001) for Tukey’s test of additivity indicates a need to interpret the reliability coefficient with caution.
Confirmatory Factor Analysis (CFA)
A CFA was performed to test the viability of the a priori specified one-factor model with our sample of Spanish-speaking ELLs. As shown in Table 3, the one-factor model resulted in a marginally adequate fit to the data.
Fit Statistics of the CFAs.
Note. CFA = confirmatory factor analysis; CFI = comparative fit index; TLI = Tucker–Lewis index; RMSEA = root mean square error of approximation; SRMR = standardized root mean square residual.
p < .001.
However, the obtained root mean square error of approximation (RMSEA) value exceeded the range of acceptable values. As this fit statistic is a measure of model parsimony, the unacceptable value may be a result of the use of 20 items to measure a single construct. Although substantively relevant model modifications (e.g., specifying residual covariances) are routinely considered to improve model fit, we chose not to pursue them in this context. Adding residual correlations based on high modification indices would not improve the RMSEA value because specifying these parameters would make the model less parsimonious. We used a minimum cutoff value of 0.50 to evaluate the strength of the standardized factor loadings. All items met this criterion. Factor loadings for this model can be seen in Table 4.
Standardized Loadings of the 20 WMRS Items.
Note. WMRS = Working Memory Rating Scale. All loadings significant at p < .001.
Next, we chose to conduct a two-factor model due to considerations that teachers may have been conflating behaviors intended to measure WM with behaviors that measure language acquisition. We chose items that included a linguistic component in their wording and we specified a second factor consisting of these items. This model resulted in significantly worse fit to the data as can be shown by calculating the difference in obtained chi-square values. In addition, all of the other fit indices resulted in unacceptable values. Thus, this model was rejected.
Validity
For this analysis, correlations were conducted with the WMRS latent factor, student cognitive and achievement measures, and the CTRS-R:S subscales (Conners, 2001) to establish convergent validity for our sample.
Latent variables for each of the cognitive and academic constructs were derived from the weighted z scores for each subtest (for more information, see Appendix A in Swanson et al., 2011). Correlations between the WMRS and the target constructs are reported in Table 5, as well as the reliability of each subtest.
Correlations Between the WMRS, Latent Cognitive Factors, and Academic Achievement Measures and Reliability Coefficients of Subtests.
Note. WMRS = Working Memory Rating Scale; E = English; S = Spanish; WRAT = Wide Range Achievement Test; PPVT = Peabody Picture Vocabulary Test; TVIP = Test de vocabulario en imagénes de Peabody.
Most of the correlations were statistically significant, but the point estimates revealed generally weak correlations between the WMRS latent factor and cognitive and academic latent factors. The negative correlations can be interpreted as follows: as children’s WM difficulties increased, their performance on measures of phonological processing in English decreased. Of the cognitive measures, the weakest correlations were with the latent factors measuring Spanish oral expressive vocabulary, English random generation, and Spanish random generation. Moreover, the relationship between Spanish random generation and the WMRS was nonsignificant. Although we would not refer to the following correlations as strong, the cognitive constructs that correlated most highly with the WMRS were English phonological processing, Spanish speed, and classroom inattention. It may be noted the classroom inattention factor was created from the CTRS-R:S, which also relies on teacher observation of student behavior. Critically, the WMRS was weakly correlated with the latent measures of English and Spanish STM and WM.
Of the academic measures, the WMRS was most weakly correlated with tests of fluid intelligence (i.e., Raven), math, and receptive vocabulary in both English and Spanish. Similar to the results of the cognitive measures, the nonsignificant correlation occurred with a test administered in Spanish, namely, the TVIP. Although all correlations with academic measures were weak, the tests with the strongest (relative to the other) correlation coefficients were passage comprehension in both English and Spanish and word identification in English.
Discussion
The purpose of this study was to determine the psychometric properties of the WMRS for young Spanish-speaking ELLs in the United States. The CFA suggested one factor was acceptable fit for the data. Convergent validity for the WMRS, as related to actual ELL task performance, was not adequately established. While convergent validity was established with the behavioral rating measure (CTRS-R:S, Conners, 2001), only a weak relationship was found between the latent WMRS factor and all memory constructs. A weak relationship was also found between the WMRS and general achievement, vocabulary, phonological processing, and fluid intelligence measures. Relationships between the WMRS and Spanish Random Generation and Spanish Receptive Vocabulary (TVIP) lacked statistical significance. The majority of the evidence suggests the WMRS lacks sufficient discriminant validity to measure WM abilities for children who are ELLs.
Subsequent score interpretation on the WMRS for ELLs suggests administration and use of the scale may be under threat from construct irrelevant variance. Although the purpose of the scale shows considerable potential in the K-12 context, given the diverse and inclusive populations in today’s K-12 classrooms, the scores resulting from this scale may be confounded by other student characteristics. In other words, there is not sufficient evidence to claim the resulting score an ELL receives on the WMRS is due to the observation of the WM abilities, but instead may be related to overt behaviors (e.g., inattention), speed (e.g., automaticity), or academic reading skill (e.g., phonological processing, letter word identification, and passage comprehension).
It is possible the correlations between the WMRS factor and the achievement and cognitive measures are also indicative of a possible methodological artifact: halo effects. Although strong inter-item correlations may reflect high reliability, as seen in the item correlations within the WMRS, the high correlations may also point to halo effects (Feeley, 2002). Also, the possibility of interactions between items, as evidenced by the significant additivity test, suggests scores on one or more WMRS items are influencing scores on one or more of the other WMRS items. Because Cronbach’s alpha shares the same assumptions as the ANOVA, or the general linear model family, linear distribution of error is assumed. However, our results suggest the random error associated with the data were not linearly distributed.
When calculating the alpha coefficient for the WMRS, we expect the distribution of error to be random. However, our results suggest measurement error in the WMRS ratings was not systematically distributed within the sample. This violation of the homogeneity of variance in the data indicates the random error associated with measuring the latent WM construct is no longer random, but instead, the error is systematic; the variation in the error is associated with the items, the raters, or the subjects being rated. Multiplicative interaction within the items may be responsible (e.g., content from one item could be influencing the response to other items). This is problematic. Content from one item should not influence how content for another item is interpreted, as the error associated with each item can no longer be consistently estimated. Also, raters should be assumed to be interchangeable; however, if raters rate their subjects subjectively, not objectively, there would be violations to the homogeneity of variance assumption—in other words, rater differences could create between-group or within-group differences.
When rating students’ WM abilities in the classroom, the tool should present items and answers that reliably portray construct-relevant diagnostic characteristics. It is possible that the items on the WMRS portray diagnostic characteristics that are not consistently interpreted for ELLs. Our data suggest the items describe behaviors that do not adequately distinguish the WM construct from linguistic characteristics or overall academic performance for young ELLs and future research is necessary.
Directions for Future Research
Further research is warranted with the WMRS and other ELL populations as these results suggest the internal consistency of the scale may be in question, in addition to the finding that the behaviors measured on the WMRS are moderately to strongly correlated with inattention, whereas ELLs’ reading skill (or lack of, as suggested by the negative correlation) is weakly to moderately associated with the overt behaviors measured on the WMRS. Related areas of research include the appropriate use of teacher report as a valid and reliable indicator of WM ability for children who are ELLs.
Clear directions for future research include extending the validity research agenda to classroom-based tools for diverse student populations, as well as the impact of nesting when these classroom screeners are used for the entire classroom evaluation. Although there is ongoing validity research for the high-stakes assessments, these low-stakes assessments have potential to be added to teacher and educational teams’ decision-making process for educational programs (i.e., general education, special education). Due to WM’s increasing popularity in the educational system as a diagnostic factor for learning difficulties, extending validity research in this direction is critical to eliminate potential bias against ELLs while minimizing threats to construct validity.
Footnotes
Authors’ Note
The views expressed in this article are those of the authors and do not reflect the views of the funding agency.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study is funded by United States Department of Education, Cognition and Student Learning Program, Institute of Education Sciences Grant R324A090092 awarded to H. Lee Swanson (PI).
