Abstract
Tutoring is commonly employed to prevent early reading failure, and evidence suggests that it can have a positive effect. This article presents findings from a large-scale (n = 734) randomized controlled trial evaluation of the effect of Time to Read—a volunteer tutoring program aimed at children aged 8 to 9 years—on reading comprehension, self-esteem, locus of control, enjoyment of learning, and future aspirations. The study found that the program had only a relatively small effect on children’s aspirations (effect size +0.17, 95% confidence interval [0.015, 0.328]) and no other outcomes. It is suggested that this lack of evidence found may be due to misspecification of the program logic model and outcomes identified and program-related factors, particularly the low dosage of the program.
T
More recently, a meta-analysis by Ritter, Barnett, Denny, and Albin (2009), which included only trials conducted in the United States, reported that volunteer tutoring programs improved overall reading by approximately one third of a standard deviation. Similar to the Elbaum et al. (2000) meta-analysis, this review also found that specific reading subskills, including oral fluency and writing, also significantly improved, with effect sizes ranging between 0.26 and 0.45. Similarly, Slavin, Lake, Davis, and Madden (2009) looked at the effectiveness of a number of alternate approaches, including volunteer tutoring, on reading outcomes for struggling readers. This review incorporated 96 studies from around the world, including the United States, New Zealand, and the United Kingdom, and included both randomized trials and well-matched controlled evaluations. It concluded that tutoring can be an effective way of improving reading outcomes, including (among others) passage comprehension, phonemic decoding, and oral fluency for young children.
Within this overall picture, some evidence exists that effects are mediated by tutor characteristics and the nature of the program. In relation to the former, the qualification level of the tutor has been shown to be significantly associated with variations in effect sizes. Teachers and college students tend to be more effective as tutors than paraprofessionals or volunteers (Elbaum et al., 2000; Slavin et al., 2009), and tutors tend to work best under the supervision of qualified teachers or reading specialists (Elbaum et al., 2000). In relation to program characteristics, there is considerable variability in terms of scope, design, and effectiveness between different programs, and many programs fail to articulate the theory of reading (or change) that underlies their program (Shanahan, 1998; Wasik & Slavin, 1993). Based on available evidence, the most effective tutoring programs tend to be those that are underpinned by a comprehensive theory of reading; address several aspects of the reading process through comprehensive approaches to teaching; have structured content; are delivered sufficiently frequently; and have a strong emphasis on phonics (Slavin et al., 2009; Wasik & Slavin, 1993).
One of the potential limitations of this current evidence base, however, is the small-scale nature of most of the studies conducted in this area. Of the tutoring trials included in the reviews cited above, only four had a sample size greater than 250 (Compton, 1992; Curry, Griffith, & Williams, 1995; Morrow-Howell, Jonson-Reid, McCrary, Lee, & Spitznagel, 2009; Ritter, 2000). This, in turn, raises the potential for publication bias whereby trials with small samples and large, positive effect sizes are overrepresented in the literature (Slavin & Smith, 2009). As Slavin and Smith (2009) contended, a small effect size of, for example, +0.15 is unlikely to emerge as statistically significant in a trial with a small sample. The small size of the sample will mean that the trial is likely to be underpowered and thus the reader cannot have confidence that the effect observed is a “true” effect of the intervention. However, the same effect size observed in a large trial is likely to be statistically significant. In this scenario, the study is not underpowered and so the reader can be confident that this effect is likely to be a real effect of the intervention, and as such, the study is more likely to be published. This means that trials of interventions that achieve only small effect sizes, or that show no evidence of effects at all, are only going to make their way into the literature if the sample sizes are large. Given that the majority of sample sizes in evaluations of educational programs are small, it is therefore unsurprising that trials with small or negative effect sizes tend to be underrepresented and only trials achieving large positive effect sizes are reported, published, and included in systematic reviews.
In developing this argument further, Slavin and Smith (2009) suggested that the tendency for small trials, which are often pilot studies, to achieve large effect sizes may be the result of three factors. The first factor is the use of less robust methodologies that are at high risk of bias and may lead to artificially inflated effects. For example, non-randomized controlled designs may result in the most motivated teachers volunteering to deliver the program or perhaps children most in need being allocated to the intervention group, and children perceived to be less in need allocated to the control group. The second factor relates to “super-realization,” which occurs when the experimenters monitor and modify the quality of program implementation, making the conditions of the study unrealistic and thus not replicable on a large scale. The third factor is the use of treatment inherent measures that closely match the intervention being evaluated. Using data from two systematic reviews, Slavin and Smith correlated sample size with pre-test adjusted effect sizes. A significant negative correlation emerged with small sample sizes (n < 50) having an average effect size of +0.44 and large samples (n > 2000) an average effect size of just +0.09. The results also suggest that effect sizes are more reliable and replicable in larger sample sizes. Slavin and Smith (2009) concluded that although randomization is undoubtedly an important feature of robust evaluative methodologies, sample sizes are also important and indeed large well-controlled trials provide more conclusive evidence than small, randomized trials and should be given greater weight in program reviews.
It is with these arguments in mind that this article reports the findings of a large randomized controlled trial (n = 734) of the Time to Read volunteer tutoring program. The aim of this evaluation was to test the logic model for the program and the associated hypotheses that the Time to Read program improves children’s self-esteem, locus of control, enjoyment of education, reading ability, and aspirations for the future. The article begins with an outline of the Time to Read program and the logic model that was suggested by the program developers to explain how it was hypothesized to work. The methodology for the present trial and the results are then reported before the article concludes with a discussion of the findings and their implications for understanding the role that volunteer tutoring programs can play in helping struggling young readers.
The Program
The Time to Read program was developed in 1999 by the charitable organization Business in the Community. Business in the Community is a unique movement in the United Kingdom and Ireland of more than 700 member companies. It works with companies to help them address their responsibilities to society in the environment, the workplace, and the community and also by assisting small firms to boost the local economy. In helping companies to demonstrate their commitment to having a positive effect on society, Business in the Community has developed a number of campaigns and programs that have been introduced to support and engage companies across Northern Ireland in addressing their responsibilities.
Time to Read is one of a suite of tutoring programs that the organization has developed, and it is currently delivered in more than 130 primary schools in Northern Ireland, supported by close to 120 companies providing up to 500 business volunteers.
The Time to Read program is aimed at children who are aged 8 to 9 years old and are considered to be struggling readers. Volunteers are recruited through local companies and matched with primary schools taking part in the program. Tutors undergo a half-day training course in paired reading strategies designed to improve reading fluency, word recognition, meaning, and comprehension. These strategies are designed to provide additional opportunities for practice in reading simple, familiar, high interest materials and include the following:
Repetition: the tutor reads a passage out loud and the child reads the same passage again (and vice versa);
Alternate reading: the tutor reads the first passage, the child reads the following passage, and they both read the next passage together;
Word recognition: if the child stumbles over a word, the tutor stops, sounds out the word, and asks the child to repeat it a number of times;
Word meaning: exploring the meaning of new words by looking them up in the dictionary and putting each word into sentences to demonstrate how it can be used in different contexts; and
Comprehension: the tutor talks to the child about the reading material, asking questions to explore the child’s understanding, stops at various points to ask the child what might happen next, and reviews the book once it has been finished to see if the child enjoyed it and whether he or she would recommend it to a friend.
Following the training, tutors are subsequently paired with two children who have been identified as struggling readers and spend 30 minutes each week working on a one-to-one basis with each child. The tutoring session takes place outside the classroom setting in a separate room. Each school taking part in the program is supplied with books that the tutor and child can choose from for their session. However, children are also free to choose their own books if they so wish. Tutors are encouraged to take their tutees on a “workplace visit” to their company as part of the program.
The program itself was informed by a wider logic model that was developed by Business in the Community with external professional support. The logic model is outlined in Figure 1 and, as can be seen, it hypothesizes that the provision of weekly one-to-one 30-minute tutoring sessions in addition to the relationship they build with their tutor leads to children’s raised self-esteem and greater enjoyment of learning, which in turn promotes improved reading skills (comprehension) and raised aspirations for the future. It is further suggested that improvements in these outcomes lead to improved economic viability in adulthood. This logic model, in turn, provided the terms of reference for the present evaluation, and the present authors were commissioned as an independent research team to evaluate the effectiveness of the Time to Read program against this hypothesized logic model.

The logic model underpinning the Time to Read program.
Method
The evaluation of the Time to Read program consisted of a randomized controlled trial that ran between September 2006 and June 2008. Ethical approval for the study was granted by the Research Ethics Committee of the School of Education at Queen’s University Belfast.
Participants
Primary schools that met the following criteria were invited to participate in the trial:
Schools that had never taken part in Time to Read previously
Schools that were located in geographical areas in which there would be a sufficient concentration of business volunteers to recruit as tutors
Schools that were sufficiently large to have one teacher per class
An equal number of schools from each of the Education and Library Boards (school districts) in Northern Ireland
The application of these criteria to all primary schools in Northern Ireland resulted in a list of 200 schools. These 200 schools were written to and the first 50 schools to respond to the invitation to participate were recruited into the study. To determine which children were eligible to participate in the evaluation, all consenting children aged 8 to 9 years in all 50 schools were tested at baseline (pre-intervention) on the outcomes listed in Table 1. These outcomes will be described in more detail later in the article. For this first sweep of baseline data collection, an opt-out system was used to secure parental consent whereby parents were sent a letter explaining the purpose of the survey and were asked to complete and return a tear-off slip if they did not wish their child to take part in the survey. For those children who remained, their direct informed consent was also sought prior to them completing the survey. However, none of the children declined to participate in the survey.
Outcome Variables and Measures
The following eligibility criteria were applied to all children in this initial cohort for inclusion in the evaluation: Children were considered eligible if they scored below average (namely, between the 10th and 50th percentile) on the NFER Group Reading Test II. Children were excluded if they had a statement of special educational need.
Randomization
Based on the reading scores obtained in the baseline data sweep, a list of children who met the eligibility criteria was compiled for each participating school. There was capacity to put approximately four tutors into each school. Therefore, approximately eight of the eligible children in each school (the exact number depended on the actual number of available tutors) were randomly allocated to the intervention group, and the remaining eligible children formed the control group. The research team used the random selection function in the Statistical Package for the Social Sciences (IBM SPSS) to randomly assign blindly the required number of children (approximately eight) to the intervention group for each school separately. Because of the number of data sweeps required, a parental opt-in system was used at this stage with children being included in the evaluation only once their parents had signed and returned a form providing permission for this. As before, the children’s informed consent was then obtained and, in this case, none of the children in either the control or intervention groups declined to participate in the evaluation.
Unfortunately, and because of the pressure of time, it was necessary to undertake the random allocation process prior to parental consent being obtained for the children to take part in the full evaluation. This raises the possibility that parents of children allocated to the control group may have been less inclined to take part in the trial because they perceived no benefit to being in the control group. The potential effect of this on subsequent dropout rates is discussed below.
Children who were allocated to the intervention group received the Time to Read program for 2 academic years. Children allocated to the control group did not receive the program and continued to receive usual classroom instruction. Some efforts were made by the program developers to standardize program implementation; all tutors received the same training, participating schools were supplied with the same set of Time to Read books that could be used during the tutoring sessions, and tutors were required to keep a log of each tutoring session with each child. However, in reality, variation in delivery did arise: Tutors and children could choose their own reading material (it did not have to be a book from the Time to Read set), the number of tutoring sessions received by each child on the intervention varied considerably, and some children received a workplace visit whereas others did not.
Outcomes and Measures
In accordance with the immediate and short-term outcomes depicted in the logic model above, the outcome variables measured for each child taking part in the trial included self-esteem, locus of control, enjoyment of learning, reading ability, and aspirations for the future. The long-term outcome “economic viability” is something that could be addressed in the future but was not within the remit of this trial.
Self-esteem was measured using the Global Self-Worth Scale of Harter’s Self-Perception Profile for Children (Harter, 1985), which consists of six items that are scored on a 4-point Likert-type scale. For each item, children consider two contrasting statements and choose the statement they think describes them the best. Then they indicate whether this statement is “sort of true for me” or “really true for me.” Locus of control refers to the extent to which people believe that they are in control of the things that happen to them. It was measured alongside self-esteem in order to provide an additional measure of self-concept that might be able to differentiate children better than self-esteem sometimes can for this age group. Rotter’s Locus of Control Scale (Rotter, 1966) was used for this purpose and consists of six items that are scored on a 4-point Likert-type scale ranging from strongly agree to strongly disagree.
Enjoyment of education was measured using the “liking school” subscale of Pell and Jarvis’s (2001) attitudinal scale. It is made up of nine items, which assess how much the child enjoys different aspects of school, and is rated on a 5-point Likert-type scale from really don’t like it to like it a lot. The NFER Group Reading Test II (Cornwall, France, & Hagues, 1997) is a group-administered measure of reading comprehension that was used to assess reading skills. It consists of 48 sentence completion items in which children have to select the appropriate word from a choice of five alternatives to complete the sentence. Finally, aspirations for the future was measured using Loeber, Stouthamer-Loeber, Van Kammen, and Farrington’s (1991) scale, which consists of seven questions aimed at assessing how important the child considers (on a scale from not important at all to very important) such matters as having a well-paid job in the future, working hard to get ahead, and having a good reputation in the community. The scale also asks the children what job they would like to do when they leave school.
Table 1 provides details of the alpha coefficients that have been reported in other studies for these measures, alongside the alpha coefficients achieved in the current sample. It can be seen that the measures of self-esteem and enjoyment of education achieved better internal reliability in the current sample compared to other studies, whereas the measures of locus of control and aspirations for the future did not perform as well compared to other studies.
Data Collection
The outcome measures were group administered in a classroom setting by a trained researcher who read each question aloud as the children recorded their responses in individual answer booklets. In addition to these core outcome measures, the following background data were collected on each of the children: age, gender, religion, and postcode (a proxy measure for socioeconomic status). The following background data were also collected on the tutor: age, gender, religion, educational attainment, occupation, and postcode. Although it was not possible to blind participants to group allocation, the outcomes assessors were not privy to the allocation of the children.
All children (in both the intervention and control groups) were tested on the outcome measures six times over the period of the evaluation (September 2006, February 2007, June 2007, September 2007, February 2008, and June 2008). Locus of control was the exception to this as it was introduced as an additional measure in the third data sweep and was included in every subsequent data sweep. Due to the availability of tutors, children in the intervention group started on the program between September 2006 and April 2007 and continued with the program until June 2008 when the evaluation ended. All 734 children were included in at least two data sweeps, however, and the use of hierarchical linear modeling can deal with missing data at this level without the need to impute scores.
Tutors were also asked to complete a log of their activities for the duration of the tutoring program, and a record was kept of each child’s duration on the program and the number of tutoring sessions received. This provided information to help monitor the fidelity of program implementation.
Power
Taking into account any design effect arising from the clustered nature of the data, the study had 80% power to detect an effect size of at least .25 at the .05 level of significance for all outcomes.
Statistical Methods
The main analysis used three-level hierarchical linear regression models with repeated measures nested within children within schools. Including the baseline data sweep, there were six sweeps of data collection in total. However, not all children were included in all six data sweeps because of their different starting dates and also because some were absent from school when specific data sweeps were undertaken. The models are described in Table 4.
Results
Participant Flow Through the Study
Overall, 2,115 children in the 50 participating schools were assessed for eligibility to take part in the evaluation. As can be seen from Figure 2, 1,272 children were excluded for not meeting the inclusion criteria, which resulted in 843 children being randomized. Parental consent was secured for 87.1% of these children to take part in the full evaluation. This, in turn, reflected a 6.5% dropout rate for those children allocated to the intervention group and an 18.3% dropout rate for those allocated to the control group. The final numbers of children participating in the evaluation and included in the analysis were therefore 360 in the intervention group and 374 in the control group (total n = 734).

Flow diagram of the selection and allocation of participants.
There is always the possibility that these dropout rates following random allocation may have introduced an element of bias into the research design. However, through the use of pre-test scores and a repeated measures design that included up to six sweeps of data collection for each child, any differences between the intervention and control groups that may have arisen through these dropout rates will therefore have been largely accounted for in the analysis. Furthermore, an additional element of analysis, using a variation on the “intention to treat” method, was undertaken to assess whether these dropouts may have had any notable effects on the main findings. Although there is a need to treat these findings from this additional “intention to treat” analysis with some caution, they did tend to corroborate the main findings from the main analysis described below.
The “intention to treat” analysis involved keeping all 843 children who were originally randomized in the analysis. However, the problem in this case of including the 109 children who dropped out of the study was that it was not possible to collect any follow-up data on these children. It was therefore necessary to impute scores for these children for the follow-up data sweeps, and this was done simply on a “last value carried forward” basis. Doing this, however, introduces its own problems as such imputation results in these 109 children showing no change on any of the outcome measures at all from their original baseline scores. However, it can be assumed that such scores will have changed naturally over the intervention period due to maturation and the general effects of attending school. The results of the “intention to treat” analysis should therefore be treated with some caution and have been used here simply to check that no notable and/or profound differences were evident between these and the findings from the main analysis.
Dosage
Dosage was defined as the number of tutoring sessions received by children in the intervention group. Of those children in the intervention group for whom data were available (70.3%), the mean number of tutoring sessions received was 25.0 (SD = 12.9). This equated to a child on average receiving 12.5 hours of direct tutoring in total during the child’s involvement in the program. The average duration of the program for each child was 14.13 months (SD = 3.49).
Sample Characteristics
The mean age of the children involved was 8 years 9 months. Fifty-seven percent were male and 43% were female. The religious breakdown of the sample and the Northern Ireland primary school population in 2006–2007 (the academic year in which the sample was selected) is shown in Table 2. As can be seen, the religious breakdown of the sample is similar to the Northern Ireland primary school population; however, in the sample there is an overrepresentation of catholic children and an underrepresentation of children for whom religion is not recorded.
Breakdown of the Sample by Religion (child level)
Note. Since percentages are rounded to the nearest integer, they may not always sum to 100.
Table 3 illustrates the key characteristics of the sample of 50 schools taking part in the trial and how they matched the overall population of schools in Northern Ireland in 2006. There were no statistically significant differences between the sample and the population in terms of percentage Free School Meal Entitlement 1 (%FSM) or type of school (for %FSM: p = .160, chi-square = 3.670, df = 2; for school type: p = .475, chi-square = 1.490, df = 2). The sample did, however, differ significantly from the population in terms of the proportions of schools from each of the education and library board areas (p < .0005, chi-square = 21.315, df = 4). As can be seen, there was an overrepresentation of schools in the sample from the Belfast and South Eastern Boards.
Breakdown of the Sample by Percentage Eligible for Free School Meals, Type of School, and Education and Library Board Area (school level)
Note. Since percentages are rounded to the nearest integer, they may not always sum to 100.
Main Analysis
As can be seen from Table 4, the trial found evidence that Time to Read had a positive effect in terms of increasing children’s overall future aspirations (equivalent to d = +.17 over a 2-year period). However, the trial found no evidence of any effects associated with Time to Read in relation to the four remaining outcomes identified through the logic model (the children’s general levels of self-esteem, locus of control, enjoyment of education, and reading skills). The details of the hierarchical linear models estimated in the main analysis are provided in Table 5.
Summary of the Findings of the Main Analysis of the Effects of Time to Read
There were no significant differences between the two groups for four of the five outcomes, the one exception being reading ability (t = 2.49, df = 731, p < .05). However, since the pre-test scores were included in the analysis, these minor differences have been controlled for.
For the full, 2-year program, there were six data collection time points and thus five periods between these. The effect sizes estimated here (Cohen’s d) are thus calculated by multiplying the change by sweep by five and dividing by the pooled standard deviation for the sample as a whole at the first data collection point (Sweep 1). The effect size for the “future aspirations” outcome is therefore (.08 × 5) / 2.38 = +.17.
Measured for the first time at Sweep 3.
Details of the Multilevel Models Used in the Main Analysis
p = .032.
p = .144.
p = .542.
p = .700.
p = .983.
In Table 5, and by way of interpretation, the coefficient β1 represents the average overall increase in the outcome measure as a child passes from one sweep to the next, not including any increase specifically due to the intervention. The coefficient β2 represents the overall difference between the intervention and control group, not including any increase specifically due to the intervention. Finally, the coefficient β3 is the one of primary interest and represents the average difference in the outcome measure “score” between the children in the intervention and the control group as they pass from one sweep to the next, controlling for any initial differences between the intervention and control groups.
In addition to these main models and the primary analysis reported above, further exploratory sub-analyses were undertaken regarding whether the tutoring program had differential effects for particular subgroups of children. These further sub-analyses were pre-specified prior to data analysis and consisted of testing for any evidence of differential effects between boys and girls; children with high and low self-esteem; children with different initial reading levels; and children from areas of high and low deprivation. No significant findings emerged from this exploratory analysis. Also, the data were further analyzed to assess whether there were any significant associations between the dosage received and the outcomes achieved and also the characteristics of the tutors and their tutees. No evidence was found of any associations in relation to these two factors for any of the five outcomes.
Discussion
It will be recalled that the program’s logic model hypothesizes that through regular meetings with a tutor, the child’s self-esteem is enhanced, they start to enjoy learning, and this in turn leads to improved reading skills and raised aspirations for the future. The evidence from this randomized controlled trial suggests that Time to Read has had a small positive effect on the children’s future aspirations (effect size of +0.17) but no effect in relation to the remaining four outcomes identified. In particular, there is no evidence of any effect on comprehension, which was the core reading outcome measured. Overall, therefore, the findings do not provide any support for the logic model depicted in Figure 1; indeed, the one independent effect (increased aspirations) undermines the model. In considering these findings, there are two main points to draw out in relation to flawed program theory and inadequate program implementation, and these are considered in turn.
Program Theory
We consider flawed program theory to be the primary reason that the trial was unable to show an effect of Time to Read. This conclusion is substantiated by the lack of any association found between dosage and the effectiveness of the program in relation to the five outcomes specified through the program theory. In other words, however much exposure to the program a child received, there was no evidence to suggest that he or she was more likely to show improvements in relation to the outcomes identified through the existing program theory.
To better understand what effects Time to Read is likely to have on children’s reading skills, it is worth locating the program within the six stages of reading development proposed by Jeanne Chall (1983), who led the “great debate” over the benefits and value of code-based (phonics) versus meaning-based (whole word) reading instruction (Chall, 1967, 1989; Foorman, Francis, Novy, & Liberman, 1991; Xue & Meisels, 2004). Progress through Chall’s stages of reading development is not necessarily linear, nor does it occur at the same rate for everyone. However, the model provides a useful framework for understanding how reading skills might develop and how the Time to Read program might work in terms of providing support for those who are struggling to transition through the developmental stages.
Stage 0 is considered to be the pre-reading stage, which tends to occur between birth and the start of formal schooling. It represents the period of time during which children’s language develops. It is during this time that the experiences of preschool children and their exposure to books and print are deemed to be extremely important in the development of beginning reading and progression to Stage 1. Stage 1 is also known as the decoding stage and tends to occur when children are aged 6 to 7 years old. They start to match the printed word to the spoken word and begin to understand what the letters “do.” At this stage, children are very focused on the printed text rather than the meaning of the text.
When children are aged around 7 to 8 years, they tend to progress to Stage 2, during which they consolidate what they have previously learned rather than learn new information. Through reading familiar material and books, readers have the opportunity to focus on the printed word and practice using their newfound decoding skills. At this stage, the more reading practice the individual gets, the more fluent a reader he or she will become. During Stage 3, usually from ages 9 to 14 years, individuals acquire new knowledge through their reading and begin to “read to learn.” It is the time during which reading leads to new thoughts and ideas and becomes an additional source of information alongside listening and watching. According to Chall (1983), Stage 3 readers are “learning how to learn from reading, but essentially from only one point of view” (p. 22). As readers in Stage 3 progress, they are able to consider different points of view and become more analytical about what they are reading.
The knowledge and skills learned in Stage 3 enable the reader to move on to Stage 4 and help to develop the ability to grasp new ideas and theories. Stage 4 readers (ages 14 to 18) are able to process increasingly complex and nuanced materials, which contain multiple perspectives, concepts, and facts. The final and most mature stage is Stage 5. Readers at this level are capable of being selective in what they read, taking what they need from it, and forming their own ideas and knowledge as a consequence. It is thought that not everyone reaches this stage, however, and Chall hypothesizes that there is interaction and movement between the stages and that, for example, an individual might read at Stage 5 for academic work and at Stage 2 for leisure reading.
The children who received the Time to Read program are those who are struggling readers and aged 8 to 9 years. As such, and according to Chall’s theory outlined above, they should be starting to make the transition from Stage 2 to Stage 3 reading. Successful transition to Stage 3 requires a lot of practice in reading familiar materials, and children who do not have enough opportunities to read at home or at school may get insufficient practice to make a successful transition. Chall suggests that such children would benefit a great deal from additional practice opportunities in school so that they are supported and helped to reach Stage 3 reading. This is precisely the focus of the Time to Read program and its associated activities as described earlier.
With this in mind, the program theory specified for Time to Read, as set out in the logic model, should have been more specific and have included additional reading outcomes, including fluency and decoding, rather than just reading comprehension. For the children who are participating in the Time to Read program, and who are thus struggling to make the transition from Stage 2 to Stage 3, Chall’s theory would suggest that the reading outcomes that are most likely to improve as children approach the end of Stage 2 are decoding and reading fluency. Unfortunately, neither of these reading skills were identified as outcomes to be measured in this trial, thus it is not possible to know if there was any improvement in these areas.
Interestingly, and by way of comparison, a similar, low intensity volunteer tutoring program aimed at children aged 6 to 8 years and delivered for 30 minutes twice a week is SMART (Start Making A Reader Today). A small evaluation of this program (n = 84; Baker, Gersten, & Keating, 2000) found that there were indeed improvements in reading fluency (Glass’s Δ = +.53), supporting the premise that additional practice in reading promotes progress through Stage 2, resulting in improved reading fluency. However, they also found statistically significant, if smaller, effects for reading comprehension (Glass’s Δ = +.32). This is consistent with findings from studies reviewed in the introduction, which reported medium to large effects in decoding and fluency but smaller effects in relation to reading comprehension. The findings from the SMART evaluation should be treated with some degree of caution, however, given that the sample size was small and the clustered nature of the data was not taken into account in the analysis, which in itself may have led to deflated standard errors and potentially misleading inferences.
Beyond the problems associated with the misspecification of the reading outcomes, a further issue in relation to the program theory for Time to Read is associated with the attitudinal outcomes that were also specified as part of the logic model, which perhaps should also be reconsidered. As outlined above, Time to Read is principally a reading enhancement program and, thus, there is no reason to expect to see significant changes in self-esteem, locus of control, and enjoyment of learning given that the tutoring sessions do not contain activities specifically aimed at improving these outcomes. Although there is some evidence to suggest that tutoring can lead to improved social outcomes such as interpersonal relationships and increased feelings of social status in some situations (Shanahan, 1998), there appears to be little direct association between such global, wider outcomes (as specified in the logic model) and reading (Marsh & O’Mara, 2008; Pullmann & Allik, 2008; Swann, Chang-Schneider, & McClarty, 2007; Valentine, DuBois, & Cooper, 2004). If it is to be hypothesized that the program has any effect on more social and attitudinal outcomes, then the evidence would suggest that this is more likely to be in a more limited and domain-specific way, for example, affecting reading or academic self-esteem as opposed to global self-esteem (Marsh & O’Mara, 2008; Swann et al., 2007).
Overall, therefore, in relation to the current program theory as depicted by the existing logic model (Figure 1), we would suggest that it is misspecified and is unlikely to capture any effects that the Time to Read program is likely to have had. Rather, the logic model needs to be revised to locate the program within a more specific theory of reading development along the lines proposed by Chall (1983). In doing this, the model would therefore need to focus on more specific and intermediary reading outcomes—most notably, decoding, reading rate, and reading fluency—alongside comprehension and also reassess the appropriateness of including secondary, attitudinal outcomes. In relation to the latter point, if secondary outcomes are to be specified, these will need to be theoretically grounded and more closely related to improvements in reading outcomes.
Program Implementation
Alongside the problems associated with program theory, the other key issue to arise from this present evaluation is in relation to program implementation. Although it is not possible to know for certain whether increasing the dose of Time to Read would result in improvements in reading outcomes, it is reasonable to speculate that the reason the SMART evaluation (Baker et al., 2000) found an effect on comprehension, whereas the Time to Read evaluation did not, may be that the dose of the SMART program was double that of Time to Read.
Furthermore, a research report by Ritter and Maynard (2008) focused on the reasons that the randomized controlled trial that they conducted in the late nineties of a similar tutoring program showed no measurable benefits. They concluded that, among other reasons, the dosage of the tutoring may have been too low (1 hour per week), and when school holidays and other absences were taken into account, this further affected the amount of the program that the children were exposed to. Given this, the intensity of the tutoring should be reconsidered from the current average of a total contact time of just 12.5 hours per child over a typical period of 14 months.
Conclusion
Overall, the trial reported here is one of the largest ever conducted of a volunteer tutoring program. It has highlighted the importance of both program theory and implementation not just in relation to service design but also, importantly, in relation to the design and conduct of robust evaluations. Perhaps the key lesson to draw from this present evaluation is the need for tutoring programs to be grounded properly in a theory of change that not only defines more specifically the reading outcomes likely to be affected by the program but is also realistic in relation to the potentially wider effects that the program is likely to have on other social outcomes. If this had happened in relation to the present program, then the developers may have been more likely to appreciate that the primary outcomes for Time to Read should have been decoding and reading rate and fluency. This, in turn, may well have also increased their likelihood of recognizing the need to have a higher level of dosage in the program to ensure that children gain sufficient experience and practice in these areas.
Beyond this, although it is clear that the effectiveness of volunteer tutoring programs is dependent on their quality and implementation, the magnitude of their effect may well also be more modest than previously thought. Slavin and Smith (2009) argued that larger studies (that also employ a robust methodology) are likely to provide a more accurate representation of the true effects of volunteer tutoring programs than smaller, underpowered trials that until now have been the basis of the body of evidence supporting the effectiveness of volunteer tutoring. In fact, the only other trial conducted to date larger than this present evaluation of Time to Read trials was an evaluation of a similar volunteer tutoring program, Experience Corps (Morrow- Howell et al., 2009), which found an average effect size of +0.11 on reading achievement. These findings do at least raise the possibility of publication bias, which needs to be tested further. Time to Read however had a positive impact on children’s aspirations for the future (ES = +0.17), which may well be related to the optional workplace visit that is part of the program. Thus indicating that tutors may be able to change children’s future aspirations more readily than reading comprehension. This is not to suggest that volunteer tutoring is ineffective in improving reading outcomes, just that the effect may not be as large as previously believed.
Footnotes
Acknowledgements
The authors would like to thank Ben Styles, formerly of the National Foundation for Educational Research (NFER), for his contribution to the analysis of the data.
Notes
Authors
SARAH MILLER is Deputy Director of the Centre for Effective Education, Queen’s University Belfast, 69-71 University Street, Belfast, Northern Ireland, BT7 1HL.
PAUL CONNOLLY is Professor of Education at the School of Education, Queen’s University Belfast, 69-71 University Street, Belfast, Northern Ireland, BT7 1HL.
