Abstract
This article argues that high-stakes educational testing, along with the attendant questions of power, education access, education management and social selection, cannot be considered in isolation from society at large. Thus, high-stakes testing practices bear numerous implications for democratic conditions in society. For decades, advocates of high-stakes educational testing have argued that testing would result in meritocracy, ensuring that everyone would be afforded the same opportunities to find success in adulthood. Examined from a critical perspective, however, we learn that testing is also extremely well designed as a bulwark against opposing or alternative outlooks and opinions, because testing is a complex tool requiring highly specialised knowledge to administer effectively. This article sets out to investigate the relation between high-stakes educational testing and democracy drawn from the experiences of 20th-century high-stakes educational testing practices in the Danish history of education. Among other things, the article concludes that a combination of different evaluation technologies – some formative and some summative – might be the safest way to go from a democratic perspective.
Introduction: high-stakes testing in society
Testing, along with the attendant questions of power, education access, education management and social selection, is a tool for evaluating individuals and educational system performances that cannot be treated in isolation from society at large.
Indeed, it can be argued that societies to a certain extent are dependent on some form of a testing system to establish criteria of human worth. In his grand, classic survey Prestige, Class, and Mobility, sociologist Kaare Svalastoga (1959) has written that achievement factors rooted in the passing of some formal or informal test of achievement or ability are predominant markers among a society’s status determinants (p. 16ff). His work testifies that a dimension of educational testing serves a gatekeeping function for society. This dimension has sparked the term high-stakes testing, which can be defined as test results being directly linked to important rewards or sanctions for students, teachers or institutions (Madaus, 1988: 29f).
High-stakes testing, by its very definition, is the most extreme form of testing, for it results in the most direct, far-reaching set of consequences for the test taker. Thus, high-stakes testing bears great significance for human achievements, individual lives and educational practices alike.
In this respect, it is important to note that the term high-stakes testing covers examinations, evaluations and educational tests in many different incarnations. The common denominator is that an important result of some kind is at stake – either actual or perceived.
Svalastoga’s point suggests a close link between high-stakes educational testing and society at large. This makes it relevant to examine the nature of democracy connected with high-stakes testing in order to establish some central focus areas for a historical analysis of this connection. While it is clear that the purposes and impacts of high-stakes testing change diachronically, conducting such an analysis may help identify central characteristics of the relation between high-stakes testing and democracy, offering continued relevance for contemporary policies and debates.
Following the work of French sociologist Emile Durkheim (1938), one might posit that the study of the history of education should anticipate the future and understand the present (p. 13). While education access, as a purpose and impact of high-stakes testing, generally speaking was at the forefront in the early days of high-stakes testing, today high-stakes testing practices are intimately connected with accountability purposes. The point is, however, that historical high-stakes testing practices still served accountability purposes to some extent (Ydesen, 2013; Dorn, 2014 (forthcoming)) and contemporary high-stakes testing practices still bear an impact on education access, as demonstrated by, among others, Bourdieu (1996; Bourdieu and Passeron, 1970/1977) and Bernstein (1996; Au, 2008; Andreasen et al., 2013). Thus, I argue that it is only the configuration of purposes and impacts connected with high-stakes testing that change diachronically; the palette of purposes and impacts remains stable, and this serves as an argument for exploring the relation between high-stakes testing and democracy using a historical approach.
High-stakes testing and democracy
Since democracy can hardly be defined unambiguously, and because research on educational testing – regardless of field or approach – is a political and scientific minefield, identifying clear-cut and generally agreed-on focus areas for understanding the interaction between high-stakes testing and democracy seems a lost cause. One reason is that educational testing and democracy hold an unavoidable political dimension, igniting disagreement on testing practices and their implications for society at large (see, for example, Garsdal and Ydesen, 2009). However, different perspectives and research in the field can provide and substantiate a number of key components central in the interplay between high-stakes testing and democratic conditions.
In 1903, John Dewey wrote about the importance of ‘freedom of mind’ in education (p. 194). In this respect, Dewey (1903) advocated the creation and growth of democratic citizens by educating them, and he frowned upon externally and centrally administered initiatives seeking to govern and control the education process: ‘To subject mind to an outside and ready-made material is a denial of the ideal of democracy, which roots itself ultimately in the principle of moral, self-directing individuality’ (p. 199). In most cases, contemporary high-stakes testing is precisely the notion against which Dewey railed. It becomes a centrally devised and administered evaluation tool that is counter-productive to advancing democracy in education, because it tends to stifle pupil individuality and delimit learning with its focus on educational goals dictated from the outside. Dewey’s critique has been echoed to this day since his critique in 1903 (see, for example, Onosko, 2011).
But Dewey’s argument was not limited to how it would affect the learning processes of the pupils but also equally rooted in the teachers’ professional and work conditions. Dewey (1903) held teacher influence and their share in educational power to be a keystone in securing democracy in education: ‘What does democracy mean save that the individual is to have a share in determining the conditions and the aims of his own work’ (p. 197).
Dewey’s perspective on democracy in education thus also raises the question of whose voice should be heard in educational management. Furthermore, since high-stakes testing functions as a central tool in school management, it is relevant to ask who is heard and who is able to participate in designing and implementing high-stakes educational testing practices.
Dewey’s ideas about democracy in education find staunch support in the writings of German philosopher Jürgen Habermas (1981/1984). One of Habermas’ main points from a normative democratic perspective is that everyone affected by a measure or an initiative should be heard, that is, be free to speak in a space without power structures and asymmetries. To expand on Dewey’s argument, this means that parents and even the public at large should be heard when it comes to designing and implementing high-stakes testing practices in education.
American educator and senior scholar Deborah Meier’s work on the American No Child Left Behind Act of 2001 also supports this line of thinking, based on a notion of democratic ideals. In her advocacy, Meier (2004) emphasises the importance of public engagement and parental involvement in democratic education (p. 67). This again touches upon the central issue in democracy–testing relations raised by Dewey: Who should be heard in the domain of education management? According to associate professor Carlo Ricci (2004), central authorities implement directions for education via testing (p. 342). This phenomenon has sparked the term teaching-to-the-test, a practice that makes it difficult for stakeholders to exercise direct influence on the educational practices in their community. Professor Elana Shohamy (2004) even argues that ‘[…] powerful uses of tests violate fundamental democratic values in that the tests are in the hands of central organisations that are capable of controlling and defining knowledge on their own terms and in accordance with their own agendas’ (p. 74). This highlights a focus on the distributions of power connected with testing in general and school autonomy in particular. According to Meier (2004), ‘Schools need to be governed in ways that honor the same intellectual and social skills we expect our children to master, and – ideally – in ways the young can see, hear, and respect’ (p. 78). Meier thus links the issues of school autonomy with the democratic teaching of pupils.
While the perspectives listed above clearly demonstrate critical voices raised against high-stakes testing and supportive of democratic conditions surrounding public education, advocates of high-stakes testing have argued that the transparency and comparability of school results generated by test results provide a democratic opportunity for parents to make informed decisions about school choice. This is at the heart of the counter-argument that high-stakes testing creates democratic accountability benefitting parents and thus ultimately their children (see, for example, Madaus et al., 2009: 2).
Another – and very classic – argument in favour of testing vis-a-vis democracy is that what we now call high-stakes testing is less unfair than evaluation types where teachers or societal elites play the role of arbitrary gatekeepers – at least standardised testing ensures that everyone is treated the same and given the same possibilities of success. This is the argument about the meritocratic and legal rights potential of standardised high-stakes testing (Boyd, 1930: 270; Brehony, 2004: 749).
A third argument in the positive camp says that high-stakes testing helps the establishment of educational standards, which will spark high-quality education and transparency to the benefit of democracy (see, for example, Levinson, 2011: 130). A variant of this type of argument is that the restriction of access improves the quality of the educational institution.
While these arguments reflect the positive view on democracy–testing relations, describing them as more or less symbiotic, they treat testing with a sole focus on the test results and ignore the complex endogenous processes leading up to the results. Another line of division in the debate pertains to the sense of democracy. Should democracy be institutionalised as an elite democracy or should democracy be practised as a participatory democracy or maybe even as a deliberative democracy? (Held, 1987/1996; Hanberger, 2006) At least to some extent, this might explain some of the roots of the intense disagreement on democracy–testing relations.
According to professor Sherman Dorn (2007), this equivocality of democracy–testing relations is in fact rooted in contradictions of a democratic society itself that issues calls for a system that both rewards merit and generates equality (p. xiv). Linking with Svalastoga’s findings mentioned above, that societies are dependent on some form of a testing system to establish criteria of human worth, Dorn’s insight leads to a focus on education access and the democratic credo of creating the same opportunities for all, as also emphasised by American essayist Peter Sacks (1997) in his treatment of meritocracy and standardised testing. The question closely related to democracy in this respect is which members of society will advance and which will not?
To clearly identify focus areas for the historical analysis of the implications of high-stakes testing on democratic conditions, it is fruitful to discern between macro, meso and micro perspectives, each associated with a type of democracy.
Society at large is associated with the macro level. Here belong questions related to educational in-groups and out-groups. This level naturally is associated with hegemonic notions of normality and deviance permeating an educational system. Looked at from a democratic perspective, this question is relevant, because high-stakes testing generates outcomes of inequalities, thus serving as a gatekeeper of opportunities in life. In contrast, one might argue that securing an equal starting point suffices to satisfy democratic principles. Casting light on the existence of the equal starting point presupposed by this argument, conducting an analysis calls for a focus on the skills required to do well in a given testing practice. But along with taking the argument concerning standards’ positive impact into consideration, this focus should be combined with educational quality linked with accountability and transparency through clear and established standards generated by the high-stakes testing practice.
This level can be associated with the notion of an elite democracy, meaning that citizens can control decision-makers by choosing among competing elites (Schumpeter, 1942). The connection between the defined macro level and the elite democracy is the question of who is able to enter the elite. Another key issue in this type of democracy is the presence of good conditions for accountability (Hanberger, 2006: 22). This aspect pertains to the issue of transparency, accountability and educational standards.
Next, the meso level aims to illuminate the democratic conditions of schools, as well as the roles teachers and parents play and how these relate to high-stakes testing. The main focus at this level is whether pupils, teachers and parents will be heard in the process of implementing and sustaining a high-stakes testing practice. The key components are the levels of school and teacher autonomy, the magnitude and strength of school and central authority influence on the testing practice used and the quality of information provided to these stakeholders – including parents – for making informed decisions.
The meso level can be associated with a participatory democracy where people’s participation is the most important quality. Citizens are empowered or delegated freedom of choice to decide what is feasible for them (Hanberger, 2006). In this context, schools, teachers and parents constitute the democratic agents.
Finally, the micro level centres on intimate pupil–test relations in general and room for pupil individuality and the possibilities of democratic teaching in particular. This includes a focus on the relations of testing with other evaluation technologies, such as teacher evaluations and the power of the stakes associated with the test practice used.
The type of democracy associated with this level is deliberative democracy, highlighting the importance of discussions among free and equal citizens (Hanberger, 2006). Although most educational and testing situations are characterised by power asymmetries, the main link between this type of democracy and the defined micro level is pupil individuality. Much in accordance with Deborah Meier’s argument above, pupil individuality in combination with democratic teaching provides the outset for the development of democratic spaces of free and equal citizens, so vital for the existence of deliberative democracy.
Having established central focus areas to conduct a historical analysis of democracy–testing relations associated thematically with macro, meso and micro levels, each associated with a type of democracy, I will briefly present the three historical case studies used to conduct the analysis.
The historical case studies
The first case study is the high-stakes sorting of children into remedial education at the municipality of Frederiksberg in the 1930s, based on standardised intelligence testing.
The second case study is the comprehensive post-war high-stakes testing programme conducted at the Copenhagen experimental school of Emdrupborg.
The third case study is the post-colonial high-stakes testing practice of Greenlandic children during the course of the preparation scheme in the Greenlandic educational system from 1961 until 1976.
Common to all three cases is that they contain an original embedding of high-stakes tests into a local educational field. Each case is mobilised by its own version of high-stakes testing; as such, each constitutes a different narrative regarding children and education, and the way in which their relationship is connected with society.
The Frederiksberg case is rooted in the rise of educational psychology from the 1920s onwards, and it is the very first case of high-stakes standardised intelligence tests being institutionalised and systematically applied in the Danish public school system. A clear high-stakes element can be discerned in this case. Intelligence testing was a significant component in the sorting and documenting of children transferred into remedial education. The institutionalised practice of intelligence testing at Frederiksberg culminated in 1934 with the employment of the first educational psychologist in Scandinavia, Henning Emil Meyer (1885–1967).
Covering the founding of a new and progressive experimental school in 1948, the Emdrupborg case takes a look at the high-stakes testing practices surrounding the school. Emdrupborg was the first experimental school in Denmark, and it served as an educational flagship, not least due to its hosting of numerous international and national visits as well as producing school reports that were distributed widely throughout Denmark. In its heyday, the experimental school employed no fewer than four educational psychologists who conducted a comprehensive testing programme of both low- and high-stakes tests on an annual basis. The high-stakes element is most vividly clear in the use of the Uppsala school readiness test that was used right from the start with the prognostic ambition of determining a child’s school readiness level (Ydesen, 2011: 17). These test results were also used to distribute children throughout the first grades, so that each class would consist of an equal number of high-scoring and low-scoring children. Another high-stakes element, although of a more indirect nature, was that all new children were given the intelligence test, and these test results would naturally affect teacher expectations, as well as be computed in the overall evaluation system at Emdrupborg.
Taking place in a neo-colonial setting, the Greenlandic case differs to some extent from the other two cases. Attached as it was to the Danish educational system, Greenland belonged to the same entity as the other cases. The Danish state functioned as both a significant source of influence and as a gatekeeping and mediating nerve centre of wider international aspirations for Greenland, while local particularities continued to play vital roles. The high-stakes element in the Greenlandic case is closely tied to the neo-colonial setting, since Greenlandic Inuit children had to learn Danish and often were sent to Denmark so that they could qualify for the lower secondary education; this was known as the preparation scheme. The preparation scheme proved to be of pivotal importance for deciding which children would receive a secondary education in the 1960s and 1970s. The overall selection process can best be characterised as a tripartite process, consisting of a centrally devised standardised testing battery, a teacher evaluation of each individual child accompanied by a prioritised school recommendation list and a final decision on the future of the child by the school director who took numerous aspects into account (including the state’s financial situation and the number of Danish foster homes available). The ambition was to handpick the Inuit children with the highest academic potential as well as the required physical and mental stability.
The macro level: education access
Treating the question of education – and thereby elite – access, it is necessary to focus on children’s need for knowledge and their ability to show mastery of a number of skills in order to perform well on a test. This observation validates the question of democratic education access in connection with high-stakes testing practices, because knowledge and skills depend on how well a child has been socialised.
For instance, the 1930 Danish Binet–Simon intelligence test used in the Frederiksberg case stipulated reading ability, knowledge about numbers and a certain level of linguistic development and vocabulary concerning many test items. Moreover, this required that the child know certain objects, understand certain concepts, find superordinate and associated concepts and even master the skill of rhyming. These requirements advanced with an increasing level of complexity, following a preconceived notion of a ‘normal’ child’s progressive development. This meant that the level of difficulty would steadily increase concurrently with the age groups growing older and that the room for deviations gradually diminished, calling for increasingly specific knowledge and skills (Ydesen, 2011: 102).
Furthermore, the test taker was required to relate to the particular way of thinking employed in the particular test item. Failure to do so would most likely result in poor performance. The child needed to interact not only with the test items themselves, including their paraphernalia, but also with the very logic used in the test’s design. The ability to do so would require experience with similar problems and tasks, an experience unevenly distributed among children due to their differing social backgrounds.
Looking at the 1930 Danish Binet–Simon intelligence test, it becomes clear that it contained a critical lopsidedness in aspects of both social and gender perspectives rooted in certain acquired skills being in demand if a child were to perform well on the test. The test items primarily drew on examples generally found in a male-oriented world, but both boys and girls needed to navigate a narrow path of middle-class moral codes as well as draw on a certain frame of reference. Failure to do so would result in a lower intelligence quotient (IQ) score. Concepts of time, money, abstract thinking and logical reasoning were key elements in the test. The test grouped certain values from both the working class and the middle class under a unifying umbrella of industrialised society. As such, the test’s inherent lopsidedness was sympathetic to norms and values common to a mid-20th-century industrial mindset.
In the Emdrupborg case, an updated version of the Binet–Simon intelligence test, published in 1943, was used. This version opened a window for including environmental factors when determining IQ levels, with the Emdrupborg educational psychologists even listing a vast array of possible biases in testing (CPH 1). Still, although the Emdrupborg experimental school psychologists possessed substantial knowledge about educational testing biases, quantification’s attraction seems to have overruled reservations in the common test assessment. This was so, in spite of a nascent modification in the design and use of intelligence testing, along with the emergence of a competing concept of intelligence, one that permitted a wider margin for environmental factors. Although the new version reflected the change in the concept of intelligence, it offered no revolution for Danish intelligence testing, and the 1943 test – like the 1930 version – continued to reproduce the sorting of children according to criteria closely associated with an industrialised society (Ydesen, 2011: 169).
The Uppsala school readiness test also used at Emdrupborg similarly required skills that, to a significant extent, would depend on the child’s social background and its gender. The test placed substantial emphasis on analytical skills, initiative, familiarity with figures and numbers and a relatively close-knit notion of normality, for example, what a house looks like, which objects belong in certain situations and which objects stand out from other objects (Ydesen, 2011: 155ff).
Thus, again, the all-comprehensive evaluation process at Emdrupborg was based on values inherent to certain character traits and work ethics, such as precision, order, stability and diligence. In this connection, it is striking that the two high-stakes testing components contained numerous factors evolving out of an atmosphere consisting of personal judgements of the test administrators; problems of fatigue, excitement and anxiety; assumptions about the neutral impact of time limits; and the test takers’ familiarity with test items and testing situations.
Because of the skills required to do well on the tests and the socio-material issues mentioned, the Uppsala school readiness test and the 1943 Binet–Simon intelligence test produced social and gender differences. The high-stakes testing practice at the Emdrupborg experimental school was – as was the case at Frederiksberg – not a neutral and objective evaluation tool. Following the close-knit notions of normality and deviance inherent in the tests, social and gender biases would have a significant impact on the distribution of life chances.
In the Greenlandic case, it is important to note that from a democratic perspective, advancement in society was to be achieved on Danish terms and conditions. The Greenlandic child was faced with the schizophrenic challenge of coping with, and adapting to, Danish middle-class norms and values, because the Greenlandic system was seen as less developed and thus in need of knowledge transfer from Danes towards the Greenlanders for the latter’s assimilation. This discourse fired the imaginations of decision-makers in the Greenlandic educational field who called for selection technologies, such as high-stakes educational testing, to be implemented (Egede, 1971).
From an analytical point of view, the norms and values of the test battery used centred on the following abilities: abstract thinking, logical reasoning, grammatical understanding, creativity, consistency and comprehension – abilities that, as described above, are in no way culturally neutral but are in good harmony with the values of an industrialised society needing citizens who could perform such tasks one day in their jobs as adults. The Danish values of modernisation, industrialisation, economic growth and prosperity thus permeated the test battery.
The cultural imbalance of educational tests in the Greenlandic context is reflected in the failure of the Uppsala school readiness test, which was terminated in 1964. During the course of the work with the test, it proved impossible to obtain the abstract concepts required from the test takers.
Focusing on the democratic question of education access through the lens of the historical case studies shows that testing is intimately connected with hegemonic societal values. This means that the material conditions of a given society in general and gender, social conditions and culture in particular permeate test designs with clear and present consequences for the distribution of life chances among test takers. Thus, the historical experiences suggest that when anyone employs high-stakes test batteries, close attention must be paid to the constructed notions of normality and deviance inherent in the test. To some extent, one might even say that testing is prestructured to incorporate a scale of values – a field of tension between genius (good) and idiocy (bad) – but only as certain forms. The test designer would need to anticipate all manifestations of genius in order for genius answers to be credited. Because this is impossible, only limited room for deviation exists, and there is none at all for spontaneity and surprise; thus, this type of testing tests for only one kind of genius and one kind of idiocy.
The macro level: accountability, transparency and education standards
Educational standards vary across different historical settings and practices. However, the overall picture that emerges from these three case studies is that implementing and practising high-stakes testing served as a way of securing and – in some cases – lifting education standards.
At Frederiksberg, the existing system of ability grouping and attendant selection procedures had caused a number of problems in the first decades of the 20th century. Thus, it was believed that a system that would allow for sorting those children who were perceived as lagging behind the group would be beneficial. As such, conditions of the Frederiksberg educational system circa 1930 were conducive to inviting proponents of high-stakes educational testing, which seemed to offer a solution to fairly and scientifically sort children into different levels for learning. Testing presented itself as a tool able to counter precisely the growing educational problems connected with lagging children, backwardness and class teaching. As such, testing held the potential for reasserting effectiveness and order in Frederiksberg’s educational system (Ydesen, 2011: 53).
Amid an atmosphere of international competition, post–World War I, the business world and certain conservative politicians had launched a severe critique of the educational system, claiming that children did not learn enough to be productive labourers upon leaving school. This critique sparked a desire – both outside and inside the educational system – for measuring the results produced by the educational system, fostering the opportunity to introduce a tool that seemed capable of meeting accountability demands and creating transparency by quantifying and accurately measuring the educational system’s professed standards (Ydesen, 2011: 109).
The desire to measure educational standards is also visible in the Emdrupborg case. The Emdrupborg experimental school was founded on the knowledge that other countries – primarily the United States – undertook comprehensive educational experiments. This sense of international competition in the realm of educational experimentation played into a competitive or at least comparative international mindset prevalent in the spheres of post-war Danish educational and political policy making. Furthermore, here were found the buzzwords of optimisation and effectiveness, in which a sense of international competition and a linkage between economic growth and education were important prerequisites (Ydesen, 2011: 167).
To this end, testing was generally acknowledged in Danish post-war education circles as a prerequisite for both conducting school experiments and documenting results and standards. As such, it is notable that testing demonstrated itself as such a powerful tool, because, to a great extent, it confirmed the existing practice. The Uppsala school readiness test revealed that some children were ready for school, whereas others were not; and the Binet–Simon intelligence test indicated significant differences in children’s intellectual capacities. Both tests were surrounded with an aura of credibility. Testing confirmed what had been conventional knowledge and provided scientific evidence for the righteousness of the educational practices, not least at the Emdrupborg experimental school (Ydesen, 2011: 169).
In Greenland, testing functioned as a transparency tool used to compare the standards of Greenlandic and Danish education. The Danish Ministry viewed testing as an attractive tool for use in Greenland. Still, Greenlandic politicians desired that Greenlanders themselves should be afforded the same opportunities to achieve a status equal to that of Danes in the educational sphere, enabling native participation in Greenland’s administration and development. Achieving such status required similar educational standards, and to this end, testing served the purpose well in securing a rise in Greenlandic education standards (Ydesen, 2011: 177).
Thus, the historical cases show that outside accountability demands and international comparisons facilitate the implementation and promotion of high-stakes testing practices, which are very apt for documenting comparable education results and standards. This serves a positive role for democracy in education, provided the education results and standards are made available for the public to see. However, the historical cases also reveal that high-stakes testing practices tend to confirm existing knowledge and practices, making it a preserving, conservative element in education. This might pose a problem, because policy changes may be more difficult to implement due to the scientific status of standardised testing requiring highly specialised knowledge. Hence, the other side of the democratic coin is that high-stakes testing makes education practices less susceptible to democratic arguments, agendas or perhaps even majorities.
The meso level: schools, teachers and parents versus central authority governance
At this thematic level, the degree of autonomy and influence for schools, teachers and parents alike concerning the implementation and use of high-stakes testing practices takes centre stage. In particular, the focus will be on the democratic room in which schools, teachers and parents can exert their respective influence on testing practices vis-a-vis the authority calling for the high-stakes testing practices to be implemented and sustained.
At Frederiksberg, the rise of high-stakes educational practices must partly be understood as a result of bottom-up processes. A small group of entrepreneurial teachers and psychologists, who used their extensive knowledge regarding international research on testing to promote intelligence testing as a new practice in the Danish educational field, undertook the pioneering task. In these bottom-up processes, Frederiksberg excelled as a site of the confluence of important agents. Meyer and a number of benefactors of educational testing received the required support and freedom to install and develop intelligence testing as a practice (Ydesen, 2011: 112).
However, many Frederiksberg teachers – at least initially – took a critical attitude towards testing and pedagogical experiments. There might be several reasons for what was apparently a widespread stance among teachers. First, educational psychologists were seen by many teachers as representatives of the reform pedagogy movement that advocated for a free upbringing and freedom of the child and thus brought revolution into the educational system. Second, teachers were also often sceptical regarding new exams and tests, because they feared such initiatives would encourage lockstep conformance and inhibit the teacher’s freedom and practice. Third, these initiatives apparently drained power from the teachers, who had lost their influence in the process of determining whether a child would be transferred into remedial education. This resulted in the whole enterprise of educational psychology and high-stakes testing being perceived as somewhat of an unknown factor in the everyday life of the school, one that might interfere with the teachers’ own daily practice, freedom and influence in the classroom (Ydesen, 2011: 73f).
But Meyer gradually managed to overcome this scepticism levelled at him from the ranks of many of his teacher colleagues. The support of the organisations, the teachers’ union and the Frederiksberg educational leadership undoubtedly helped him in this endeavour. Moreover, the ability of educational psychology to sometimes rid a teacher and a classroom of a troublesome child would often generate an attitude that could counter the above-mentioned scepticism and create a community of allied interests between teacher and educational psychologist.
Perhaps more importantly, however, the Frederiksberg educational leadership, in their promulgation of educational psychology guidelines, retained teacher initiative in transferring children to remedial education. The role of the Frederiksberg educational psychology office remained a consultative one in relation to transferring children into remedial classes throughout the interwar years. Thus, a transfer presupposed a recommendation from the teacher, the educational psychologist and the headmasters of both sending and receiving schools, along with a medical consultation with the school doctor, and finally the transfer’s approval by the school director. Based on the records of children transferred to the remedial classes, though, a clear picture emerges that the educational psychologist’s recommendations were nearly always followed.
From a parental perspective, it is noteworthy that parents had to authorise the testing for their child. The initiative for conducting an educational psychological examination of a child emanated from the teacher or more rarely from the parents. If a teacher wanted the educational psychologist to examine a child, they needed parental consent. The mindset was that the educational psychological practice would benefit if parents were included as co-players in the process. The only exception was in the event of a child’s pending transfer to the remedial school, in which case an educational psychology examination and intelligence testing of the child would take place without parental consent. Although Meyer stated in 1943 that parents had never been overruled in connection with the educational psychology examination, the exception is still particularly noteworthy, since parent’s opposition towards transferring their child to a remedial class was often fierce (Rifbjerg, 1963: 259; Persson, 1985: 53).
At Emdrupborg, it is noteworthy that the evaluation system did not stop with the child. Instead, the school launched an all-inclusive evaluation, which included both the home and the parents. The child was seen as reflective of a social context in need of help and control. This is in line with the spirit of comprehensive parent–teacher cooperation, a key area of focus at the school. But even more interestingly, annual log cards were completed, which included a broad evaluation of the home environment of each individual child as well. The aspects evaluated by the Emdrupborg teachers were school–parent relations; the parents’ school interest; the parents’ help for school homework; the home’s economic conditions, including whether the home consisted of a sole breadwinner; the general attitude of the parents; the child’s appearance; the upbringing of the child; and the house order.
The immediate purpose of these log cards was to ensure that the teachers would evaluate each child holistically and empathetically, stimulate teacher attention towards problems, identify children in need of help, facilitate the overview of the children and give a description of immeasurable factors (CPH 2). But this rather strictly designed log file system was not always well received among the teachers. Some of them felt that the basis of the evaluation was uncertain, and they lacked the possibility of discussing the evaluations with colleagues (Ydesen, 2011: 163).
The Emdrupborg case shows that teachers were integral cogs in a meticulously planned evaluation regime imposed by the school leadership consisting of a clergy of educational psychologists. On the one hand, some evidence exists that this evaluation regime was perceived like a straitjacket among teachers, but on the other hand, the label as an experimental school called for a thorough registration of all pupils. What is interesting in this respect as well is that parents, though school–parent relation was an area of active school fertilisation, were themselves objects of evaluation. To some extent, this bears witness of asymmetry in democratic relations and that ‘the system’ or the central authorities were the central arbiter of democratic influence; they held the prerogative of deciding who should be heard and who fell into the hegemonic notion of normality.
Turning attention to the role and latitude of parents in the Greenlandic case, it is clear that parents were generally positively disposed towards sending their children to Denmark. In fact, parents did not simply accept test results that could have a detrimental effect on their child’s potential future. The rejections called for thorough justification from the Greenlandic educational system’s highest echelon – justification that would often take to underline the objectivity of educational testing. However, it should be noted that although the majority of parents was enthusiastic about the preparation scheme when it was introduced in 1961, often they did not know what the scheme actually involved (KIIIN 1).
In order to understand the democratic conditions of Greenlandic schools and teachers in relation to high-stakes testing, it is important to note that the Greenlandic educational system was a state school ultimately controlled by the Ministry for Greenland from Copenhagen. But the school directorate in Nuuk enjoyed a wide degree of autonomy and freedom. Even so, it is fair to state that the school director, to a significant extent, functioned as a mediator between the Ministry and local school authorities. This was a position that, nevertheless, could often trap the school director between a rock and a hard place – particularly in relation to the preparation scheme. One example is found in the sheer number of school and teacher complaints in relation to the preparation scheme. Often, the final selection of pupils was not well received among schools and evaluating teachers, and over the years, the school director received numerous complaints. The main issue was that the number of children being recommended differed from the number of available places in Denmark (most often dictated by financial aspects and the number of available foster homes). Generally, the schools recommended far more children than could be accommodated. It fell upon the school director to address and resolve such dilemmas, revealing the delicate position of the directorate within the organisation and outlining the contours of a power struggle between it and the ministry. The ministry did not want the authorities in Greenland to be too autonomous, and the authorities in Greenland did not want the ministry to interfere too much in their affairs.
The school directorate generally proved itself to be a rather pragmatic institution, as a result of its high-profile organisational position mediating between the ministry and the 18 Greenlandic school districts. It is noteworthy that the school directorate did not refrain from opposing test results when they would threaten to undermine its authority and autonomy. This is particularly clear in the case of the 1963 ministerial research experiment aimed at identifying feeble-mindedness in Greenland. This indicates a rather pragmatic and perhaps even ambiguous approach to testing on the part of the school directorate. It also shows that its leading officials did not enforce a notion of fixed intelligence measurable through arithmetic tests, as had been the case with the Danish researchers conducting the experiment. Considerations of power, influence and freedom were the key concerns of the school directorate.
Without a doubt, however, the school directorate offered its full support in employing high-stakes educational tests in the preparation scheme until 1971/1972, and it maintained a significant, positive rhetorical stance regarding educational testing throughout the period. This discourse evolved around meritocratic ideals that upheld the effective spotting of talent as a key value.
Thus, testing was largely perceived as an objective evaluation tool able to create an allegedly objective reference point that would ease the administration of opposing interests and thus preserve the school directorate’s autonomy. The school directorate found testing very useful as a seemingly objective bulwark against the arguments of local interests threatening to warp the delicate balance of the highly heterogeneous Greenlandic educational system. Both teachers and parents who would challenge the decisions and the authority the school directorate exerted in the preparation scheme frequently expressed such interests. In other words, testing solved a problem for the school directorate in the governing of the Greenlandic educational system (Ydesen, 2011: 214ff).
Examining the meso level clearly indicates the importance of the origins of high-stakes testing practices. At Frederiksberg, high-stakes testing emerged largely as a bottom-up process but found critical support from Frederiksberg’s educational leadership. The high-stakes testing practice largely conceived by Meyer helped overcome teacher scepticism. At Emdrupborg, high-stakes testing practices emerged more as a result of a top-down process with the educational psychologist leadership at centre stage. Teachers, to some extent, felt trumped, although working at an experimental school called for comprehensive evaluation practices. In the Greenlandic case, high-stakes testing practices were unambiguously the result of a top-down process co-instigated by the ministry and the school directorate. Teacher and school priorities were often overruled with a reference to test results. Democratic debate was suspended, due to the highly specialised knowledge required to mount an effective critique of test results. This must be understood as a problem of governance rooted in the organisational structure of the Greenlandic educational system.
All three cases reveal a serious lack of parental influence concerning high-stakes testing practice. At Frederiksberg, the use of intelligence testing with a high-stakes purpose was listed as precisely the one exception in which parents exercised no influence. At Emdrupborg, while parents experienced the possibility of being heard, the school retained its prerogative regarding who would be heard, with parents themselves being subject to evaluation. In Greenland, parents could do little to object to either test results or decisions as to whether their child should receive secondary education via a school stay in Denmark. Some parents, however, could exert enough economic muscle to pay for a school stay themselves, thus overriding decisions made by the established educational system.
Overall, the meso level indicates that high-stakes testing practices tend to overrule and suspend democratic objections from stakeholders, because they are held to be scientifically objective and beyond reproach.
The micro level: test power and pupil individuality
At the micro level, three key concerns take centre stage: pupil individuality, democratic teaching and the relations of testing with other evaluation technologies.
Although intelligence testing by its definition is aimed at shedding light on an individual’s capacities and abilities, at the same time, it is limited to an understanding of the individual in relation to a preconceived scale of normality based on standardisation. Frederiksberg was no exception, and it is clear that intelligence and IQ both played vital parts in the evaluation process. In the majority of his endorsements, Meyer mentioned IQ test results as the sole determining factor. Moreover, a child’s IQ level was always indicated at the beginning of the educational psychologist’s endorsement. This shows how the notion of intelligence was a key determining factor permeating the Frederiksberg educational system. It seems as if the educational psychologist’s primary role was merely to specify the IQ of the child in question.
This observation offers a glimpse into a practice that appears contrary to descriptions made about a comprehensive educational psychology evaluation process in which a holistic and individual perspective was given pride of place, and an IQ score never would function as the sole factor determining a child’s transfer into a remedial class. Put another way, it appears that thoroughness and a comprehensive view turned into a pragmatic attraction of quantification to the detriment of pupil individuality (Ydesen, 2011: 104ff).
While there was no specific focus on democratic teaching at Frederiksberg in the interwar years, the picture is very different at post-war Emdrupborg. After the German occupation of Denmark had ended in May 1945, the climate in the Danish educational field was generally one of reconciliation and openness. This new openness was particularly true in relation to experiments regarding educating children democratically. The leading political parties in the decades after the German occupation saw the educational system as a hotbed for promoting democratic dispositions in the population (Juul, 2006: 80). Democracy was seen as the key bulwark against the atrocities committed during World War II, with education the key to creating a better world.
This clear-cut focus on democracy permeated the very comprehensive pupil evaluation programme. Apart from the tests, the evaluation system at the experimental school had two additional dimensions. First, the teacher submitted an evaluation report twice a year. The first report centred on the work in the classroom, and the second report emphasised the work and development of each individual child summarised in the teacher’s own words. Second, the teacher maintained an annual log describing an extensive list of factors pertaining to each individual child, such as teaching time spent on the child and the child’s school interest, ability to concentrate, individual work pace, sense of order, self-confidence, ability to cooperate, stability, verbal ability of expression, creative abilities, dexterity, manual skills, motor development, role in commotion incidents, lack of precision, managerial skills, spare-time activities and initiative in the classroom apart from the more traditional ratings of the child’s work in different academic subjects (Nørvig, 1955: 153).
The teacher evaluation reports, in which teachers were allowed to evaluate children using their own words, shows – among other notions – that IQ scores often exerted considerable influence over teachers’ own expectations towards the individual child. In Table 1, there are selected excerpts from teacher evaluations concerning different children from different grades to give impressions of the values and notions prevalent in the Emdrupborg experimental school evaluation regime.
Excerpts from teacher evaluations concerning different children from different grades giving impressions of the values and notions prevalent in the Emdrupborg experimental school evaluation regime.
The teachers aimed at a holistic evaluation of each individual child, and they certainly did not refrain from evaluating the child’s home or even making causal explanations about the child and its parents, based on hereditary assumptions. This supports the argument about teacher evaluations being based on arbitrary criteria, making standardised high-stakes testing more apt for creating a democratic environment.
In the ‘body and appearance’ and ‘traits’ categories, it is striking how the teachers subscribed to all kinds of stereotypes such as ‘a typical boy’, ‘a wimp’, ‘a boy scout type’, ‘a Don Juan’ and ‘Ferdinand the bull’, to mention but a few. These stereotypes – contrary to pupil individuality – to a large extent seem to be ordered according to gender conceptions, and they clearly show gender issues at play in how the teachers perceived of, and what was expected from, the individual children. Thus, the teacher evaluations reveal that gender concepts would have an impact on both the schooling and learning environment of the child. This was probably not unique to the Emdrupborg experimental school, but the content of these evaluations is very striking inasmuch as they also reveal a very strong focus on appearance and looks among many Emdrupborg teachers.
On a more positive note, the evaluations also indicate the clear presence of democratic ideals. As such, much emphasis is put on a child’s relation to classmates, participation in common activities and positions of trust. Based on the negative remarks about children who are believed to possess certain character traits and work ethics, it is precision, order, stability and diligence that again stand out as the experimental school’s canonised values. This indicates that preconceived notions of normality permeated the evaluation regime. The taxonomic logic of the log cards bears strong witness to the experimental school’s role in preparing children for their imminent adult life in the labour force. The evaluative categories pertaining to the child all more or less centre on the child’s ability to cope with society and a life in the labour force upon leaving the school. This is evident in the focus on spare-time activities. The values inherent in the log files stand out as a mixture of working- and middle-class values inculcated by an industrialised society in which precision, order, stability and diligence assumed pride of place.
The evaluations of giftedness are particularly interesting in relation to the 1943 Binet–Simon intelligence test scores. The remarks ‘seems brighter than she is’ and ‘seems sensible and more mature than indicated by her IQ’ are indeed very telling about this relation. Looking through the available Emdrupborg teacher evaluations in the semi-annual log files makes it apparent that the notion of IQ was a recurring element against which teachers would position, and argue for, their own evaluations. Particularly, children’s IQ results functioned as points of orientation for teacher expectations. What is interesting is that teachers would often estimate an IQ in cases even when the child had not been IQ tested. This is evident in the frequent use of the term ‘the coefficient of utilization’. One teacher wrote, ‘His IQ had subsequently been measured to 119, whereas I had estimated him to be around 100, so I guess something will present itself in the time to come’.
Another interesting observation is that the IQ seems to have been hit by inflation. In several teacher evaluations, children are denoted as average if their measured IQ appeared to be around 110–115. Sometimes, though, teachers noted the discrepancy between the IQ test results and customary and observed work habits: ‘His IQ was measured with the Binet test to be 134, but today one would say that that is too high, since his academic achievement does not reflect this level’ (CPH 3).
In the Greenlandic case, high-stakes testing interaction with teacher evaluations created numerous problems. In fact, there often seems to have been a marked schism at the practice level between teacher evaluations and test results. This points to a well-known problem in the relationship between high-stakes educational tests and teacher evaluations: ‘[…] the tool of high-stakes testing diminishes teacher judgment and decreases their responsibility, and instead routinises instruction, deskills many teachers, and places them under closer supervision’ (Madaus et al., 2009: 102). Occasionally, the school director, having to defend his decision against that of disgruntled teachers, openly used test results as an argument for rejection or acceptance (Ydesen, 2011: 212).
Teachers often saw testing as an external tool that would potentially undermine their professional expertise. In contrast, the school director saw testing as a useful tool to counteract teacher evaluation inflation and as a means to strengthen his own power position.
The micro level thus demonstrates that testing indeed is an extremely powerful tool permeating and often overruling alternative evaluation technologies. At the same time, ambitions of taking pupil individuality into consideration seem to have been thwarted by the strong notion of normality associated with high-stakes testing – a notion closely connected with the norms and values of the society in which the high-stakes testing practice is conducted.
The inadvertent launch of close-knit notions of normality, inherent in high-stakes testing practices, affects the conditions under which democratic teaching can be realised. This is, however, more of an indirect influence, and the adoption of high-stakes testing practices and democratic teaching cannot be held to be mutually exclusive.
Conclusion: democratic testing?
Following the historical analysis of the case studies, questions arise concerning the lessons that can be drawn from historical experiences with relevance for contemporary democracy–testing relations. What are the implications of high-stakes testing on democratic conditions? At which levels does high-stakes testing interact with democracy?
As evidenced by the historical case studies, high-stakes testing is closely interwoven with societal needs for governing education access, education management and social selection. These societal needs form the core links in democracy–testing relations, which probably remain true today.
Accountability issues cut across democracy–testing relations, because all types of democracy call for transparency, and testing can provide a certain layer of transparency pertaining to the issues of education access, education management and social selection. In this respect, democracy–testing relations might be denoted as symbiotic. In fact, the historical case-study analysis demonstrated that accountability demands and international comparisons facilitate both implementing and promoting high-stakes testing practices.
However, this apparently symbiotic relationship is based on the ability of testing to function objectively, fairly and justifiably. The analysis showed that testing practices are not purged from constructed notions of normality and deviance, which suggests that when working with high-stakes test batteries, educators should pay close attention to such constructions inherent in the tests, because they are important for education access, education management and social selection that are critical to any type of democracy. However, this critique proved also to be true in teacher evaluations, which is why a combination of different evaluation technologies – some formative and some summative – might be the safest way to go from a general democratic perspective.
A second focus area that would benefit from close observation employing a participatory and deliberative democratic perspective is the openness for democratic debate in general and parental influence in particular. The historical analysis indicated that both democratic debate and parental complaints are in danger of being suspended due to the highly specialised knowledge required in critiquing test results or promoting alternatives to testing. An important factor in this respect is the tendency of testing to overrule and outmanoeuvre other evaluation technologies. This tendency is strengthened by testing’s ability to generate comparable results, which are then held to be scientifically objective and beyond reproach. The historical analysis, however, also implied that testing tends to be a conservative element in education, confirming existing knowledge and practices.
A third focus area for democracy–testing relations in a participatory or deliberative perspective would be to study the origins of high-stakes testing practices. Are they implemented and sustained as a result of bottom-up or top-down processes? Based on the historical cases, the latter form of implementing and sustaining high-stakes testing practices tends to be an expression of outside governance that can stifle and perhaps even undermine local and more democratic bottom-up initiatives for achieving best practice. In this respect, it might be prudent to also ask the ancient Ciceronian question, Cui bono? (Who benefits?) (e.g. Cicero, Pro Roscio Amorino, Section 84).
Footnotes
Funding
This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
Unpublished sources
CPH 1: Copenhagen City Archives, Emdrupborg School Archive, Wilhelm Marckmann’s papers II 1952–1964, Undated Note.
CPH 2: Copenhagen City Archives, Emdrupborg School Archive, The Psychologist’s Room: Note dated April 1952, p. 2.
CPH 3: Copenhagen City Archives, Emdrupborg School Archive, Child Descriptions 1948/49–1964/65.
KIIIN 1: Kultureqarnermut, Ilinniartitaanermut, Ilisimatusarnermut, Ilageeqarnermullu Naalakkersuisoqarfik [Department for Culture, Education, Research, and Church] Archive, j.nr. 949.3, 1961: Letter from the headmaster in Egedesminde to the school director dated 8 June 1961.
