Abstract
In the current article, we consider the influential position of the Programme for International Student Assessment (PISA) and discuss several methodological areas that demonstrate the need for caution when using and interpreting PISA results. We motivate our argument by briefly describing the program’s increased influence in educational policy over time. Subsequently, we describe the methodological areas of interest, including sampling participants, the achievement estimation model, and measuring trends. We also offer our perspectives on how the Organisation for Economic Co-operation and Development might productively and more clearly communicate PISA’s limitations.
Keywords
Over a 60-year history, modern international large-scale assessments have become influential educational policy tools, moving beyond their historical role as descriptive “snapshots” of educational systems (Purves, 1987). A prime example of this evolution is the Programme for International Student Assessment (PISA), the flagship educational study of the Organisation for Economic Co-operation and Development (OECD). According to the OECD (2012), “[PISA] is a collaborative effort among OECD member countries to measure how well 15-year-old students approaching the end of compulsory schooling are prepared to meet the challenges of today’s knowledge societies” (p. 22). Beginning with its first cycle in 2000, PISA results have stimulated considerable changes in a number of participating educational systems. 1 For example, the results from PISA 2000 gave rise to a national “PISA shock” in Germany, leading to massive and swift educational reforms (Ertl, 2006). In this regard, Germany is not alone, as similar educational impacts were felt in Japan (Takayama, 2008), Denmark (Egelund, 2008), Finland (Dobbins & Martens, 2012), and a number of other European countries (Grek, 2009). As Hopmann, Brinek, and Retzl (2007) wrote, “every time a new PISA wave rolls in, or an additional analysis appears, governments fear the results, newspapers fill column after column, and the public demands answers to the claimed failings in the country’s school system” (p. 1). Nine years since this statement, the response to PISA results in the United States are similar. For example, following the 2012 PISA results, then–secretary of education Arne Duncan called for higher educational standards, using PISA performance as justification, asserting that U.S. results were a picture of educational stagnation (Duncan, 2013).
Media and policymakers are often most interested in one particular aspect of PISA results: the achievement rankings, or “league tables.” These rankings have given rise to what effectively amounts to an international “horse race” that identifies the educational winners and losers, with winners placed in an international spotlight and losers placed under a figurative microscope. A prime example of both positions is represented by Finland, which topped the rankings in the first three PISA cycles. As a result, this small Nordic country was the recipient of a steady wave of educational tourism from scholars and policymakers who sought to understand the ingredients that pushed Finland to educational success. Consequently, movies (Wagner & Compton, 2011), books (Sahlberg, 2011), and touring speakers with keynote slots at international conferences (e.g., Sahlberg, 2014) popularized the Finnish approach to education. This trend continued until 2009, when Shanghai topped the rankings in all three content areas, far surpassing Finland. A repeat performance by Shanghai in 2012 firmly positioned this Chinese city as the PISA darling, with a level of scholarly and popular attention (Cheng, 2010) similar to that of Finland. And while Shanghai was basking in the limelight of the 2012 results, national media in Finland concluded that due to decreases in PISA scores since 2009, its educational system had collapsed (Sahlberg, 2013), marking a stark shift in the rhetoric around the Finnish system and changing the global “reference society” from Finland to Shanghai (Sellar & Lingard, 2013a).
Given the influential position that PISA holds in education policy and research discussions, it is important that limitations associated with using and interpreting PISA data are clearly understood. To that end, we explain three methodological aspects, their limitations, and how these issues can weaken conclusions and associated policy prescriptions that stem from PISA results. Our choice of methodological issues is intended to go beyond standard user-guide warnings and caveats aimed at data analysts (e.g., the importance of including sampling weights and correctly treating plausible values; OECD, 2009). In particular, we outline and describe for a nonpsychometric audience the following: sampling participants, the achievement estimation model, and measuring trends. Although our areas of focus are not exhaustive, they illustrate less-discussed issues that necessitate a measured approach to drawing conclusions from PISA results. Two of the issues we raise are common across other international assessments; the third, the trends limitation, is unique to PISA.
Select Methodological Limitations of PISA
Sampling Participants
Two methodological limitations relevant to sampling participants are (a) poorly reported exclusion rates that exceed international standards and (b) a misalignment between PISA’s stated aims and population coverage. As background, sampling procedures in PISA follow a carefully developed protocol and are in accordance with strict technical standards (OECD, 2014a). Further, the entire process and resultant databases are subject to an adjudication process to ensure that standards are met. The PISA target population is 15-year-old students attending educational institutions with Grade 7 and higher, located within the educational system (OECD, 2012, p. 62). To select students for participation, PISA generally uses a two-stage stratified sample design where the first stage is a sample of at least 150 schools, with probability of selection proportional to size. After schools are selected, a random sample of 35 15-year-olds is chosen for participation. In each educational system, a minimum of 4,500 students are drawn. The goal of this process is that PISA samples are intended to be representative of the target population (15-year-olds in school) but not necessarily 15-year-olds in general. To ensure that the sample of students is generally representative of the target population, several criteria are in place regarding school and student exclusion rates. Although not comprehensive, students with severe intellectual or functional disabilities, those with insufficient language skills, or schools attended by students with these characteristics are candidates for exclusion. In general, overall exclusions should be kept below 5% (OECD, 2014b, p. 67).
In spite of the care with which samples are drawn, this sort of process will necessarily suffer from deficiencies, given the scale and scope of PISA. For example, in eight educational systems, overall exclusion rates exceeded the maximum of 5% (OECD, 2014a, p. 265), with the highest exclusion rate of 8.4% in Luxembourg. Importantly, the agreed-upon maximum exclusion rates ensure that any distortions in national mean scores due to omitted schools or students would be no more than ±5 score points on the PISA scale (or about two standard errors; OECD, 2014a, p. 27). Higher exclusion rates can cause distortions that are larger than expected, leading to possibly incorrect inferences and conclusions. As such, failure to meet sampling standards should be clearly communicated in all tables of results using typographical devices referring to footnotes with relevant information (as an example, see Mullis, Martin, Foy, & Arora, 2012, p. 40). Nonetheless, these exceptions are documented only in Annex A2 of Volume 1 of the results; no qualifications are noted in the main tables of results, and only a general mention of exclusion rates is included in the introduction of the same document (OECD, 2014a, p. 27). Additional information about exclusion rates is also available in the technical report (OECD, 2014b); however, this document was unavailable until December 2014, 1 year after the results from the 2012 cycle were published. We see this as an important issue with respect to scientific inquiry into PISA methods and implementation, particularly given that the PISA cycle is only 3 years long and the 2015 administration was only months away when the 2012 technical report was released.
A second issue with respect to sampling regards the issue of population coverage. That is, how well does the sample represent the population that PISA intends to measure in each educational system? In terms of the target population (15-year-olds in school), PISA samples are highly representative. In contrast, less than 80% of all 15-year-olds are captured in 16 out of 65 participating educational systems in 2012, including top-performing Shanghai (OECD, 2014a, p. 268). At the extreme, Costa Rica covered just 50% of all 15-year-olds, and Albania and Vietnam covered 55% and 56%, respectively. This suggests that nearly half of the population of all 15-year-olds were not included in the sampling frame. And although this is not an inherent sampling problem (as indicated by a well-covered target population), it certainly precludes any generalization of PISA results to the entire population of 15-year-olds (that eventually enter the workforce). Such low coverage weakens OECD claims that the average level of skills measured by PISA is an “important indicator of human capital, which in turn has an impact on the prosperity and well-being of society as a whole” (OECD, 2013a, p. 169). Clearly, as an overall indicator of human capital, PISA will necessarily be limited by the fact that 15-year-olds not enrolled in schools are outside of the target population.
Achievement Estimation Model
In what follows, we discuss two issues with respect to the achievement estimation model: (a) the assumption that item parameters are equal across measured populations and (b) missing and error-prone background data, which can have an impact on achievement estimates. As of the 2012 cycle, the PISA approach to estimating achievement assumed that item responses adhered to a generalized Rasch item response theory (IRT) model (Adams & Wu, 2007; Adams, Wu, & Carstensen, 2007; Rasch, 1980). A fundamental assumption of IRT models is that meaningful cross-cultural comparisons depend on item parameter equivalence (Hambleton & Rogers, 1989; Mellenbergh, 1982; Meredith, 1993; Millsap, 2011). Given the specification of the Rasch model, this implies that test items are assumed to be equally difficult across the populations under consideration. That is, an item should be equally difficult for children in the United States, Kazakhstan, and Shanghai. As operational procedures in PISA rely on this assumption, it is notable that in empirical investigations, the assumption does not hold (e.g., Kreiner & Christensen, 2014; Oliveri & von Davier, 2011; Rutkowski, Rutkowski, & Zhou, 2016). And in a limited investigation (Rutkowski et al., 2016), violations were found to have consequences for ranking especially middle-performing educational systems on their achievement scores. Further, the same study showed that achievement can be meaningfully biased—in several cases resulting in achievement outside of the original 95% or 99% confidence interval. In other words, a consequence of violating this assumption is that achievement rankings may not be accurate and system-level comparisons can lead to incorrect conclusions regarding achievement differences, particularly when those differences are small, albeit statistically significant. It is important to note that this error source is not currently captured in reported measures of uncertainty, including standard errors and confidence intervals. As such, true between–educational system differences in achievement can be obscured while false differences can be revealed.
Balancing a desire to assess content domains broadly against an effort to minimize the testing burden for each examinee, PISA uses a sophisticated method of test administration called multiple-matrix sampling (Shoemaker, 1973). Under multiple-matrix sampling, test material is divided up into nonoverlapping item clusters that are assembled into partially overlapping test booklets so that 10 or more hours of testable materials is packaged into 120-min booklets, one of which is administered to each examinee. As only a subsample of the total test material is administered to any examinee, this essentially creates a missing data problem for achievement estimation. As such, a modification of multiple imputation (Rubin, 1987) treats achievement as if it were a missing value to be “filled in” for examinees (Mislevy, 1991; Mislevy, Beaton, Kaplan, & Sheehan, 1992; Mislevy, Johnson, & Muraki, 1992).
As in multiple-imputation methods, an imputation model (called a “conditioning model”) is developed to predict population and subpopulation achievement distributions. This model uses all available student data (cognitive as well as background information) to generate a conditional proficiency distribution from which to draw a number of plausible values (usually five) for each student on each latent trait (e.g., mathematics, science, reading, and associated subdomains). Effectively, this method borrows information from the subpopulation to which individual students belong to predict their scores. For example, if achievement values are predicted based on sex, then the values assigned to boys and girls will be predicted based on the achievement distribution for the sex to which they belong. That is, if girls constitute the higher-performing group, individual girls’ scores are predicted based on this difference, enabling a better approximation of the true achievement of each sex within the overall population. This method produces sufficiently precise, unbiased estimates of population and subpopulation achievement when the underlying assumptions of this model are met (Mislevy, 1991; Mislevy, Beaton, et al., 1992; von Davier, Gonzalez, & Mislevy, 2009); however, missing and error-prone data in the conditioning model can have consequences on subpopulation achievement estimates.
A simple abstraction of the conditioning model, described above, is a linear regression model where the dependent variable is unobserved achievement for each examinee, which in turn is a function of student background characteristics (approximately 300 to 400 variables, depending on the educational system) along with student responses to the assessment. That is, unobservable student achievement is partly a function of background characteristics, such as gender, socioeconomic status, and attitudes toward learning. Simply put,
Notably for the current discussion, student attributes, on the right side of the equation, are assumed to be measured perfectly (no error) and are not missing. But this assumes that students are willing and able to select an appropriate category to describe them. That is, setting aside gender complexities, boys tick the boy box and girls tick the girl box—a fairly straightforward exercise. With more complex questions, however, research on an international assessment similar to PISA has shown that there are discrepancies between parent and student reports of home possessions and that the disparities are higher in less economically developed educational systems (Rutkowski & Rutkowski, 2010). At the extreme, correlations between parent and student reports are as low as .17. Similarly, reliability, as measured by Cronbach’s alpha, is as low as .41 on a commonly used scale of home possessions in PISA, with 43 out of 65 educational system estimates below .60 (Rutkowski & Rutkowski, 2013). Further, PISA 2012 documentation reports fairly large rates of missing data on policy-relevant variables. For example, up to 12% of data was missing on a question that asked which language was spoken most frequently at home (OECD, 2013b). Although we note these examples here, as with all social science data, every variable suffers from some missing data and the possibility of measurement error, the severity of which depends on the educational system in question.
As with ordinary least squares regression, research has shown similar effects on parameter estimates when independent variables are missing (Rubin, 1976) or error prone (Buonaccorsi, 2010). In other words, when student background information is missing, subgroup achievement differences are incorrectly estimated (Rutkowski, 2011), particularly with respect to comparisons across educational systems (e.g., comparing performance of German girls to Norwegian girls). Further, a recent study showed that measurement error in student background information was responsible for meaningfully biased subgroup achievement differences and that the degree of bias was directly related to the amount of measurement error (Rutkowski, 2014). In both cases, misleading inferences are likely when comparing performance across categories of a variable that are either prone to missing data or measurement error.
Measuring Trends
The final methodological issue regards insufficiently reported limits to measuring trends. As we discuss subsequently, some trend measures, although unsupported by the data, are inconsistently documented in PISA reports. In each cycle of PISA, one content area is a major domain and the other two are minor domains. In 2009, this means that the major domain (reading) was allotted 60% of the testing time, whereas the other 40% was evenly split between math and science (OECD, 2012, p. 28). Further, one of the primary stated outcomes for PISA is trend indicators that show how results change over time (OECD, 2013a, p. 13). To achieve this outcome, a selection of items is held in reserve, rather than released to the public, and is repeated in the next test administration. These items serve as the bridge across cycles, allowing for statistical linking and measuring trends. For example, of 110 math items administered in 2012, 35 of these were common to 2009, 48 were common to 2006, and 84 were common to the 2003 cycle (OECD, 2014a, p. 280). Because math was a major domain in both 2012 and 2003, the share of linking items is considerably larger between these two administrations. The process for creating the link across cycles focuses on linking the most recent assessment cycle to the adjacent cycle (e.g., 2012 math linked to 2009 math). In this way, 2012 performance is linked to 2009 and 2009 performance is linked to 2006, and so on, as in a chain. And given that the entire assessment is not common across cycles, there is inherent error in these links. These errors, termed link error, are statistically accounted for and documented in the PISA technical reports (e.g., OECD, 2012). Link error estimates provide information as to how trends were estimated in PISA. Further, they are also useful for independent researchers who are interested in understanding trends in PISA. Along with sampling error, link errors manifest themselves in the measures of uncertainty that surround achievement estimates across time.
Relevant to the major/minor domain distinction is the fact that the original math and science frameworks were not well developed until the point at which they served as the major domain (2003 for math and 2006 for science). Consequently, mathematics achievement is comparable only back to 2003, and science achievement is directly comparable across cycles back to 2006 (OECD, 2012). This can also be seen in the relatively poor content overlap in these areas. For example, only five items are common between the 2000 and 2009 science assessments. And only two (of four total) content areas are common between the 2000 and 2003 math assessments, leading to just eight common math items between 2000 and 2009. Although much of the current technical documentation is clear on the limitations of measuring change over time on the math and science scale (e.g., OECD, 2012, 2014a), there remain some inconsistencies and confusing points in the documentation. For example, characteristics of the 2000-to-2003 mathematics link are described in the PISA 2009 technical report (OECD, 2012, pp. 213, 220); however, there is no reported link error for this scale in Table 12.36 (OECD, 2012, p. 230). Although someone intimately acquainted with PISA methods would likely find no trouble understanding the issue at hand, only a close and careful inspection of the technical information makes it clear that, although the link is described in some detail, the 2000-to-2003 math link is for two math subdomains only (space and shape, and change and relationships) and that the two scales are, in fact, not generally comparable.
Summary and Conclusion
PISA has served to build capacity and technical savvy in countries that do not have their own national assessments or the infrastructure to develop one. It has also served as an important yardstick to help countries understand aspects of their educational system. In these and other regards, PISA is a valuable instrument. Nevertheless, PISA is not error free; partly as a result, many scholars have taken a critical perspective toward PISA (Brown, Micklewright, Schnepf, & Waldmann, 2007; Goldstein, 2004; Rizvi & Lingard, 2009; Sellar & Lingard, 2013b), some of whom patently reject international comparisons (e.g., Bracey, 2008). PISA’s position of prominence and the associated criticisms make it all the more important, then, to exercise careful use of PISA data, which depends on a clear understanding of the limitations inherent in the study. This is first and foremost an OECD duty to use the data within its means, to set and exemplify high scientific standards, and to take special care in transparently discussing—especially for non-psychometricians—the limitations of the PISA study. A reasonable share of the onus also lies with the research community that consumes PISA data and results. The intent of the current paper is to contribute in this regard.
As a way to underscore the importance of interpreting and using PISA results with caution, we presented three interrelated methodological issues: sampling participants, the achievement estimation model, and measuring trends. In each case, we can conceptualize the issue at hand as one of error or inaccuracy, which is part and parcel of inferential statistics. Sampling participants gives rise to sampling error, issues with the achievement estimation model can be attributed to measurement error, and measuring trends gives rise to linking error. Although not exhaustive, these three error sources are responsible for the greatest inaccuracies in large-scale assessment results (Wu, 2010). The impact of sampling error is critically important, particularly as this error pertains to the stated aims of PISA (an indicator of an educational system’s stock of human capital and a yield study that aims to measure cumulative years of learning) relative to the target population (15-year-olds in school). The problem of measurement error—ubiquitous in social science data—is compounded in the large-scale assessment context particularly when estimating subgroup differences where conditioning variables are partially missing or error prone or when the model assumptions do not hold. Finally, connecting multiple test forms over time is made more challenging due to the major/minor domain distinction and that some content domains did not have fully developed frameworks until they featured as a major domain. Importantly, insufficiently or unclearly communicating these issues can lead to misinterpretation, overinterpretation, and in the worst case, unsupported policy decisions (Wu, 2010, p. 24). As PISA results are only estimates, the degree to which incorrect inferences are being made in practice is unknowable; however, the methodological evidence reviewed in previous sections suggests that such an outcome is certainly likely.
As a call from education researchers to the OECD, we offer recommendations for more completely and transparently communicating limitations when reporting PISA results. Our first recommendation regards clearly published caveats. Although PISA technical documentation and reports typically warn readers of many of the limitations of the data, these cautions, for lack of a better-implemented solution, are often buried among a myriad of other details and information. Anyone who has read PISA reports and manuals understands that, in the voluminous available information, crucial limitations are easily overlooked amid tables, figures, information boxes, and text. We recommend a dedicated limitations chapter or section in every PISA report. This sort of addition could include links to more in-depth information along with standard statements that would be suitable for reporters, policymakers, and other stakeholders to use in their writing or media coverage. This standardized chapter or subsection would allow readers to more efficiently and effectively understand the key limitations associated with PISA methods and results.
A second suggestion is to be consistent in documenting and reporting only those analyses that are methodologically sensible. We point to the above example regarding technical documentation that alluded to links across PISA studies that were not supported by the data (e.g., linking the 2000 math scale to the 2003 math scale). Consistency and clarity in these regards would go a long way toward avoiding mistakes by analysts who are conducting independent research with PISA data. Although we offer only the trend example here, there are other similar areas that would benefit from a consistency review of PISA technical reports.
We also strongly urge the OECD to publish the technical documentation for a given test cycle with the same expediency associated with publishing the initial results. As mentioned previously, the publicly available technical report for PISA 2012 was not published until December 2014 (a full year after the initial release of the results). In contrast, the International Association for the Evaluation of Educational Achievement makes the technical reports for its international achievement studies available within a month after the results are released (D. Hastedt, personal communication, February 26, 2016). Given the level of innovation that accompanies each new PISA cycle, it is an insurmountable challenge for applied researchers and methodologists to conjecture about the technical details that underpin the publicly available data, reports, and policy prescriptions.
Even under the very best circumstances, PISA, by definition, is limited in that it can provide information regarding what a representative sample of 15-year-olds enrolled in school knew on a particular day about select content areas as they are defined by the PISA consortia (OECD, 2013b). As such, inferences can be made only about a narrowly defined population regarding its performance on a narrowly defined set of topics. Based on rigorous technical standards, the information provided by PISA is acceptably precise and reliable; however, it is not perfect. And any interpretation of PISA results should be made in light of the test’s limitations. Finally, we see clear value in international tests, such as PISA; however, we also believe that restraint commensurate with the level of influence should be exercised by all parties when interpreting results and making policy recommendations.
