Abstract
International large-scale assessments of student achievement such as International Association for the Evaluation of Educational Achievement’s Trends in International Mathematics and Science Study (TIMSS) and Progress in International Reading Literacy Study and Organization for Economic Cooperation and Development’s Program for International Student Assessment that have come to prominence over the past 25 years owe a great deal in methodological terms to pioneering work by National Assessment of Educational Progress (NAEP). Using TIMSS as an example, this article describes how a number of core techniques, such as matrix sampling, student population sampling, item response theory scaling with population modeling, and resampling methods for variance estimation, have been adapted and implemented in an international context and are fundamental to the international assessment effort. In addition to the methodological contributions of NAEP, this article illustrates how the large-scale international assessments go beyond measuring student achievement by representing important aspects of community, home, school, and classroom contexts in ways that can be used to address issues of importance to researchers and policymakers.
Introduction
Large-scale international assessments have emerged on the assessment landscape since the 1992 Special Issue of the Journal of Educational Statistics on National Assessment of Educational Progress (NAEP). Among the best known of these are Trends in International Mathematics and Science Study (TIMSS) in 1995, Program for International Student Assessment (PISA) in 2000, and Progress in International Reading Literacy Study (PIRLS) in 2001. Each of these assessment programs is designed to measure trends in international student achievement for a substantial number of countries around the world—TIMSS in mathematics and science; PISA in mathematics, science, and reading; and PIRLS in reading. TIMSS and PIRLS are programs of the International Association for the Evaluation of Educational Achievement (IEA), directed by the TIMSS & PIRLS International Study Center at Boston College. PISA is coordinated by the Organization for Economic Cooperation and Development (OECD).
The initial designs of these three well-known international assessments relied heavily on various aspects of the NAEP methods described in the 1992 Special Issue on NAEP. More recent descriptions of these methods may be found in von Davier, Sinharay, Oranje, and Beaton (2006) and Rutkowski, von Davier, and Rutkowski (2014). In particular, these international assessments adapted NAEP methods for population sampling, matrix sampling for administering the achievement items to students, item response theory (IRT) scaling with latent regression and plausible values for estimating trends in student achievement, and repeated replication methods such as the jackknife for estimating sampling error.
While recognizing the importance of the contribution of NAEP methods to the emergence of large-scale international assessments, it is important to realize that these assessments differ from NAEP in two major respects that have shaped their growth and development: Most obviously, international assessments are conducted across many countries around the world and not just in the United States as in NAEP, necessitating procedures for development, analysis, and reporting that are applicable in a wide variety of cultural, educational, and language contexts.
1
Large-scale international assessments need to produce valid, reliable results that are comparable across the 50 to 70 participating countries and across time. Most importantly, these assessments have an emphasis on providing policy-relevant results to improve education, which involves ambitious data collection of background information about community, home, school, and classroom contexts for learning.
It would be impossible in this short chapter to cover a broad variety of international assessments—or even the major aspects of TIMSS, PISA, and PIRLS, so TIMSS will be used to illustrate advancements in large-scale international assessment since 1992. The 20 years of TIMSS trends from 1995 to 2015 coincide nicely with the time frame to be covered in this current special issue. Also, because TIMSS has just completed reporting for the TIMSS 2015 cycle, the 2015 methods and procedures are fully documented and available for reference beyond the information that can be provided here, see Methods and Procedures TIMSS 2015 (Martin, Mullis, & Hooper, 2016).
Overview of TIMSS
TIMSS is an international assessment of mathematics and science at the fourth and eighth grades, conducted every 4 years. Since its first data collection in 1995, TIMSS is entering into its seventh assessment cycle and 24th year of data collection. TIMSS 2019, the current assessment in the series, is undergoing development and a transition to eTIMSS. 2 As previously noted, TIMSS 2015 is completed, providing the sixth in a series of trend measures collected over 20 years from 1995 to 2015. TIMSS sometimes includes extensions to its core program. For example, TIMSS Advanced for students taking special mathematics and physics courses in their final year of secondary school was assessed in 1995, 2008, and 2015, and TIMSS Numeracy (a less difficult version of the fourth-grade mathematics assessment) was assessed in 2015. However, these special initiatives will not be covered here.
TIMSS is a project of IEA, an independent international cooperative of national research institutions and government agencies with centers in Amsterdam and Hamburg. TIMSS continues the long history of international studies in mathematics and science conducted by IEA since 1959. IEA pioneered large-scale international comparative studies of educational achievement in the late 1950s and early 1960s to gain a deeper understanding of the effects of policies across countries’ different systems of education (Husen, 1967). TIMSS is directed by the TIMSS & PIRLS International Study Center at Boston College.
Each TIMSS cycle is based on updated assessment frameworks describing in some detail the content and cognitive domains in mathematics and science to be assessed at the fourth and eighth grades. More specifically, in mathematics, the TIMSS 2015 Assessment Frameworks (Mullis & Martin, 2013) defined three content domains at the fourth grade (number, geometric shapes and measures, and data display) and four at the eighth grade (number, algebra, geometry, data, and chance). Science also has three content domains at the fourth grade (life science, physical science, and earth science) and four at the eighth grade (biology, chemistry, physics, and earth science). The same three cognitive domains—knowing, applying, and reasoning—are assessed in mathematics and science at both grades. TIMSS produces achievement scales for each content and cognitive domain, as well as overall achievement scales for both mathematics and science. Each cycle also is guided by an updated framework for background data collection that describes the types of learning situations and factors associated with student’s achievement in mathematics and science that will be investigated via the questionnaire data.
The TIMSS assessments include hundreds of achievement items in various formats that are developed and reviewed through a collegial and collaborative process among representatives from the participating countries. Once the assessment items are agreed upon and field tested across countries, TIMSS uses a matrix-sampling approach to place the assessment items into blocks and then booklets (Beaton, 1997; Martin, Mullis, & Foy, 2013). The assessment achievement booklets are given to nationally representative probability samples of students in each participating country or benchmarking entity (typically regional jurisdiction of countries). Countries participating in TIMSS aim for a sample of about 4,500 students from 150 schools to ensure population coverage and that there are enough respondents for each item.
The data are collected by the National Centers responsible for TIMSS in each country, following procedures detailed in a series of proprietary manuals prepared by the TIMSS & PIRLS International Center. IEA Amsterdam and the TIMSS & PIRLS International Center work jointly on implementing an extensive quality assurance program that involves observational site visits, extensive checking of the consistency and accuracy of the data by IEA Hamburg, and complex analyses of response patterns by the TIMSS & PIRLS International Study Center.
TIMSS uses IRT scaling methods to summarize student achievement from the data collected in each country (Mislevy, Beaton, Kaplan, & Sheehan, 1992; von Davier, Sinharay, Oranje, & Beaton, 2006). Using methods originally developed by NAEP, TIMSS has shown that matrix sampling together with IRT scaling and population modeling can provide robust estimates of achievement for the participating countries without overburdening individual students.
In 2015, nationally representative samples of students in 57 countries and 7 benchmarking entities participated in TIMSS 2015, for a total of 580,000 students. The results are reported in two online companion volumes: TIMSS 2015 International Results in Mathematics (Mullis, Martin, Foy, & Hooper, 2016) and TIMSS 2015 International Results in Science (Martin, Mullis, Foy, & Hooper, 2016). The reports summarize trends in fourth- and eighth-grade students’ achievement overall, at the TIMSS International Benchmarks, and within content and cognitive domains. The reports also present a rich array of information about the students’ attitudes toward mathematics and science, their home and school experience in learning mathematics and science, teachers’ education and training, classroom characteristics and activities, and school resources and climates.
TIMSS 2015 Assessment Design
As TIMSS evolved from cycle to cycle, the participating countries asked for more reporting categories for the achievement results, and it became the expected practice to report results not only overall but for the content and cognitive domains within mathematics and science. More recently, TIMSS began reporting trends for the content and cognitive domains. These ambitious reporting goals require each TIMSS assessment to have a substantial number of achievement items. For example, TIMSS 2015 had 345 items at the fourth grade (169 in mathematics and 176 in science) and 432 items at the eighth grade (212 in mathematics and 220 in science; Mullis, Cotter, Fishbein, & Centurino, 2016). 3
According to the TIMSS matrix-sampling design, the entire assessment pool of mathematics and science items at each grade is packaged into a set of 14 student achievement booklets, with each student completing just one booklet (Martin et al., 2013). To facilitate the process of creating the student achievement booklets, TIMSS groups the assessment items into a series of item blocks, with approximately 10 to 14 items in each block at the fourth grade and 12 to 18 at the eighth grade. As far as possible, within each block, the distribution of items across the content and cognitive domains matches the distribution across the item pool overall. TIMSS 2015 had a total of 28 blocks at each grade, 14 containing mathematics items and 14 containing science items.
TIMSS uses a systematic design for rotating some blocks of items out of the assessment after each cycle and replacing them with blocks of newly developed items. This approach enables each TIMSS assessment to reflect the most recent developments in the field and to present content in ways consistent with students’ instructional and everyday experiences but still retain a large number of items from assessment to assessment for stability in measuring trends. The trend design provides for each assessment to include blocks of items from three cycles—the current cycle and two cycles preceding it. Blocks are not used for more than three cycles. For example, after the 2011 assessment, 8 of the 14 mathematics blocks and 8 of the 14 science blocks at each grade were secured for use in measuring trends in 2015. The remaining 12 blocks were made available for use in research and teaching 4 and were replaced by blocks of newly developed items for the TIMSS 2015 assessment. Accordingly, the 28 blocks in the 2015 assessment consisted of 16 blocks of trend items (8 mathematics and 8 science) and 12 blocks of items newly developed for 2015.
In deciding how to distribute assessment blocks across student achievement booklets, the major goal is to maximize coverage of the framework while ensuring that every student responds to sufficient items to provide reliable measurement of trends in mathematics and science. A further goal is to ensure that trends in each of the mathematics and science content and cognitive domains can be measured reliably. To enable linking among booklets while keeping the number of booklets to a minimum, each block appears in two booklets.
Each of the 14 student booklets in TIMSS contains two blocks of mathematics items and two blocks of science items. In half the booklets, two mathematics blocks are presented first, and in the other half, two science blocks are presented first. Additionally, in most booklets, two blocks (one mathematics and one science) contain trend items and two contain newly developed items. Each student completes one booklet in two parts. Fourth-grade students are given 72 minutes to complete their booklets, 36 minutes for the first part, then a break and 36 minutes for the second part. Eighth-grade students are given 90 minutes—45 minutes for each part. Booklets are distributed among students in participating classrooms according to a random assignment determined by the TIMSS within-school sampling software, so that the groups of students completing each booklet are approximately equivalent in terms of student ability.
Sample Design in TIMSS 2015
TIMSS employs a two-stage random sampling design, with a sample of schools drawn as a first stage and one or more intact classes of students selected from each of the sampled schools as a second stage. Intact classes of students are sampled rather than individuals from across the grade level or of a certain age because TIMSS collects important information about students’ curricular and instructional experiences, and these typically are organized on a classroom basis. Sampling intact classes also have the operational advantage of less disruption to the school’s day-to-day business than individual student sampling.
Each country participating in TIMSS develops a national sampling plan for defining its national target population and applying the TIMSS sampling methods to achieve a nationally representative sample of schools and students. The development and implementation of the national sampling plan is a collaborative exercise involving the country’s National Research Coordinator (NRC) and TIMSS sampling experts. Statistics Canada, with support from sampling staff at IEA Hamburg, is responsible for advising the NRC on all sampling matters and for ensuring that the national sampling plan conforms to the TIMSS standards (LaRoche, Joncas, & Foy, 2016).
Target Population and Exclusions
As an international study of the effects of education on student achievement in mathematics and science, TIMSS defines its international target populations in terms of the amount of schooling students have received, with the number of years of formal schooling as the basis of comparison among participating countries.
TIMSS uses United Nations Educational, Scientific, and Cultural Organization (UNESCO)’s International Standard Classification of Education (ISCED) 2011 as an internationally accepted classification scheme for describing levels of schooling across countries. ISCED Level 1 corresponds to primary education or the first stage of basic education. The first year of Level 1 “coincides with the transition point in an education system where systematic teaching and learning in reading, writing and mathematics begins” (UNESCO, 2012, p. 30). Four years after the beginning of the first year is the target grade for fourth-grade TIMSS and is the fourth grade in most countries. Similarly, 8 years after the beginning of the first year of ISCED Level 1 is the target grade for eighth-grade TIMSS and is the eighth grade in most countries. However, given the cognitive demands of the assessments, TIMSS wants to avoid assessing very young students. Thus, TIMSS recommends assessing the next higher grade (i.e., fifth grade for fourth-grade TIMSS and ninth grade for eighth-grade TIMSS) if, for fourth-grade students, the average age at the time of testing would be less than 9.5 years and, for eighth-grade students, less than 13.5 years.
Accordingly, the fourth-grade student target population is defined as all students enrolled in the grade that represents 4 years of schooling counting from the first year of ISCED Level 1, providing the mean age at the time of testing is at least 9.5 years. Similarly, the eighth-grade target population is all students enrolled in the grade that represents 8 years of schooling counting from the first year of ISCED Level 1, providing the mean age at the time of testing is at least 13.5 years. All students enrolled in the target grade, regardless of their age, belong to the international target population and should be eligible to participate in TIMSS.
Sampling Precision and Sampling Size
Because TIMSS is fundamentally a study of student achievement, the precision of estimates of student achievement is of primary importance. To meet the TIMSS standards for sampling precision, national student samples are expected to provide for a standard error no greater than .035 standard deviation units for the country’s mean achievement. With a standard deviation of 100 on the original TIMSS achievement scales, this target standard error corresponds to a 95% confidence interval of ±7 score points for the achievement mean and of ±10 score points for the difference between achievement means from successive cycles (e.g., the difference between a country’s achievement mean on TIMSS 2011 and TIMSS 2015). Sample estimates of any student-level percentage estimate (e.g., a student background characteristic) should have a confidence interval of ±3.5 percentage points.
Taking into account the clustering effects of sampling schools and classes, TIMSS has found that, for most countries, the precision requirements are met with a school sample of 150 schools and a student sample of 4,000 students for each target grade (Foy & LaRoche, 2016). Depending on the average class size in the country, one class from each sampled school is often sufficient to achieve the desired student sample size. For example, if the average class size in a country were 27 students, a single class from each of 150 schools would provide a sample of 4,050 students (assuming full participation by schools and students). Some countries choose to sample more than one class per school, either to increase the size of the student sample or to provide a better estimate of school-level effects. Statistics Canada works with each country to adapt the TIMSS sampling design to the organization of the country’s education system in order to meet the sampling precision requirements.
Scaling the TIMSS 2015 Achievement Data
Given the complexities of the TIMSS matrix-sampling design and the need to have measures of student proficiency on the entirety of each assessment for analysis and reporting purposes, TIMSS uses a combination of IRT scaling and population modeling to describe student achievement and to provide accurate measures of trends. As each student responded to only a small part of the assessment item pool (2 of the 14 mathematics item blocks and 2 of the 14 science item blocks), the TIMSS scaling approach uses multiple imputation—or plausible values—methodology to obtain proficiency estimates in mathematics and science for all students. The scaling and modeling procedures, and the software systems to implement these procedures, were developed at Educational Testing Service (ETS) for NAEP in the 1990s and continue to be used by TIMSS to this day (von Davier & Sinharay, 2014). The following description is based on Foy and Yin (2016).
The application of IRT scaling and plausible values methodology to the data from the TIMSS assessments involves four major tasks: calibrating the achievement items (estimating IRT parameters for each item), creating conditioning variables from the student questionnaire data for population modeling, generating proficiency estimates for mathematics and science, and transforming these proficiency estimates to the achievement scales used to report trend results from previous assessments. TIMSS has separate scales for mathematics and science at both fourth and eighth grades. In addition to these overall achievement scales, TIMSS generates proficiency estimates for each content domain (e.g., algebra, geometry in mathematics; physics, chemistry in science) and cognitive domain (knowing, applying, and reasoning).
Linking Assessments Cycles With Concurrent Calibration
The TIMSS reporting scales for overall mathematics and science at each grade level were graduated originally in TIMSS 1995 by setting the mean score across all countries that participated in TIMSS 1995 to 500 and the standard deviation to 100. To enable measurement of trends over time, achievement data from successive TIMSS assessments have been transformed to these same scales. This is done by concurrently scaling the data from each assessment together with the data from the previous assessment—a process known as concurrent calibration—and applying linear transformations to place the results from the assessment on the same scale as the results from the previous assessment. This procedure enables TIMSS to measure trends across all six assessment cycles: 1995, 1999, 2003, 2007, 2011, and 2015.
In concurrent calibration, item parameters for the current assessment are estimated based on the data from both the current and previous assessments, recognizing that a number of items (the trend items) are common to both. It is then possible to estimate the latent ability distributions of students in both assessments using the item parameters from the concurrent calibration. The difference between these two distributions is the change in achievement from one assessment to the next.
The stability of the concurrent calibration linking is dependent to a considerable extent on having a substantial number of trend items, items that are retained from one assessment to the next. TIMSS achieves this by having eight blocks of items in common from one assessment to the next for each subject and grade. For example, the TIMSS 2011 and 2015 concurrent scaling of fourth-grade mathematics involved 14 item blocks from 2011 and 14 item blocks from 2015, of which 8 blocks containing 102 items were common to both assessments.
Figure 1 illustrates how the concurrent calibration approach is applied in the context of TIMSS trend scaling. This is essentially the same approach used by NAEP (Mazzeo & von Davier, 2014). The gap between the distributions of the previous assessment data under the previous calibration and under the concurrent calibration (Figure 1, second panel) is typically small and is the result of slight differences in the item parameter estimates from the two calibrations. The linear transformation removes this gap by shifting the two distributions from the concurrent calibration, such that the distribution of the previous assessment data from the concurrent calibration aligns with the distribution of the previous assessment data from the previous calibration, 5 while preserving the gap between the previous and current assessment data under the concurrent calibration. This latter gap represents the change in achievement between the previous and current assessments that TIMSS sets out to measure as trend.

Trends in International Mathematics and Science Study concurrent calibration model.
Calibrating the TIMSS 2015 Assessment Data
Item calibration for TIMSS 2015 was conducted by the TIMSS & PIRLS International Study Center using the commercially available PARSCALE software (Muraki & Bock, 1991) developed originally for NAEP. TIMSS uses 2-parameter and 3-parameter IRT models to describe the behavior of dichotomously scored (right/wrong) constructed response and multiple-choice items, respectively (Birnbaum, 1968), and generalized partial credit models for polytomous items (scored 0, 1, or 2 in TIMSS; Muraki, 1992). In the calibration process, “not reached” items are considered as “not administered” and are omitted from the estimation procedure. However, “not reached items” are treated as incorrect for the purposes of generating proficiency scores.
The 2015 item calibration included data from the TIMSS 2011 assessment and the 2015 assessment for countries that participated in both assessment cycles. For mathematics and science at fourth and eighth grade, the calibration used all available item response data from each country’s student samples from both 2015 and 2011 assessments, with student samples weighted so that each country contributed equally to the item calibration. A total of 41 countries from TIMSS 2015 contributed to the concurrent calibration at the fourth grade and 34 countries at the eighth grade.
The item parameters estimated from these concurrent calibrations, based on the countries that participated in both the 2011 and 2015 assessments, were then used to estimate student proficiency for all countries and benchmarking entities participating in the TIMSS 2015 assessments. The item parameters from the concurrent calibration also were used to estimate student proficiency in the mathematics and science content and cognitive domains. Using the same item parameters for both the overall and content/cognitive domain scaling consolidates the scaling process while treating the subscales as just shorter scales that vary based on the items included in the content or cognitive domain. 6 Student proficiency for mathematics and science overall and each of the content and cognitive domains was estimated for a total of 47 countries and 7 benchmarking participants at fourth grade and for 39 countries and 7 benchmarking participants at eighth grade.
Evaluating Fit of IRT Models to the TIMSS Assessment Data
After completing item calibration, checks are performed to verify that the item parameters obtained from PARSCALE adequately reproduce the observed distribution of student responses across the proficiency continuum. The fit of the IRT models to the TIMSS assessment data is examined by comparing the item response function curves (item characteristic curves) generated using the item parameters estimated from the data with the item response functions calculated from the latent abilities estimated for each student who responded to the item. When the results for an item fall near the fitted curves, the IRT model fits the data well and provides an accurate and reliable measurement of the underlying proficiency scale.
Because comparable measurement across countries depends on achievement items functioning the same in all countries, TIMSS pays particular attention to differential item functioning by country. Although countries are expected to exhibit some variation in performance across items, in general, countries with high average performance on the assessment should perform relatively well on each of the items, and low-scoring countries should do less well on each of the items. When this does not occur (e.g., when a high-performing country has low performance on an item on which other countries are doing well), there is said to be an item-by-country interaction. Although rare in TIMSS, a large item-by-country interaction may be a sign that an item is flawed in some way (e.g., faulty translation or printing error) and that steps should be taken to address the problem. To detect item-by-country interactions, TIMSS derives for each item a preliminary indicator of item difficulty (based on a Rasch model) for each country and compares this to the international average item difficulty across all countries (Foy et al., 2016).
Conditioning Variables for the Latent Regression Analysis
After the item parameters have been estimated during the item calibration phase, the next step is to conduct a latent regression analysis in which the items with their item parameters are treated as indicators of the latent ability and the student background variables are treated as covariates. The plausible values generated by TIMSS are multiple imputations from this latent regression model based on the students’ responses to the items they were given, the item parameters estimated in the calibration stage, and the students’ background characteristics. Because the plausible values generated from the latent regression analysis are conditional on the student background data, the background variables collectively are known as the conditioning model.
Ideally, all background data would be included in the conditioning model, but because TIMSS has so many student background variables that could be used in conditioning, the TIMSS & PIRLS International Study Center follows the practice established by NAEP of using principal components analysis to reduce the number of variables and the collinearity among them while explaining most of their common variance. In TIMSS, principal components are computed separately for each country based on all student background variables (including parent background variables at the fourth grade). Those principal components accounting for 90% of the variance of the background variables are retained for use as conditioning variables.
In addition to the principal components, student gender (dummy coded), the language of the test (dummy coded), an indicator of the classroom in the school to which a student belongs (criterion scaled), and an optional country-specific variable (dummy coded) are included as primary conditioning variables, thereby accounting for most of the variance between students in the background variables and preserving the between-classroom and within-classroom variance structure in the scaling model.
Generating IRT Proficiency Scores for the TIMSS Assessment Data
TIMSS uses ETS’s DGROUP program (Rogers, Tang, Lin, & Kandathil, 2006; Thomas, 1993) to generate the IRT proficiency estimates. This program takes as input the students’ responses to the items they were given, the item parameters estimated at the calibration stage, and the conditioning variables and generates as output the plausible values that represent student proficiency. To estimate the uncertainty due to the item sampling process, and following the practice first established by NAEP, TIMSS generated five plausible values for each student on each of the TIMSS 2015 scales.
A useful feature of DGROUP is its ability to perform multidimensional latent regression using the responses to all items across the proficiency scales and the correlations among the scales to improve the reliability of each individual scale. TIMSS capitalizes on this feature to simultaneously estimate overall mathematics and overall science proficiency using a two-dimensional DGROUP run (Rubin & Thomas, 2000; Thomas, 1993).
The multidimensional scaling feature of DGROUP also is used to generate proficiency scores for the TIMSS content and cognitive domains. For these applications, the same item parameters estimated for the overall mathematics and science scales are used, as well the same conditioning variables. At the fourth grade, for mathematics, a three-dimensional model is used to estimate proficiency in the content domains of number, geometric shapes and measures, and data display. Similarly, for science, a three-dimensional model is used to estimate proficiency in life science, physical science, and earth science. At the eighth grade, four-dimensional models are used for the content domains of number, algebra, geometry, and data and chance in mathematics and for biology, chemistry, physics, and earth science in science. The cognitive domain scaling uses three-dimensional models to estimate the three cognitive domains (knowing, applying, and reasoning) in mathematics and science at both fourth and eighth grades.
Transforming the Overall Scores to Measure Trends
To provide results for the TIMSS 2015 assessments on the existing TIMSS achievement scales, the 2015 plausible values for overall mathematics and overall science generated by DGROUP had to be transformed to the TIMSS reporting metric. This was accomplished through a set of linear transformations as part of the concurrent calibration procedure. These linear transformations were given by
where PVk,i is the TIMSS 2015 plausible value i of scale k prior to transformation;
The linear transformation constants were obtained by first computing the international means and standard deviations of the proficiency scores for the overall mathematics and science scales using the plausible values produced in 2011 based on the 2011 item calibrations for the trend countries. These were the plausible values published in 2011. Next, the same calculations were done using the plausible values from the rescaled TIMSS 2011 assessment data based on the 2015 concurrent item calibrations for the same set of countries. From these calculations, the linear transformation constants were defined as
where
There are five sets of transformation constants for each scale, one for each plausible value. These linear transformation constants were applied to the overall proficiency scores—mathematics and science—at both grades and for all participating countries and benchmarking participants. This provided student achievement scores for the TIMSS 2015 assessments that are directly comparable to the scores from previous TIMSS assessments.
The linear transformation constants for the overall scales also were applied to the scales for the content and cognitive domains, with the transformations for overall mathematics applied to the mathematics content domains and cognitive domains, and the transformations for science applied to the science content domains and cognitive domains. In this approach to measuring trends in content and cognitive domains, achievement changes over time are established in the context of achievement in each subject overall. Trends are not established separately for each content or cognitive domain; rather, differential changes in performance in the domains are considered in the light of trends in the subject overall.
Estimating Standard Errors in TIMSS 2015
The TIMSS approach to estimating student proficiency in mathematics and science combines probability sampling techniques for student sampling with matrix-sampling designs for targeting individual students with a subset of the assessment item pool. This approach makes efficient use of resources, in particular keeping student response burden to a minimum, but at a cost of some variance or uncertainty in the reported statistics, such as the means and percentages computed to estimate population parameters.
Each statistic in the TIMSS 2015 international reports is accompanied by an estimate of its standard error. For statistics reporting student achievement, which are based on plausible values, standard errors have two components: sampling variance due to generalizing from student samples to the entire fourth- or eighth-grade student populations and imputation variance due to inferring students’ performance on the entire assessment from their performance on the subset of items that they took. For parameter estimates of variables that are not plausible values, including context questionnaire scales, standard errors are based entirely on sampling variance and contain no correction for unreliability (Foy & LaRoche, 2016).
Estimating Sampling Variance
Because of its complex multistage cluster sampling design, TIMSS uses a resampling method known as jackknife repeated replication (JRR) to estimate sampling variances. JRR was chosen originally by NAEP because it is computationally straightforward and provides approximately unbiased estimates of the sampling variances and sampling errors of means, total, and percentages (Johnson & Rust, 1992).
In the TIMSS application of the JRR, the schools in each national sample are assigned to a series of jackknife sampling zones, two schools to a zone. Since most national samples consist of 150 schools, a total of 75 zones are created. The JRR procedure creates two replicate samples for each sampling zone: one in which the first school in the pair is omitted and the weights for the second school doubled to compensate for the omission and another in which the second school is omitted and the weights for the first school doubled. Both replicate samples also include all students in the other sampling zones. With this process applied to each of the 75 sampling zones, the JRR procedure yields a total of 150 replicate samples of the total sample, each with its own set of replicate sampling weights reflecting the removal of one school and the doubling of the other.
When the replicate samples have been created, they can be used to estimate the sampling variance of any statistic based on the TIMSS data. This is done by computing the statistic 150 times, once for each set of replicate weights. The variation across these 150 jackknife estimates (the sum of the squared errors) determines the sampling variance of the statistic.
For any given statistic, the sampling variance estimated with the TIMSS JRR method quantifies the variation arising from sampling students using the multistage stratified cluster sample design. For many variables in TIMSS, with the notable exception of proficiency scores based on plausible values, the standard error of a statistic reflects only sampling variation and is simply the square root of its sampling variance. Examples of such variables include the age of students, the percentage of students with at least one parent with a university degree, and scale score on context questionnaire scales such as the TIMSS Students Like Learning Mathematics Scale. Being based entirely on sampling variation, the standard error for such variables makes no provision for measurement error, unlike proficiency scores based on plausible values.
Estimating Imputation Variance
As described earlier, the plausible values used to represent student proficiency are derived through a multiple imputation process, in which each proficiency estimate incorporates a random element. TIMSS follows the procedure initiated by NAEP of using the variability among the five plausible values as a measure of the imputation uncertainty (Little & Rubin, 2002; Mislevy et al., 1992). For example, in estimating the imputation variance for mean mathematics achievement in a country, TIMSS computes the mathematics mean 5 times—once for each set of plausible values—and averages the five means to provide an estimate of mean mathematics achievement for the country. The variance of the five plausible value means is an estimate of the imputation variance of the mean.
Although NAEP now generates 20 plausible values as a matter of course, analysis of the TIMSS data show that only a very marginal increase in the precision of estimates could be expected by increasing the number of imputations and at a cost of greatly increased computational burden for the TIMSS countries.
Total Variance
For any statistic based on plausible values, the total variance is the sum of the sampling variance and the imputation variance. For this purpose, the sampling variance of the statistic is computed separately for each of the five plausible values using the JRR technique and then averaged to give an overall figure. The imputation variance is the variance among the five plausible value estimates, as described above. The square root of the total variance is the standard error for any statistic based on plausible values such as the average TIMSS mathematics achievement for girls and the percentage of students who reach the TIMSS advanced international benchmark of mathematics achievement. 7
The TIMSS International Benchmarks of Student Achievement
As previously described, the TIMSS achievement results are summarized using scaling and reported on achievement scales with a range of 0 to 1,000, with most proficiency estimates in the interval from 300 to 700. Countries’ average scores provide users of the data with information about how achievement compares among countries and whether achievement is improving or declining over time. To provide as much information as possible for policy and curriculum reform, however, it is important to understand the mathematics and science competencies associated with different locations along the achievement scales. For example, in terms of levels of student understanding, what does it mean for a country to have average achievement of 426 or 513, and how are these scores different?
To address this issue, the TIMSS International Benchmarks provide information about what students know and can do at various points along the achievement scales. More specifically, TIMSS has identified four points along the achievement scales to use as international benchmarks of achievement—Advanced International Benchmark (625), High International Benchmark (550), Intermediate International Benchmark (475), and Low International Benchmark (400). The percentages of students in each country reaching these International Benchmarks are reported for each TIMSS assessment, as well as changes in the percentages from one assessment cycle to the next.
For each assessment cycle, the TIMSS & PIRLS International Study Center works with the expert international committee, the Science and Mathematics Item Review Committee, to conduct a scale anchoring analysis to describe student competencies at the benchmarks. Consistent with the procedure used in prior assessments, the TIMSS 2015 scale anchoring analysis was conducted separately for mathematics and for science at fourth and eighth grades.
In brief, scale anchoring involves identifying items that students scoring at the international benchmarks answer correctly 8 and then having the mathematics and science experts examine the content of each item to determine the kind of knowledge, skill, or reasoning demonstrated by students who respond correctly to the item. The experts then summarize the detailed list of item competencies in a brief description of achievement at each international benchmark. Thus, the scale anchoring procedure yields a content-referenced interpretation of the achievement results that can be considered in light of the TIMSS 2015 frameworks for assessing mathematics and science (Mullis, Cotter, Centurino, Fishbein, & Liu, 2016).
Students’ Contexts for Learning Mathematics and Science
As a program of IEA, TIMSS has been firmly rooted in the goal of examining how variations in educational achievement relate to differences in educational systems. In the words of IEA’s first Chair, Professor Torsten Husen (1967), the naturally occurring variation among the education systems of the world represented a laboratory allowing “comparisons to be made with means more powerful and more sure than artificially set up and costly experimental situations with in one country or culture” (pp. 27–28).
In the IEA tradition begun more than 40 years earlier, TIMSS 1995 burgeoned into the most ambitious international assessment of student achievement conducted until then, with an extensive array of questionnaires addressing system, school, classroom, and student explanatory factors. Also, TIMSS produces an encyclopedia with each assessment cycle that serves as a qualitative companion to the quantitative achievement and questionnaire data.
Each participating country contributes a chapter to the TIMSS encyclopedia, such that the TIMSS 2015 Encyclopedia: Education Policy and Curriculum in Mathematics and Science (Mullis, Martin, Goh, & Cotter, 2016) is a comprehensive compendium of how mathematics and science are taught around the world. The chapters describe the structure of each education system, the mathematics and science curricula in the primary and lower secondary grades, and overall policies related to mathematics and science instruction. The chapters also explain the routes and requirements for teacher education and professional development, how countries monitor student progress in mathematics and science, and any special initiatives in mathematics and science education. To support the chapters, each country also completed a curriculum questionnaire about national education policies and the national contexts that shape the content and implementation of the mathematics and science curricula. This included questions on promotion and retention policies, the local or national examination system, and goals and standards for mathematics and science instruction. Taken together, the data from the curriculum questionnaire and the information in the chapters present a concise yet rich portrait of mathematics and science education globally and make the TIMSS 2015 Encyclopedia a valuable resource for policy and research in comparative education.
TIMSS Questionnaires
TIMSS routinely administers background questionnaires to students, their parents, their teachers, and their school principals. Similar to the frameworks defining the content and cognitive domains to be included in the achievement measures, TIMSS has a context questionnaire framework that is updated with each assessment cycle. The TIMSS 2015 Context Questionnaire Framework (Hooper, Mullis, & Martin, 2013) established the foundation for the background information collected in TIMSS 2015. In 2015, for the first time, the fourth-grade TIMSS assessment included a home questionnaire for students’ parents and caregivers to collect information about students’ home backgrounds and early learning experiences. Student Questionnaire: A questionnaire is completed by each student who takes the TIMSS assessment. This questionnaire asks about aspects of students’ home and school lives including basic demographic information, their home environment, school climate for learning, and self-perception and attitudes toward mathematics and science. Home Questionnaire (fourth grade only): The parents or caregivers of each student taking the TIMSS fourth-grade assessment are asked to complete a questionnaire. This questionnaire asks about home resources for literacy and numeracy; early childhood activities in literacy, numeracy, and science; the child’s reading and quantitative readiness when beginning school; parents’ attitudes toward reading and mathematics; and parental education levels and occupation. Teacher Questionnaires: For the students sampled to take part in TIMSS, their mathematics and science teachers complete a teacher questionnaire. This questionnaire is designed to gather information on teacher characteristics as well as the classroom contexts for teaching and learning mathematics and science and the topics taught in these subjects. In particular, the teacher questionnaire asks about teachers’ backgrounds, their views on opportunities for collaboration with other teachers, their job satisfaction, and their education and training as well as professional development. The questionnaire also collects information on characteristics of the classes tested in TIMSS, instructional time, materials, and activities for teaching mathematics and science and promoting students’ interest in the subjects, use of computers, assessment practices, and homework. School Questionnaire: The principal of each school participating in TIMSS is asked to respond to this questionnaire. It asks about school characteristics, instructional time, resources and technology, parental involvement, school climate for learning, teaching staff, the role of the principal, and students’ school readiness.
Developing Context Questionnaire Scales
Establishing a systematic approach to summarizing topics covered in the context questionnaires became a necessity in 2011, when the trend cycles of TIMSS (every 4 years) and PIRLS (every 5 years) coincided. Countries took advantage of having both TIMSS and PIRLS in 2011 to assess the same fourth-grade students in reading, mathematics, and science and be able to relate achievement in these three key curriculum areas to extensive context questionnaire data. Context questionnaire development concentrated on measuring constructs related to fostering achievement across countries through context questionnaire scales—sets of items analyzed through methodology (Mullis, Martin, & Hooper, 2017), with each construct modeled by a unidimensional IRT scale.
The TIMSS & PIRLS International Study Center guided the process of developing valid and reliable scales to measure those constructs. The process required considerable prioritizing to minimize student burden because each construct needed to be measured by at least 5 to 8 Likert-type items for satisfactory reliability and validity. 9 The task included (1) identifying constructs that were important to all three curricular areas for which scales could be developed to measure trends in future assessments and (2) minimizing response burden to an acceptable level. The questionnaires were developed and field tested, and the scales evaluated for unidimensionality, reliability, item fit with the Rasch partial-credit model, and relationship with student achievement. The 2011 questionnaire development effort yielded nearly 20 context questionnaire scales measuring aspects of student learning and teaching. Examples of scales from the TIMSS 2011 home, school, teacher, and student questionnaires include Early Literacy and Numeracy Activities, School Emphasis on Academic Success, Student Engagement, and Student Bullying. TIMSS also has three student attitude scales for mathematics and science that have been refined through the assessment cycles: Students Like Learning Mathematics/Science, Students Confident in Learning Mathematics/Science, and Students Value Mathematics/Science.
To update the context questionnaire scales for TIMSS 2015, the 2011 data were used to identify scales that needed more construct-relevant items to improve reliability and validity or in a few instances had items that did not contribute to measuring the construct and could be deleted (Hooper, 2016). A number of scales had new items added to improve measurement, and several new scales were developed. After the process of NRC review, field testing, revisions by the TIMSS & PIRLS International Study and by the questionnaire advisory group, and final review by the TIMSS 2015 NRCs, the TIMSS 2015 Home, Student, Teacher, and School Questionnaires included about 30 scales at the fourth grade and 25 scales at the eighth grade.
Reporting the Results of the Context Questionnaire Scales
For reporting in both TIMSS 2011 and 2015, the context questionnaire scales were constructed using IRT scaling methods, specifically the Rasch partial credit model (Masters, 1982). This model was chosen in preference to the generalized partial credit model used for the TIMSS achievement scaling because of its simplicity and ease of interpretation. Construction of the TIMSS context questionnaire scales is described in Martin, Mullis, Hooper, Yin, et al. (2016), which also provides extensive data on the reliability and validity of each scale for each country. The TIMSS 2015 context questionnaire scaling was conducted by the TIMSS & PIRLS International Study Center using the ConQuest 2.0 software (Wu, Adams, Wilson, & Haldane, 2007).
The primary purpose of the context questionnaire scaling was to provide scale scores that could be used in analyses of relationships with achievement, and for this purpose, the Rasch scale scores were very suitable. However, it also was important to provide a way to interpret the meaning of a score on each context questionnaire scale. As a parallel to the TIMSS International Benchmarks of achievement, which describe performance on the achievement scales at particular points on the scale (400, 475, 550, 625), a procedure was developed to classify students into regions of each questionnaire scale corresponding to high, middle, and low values on the underlying construct. The scale score cut points delimiting the regions were defined in terms of combinations of response categories—essentially “raw scores” based on student responses. Because each possible raw score corresponds to one and only one Rasch scale score, it is possible to use this procedure to determine the scale score equivalent of a cut point defined in raw score terms.
The following example illustrates the TIMSS approach to reporting context questionnaire data using the TIMSS 2015 Students’ Sense of School Belonging Scale at the eighth grade. As the name suggests, the Students’ Sense of School Belonging Scale seeks to measure students’ feelings toward their school and connectedness with the school community. For each of the seven statements shown in Figure 2, students were asked to indicate the degree of their agreement with the statement: agree a lot, agree a little, disagree a little, or disagree a lot. Using IRT partial credit scaling, the data from student responses were placed on a scale constructed, so that the scale centerpoint of 10 was located at the mean logit score across all TIMSS countries. The units of the scale were chosen, so that 2 scale score points corresponded to the standard deviation of the logit scores across all countries.

Items in the Trends in International Mathematics and Science Study 2015 students’ sense of school belonging scale, eighth grade.
To facilitate reporting and interpreting results the scale was divided into three regions: High Sense of School Belonging, Some Sense of School Belonging, and Low Sense of School Belonging. With this approach, countries may be ordered for ease of comparison by the percentage of students with a high sense of school belonging. It was a priority that the meaning of the scale regions be easily understood, and so the boundaries of the regions were defined in terms of identifiable combinations of response categories. These boundaries were defined in terms of raw score equivalents of response combinations and then transformed into cut points on the questionnaire scale.
For example, from a consideration of the questions making up the Students’ Sense of School Belonging Scale, it was determined that in order to be in the high region of the scale and labeled High Sense of School Belonging, a student would have to agree a lot, on average, to at least four of the seven statements and agree a little to the other three. Similarly, it was determined that a student who, on average, at most agreed a little with three of the statements and disagreed a little with the other four would be labeled to have Little Sense of School Belonging. The particular response combinations that defined the regions boundaries, or cut points, were based initially on a judgment by TIMSS staff of what constituted a high or low region on each individual scale and subsequently reviewed and agreed by the TIMSS NRCs.
To determine their scale scores equivalents, the cut points were first quantified in raw score terms by assigning a numeric value to each response category. Assigning 0 to disagree a lot, 1 to disagree a little, 2 to agree a little, and 3 to agree a lot results in raw scores for the Students’ Sense of School Belonging Scale ranging from 0 (disagree a lot with all seven statements) to 21 (agree a lot to all seven). A student who agreed a lot with four statements and agreed a little with the other three would have a raw score of 18 (4 × 3 + 3 × 2). For the School Belonging Scale, this raw score corresponds to a scale score of 10.3, which then became the cut point for the high region. Following this approach, a student with a scale score of 10.3 or more would be in the High Sense of School Belonging region of the scale. Similarly, agreeing a little with three statements and disagreeing a little with four statements would result in a raw score of 10 (3 × 2 + 4 × 1), which corresponds to a scale score cut point of 7.5, so that a student with a scale score less than or equal to 7.5 would be in the Little Sense of School Belonging region.
Trends in the Results for Context Questionnaire Scales
Although a number of the TIMSS 2015 scales had been updated too much since 2011 to be appropriate for measuring trends, progress was made toward the goal of developing a stable set of educational context factors. For these trend scales, linking procedures were implemented to place the data from the two TIMSS cycles (2011 and 2015) on a common metric. This section describes the procedures for measuring trends—placing data for the TIMSS 2015 context questionnaire scales onto the TIMSS 2011 metric and validating this process.
As an example, Figure 3 shows the TIMSS 2015 Students Confident in Mathematics Scale for fourth-grade students—one of the scales where trend measurement was reported. This scale measures how confident students feel about their ability in mathematics, in terms of their level of agreement with nine statements about mathematics. Statements expressing negative sentiment were reverse coded during the scaling. Seven of the nine statements were common to the TIMSS 2011 and TIMSS 2015 versions of this scale, with “T” for trend identifying these items to the left of their variable name. Two new statements were added to the seven common items to improve the measure of Students Confident in Mathematics for TIMSS 2015.

Items in the Trends in International Mathematics and Science Study 2015 students confident in mathematics trend scale, fourth grade.
The IRT calibration and scoring methods for trend context questionnaire scales were the same as those used for the new context scales. The data for these nine items were calibrated across all TIMSS 2015 countries using the Rasch partial credit model, and, through this calibration, item parameters were estimated on a logit scale that was unique to the 2015 cycle. Following calibration, weighted maximum likelihood estimation was used to derive Rasch logit scale scores based on these estimated item parameters for all countries and benchmarking participants, and as such, student scores were placed on this 2015 logit metric. Although similar, the TIMSS 2015 logit metric is not identical to the TIMSS 2011 logit metric, and thus the TIMSS 2015 scores needed to be transformed to the 2011 metric to allow for trend reporting.
This linking was achieved through a two-step transformation process. The first transformation—with linear constants A1 and B1—placed the TIMSS 2015 logit scale scores on the TIMSS 2011 logit metric, and the second transformation—with linear constants A2 and B2—transformed the TIMSS 2011 logit metric to the TIMSS scale metric, which uses the (10, 2) metric described earlier. To increase the efficiency of this transformation process and reduce rounding errors, both transformations were combined into one calculation using the equations below to create a set of final scale transformation constants, A and B:
The first set of transformation parameters, A1 and B1, were obtained by applying the mean/sigma method (Kolen & Brennan, 2004) to the two sets of common item parameters: one from the current calibration of TIMSS 2015 data and the other from the previous calibration of TIMSS 2011 data. The mean and standard deviation of the item parameters were first found over all common items for each calibration. The transformation parameters A1 and B1 were calculated based on these two sets of means and standard deviations:
where MNc15 and SDc15 are the mean and standard deviation of the item parameter estimates for all common items from the current calibration on TIMSS 2015 data; MNc11 and SDc11 are the mean and standard deviation of the parameter estimates for all common items from the previous calibration on TIMSS 2011 data. The second set of transformation parameters, A2 and B2, were retrieved from the scale transformations which were established in 2011 for reporting. This transformation placed the resulting Rasch scores on the TIMSS (10, 2) trend reporting metric.
Using the procedure described above, the TIMSS & PIRLS International Study Center reported trends for TIMSS 2015 at both the fourth and eighth grades on scales measuring Instruction Affected by Mathematics Resource Shortages, Instruction Affected by Science Resource Shortages, Safe and Orderly School, School Discipline Problems, Students Like Learning Mathematics, Students Like Learning Science, Students Confident in Mathematics, Students Confident in Science, and Home Resources.
Conclusion
The purpose of this chapter was to illustrate how 20 years of TIMSS trend measurements are a testament to the robustness of the NAEP methods presented in the 1992 special issue. The chapter shows how TIMSS has benefited from using the powerful combination of matrix sampling, IRT scaling and population modeling with plausible values, and robust resampling methods for estimating standard errors as a firm methodological foundation. Although the discussion has focused on TIMSS, other large-scale international assessments such as PIRLS and PISA also have benefited from these methodological contributions. Adapting these methods to accommodate a wide variety of national contexts has enabled large-scale international assessments to produce comparable achievement scores across countries and across time.
While the international assessments have their methodological roots in the work of NAEP, they also have made notable adaptations and extensions, particularly in characterizing the context for education in participating countries. Through questionnaires administered to students and their parents, teachers, and school principals, and through encyclopedia chapters and curriculum and system data provided by NRCs, the international large-scale assessments routinely collect a rich array of data that can be combined with high-quality trend data on student achievement to address issues of importance to researchers and policymakers. The application of IRT scaling techniques to summarizing the context questionnaire data and constructing valid and reliable measures for trend analysis.
In brief, TIMSS and the other well-known large-scale international assessment programs have built on the NAEP approach and methods to create enduring trend assessments that regularly provide important information to policymakers and educators across 60 to 80 countries, while extending their reach to include broad coverage of policy-relevant information about community, home, school, and classroom contexts.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
