Toward Education Quality Improvement in China: A Brief Overview of the National Assessment of Education Quality

Abstract

This article is an overview of the National Assessment of Education Quality (NAEQ) of China in reading, mathematics, sciences, arts, physical education, and moral education at Grades 4 and 8. After a review of the background and history of NAEQ, we present the assessment framework with students’ holistic development at the core and the design for each subject used in the 2015–2017 assessment cycle. Technical details including item response modeling and the standard setting procedure are presented. We conclude with a discussion of the social impact, current issues, and future directions of national educational assessment.

Keywords

accountability item response theory performance assessment standard setting

The National Assessment of Education Quality (NAEQ) is a national monitoring system for the quality of Grades 4 and 8 school education in China. The mission of NAEQ is to improve the quality of school education. Specifically, NAEQ evaluates education quality of fourth- and eighth-grade students who are in the critical development periods for children or adolescents. For each grade, six subjects including mathematics, sciences, Chinese language, physical education, arts, and moral education are assessed. Data-based evidence related to the current state and possible influencing factors of education quality are reported to the Ministry of Education (MOE), local governments, and the general public.

The system of NAEQ has been under development since 2007. The first round of operational assessment took place in 2015–2017. There have been numerous challenges in both the developmental phase and the operational assessments. Some of them remain to be problems while NAEQ, still at its young age, continues to evolve. This article, as a brief account of this arduous journey, is organized as follows. First, the background and history of NAEQ are reviewed. Next, the fundamental framework and subject designs are introduced, followed by some technical details of NAEQ. Finally, we discuss the social impact and current issues of the national educational assessment in China.

Background and History

The development of NAEQ reflects the current priorities of Chinese public education. Chinese education system consists of 6 years of primary school (6- to 12-year-olds, Grades 1–6), 3 years of middle school (13- to 15-year-olds, Grades 7–9), 3 years of high school (for 16- to 18-year-olds), 4 years of undergraduate education, and postgraduate education. China has made a great improvement in its 9-year compulsory education that includes primary and middle school levels in the past three decades. In the 1990s, the priority of Chinese education was to make sure that all the children who reached the eligible school age were able to go to school. After about three decades, China has seen the enrollment rate of primary school reaching 99.88% by the year 2015 (MOE, 2016). After achieving the goal in the quantitative aspect, the quality and equity of education have become the major concerns of the society. To improve the quality of education and to promote the development of children and adolescents in multiple subject areas have become the current priorities. As a result, policymakers and the general public are in high need of empirical evidence to monitor education quality.

Large-scale assessment of student academic achievement, nationally or internationally, has been used as an important strategy to inform policy reforms. The Program for International Student Assessment (PISA) is an example of international large-scale assessment, created in 1997 by the Organization for Economic Co-operation and Development (OECD, 2016) to evaluate and monitor the outcomes of education systems in OECD and non-OECD countries and economies. As early as 1969, National Assessment of Education Progress (NAEP), a national educational assessment program of the United States, was launched to provide educators, policymakers, and the general public information about educational achievement and progress over time (e.g., Beaton & Zwick, 1992; Johnson, 1992). Many other countries, including Australia (Ministerial Council on Education, Employment, Training, and Youth Affairs, 2006), Japan (Ministry of Education, Culture, Sports, Science & Technology & National Institute for Educational Policy Research, 2015), and Korea (Korea Institute of Curriculum and Evaluation, 2008), have also carried out their own national educational assessment programs.

Driven by the new education priorities and the international trends of large-scale assessments, the program of NAEQ was launched in 2007 as one of the national education strategies to collect information on student outcomes and their influencing factors. After a decade of exploration, NAEQ has officially become a systematic national assessment program acting under the authorization of the MOE. The MOE authorized the National Assessment Center of Education Quality (henceforth referred to as “the Center”), which is affiliated with Beijing Normal University, to develop the assessment system for NAEQ. The Center is a nongovernmental organization and thus is able to take an unbiased position in assessing education quality.

Tasked with exploring the possibility of a comprehensive assessment of the complicated construct of education quality for a vast country, NAEQ has brought together nearly 300 experts in the areas of education, measurement, and subject matters from both home and abroad. Since 2007, the designs and instruments of NAEQ have been developed and validated drawing from the experience of other advanced large-scale assessment programs like NAEP and PISA. The evolution of NAEQ over the time also reflects the most recent curriculum reforms in China (OECD, 2016).

Before launching the formal assessment, NAEQ conducted a series of pilot and field tests, collecting student achievement and performance test data, as well as background information, which lasted for 8 years from 2007 to 2014. The field tests included eight assessments spanning the 8 years and five of them were conducted at the national level. Over the years, participants of the field tests added up to more than 525,000 students and more than 125,000 teachers and principals from 772 counties in 32 provinces or regions. After sufficient exploration and validation, NAEQ has been recognized by the MOE as a national assessment program. On April 15, 2015, the MOE (2015) published the National Compulsory Education Quality Assessment System, in which NAEQ is officially included in the national education inspection system.

Framework and Designs

Aspects of Education Quality Evaluation

As implied in its name, NAEQ sets out to measure education quality, which is oft cited in the literature but seldom defined explicitly. There is no simple definition of education quality in NAEQ. Instead, NAEQ identifies several aspects of education quality evaluation. First, the evaluation should be based on the assessments of cognitive and noncognitive student outcomes of multiple subjects. Second, the assessments should be contextualized. Third, the assessments should be aligned with the curriculum standards.

Student outcomes are undoubtedly at the core of its evaluation. NAEQ is built upon the belief that education quality evaluation should be based on the student outcome in the context of its influencing factors. It is believed that the student outcome is not only affected by the material conditions of the education system, but school climate and school culture are also critical in creating a favorable environment for student development (Wang, Berry, & Swearer, 2013).

Another aspect of education quality emphasizes the holistic development of students. The holistic perspective is concerned with the development of every student’s intellectual, emotional, social, physical, artistic, creative, and moral potentials. China has a long tradition of holistic-development-oriented education, which dated back to about 3,000 years ago, the golden age of the Zhou Dynasty, a time that Confucius highly valued. At that time, young aristocrats were required to master six subjects, also called “Liuyi” or six arts, which are rites, music, archery, charioteering, literature, and mathematics. The six arts, representing a holistic perspective of education, were advocated by Confucius and have influenced Chinese education and measurement ever since.

The evaluation of education quality hinges on the measurement of student outcomes and influencing factors in multiple subjects. Specifically, six subject areas are assessed in NAEQ to provide a full picture of holistic-development-oriented education quality. The three major subjects—mathematics, sciences, and language—are assessed in most existing large-scale assessments. However, large-scale assessments of the subjects of arts, physical education, and moral education are not commonly seen. For each subject, the intellectual outcomes of students are tested as in the well-known large-scaled assessments; noncognitive student outcomes associated with each subject including emotions, attitudes, and values are also measured. The inclusion of the six subjects reflects the goal to promote every student’s holistic development. Whether the education system promotes student development in each of the six subjects is a major indicator of education quality.

The quality of school education also depends on the degree to which the national curriculum standards are successfully implemented. NAEQ is firmly based on the fourth- and eighth-grade national curriculum for each subject. On one hand, the assessments of student cognitive and noncognitive outcomes cover all the contents and cognitive abilities that the curriculum requires for each subject; on the other hand, every item or task in the assessments is deemed appropriate for the grade-level curriculum. Textbooks and other instructional materials, as potentially implemented curriculum, provide additional resources for the development of the assessment instrument.

One feature of the curriculum-based assessment is the emphasis on active and creative problem-solving, as well as interactive and co-operative learning (OECD, 2016). For the assessment of sciences and mathematics, the test blueprint includes items assessing problem-solving abilities. Such items should be set in a meaningful real-world context and require the application of multiple knowledge points and skills. A constructed-response science item is presented as an example, which assesses the ability to apply physics knowledge in a complex real-world situation.

A researcher bought a “10-A 250-V” socket from the store and conducted the following experiment:

plug a microphone into the socket and turn the microphone on;

use digital thermometer to measure the plug wire’s temperature;

record the data every 40 seconds; and

repeat the experiment with electric cup, induction cooker, and electric kettle.

The data are shown below. According to the records, what conclusions will you draw? And why?

Temperature Time(s)	Microphone (60 W)	Electric Cup (400 W)	Induction Cooker (1,000 W)	Electric Kettle (1,500 W)
0	23.92	23.71	23.69	23.49
40	23.95	24.02	26.84	28.92
80	23.97	24.48	32.34	39.12
120	23.99	24.93	37.79	50.05

The assessment of quality is inevitably entwined with evaluation based on a set of standards, which implies that criterion-referenced interpretations need to be made. In NAEQ, performance levels were derived from the curriculum standards. They describe what students are expected to know when they finish the fourth or the eighth grade. The curriculum-based assessment of NAEQ is mainly concerned with what students know. However, the scaled score alone does not show what students know and what they do not. This is where standard setting comes in, giving meaning to the scaled score. Based on the standard setting results, the percentage of students at each performance level is an important indicator of education quality for a region. Focusing on criterion-referenced interpretations of the assessment results, NAEQ is distinguished from many large-scale assessment programs designed to mainly differentiate or rank students or groups of students.

Assessment Framework

The context, input, process, and product (CIPP) evaluation model developed as a program evaluation model has been widely used in educational evaluations (Stufflebeam, 1983). NAEQ developed a multicomponent assessment framework based on the CIPP evaluation model with the student outcome assessment at the core, which is shown in Figure 1. Student outcomes consist of cognitive outcomes (or academic achievement) and noncognitive outcomes. Student academic achievement is measured by paper-and-pencil tests for mathematics, sciences, and Chinese language. Noncognitive outcome measures including student emotions, attitudes, and values associated with each subject are assessed in student background questionnaire. The performances in physical education, arts, and moral education are also part of the noncognitive outcomes.

Figure 1.

Framework of National Assessment of Education Quality.

The assessment of education process collects data on the individual involvement (e.g., study time per week, homework hours), teaching process (e.g., teaching time per week, use of technologies in classrooms), the environment, and the organization of schools (e.g., school environment and course offering). Opportunities to learn measured by classroom coverage of content topics will be included in the future assessment cycles. Indicators for education input include general student background (e.g., gender, ethnicity, social economic background, and maternal education level) and school input of financial and human resources (e.g., classroom equipment, pupil–teacher ratio, and average class size per school). Context data including demographic information, gross domestic product (GDP), and educational development index are collected from government records and used in developing the sampling scheme.

Subject Design

Each subject has its own design. Table 1 summarizes the instruments and the contents of assessment for each subject. The instruments include the paper-and-pencil achievement tests, performance assessments, and student/teacher/principal questionnaires, which were developed and validated in the pilot and field tests for each subject.

Table 1.

Design of National Assessment of Education Quality for Each Subject

Test Years	Subject	Paper-and-Pencil Test		Performance Assessment	Student Questionnaire		Teacher/Principal Questionnaire
Test Years	Subject	Content Domains^a	Cognitive Domains^b	Performance Assessment	Noncognitive Outcome	Contextual Information	Teacher/Principal Questionnaire
2015, 2018,…	Mathematics	Numbers and algebra, shapes and geometry, and statistics and probability	Computation, spatial imagination, data analysis, reasoning, and problem-solving	—	Interest, confidence, habits, and so on	Student background (e.g., social economic background) and individual involvement (e.g., homework hours)	Classroom-level and school-level education process (e.g., use of technologies in classrooms, course offering) and school input (e.g., class size)
2015, 2018,…	Physical education	—		Fitness test: Height, body mass index, eyesight, vital capacity, speed, strength, and stamina	Interest, attitudes, health habits, sleep, physical training habit, and so on
2016, 2019,…	Chinese language	Basic Chinese, reading, writing	Inference, synthesis and explanation, and literature appreciation	—	Interest, confidence, habits, and so on
2016, 2019,…	Arts	Knowledge about forms of artistic expressions, features of traditional national artwork	—	The music performance test and the creative drawing test	Interest, participation in art activities, and so on
2017, 2020,…	Sciences	Life science, physical science, earth, and universe	Remembering, understanding, applying, exploring, interpreting, and problem-solving	—	Interest, confidence, habits, and so on
2017, 2020,…	Moral education	Legal literacy, Chinese traditional culture, and knowledge about China	—	—	Beliefs and values

^a The content domains are aligned with the content standards of the national curriculum and are part of the test blueprint. ^bThe cognitive domains define the important cognitive abilities that are required in the national curriculum. The multidimensional item response theory model treats the cognitive domains as dimensions.

Administration and Reporting

The assessment of the subjects is spiraled in 3-year assessment cycles with two subjects assessed each year. Taking the 2015–2017 cycle as an example, mathematics and physical education were assessed in 2015, Chinese language and arts were assessed in 2016, and sciences and moral education were assessed in 2017.

The survey sample for the national assessment is selected using a three-stage design, which is described in detail in the next section. Once a sample of counties is obtained, NAEQ contacts the local education departments and organizes training sessions for local education administrators. The county education departments make plans for the local assessment process regarding staffing and deployment. After schools are sampled from each county, the local education departments train the school principals, proctors, inspectors, and other staff and verify the information from the schools to be reported to NAEQ. The sampling of students within the selected schools is conducted by NAEQ, and a list of students is returned to each sampled school for test administration. During the preparation and the test administration, the NAEQ opens a hotline for local administrators, principals, and other administrative staff in case any question arises.

The assessment takes place on the last Thursday in May, about half of a month before the end of the semester. This time was chosen to ensure the completeness of a school year’s teaching and learning, also considering convenience in assessment administration. All the assessments are finished within a single day following the standard directions. There are about 200,000 students taking the assessment on that day.

The reporting system consists of the main report and reports on specific topics. The main report shows student results at the national level and the influential factors and is available to the general public. In addition, there are reports on specific topics such as teacher training and quality, gender differences, and rural versus urban communities. For the scaled subject areas (i.e., Chinese language, mathematics, and sciences), both the average scaled scores and the achievement-level data (obtained from the standard setting) are reported. The subscores are also reported in some reports. For the subject areas that are not scaled, some of the results are reported on the item level in terms of the average score or the percentage in each response category.

Quantitative Methods and Technologies

Student Sampling

To obtain the national student sample for an assessment year, a three-stage design is adopted that contains a mixture of stratified probability proportionate to size (PPS) sampling and systematic PPS sampling. Stratification is employed at each stage for sampling efficiency and representativeness. The first-stage sample consists of a sample of counties from 32 provinces. Stratification of counties is based on the cluster analysis on the GDP per capita, urbanization level, and educational development index of every county in each province. The probability PPS sampling is used: The number of counties selected in each stratum is determined by the ratio of the number of students in each stratum to the total number of students for the province with at least two counties selected from each stratum and at least six counties sampled from each province. The resulting number of selected counties is about 10th of the total number of counties in China.

In the second stage of sampling, 12 elementary schools and 8 middle schools were sampled from each selected county using PPS sampling. In each county, the schools are stratified based on their locations (i.e., city, town, and rural area). The principals of the sampled schools complete the principal questionnaire.

The third stage selects 30 fourth-grade students and 30 eighth-grade students from each sampled school. The number of sampled students in each province should not be smaller than 3,600. Each selected student is required to participate in paper-and-pencil tests (and the fitness test if physical education is assessed) of both subjects assessed in that year and respond to the student questionnaire. For each grade, five students are selected from the student sample of a school to take the music performance test, and another five students are selected to take the creative drawing test. A teacher sample is selected from the sampled schools to take the teacher questionnaire, which is independent of student sampling within the school.

Within a 3-year assessment cycle, each year a different sample of counties is selected to cover as many counties as possible. To allow for the monitoring of progress within a subject area, it is planned that in the next assessment cycle of 2018–2010, the single-group design (Kolen & Brennan, 2004) will be adopted and concurrent calibration will be used to link the tests in the second assessment cycle to the tests of the same subject in the first assessment cycle. That is, each year in the second assessment cycle, a linking sample that is independent from the national sample of that year will be selected only for the purpose of longitudinal comparison and will not be included in that year’s assessment results. The national sample of each year will still be obtained by the three-stage design described earlier.

Item Development and Item Sampling

The test blueprint of the paper-and-pencil test for each subject was developed based on the national curriculum of each subject and content analyses of the fourth- and eighth-grade textbooks. All the items included in the tests are considered to be appropriate for Grades 4 or 8 according to the curriculum standards. Item formats include multiple-choice items and free-response items.

Due to the time constraint, it is unrealistic to administer all the items or tasks to every student for mathematics, sciences, and Chinese language. Therefore, items are organized into blocks, each with the same number of items. The partial balanced incomplete block (BIB) design is employed to construct booklets (Giesbrecht & Gumpertz, 2004). The booklets are linked through blocks that occur in more than one booklet, which is necessary for concurrent calibration of item parameters (described in the next section). For example, the assessment of mathematics assembles multiple-choice items into blocks of 5 items while each constructed-response item constitutes a block. The partial BIB design is applied to each item type independently, and as a result each booklet consists of two multiple-choice blocks and four constructed-response blocks. Table 2 presents an example of the mathematics booklet structure. Booklet designs are not used in the assessments of arts, physical education, or moral education, in which every sampled student responds to all the items or tasks.

Table 2.

Example Booklet Structure of Grade 4 Mathematics Assessment

Type of Item	Number of Items	Booklet 1	Booklet 2	Booklet 3	Booklet 4	Booklet 5	Booklet 6
Multiple choice	5	M01	M02	M03	M04	M05	M01
Multiple choice	5	M02	M03	M04	M05	M06	M06
Constructed response	1	C01	C03	C03	C05	C05	C01
	1	C02	C02	C04	C04	C06	C06
	1	C07	C09	C09	C11	C11	C12
	1	C08	C08	C10	C10	C12	C07

Note. Each multiple-choice block contains five multiple-choice items. Each constructed-response block contains one constructed-response item. As a result, each booklet consists of 10 multiple-choice items and 4 constructed-response items. M = multiple-choice block; C = constructed-response block.

Scaling and Scoring

Unidimensional item response theory (IRT) models are used to calibrate items and create the score scales for mathematics, sciences, and Chinese language. The Rasch model for dichotomously scored multiple-choice items and the partial credit model (PCM) for polytomously scored items models were chosen according to the analyses conducted during the pilot and field tests. The probability of a correct answer given $θ$ defined by the Rasch model is given in Equation 1. The item characteristic function of an item with $K_{i} + 1$ response categories, $k = 0, 1, \dots, K_{i},$ defined by the PCM is expressed in Equation 2. The software program ConQuest 2.0 (Wu, Adams, & Wilson, 1997) is used for the item calibration and estimation of overall ability. The overall ability estimator for each student is the expected a posterior (EAP) estimator. The estimated item parameters were used in the subscore estimation and standard setting:

P (X_{i} = 1; δ_{i} | θ) = \frac{exp (θ - δ_{i})}{1 + exp (θ - δ_{i})},

P (X_{i k} = 1; δ_{i k} | θ) = \frac{exp [\sum_{x = 1}^{k} (θ - δ_{i x})]}{1 + \sum_{r = 1}^{K_{i}} [exp \sum_{x = 1}^{r} (θ - δ_{i x})]} .

NAEQ, like most other large-scale assessments, initially uses a unidimensional IRT model to construct the scale. In response to the great need for diagnostic information, some assessment programs report subscores on content domains or cognitive domains in addition to the overall score. In the case of NAEQ, in addition to estimating the overall ability using all available items, it is desirable to report subscores to provide some diagnostic information, for example, subscores on the five cognitive domains for mathematics (computation, spatial imagination, data analysis, reasoning, and problem-solving as shown in Table 1).

To estimate subscores, the multidimensional random coeficients multinomial logit model (MRCMLM; Adams, Wilson, & Wang, 1997) was fit to the assessment data of mathematics, sciences, and Chinese language. MRCMLM is a multidimensional extension of the unidimensional random coefficients multinomial logit model (Adams & Wilson, 1996) that subsumes the Rasch model and the PCM. Assuming that I items are indexed $i = 1, \dots, I$ with each item consisting of $K_{i} + 1$ response categories, $k = 0, 1, \dots, K_{i}$ . If the response to item i is in the kth response category, $X_{i k} = 1$ , and $X_{i k} = 0$ otherwise. Therefore, the item response of an examinee to item $i$ is expressed as a vector $X_{i} = [X_{i 1}, X_{i 2}, \dots, X_{i K_{i}}]'$ . If the response to item i is in category 0, $X_{i}$ is a vector of 0s, which makes Category 0 a reference category.

The probability of a response in category k of item i defined by MRCMLM is

P (X_{i k} = 1; a_{i k}, b_{i k}, ξ | θ) = \frac{exp (b_{i k}^{'} θ + a_{i k}^{′} ξ)}{\sum_{k = 1}^{K_{i}} exp (b_{i k}^{'} θ + a_{i k}^{′} ξ)} .

The D latent traits are denoted by a vector $θ = [θ_{1}, θ_{2}, \dots, θ_{D}]'$ . To describe the factor structure, the notion of scoring function $b_{i k d}$ is introduced, indicating the relationship between dimension $d (d = 1, \dots, D)$ and category k of item i. The response scores across D dimensions are collected into a scoring vector $b_{i k} = [b_{i k 1}, \dots, b_{i k D}]'$ of length D. The scoring vectors of I items form a design matrix $B = [b_{11}^{′}, \dots, b_{1 K_{1}}^{′}, \dots, b_{I 1}^{′}, \dots, b_{{IK}_{I}}^{′}]^{′}$ of $\sum_{i}^{I} K_{i}$ rows and D columns. Because a simple structure is adopted, the vector $b_{i k}$ has only one nonzero entry. For item i, loading on the first dimension under a three-dimensional model, for example, $b_{i k} = {[1, 0, 0]}^{'}$ for $k = 1, \dots, K_{i}$ .

The vector $ξ = [ξ_{1}, ξ_{2}, \dots, ξ_{p}]'$ contains p item parameters of item i. The number of parameters p is decided by the complexity of the model and the number of response categories. Linear combinations of the item parameters describe the empirical characteristics of the response category k of item i. These linear combinations are defined by the design vector $a_{i k} (i = 1, \dots, I and = 1, \dots, K_{i})$ of length p. The design vectors can be collected to form a design matrix $A = [a_{11}^{'}, \dots, a_{1 K_{1}}^{'} \dots, a_{I 1}^{'}, \dots, a_{I K_{I}}^{'}]'$ of $\sum_{i}^{I} K_{i}$ rows and p columns. Taking an item with four response categories as an example, the following design matrix A defines a model equivalent to a PCM when $D = 1$ . In this case, the number of item parameters p equals to K_i.

A = [\begin{matrix} 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 1 \end{matrix}] .

Subscore reporting is not new to many testing programs in which separate unidimensional IRT models are fit to subsets of data or a multidimensional IRT model is fit to the whole data set. In both cases, the subscores obtained are hardly comparable with the overall scaled score. NAEQ addresses this issue by fixing the item parameters in the MRCMLM to be equal to the item parameter estimates from the previous unidimensional calibration. The parameter estimates from the unidimensional calibration are applicable to the multidimensional analysis because the test has a simple structure, and within each dimension, the item response model is equivalent to a Rasch model or a PCM as demonstrated earlier. The relationship between the items for the same latent ability remains the same in a multidimensional model. This approach is also supported by model fit statistics. By fixing the item parameters of the multidimensional model, the subscores obtained have scales that are comparable with that of the overall score scale. Multidimensional scoring is conducted in ConQuest (Wu et al., 1997) with EAP estimation.

Standard Setting for Achievement Tests

Standard setting is currently conducted for mathematics, sciences, and Chinese language. Performance standards for each subject were specified according to the national curriculum. There are four performance levels that classify students into categories of levels I, II, III, and IV with Level I as the lowest and Level IV as the highest level.

A hybrid standard setting method combining the Angoff (1971) and the bookmark (Mitzel, Lewis, Patz, & Green, 2001) methods is used to set the cut scores on the overall ability scale. We then illustrate the process of standard setting with fourth-grade mathematics as an example. A panel of 30 people is convened consisting of education and measurement experts, fourth-grade mathematics teachers, administrators, and parents, from western, central, and eastern China. The panel is divided into five to six small groups when the group discussion is needed.

The whole process is computerized using a self-developed standard setting system. The standard setting goes through four rounds and lasts for 2 days. There are large-group and small-group discussion sessions before each round of standard setting. Two booklets referred to as the Angoff booklet and the Bookmark booklet are presented to each panelist, which contain the same subset of 35 items sampled from all the items in the paper-and-pencil test of mathematics. Items are ordered by their item difficulty parameters in the Bookmark booklet and in the Angoff booklets, they are organized by item format. The panelists complete the Angoff task using the Angoff booklet and the Bookmark task using the Bookmark booklet.

The first activity is the introduction of the goal and process of standard setting. Then, the fourth-grade mathematics curriculum standards and textbooks are reviewed. The panel is broken into small groups to discuss the performance standards. After all the panelists reach a consensus on performance standards as a large group, the first round of standard setting begins. Each panelist completes both the Angoff task and the Bookmark task: They independently judge the probabilities to correctly respond each item of each level of students on the Angoff booklet and then set the bookmarks on the Bookmark booklet. The panelists can review their own cut scores as well as the averaged cut scores across panelists from the system. The subsequent rounds repeat the tasks in the first round on the basis of the feedback and group discussion. The discussion session between two rounds focuses on the items that show a great variation among panelists, and statistics are provided as reference including the average score and standard deviation of each item, and the percentage of students at each level according to the standard setting results for selected cities or counties. After the fourth round of standard setting is completed, the final cut points on the total scaled score are determined by integrating the results from the Angoff and the Bookmark methods.

Technologies in the Performance Assessment of Arts

The music performance test in the assessment of arts cannot be implemented with traditional paper-and-pencil tests. The music performance test consists of two performance tasks. The first task asks the student to sing the required song. In the second task, students sing a song by their own choice from a list of songs. There are 22 songs in the lists for primary school students and middle school students. Technologies play an important role in test administration, response recording, and rating.

A special software is installed on a computer with a camera before the administration of the music performance test. The student performances are videotaped in the software. Internet access is not required. Student performances are evaluated in five dimensions including (1) lyrics, (2) rhythm, (3) intonation, (4) artistic expression, and (5) the integrity and fluency of the overall performance. The evaluation involves the verbal and musical aspects of the performances. In 2016, the music performance test was administered to 46,858 students from 2,900 primary schools and 3,600 middle schools. As it is not realistic to grade all the performances manually within a short time, artificial intelligence techniques were applied to analyze student performance data and a rating algorithm was trained to grade the performances on a four-point scale for each aspect. For each song, expert rating of 100 students was used to train the rating algorithm. For the purpose of cross-validation, a sample of 50 students for each song was used to compare the expert rating with computer grading in terms of correlation and mean absolute error in each dimension. The correlation of each dimension was between .84 and .91, and the mean absolute error was between .12 and .27.

Discussion

Social Impact

By highlighting education quality, NAEQ aims to support the goals of the MOE, local governments, and other stakeholders who seek to ensure that all Chinese students are receiving quality education that facilitates their development in a healthy environment. NAEQ is also disseminating its education perspective to the general public, with the aim to convert the unhealthy trend of test-oriented education. Specifically, by incorporating the subjects of physical education, moral education, and arts, NAEQ delivers an important message to the general public that the holistic perspective of education would benefit students’ lifelong development.

In the past few decades, there have been concerns the National Higher Education Entrance Examination, commonly known as “gao kao,” has had excessive influence on school education. One negative outcome of the influence is the test-oriented education even in primary and middle schools. Subjects that are not covered in gao kao, including arts, physical education, and moral education, have not received adequate attention as a result of test-oriented education (Dello-Iacovo, 2009). The uneven distribution of emphasis over different subjects in primary and middle schools is holding back the efforts to improve education quality characterized by students’ holistic development. NAEQ is hoping to reverse this situation.

The Fifth Plenary Session of the 18th CPC Central Committee (2015) called on the education system to enhance public supervision, that is, to encourage citizens to participate in decision-making in public education, to discuss education affairs, and to supervise local education department’s activities. The national report of NAEQ that is available to the whole society would undoubtedly encourage the supervision from the general public. More detailed reports provided by NAEQ would help the national and local governments to monitor education quality, student development, and influential factors. The rich information in the report enables the policymakers and administrators to identify potential problems.

Current Issues and Future Directions

NAEQ is now in the last year of its first operational assessment cycle after 8 years’ development and validation. Debates still exist in several areas including the setting of the performance levels, use of background variables, statistical modeling of the student responses, and accountability purpose.

The evaluation of quality hinges on standard setting in each subject area. The standard setting remains controversial whether the same set of performance standards should be used in different demographic regions. As a vast country that is under fast development, China has seen great regional differences between eastern and western China. Currently, only one set of standards is applied to the whole country, which leads to ceiling effects in some provinces and floor effects in some other provinces. It then becomes more difficult to identify potential problems in those provinces with ceiling or floor effects. It is worth exploring how to customize the performance-level system for each province so that it may be used effectively to improve their local education quality.

Currently, a mixed approach of unidimensional and multidimensional statistical modeling is used to report the overall ability score and subscores for mathematics, sciences, and Chinese language. The dimensionality of the tests, as in most large-scale assessments, remains a problem. Educational measurement borrows from the latent variable modeling approach of psychological testing that dates back to the study of intelligence (Baker, Chung, & Cai, 2016). However, most of the educational tests have cognitive domains as well as content domains. It is worth exploring whether the current modeling approach reflects the characteristics of educational tests that are apparently distinct from psychological tests. In addition, the student sample is obtained with a three-stage sampling design, and sample weights are involved. The next step could be to explore IRT modeling that makes use of the sample weights and also considers the hierarchical structure of the data (Zheng & Yang, 2016).

It remains a challenge to make use of the background variables in estimating student outcomes. The relationships between the background variables and student outcomes are part of the contextualized evaluation of education quality. Background information can also be used to better estimate student outcomes. However, the field test results suggest that some commonly used background variables in other large-scale assessments show different relationships with student outcomes in China. The relationships between some variables and the student outcomes were even reversed. In order to make full use of the background variables, NAEQ calls on more research on the relationships between the background variables and student outcomes in China. In the future, we anticipate more opportunities for collaboration with the research community.

The potential accountability purposes of NAEQ have greatly complicated the assessment system. Three issues need special attention. First, more studies should be done on the procedures and consequences of using standard setting results for accountability purposes. Second, NAEQ needs to consider the fact that the responses of students and teachers could be affected once they realize that their assessment results could lead to policy changes that would have an impact on themselves. Last, it should be noted that the pressure of accountability could turn NAEQ into high-stake assessments and as a result the missions of NAEQ would be compromised.

Summary

NAEQ aims to assess education quality and its influencing factors. The fundamental design of NAEQ reflects the current educational priorities. It is the product of the Chinese philosophy of holistic development combined with modern quantitative methods and technologies. After 8 years’ development and validation, NAEQ has recently become an official national assessment endorsed by the MOE and has been included in the national education inspection system. With promises and challenges, NAEQ has a long way to go.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Major Project for National Assessment of Education Quality, which is funded by Ministry of Education of China.

ORCID iD

Jiahui Zhang

References

Adams

R. J.

Wilson

M. R.

(1996). Formulating the Rasch model as a mixed coefficients multinomial logit. In Engelhard

Wilson

(Eds.), Objective measurement: Theory into practice (Vol. III, pp. 143–166). Norwood, NJ: Ablex.

Adams

R. J.

Wilson

M. R.

Wang

W. C.

(1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1–23.

Angoff

W. H.

(1971). Scales, norms, and equivalent scores. In Thorndike

R. L.

(Ed.), Educational measurement (2nd ed., pp. 508–600). Washington, DC: American Council on Education.

Baker

E. L.

Chung

G. K. W. K.

Cai

(2016). Assessment gaze, refraction, and blur: The course of achievement testing in the past 100 years. Review of Research in Education, 40, 94–142.

Beaton

A. E.

Zwick

(1992). Overview of the national assessment of educational progress. Journal of Educational and Behavioral Statistics, 17, 93–94.

Dello-Iacovo

(2009). Curriculum reform and “quality education” in China: An overview. International Journal of Educational Development, 29, 241–249.

Giesbrecht

F. G.

Gumpertz

M. L.

(2004). Planning, construction, and statistical analysis of comparative experiments. Hoboken, NJ: John Wiley.

Johnson

E. G.

(1992). The design of the national assessment of educational progress. Journal of Educational Measurement, 29, 95–110.

Korea Institute of Curriculum and Evaluation. (2008). Internet homepage on the National Assessment of Educational Achievement. Retrieved October 3, 2008, from http://www.kice.re.kr/kice/eng/info/info_3.jsp

10.

Kolen

M. J.

Brennan

R. L.

(2004). Test equating, linking, and scaling: Methods and practices (2nd ed.). New York, NY: Springer-Verlag.

11.

Ministerial Council on Education, Employment, Training, and Youth Affairs. (2006). National Assessment Program—Civics and citizenship, year 6 and year 10 report 2004. Carlton, Australia: Author.

12.

Ministry of Education. (2015). The scheme of national compulsory education quality assessment system. Retrieved from https://www-moe-gov-cn.web.bisu.edu.cn/s78/A11/s8393/s8397/

13.

Ministry of Education. (2016). General statistics of national education development in 2015. Retrieved from https://www-moe-gov-cn.web.bisu.edu.cn/srcsite/A03/s180/moe_633/201607/t20160706_270976.html

14.

Ministry of Education, Culture, Sports, Science & Technology, & National Institute for Educational Policy Research. (2015). Heisei 27 nendozenkokugakuryokugakushuujoukyouchousahoukokusho (2015 Academic Year Report from the National Academic Ability and Situation Assessment). Retrieved from http://www.nier.go.jp/15chousakekkahoukoku/

15.

Mitzel

H. C.

Lewis

D. M.

Patz

R. J.

Green

D. R.

(2001). The bookmark procedure: Psychological perspectives. In Cizek

(Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 249–282). Mahwah, NJ: Lawrence Erlbaum.

16.

Organization for Economic Co-operation and Development. (2016a). Education in China: A snapshot. Paris: Author. Retrieved December 11, 2016, from http://www.oecd.org/china/Education-in-China-a-snapshot.pdf

17.

Organization for Economic Co-operation and Development. (2016b). PISA 2015 results (volume I). Excellence and equity in education. Paris, France: Author. Retrieved December 11, 2016, from https://dx-doi-org.web.bisu.edu.cn/10.1787/9789264266490-en

18.

Stufflebeam

D. L.

(1983) The CIPP model for program evaluation. In Evaluation Models. Evaluation in Education and Human Services (Vol. 6, pp. 117–141). Dordrecht: Springer.

19.

The Fifth Plenary Session of the 18th CPC Central Committee. (2015). Communique of the Fifth Plenary Session of the 18th CPC Central Committee. Retrieved from http://news.xinhuanet.com/politics/2015-10/29/c_1116983078.htm

20.

Wang

Berry

Swearer

S. M.

(2013). The critical role of school climate in effective bullying prevention. Theory into Practice, 52, 296–302.

21.

M. L.

Adams

R. J.

Wilson

M. R.

(1997). ConQuest: Multi-aspect test software [computer program]. Camberwell, England: Australian Council for Educational Research.

22.

Zheng

Yang

J. S.

(2016). Using sample weights in item response data analysis under complex sample designs. In van der Ark

L. A.

Bolt

D. M.

Wang

W. C.

Douglas

J. A.

Wiberg

(Eds.), Quantitative psychology research (pp. 123–137). New York, NY: Springer.