Online Parameter Estimation for Student Evaluation of Teaching

Abstract

Student evaluation of teaching (SET) assesses students’ experiences in a class to evaluate teachers’ performance in class. SET essentially comprises three facets: teaching proficiency, student rating harshness, and item properties. The computerized adaptive testing form of SET with an established item pool has been used in educational environments. However, conventional scoring methods ignore the harshness of students toward teachers and, therefore, are unable to provide a valid assessment. In addition, simultaneously estimating teachers’ teaching proficiency and students’ harshness remains an unaddressed issue in the context of online SET. In the current study, we develop and compare three novel methods—marginal, iterative once, and hybrid approaches—to improve the precision of parameter estimations. A simulation study is conducted to demonstrate that the hybrid method is a promising technique that can substantially outperform traditional methods.

Keywords

item response theory parameter estimation student evaluation of teaching

Student evaluation of teaching (SET) is a teaching assessment format that is used for measuring teachers’ teaching proficiency in educational environments. Students participate in a SET assessment to evaluate their teacher’s teaching performance; SET is usually conducted toward the end of a semester. Estimating teacher performance is important because teacher performance evaluation is critical for schools to understand the quality of teaching. Good teaching can make students’ learning more productive (Kember & Wong, 2000). Therefore, it is important to monitor and estimate the quality of teaching. Teachers can be either promoted or penalized based on their teaching performance, especially in private schools in Asia. Thus, obtaining reliable estimates is critical for teachers.

Conventionally, SET is administered in a paper-and-pencil format. Currently, a growing number of SETs have moved to online formats. For example, online SET has been used at the University of Oslo, where students are asked to provide their feedback on and evaluations of teachers regarding the teaching effectiveness based on the student’s learning experience in the classroom. Using online SET has numerous advantages, such as the ease of recording item responses, the reduction of recording errors, and the avoidance of logistical problems. Additionally, students do not have to fill out the SET questionnaire in front of their teacher, which avoids situations where the students might not truthfully fill out the SET. Moreover, online SET is easy to administer, and the items can be customized for each teacher. In addition, Gamliel and Davidovitz (2005) have further shown that online SET has an internal reliability of .89, which is higher than that of the paper-and-pencil format. Despite these striking advantages, online SET is prone to low response rates (Dommeyer et al., 2004), possibly due to students’ low levels of motivation and interest. For example, in 4-year medical student curriculum evaluations at Kansas Medical Center, the response rate was only 24% because students had to answer 62 items (Anderson et al., 2005; Paolo et al., 2000).

To address the low response rate concern of online SET, computerized adaptive testing (CAT) could be a viable alternative. CAT not only reduces the questionnaire length but also precisely estimates the student’s latent traits. The advantages of CAT have been well documented in the literature (for more information, see Wainer et al., 2000). However, it remains unclear whether CAT can be directly applied to online SET. Because a teacher’s teaching proficiency is evaluated by students, the students’ rating harshness regarding their responses could impact scores. A critical student may give lower scores than a more lenient student in rating the same teacher who should get the same score when all students have consistent levels of harshness. Three facets are conceptualized: items, students’ harshness toward the teacher, and teachers’ teaching proficiency. Here, there is an association between SET and conventional multifaceted measurements. For the conventional multifaceted measurements, raters are asked to evaluate ratees’ performances on, for example, essay writing in paper-and-pencil format. Similarly, online SET requires students (raters) to evaluate teachers’ teaching performance (ratees). Thus, the measurement model for the conventional multifaceted measurement, namely, the multifaceted rating scale model (MFRSM; Linacre, 1993), can be the fundamental psychometric model to develop CAT algorithms for the online SET scenario.

Before using online SET (or SET-CAT), the item parameters must be estimated from a SET dataset. These item estimates constitute an item bank for subsequent use in CAT. Usually, the item estimates are assumed to be fixed and known in CAT. However, in the SET scenario, the student’s trait (γ: student harshness) and teacher’s trait (θ: teaching proficiency) are two unknown parameters that must be estimated during CAT. In contrast, a conventional CAT only addresses one unknown parameter (ie the student’s trait). As a result, estimating γ and θ simultaneously is a challenging task for online SET. This difficulty seriously impedes the empirical application of SET-CAT. If the γ is ignored while estimating the θ, it is expected that an estimation accuracy of θ would be problematic. In the current study, we develop new algorithms for estimating γ and θ simultaneously in SET-CAT. To show the detriment of ignoring estimating γ and the benefit of estimating γ and θ together, a Monte Carlo simulation study will be conducted.

The rest of the current paper is organized as follows. First, we introduce the MFRSM (Linacre, 1993) for the SET assessment. Second, three online estimation methods for γ and θ in the SET-CAT context are proposed. Third, a simulation study is conducted to compare the performance of the three new methods with a naïve method ignoring γ. Finally, concluding remarks are provided to summarize our findings and outline further directions for SET-CAT.

Multifaceted Rating Scale Model

The MFRSM is often used to deal with the rater effect when raters score examinees’ responses. Three facets are involved in the MFRSM—item, examinee, and rater. In the context of SET, the teacher’s teaching proficiency is evaluated by the student. Hence, the three facets are item, teacher, and student. Thus, the student is considered the “rater,” whereas the teacher is regarded as the “ratee.”

Ignoring the rater effect has been found to be detrimental to parameter estimates, leading to unreliable scores (Hoyt, 2000; Wolfe, 2004); some raters have confined the range of ratings (Holzbach, 1978), and the halo effect of the rater can inflate the correlation between latent traits (Hoyt, 2000). The rater effect could be attributed to harshness versus leniency, centrality versus extremity, and accuracy versus inaccuracy (Acuña, 2017; Wolfe, 2004). Although the types of rater effects can be numerous, a more crucial concern is how to eliminate rater effects and attain reliable scores. In the current study, a rater effect of harshness versus leniency that would bias the parameter estimates (e.g., Boone et al., 2016; Wesolowski et al., 2016; Wind & Jones, 2019) is considered. Accounting for the various rater effects, such as halo effects, centrality/extremity, or inaccuracy, is possible; however, such accounting introduces model identification problems to distinguish them in the SET-CAT with sparse item responses. Therefore, we focus on eliminating the harshness effect, which can be achieved by using the MFRSM.

The probability function of the MFRSM for a positive response is formulated as follows

\log (\Pr_{i j r k} / \Pr_{i j r (k - 1)}) = θ_{r} - δ_{j} - γ_{i} - τ_{k},

(1)

where Pr_ijrk denotes the item response probability for score k for item j, γ_i denotes the rater harshness for rater i (student), θ_r is the teaching proficiency for ratee r (teacher), δ_j is the item location for item j, and τ_k is the item threshold in category k, where τ₀ ≡ 0. Note that other, more complex models can also be used (e.g., adding a slope parameter on the latent trait). The purpose of the present study, although, is to demonstrate how to deal with γ and θ simultaneously in the CAT context and to consider the detriment of ignoring γ. Thus, using the MFRSM is sufficient for illustration purposes.

The marginal maximum likelihood with the expectation-maximization method (Dempster et al., 1977; Muraki, 1992) is one method often used to calibrate parameters in item response theory models. γ is considered a random-effect variable and is marginalized in the joint likelihood function of all the parameters, given observed responses. θ is regarded as a fixed effect, and no prior distribution is included. The prior distribution of γ is usually assumed to be a normal distribution with a zero mean (fixed for identification) and an estimable variance parameter, σ². Additionally, a constraint shall be imposed on θ or δ. For example, ∑θ_r = 0. In the item bank building stage, the SET is administered to a group of students to rate the teachers’ teaching proficiency. δ, τ, θ, and σ are estimated first, followed by γ, which is estimated for all students, given the fixed estimates of δ, τ, θ, and σ. The estimates of δ and τ are then kept in an item bank for subsequent use in SET-CAT, but the estimates of γ, θ, and σ are not used. SET-CAT needs to estimate θ (teachers’ teaching proficiencies) and select the next items from the item bank that are the most adaptive (ie most information) for the teachers.

Latent Trait Estimation in SET-CAT

The difference between a conventional CAT and SET-CAT is that the former only has a single trait (e.g., θ) to estimate, while the latter has γ (student) and θ (teacher). The target of SET-CAT is to estimate the teacher’s θ precisely with the student’s γ. The two unknown parameters must be estimated during SET-CAT. However, an identification issue arises because γ – θ = (γ + c) – (θ + c) = γ^* – θ^*, where c is constant. Note that in some respects, SET-CAT is similar to conventional CAT with two-dimensional IRT models. The specific difference is that SET-CAT is concerned with rater data (teachers evaluated by students) instead of rating data (students responding to items). Additionally, the model used in SET-CAT is MFRSM rather than other regular models, such as multidimensional generalized partial credit models (Reckase, 2009). In the following, we propose a marginal method (MM), iterative once method (IOM), and hybrid method for latent trait estimation.

Marginal Method

Consider the log-likelihood function with respect to γ_i and θ_r for student i and teacher r

l (γ_{i}, θ_{r}; y, δ, τ) = \sum_{j = 1}^{J} \log \Pr (y_{j k} | γ_{i}, θ_{r}, δ_{j}, τ_{k})

(2)

where J is the number of administered items in CAT. Estimating γ_i and θ_r simultaneously could be difficult because of the identification issue previously mentioned. Considering γ as a nuisance parameter and θ as the target parameter, we propose utilizing the empirical prior information of γ to marginalize γ out of the log-likelihood function. Specifically, the variance estimates

{\hat{σ}}^{2}

of the prior distribution of γ are employed in equation (2). The log-likelihood function is thus given by the following

l (θ_{r}; y, δ, τ, σ^{2}) = \sum_{j = 1}^{J} \log \int \Pr (y_{j k} | γ_{i}, θ_{r}, δ_{j}, τ_{k}) \Pr (γ | σ^{2}) d γ .

(3)

Remarkably, the log-likelihood function is a function of the parameter θ_r (teacher), δ, τ, and y, which means that the outcome value varies with those parameter estimates. In conventional CAT with a unidimensional latent variable, the item parameters δ and τ are known, and y is the observed responses. Consequently, the expectation a posteriori (EAP) can be used to obtain ${\hat{θ}}_{r}$ . The MM updates the ${\hat{θ}}_{r}$ after each item based on the responses from the current student rating of teacher r and from the other students who have already rated teacher r. In other words, the MM updates ${\hat{θ}}_{r}$ after each item by using all responses related to teacher r regardless of which students gave them. Equation (3) can also be used for selecting the provisional item. We select the next item by the Fisher information $I (\hat{θ}) = - E [\frac{\partial^{2}}{\partial θ^{2}} \log L (y | \hat{θ})]$ (Birnbaum, 1968), where $\log L (y | \hat{θ})$ is the log-likelihood function (i.e., Equation (3)). The integral can be approximated by numerical quadrature. In programming software such as R or MATLAB, the use of vectors often makes the computation efficient.

Iterative Once Method (IOM)

The iterative once method (IOM) estimates γ and θ. Specifically, for student i giving rating w to teacher r, the first step fixes the teaching proficiency θ_r to the provisional estimated value ${\hat{θ}}_{r}$ and updates only student i’s harshness γ_i after each item response. The log-likelihood function is given by the following

l (γ_{i}; {\hat{θ}}_{r}, y, δ, τ) = \sum_{w = 1}^{W} \log \Pr (y_{w k} | γ_{i}, θ_{r} = {\hat{θ}}_{r}, δ_{w}, τ_{k})

(4)

where item response w = 1, …, W are the rating responses for teacher r and

{\hat{θ}}_{r}

is the provisional estimate for teacher r. The initial value of

{\hat{θ}}_{r}

, for example, could be set to zero when there are no ratings for teacher r from student i. CAT selects the informative items according to the likelihood function of equation (4) and continues until student i finishes the ratings for teacher r.

The first step continues until at least one student has rated teacher r, and the next step estimates teaching proficiency θ_r given the estimates of γ₁, …, γ_l, where l denotes the number of students thus far. With the known values of item parameters δ and τ, the log-likelihood function in this step is as follows

l ({\hat{θ}}_{r}; \hat{γ}, y, δ, τ) = \sum_{h = 1}^{H} \log \Pr (y_{h k} | θ_{r}, γ = \hat{γ}, δ_{h}, τ_{k})

(5)

where the item responses for h = 1, …, H are the students’ ratings for teacher r. The second step updates θ_r. The same method applies to other teachers to be rated by students. Hence, a teacher’s teaching proficiencies will be updated whenever a new student has finished rating that teacher. To summarize, the IOM updates

{\hat{γ}}_{i}

after each item by using all responses from student i, regardless of which teachers are rated, and updates

{\hat{θ}}_{r}

after each student has finished the evaluation for teacher r by using all the responses related to teacher r, regardless of the students.

Hybrid Method

A hybrid method is proposed that takes advantage of the MM and the IOM. The MM does not require estimating γ, which substantially simplifies the estimation problem. The IOM simultaneously estimates θ and γ, but it requires an estimated value for θ and could be unstable at the very early stage of SET (when only a few students are taking the SET assessment). An effective method is to utilize the MM at an early stage and then switch to the IOM at some point when the estimation is stable. We propose that the hybrid method continues with the MM when the standard error of ${\hat{θ}}_{r}$ is greater than 0.3; otherwise, it executes the IOM. The 0.3 criteria can be adjusted to a more lenient situation, such as 0.5, or a stricter situation, such as 0.1; we use 0.3 for illustration purposes.

The schema of the hybrid method is summarized as follows:

Step 1

If the standard error of the current ${\hat{θ}}_{r}$ is greater than 0.3, go to Step 2-MM; otherwise, go to Step 2-IOM.

Step 2

-MM. Execute the MM (Equation (3)) to obtain ${\hat{θ}}_{r}$ and then go to Step 3.

Step 2

-IOM. Execute the IOM to obtain ${\hat{γ}}_{i}$ and ${\hat{θ}}_{r}$ and then go to Step 3.

Step 3

If student i is required to rate more teachers, return to Step 1 for the teachers.

Step 4

The above steps are carried out for each student.

The hybrid method could improve the precision of ${\hat{θ}}_{r}$ in the early stages of SET-CAT by the MM when the estimated θ has a standard error larger than 0.3, which utilizes the prior information of γ for the unknown γ. Step-2-IOM will only be executed during SET-CAT once the SE of θ falls below .3.

Gamma Ignored Method (GI)

The GI method ignores every student’s harshness and only estimates the teacher’s proficiency θ. Therefore, the log-likelihood function for θ_r is as follows

l (θ_{r}; y, δ, τ) = \sum_{j = 1}^{J} \log \Pr (y_{j k} | θ_{r}, δ_{j}, τ_{k})

(6)

where J is the number of administered items in CAT. Only

{\hat{θ}}_{r}

is estimated. The infinite estimate arising due to the 1’s or 0’s in the responses was avoided by using the maximum a posteriori (MAP) with the prior normal for

{\hat{θ}}_{r}

. Otherwise, we used the maximum likelihood estimate. The GI method is adopted for comparison purposes.

Item Selection

In the current study, the Fisher information, $- E [\frac{\partial^{2}}{\partial θ^{2}} \log L (y | θ, γ)]$ , was used to adaptively select items for each student (Fisher, 1922; Lord, 1980; Thissen & Mislevy, 2000) even though other information-based item selection rules in CAT, such as the Kullback–Leibler information criterion (Chang & Ying, 1996) and mutual information criterion (Weissman, 2007), could be adopted in practice. Because the purpose of the current study is to deal with the hurdle of estimating γ and θ simultaneously in SET-CAT, using the primary maximum Fisher information criterion is sufficient for our purposes.

Simulation Study

The simulation study aims to investigate the accuracy and precision of the θ and γ estimations for the MM, IOM, and hybrid methods when compared with the GI method. In the following simulation study, we examine the impact of the four methods and test the length of the parameter estimates. The computer software MATLAB was used to conduct the simulation studies. The execution details of the three proposed methods are summarized in the Appendix.

Design

A simulated SET scenario resembling the real conditions of the 2008 National Dong Hwa University evaluations in Taiwan was carried out. National Dong Hwa University had a total of 173 teachers rated by 6111 students. The course load for each teacher ranged from one to eight classes, and each teacher was rated by 12–146 students. For a reasonable simulation execution time, we set 1000 students rating 50 teachers, where every teacher taught four classes and each class had 20 students on average. Every teacher was rated by 80 students in total from the four different classes, and every student rated four teachers in the four different classes. For generalizability of the study, we simulated conditions with small class sizes. The small class size condition set class sizes equal to 3 and 5, where we simulated that teachers were evaluated by students from four classes with the size of 3 and 5 students; hence, each teacher was rated by 12 and 20 students simultaneously. The small class size condition reflected the empirical situation of Master’s or PhD program classes in universities where fewer than 10 students were enrolled in a class, and teachers were rated by approximately 10 or 20 students in total. A more extreme condition was considered where the class size equaled one. Every teacher taught four classes, which means each ratee was rated by four raters. In summary, four levels of class sizes were simulated with 20, 5, 3, or 1 student in a class, which represented 80, 20, 12, and 4 students evaluating each teacher, respectively.

The item pool contains 100 items, with the difficulty parameters generated from a normal distribution with zero mean and unit variance. The 100 items were sufficient to yield a high level of precision for the parameter estimation in CAT (Rudner, 2009). The three threshold parameters were set to [−2.0, −1.07, 3.07] from an empirical analysis of SET (Setari et al., 2016). The item responses were simulated by the MFRSM.

Test lengths of 5, 10, and 20 items were chosen. The test length of 20 items was used, which resembles SET in the National Dong Hwa University in Taiwan. The 10-item condition was included to examine the effect of fewer items (Stocking, 1994; Wang & Kolen, 2001). The 5-item design aimed to show the impact of a short test length.

The three manipulated variables are test length, parameter estimation method, and class size. Of the CAT approaches, the six conditions are the MM with a prior γ with variance of one, the MM with a prior γ with variance of two, the MM with a prior γ with variance of three, the IOM, the hybrid method, and GI. Of the class size, the three conditions were 20, 5, 3, or 1 student in a class.

A normal distribution of the teacher population was considered. A total of 50 teaching proficiencies θ were generated from N(0, 1). The purpose is to explore the overall precision of the estimation at the population level. The second condition assumes that θ was generated at levels −3, −2, −1, 0, 1, 2, and 3. That is, all 50 teaching proficiencies were fixed at one of the seven levels. The purpose of this design is to examine the performance of the estimation precision when θ is conditional on a specific level from low to high. Regardless of how the teacher’s θ was simulated, the students’ harshness, γ. were always assumed from N(0, 1).

Each condition received six parameter estimation methods, with each governed by either the MM, IOM, hybrid, or GI. We considered the MM in three conditions: MM[σ² = 1], MM[σ² = 2], and MM[σ² = 3], where σ² denotes the variance of the prior distribution of γ. The condition of σ² = 1 represents a more informative prior distribution, while σ² = 3 represents a less informative prior distribution. For illustration purposes, the hybrid method used the prior distribution of γ with the variance equal to 1. The maximum Fisher information criterion was used for item selection. The initial items were randomly chosen to vary the selected items in the very early stage of CAT.

The provisional estimates $\hat{θ}$ (for the GI, MM, IOM, and hybrid) or $\hat{γ}$ (only for the IOM and the hybrid) were obtained using the MAP estimation. The final estimates of $\hat{θ}$ and $\hat{γ}$ for all methods were obtained by concurrent calibration using the joint maximum likelihood estimator, which was done by using the full response matrix after all students completed all teaching evaluations.

The conditions of three test lengths (5, 10, and 20) and the four class sizes (20, 5, 3, and 1) yield a total of 12 conditions in the simulations. The six approaches (MM[σ² = 3], MM[σ² = 2], MM[σ² = 1], IOM, hybrid, and GI) were compared for each condition. Every condition was implemented for 25 replications, which is sufficient to demonstrate the detriment of ignoring the γ parameter (see also Harwell et al., 1996). In addition, we added one more simulation condition where the item bank for SET-CAT contained only 20 items. The purpose was to inspect whether such a small item pool is sufficient in practice. This reflected the empirical situation that SET tests commonly contain few items. The 20 items were set for illustration purposes.

For $\hat{θ}$ and $\hat{γ}$ , we calculated the root mean square error (RMSE), bias, and reliability as RMSE = $\sqrt{\sum_{t = 1}^{25} {({\hat{ξ}}_{t} - ξ_{t})}^{2} / 25}$ , bias = $\sum_{t = 1}^{25} ({\hat{ξ}}_{t} - ξ_{t}) / 25$ , and $R e l i a b i l i t y = C o r r e l a t i o n {(ξ, \hat{ξ})}^{2}$ , respectively, where t is the index of replications, ξ is the true value of the parameter, and $\hat{ξ}$ is the estimator of ξ. We calculated the RMSE and bias based on the collected item responses when all students completed the SET-CAT. We evaluated the methods involving the characteristics of the selected items. The similarity of the selected items was quantified for each student by computing the overlap rate in the selected items between the methods $[\frac{# (i (m e t h o d Y) \cap i (m e t h o d X))}{t e s t l e n g t h}]$ . For example, the overlap rate between the MM and the IOM is 4/10 = 0.4 for a test-taker when four common items overlap between the MM and the IOM. The distribution of the overlap rate for the 1000 students was evaluated for every pair of methods.

Then, we explored the effects of methods, test lengths, and class sizes on the RMSE, bias, and reliability of teaching proficiency. We applied a three-way analysis of variance (ANOVA) where the RMSE, bias, and reliability of θ estimation were the outcome variables, and the methods, test lengths, and class sizes were the independent variables. The three-way interactions between methods, test lengths, and class sizes were examined.

Several results were expected: (a) for all methods, the average RMSE of θ and γ would decrease as more items were used, and the average bias of θ and γ would be close to zero; (b) for the MM and the GI, the RMSE of γ would be larger than the IOM and hybrid approaches because neither the MM nor the GI consider the individual’s γ in the provisional estimation and item selection; (c) the improvement of the precision (RMSE) of the γ estimate could improve the precision of the θ estimate; (d) the hybrid method would yield a lower RMSE for the θ estimate than the other methods.

Results

Class Size = 20

Table 1 shows the average bias, RMSE, and reliability of the teacher and student parameters on the six approaches with θ from N(0, 1). The average bias of

\hat{θ}

was close to zero across methods and test lengths, whereas the average bias for

\hat{γ}

was equal to zero because the γs were constrained with mean to zero. Note that the average bias across 1000 students would be zero when the mean of γ was constrained to zero because, within an iteration, the average true value of γ and average

\hat{γ}

equal zero (i.e.,

\sum_{n = 1}^{1000} γ_{n} = 0

and

\sum_{n = 1}^{1000} {\hat{γ}}_{n} = 0

). Across the 25 iterations, the calculation of the “average” bias is

\sum_{n = 1}^{1000} \sum_{t = 1}^{25} ({\hat{γ}}_{n t} - γ_{n t}) / (25 \times 1000) = [\sum_{n = 1}^{1000} \sum_{t = 1}^{25} {\hat{γ}}_{n t} - \sum_{n = 1}^{1000} \sum_{t = 1}^{25} γ_{n t}] / (25 \times 1000) = [0 - 0] / (25 \times 1000) = 0

Table 1.

Average Bias, RMSE, and Reliability of Latent Trait Estimation in the Ignoring γ, Marginal, Iterative, and Hybrid Methods for 20 Classes.

	5 items		10 items		20 items
	θ	γ	θ	γ	θ	γ
Bias
GI	0.002	0.000	0.003	0.000	−0.004	0.000
MM σ² = 3	0.001	0.000	0.006	0.000	−0.009	0.000
MM σ² = 2	0.006	0.000	0.010	0.000	0.000	0.000
MM σ² = 1	0.003	0.000	−0.007	0.000	−0.009	0.000
IOM	−0.008	0.000	0.002	0.000	0.004	0.000
Hybrid	−0.008	0.000	0.009	0.000	−0.006	0.000
RMSE
GI	0.157	0.623	0.131	0.305	0.079	0.214
MM σ² = 3	0.139	0.615	0.092	0.296	0.070	0.195
MM σ² = 2	0.138	0.687	0.098	0.289	0.072	0.198
MM σ² = 1	0.134	0.598	0.095	0.277	0.070	0.194
IOM	0.134	0.373	0.095	0.244	0.055	0.173
Hybrid	0.132	0.366	0.083	0.246	0.067	0.184
Reliability
GI	0.953	0.756	0.989	0.917	0.996	0.956
MM σ² = 3	0.989	0.761	0.995	0.926	0.997	0.964
MM σ² = 2	0.989	0.732	0.994	0.927	0.997	0.964
MM σ² = 1	0.990	0.809	0.995	0.932	0.997	0.964
IOM	0.988	0.887	0.995	0.942	0.998	0.971
Hybrid	0.990	0.899	0.996	0.938	0.997	0.969

Note. RMSE is the root mean square error, GI is the gamma ignored method, MM is the marginal method, IOM is the iterative once method, and σ² indicates the prior variance of γ in the marginal method.

The longer test lengths (i.e., 20 items) had a lower RMSE and higher reliability for all the methods. For θ and γ, the RMSE of $\hat{θ}$ was lower than that of $\hat{γ}$ because every teacher was rated by 80 students and every student rated four teachers. For item information, we had 80 × J observations for estimating the teacher’s $\hat{θ}$ (where J is the test length), whereas only 4 × J observations were used to estimate the student’s $\hat{γ}$ . Thus, $\hat{θ}$ was better estimated than $\hat{γ}$ .

Among the four methods in the 5-item condition, the RMSE of $\hat{θ}$ for the hybrid method was the lowest, which suggests that the hybrid method yields the most precise estimates of $\hat{θ}$ among the approaches for the short test length condition. This may be attributed to the fact that the hybrid method includes the prior information of γ (e.g., the MM) at the early stage and the individual $\hat{γ}$ estimates during item selection (e.g., the IOM) at the later stage. The results for the MM and the IOM were close to that of the hybrid method in terms of the RMSE of θ (MM[σ² = 3] = 0.139; MM[σ² = 2] = 0.138; MM[σ² = 1] = 0.134; IOM = 0.134; hybrid = 0.132).

For the RMSE of $\hat{γ}$ in the 5-item condition, the MM had a higher RMSE for $\hat{γ}$ than the IOM and hybrid methods, which might be attributed to the influence of the prior information of σ². The MM approach took the prior distribution of γ into account, but the prior would be influential due to the small test length. The GI considered neither $\hat{γ}$ estimates nor the prior distribution of γ, so it tended to have a higher RMSE for $\hat{γ}$ .

In the 10-item condition, for the hybrid method the RMSE of $\hat{θ}$ was 0.083. The RMSE for the hybrid method was lower than those of the other methods. The three MMs had similar RMSE values for $\hat{θ}$ close to that of the IOM (RMSE_θ: MM[σ² = 3] = 0.092; MM[σ² = 2] = 0.098; MM[σ² = 1] = 0.095; IOM = 0.095). For the RMSE of $\hat{γ}$ , the IOM and hybrid methods had a similar performance (RMSE_γ: IOM = 0.244; hybrid = 0.246) and were better than the MM and the GI (RMSE_γ: MM[σ² = 3] = 0.296; MM[σ² = 2] = 0.289; MM[σ² = 1] = 0.277; GI = 0.305). This might be because neither the MM nor the GI considered the individual γ estimates, while the hybrid method and the IOM did.

In the 20-item condition, the IOM had the lowest RMSE, which is due to the fact that γ can be stably estimated when the test length is longer, even though it may be unstable at the early stage. The hybrid method with 20 items had a marginally smaller RMSE than the MM. The IOM, hybrid method, and MM performed better than the GI, but even the latter had a lower RMSE as test length increased. The reason that the IOM performed slightly better than the hybrid in the 20-item condition can be attributed to the IOM having a sufficiently long test length to update $\hat{γ}$ and $\hat{θ}$ , which was more effective than just updating $\hat{θ}$ . For the 30-item and 40-item conditions, the RMSEs of $\hat{θ}$ for the IOM and hybrid were .052 and .060 in the 30-item condition, respectively. For the 40-item condition, the RMSEs of $\hat{θ}$ for the IOM and hybrid were .050 and .056, respectively. This shows that the IOM performed slightly better than the hybrid when the test length was longer. We attribute the reason to the fact that the hybrid only updated γ estimates when the standard error was below .3 and when to execute the update is not the same for each student. Some students had longer tests with standard errors larger than 0.3; thus, the MM method was always used, and there was no update on γ estimates.

The bottom of Table 1 shows that the MM, IOM, and hybrid methods had higher reliability than GI, especially in 5-item and 10-item situations. In the 20-item case, all methods had high reliability for the teaching proficiency estimates.

Figure 1 shows the boxplot of the overlap rates among the four methods for the 5-, 10-, and 20-item test lengths. The MM with a prior N(0,1) is presented in Figure 1 for illustration purposes. The overlap rates for the MM with priors N(0,2) and N(0,3) were similar to the MM with the prior N(0, 1) and thus omitted in Figure 1. In the 5-item condition (top plot in Figure 1), the median of the overlap rate (the circle in box plot) between the IOM and the hybrid method was equal to zero. This means that at least 50% of the test-takers under IOM took the 5 items that are totally different from the items that they took under the hybrid method. The highest 0.4 overlap rate between the IOM and hybrid method indicates that the test-takers who took 2 out of 5 items in the IOM were the same as those in the hybrid method. The overlap rates for the pairs of the MM and the hybrid, the MM and the IOM, the GI and the hybrid, and the GI and the IOM were equal to zero for all samples. This means that the MM and GI groups selected 5 items that were different from the IOM and hybrid groups for every person. However, the overlap rate between the GI and the MM had a third quantile of 0.8 and a maximum value of 1.0. This means that overlap in items between the GI and MM methods was 4 out of 5 items for approximately 25% of the students, on average, and that the overlap was 5/5 items for the other 25% of students. The overlap rate in the 5-item condition implied that the selected items in the GI and MM groups were highly similar to each other; the IOM and hybrid groups’ selected items were moderately similar to each other. The selected items for the GI and MM groups were different from those for the IOM and hybrid groups. The same conclusion about the overlap rate can be applied to the 10- and 20-item conditions (the middle and bottom plots in Figure 1). For example, in the 20-item condition, the median overlap rate between the IOM and hybrid groups was approximately 0.3, whereas the median overlap rate between the GI and MM groups was high at 0.9. The medians of the overlap rate between the MM and the hybrid, the MM and the IOM, the GI and the hybrid, and the GI and the IOM groups were close to or equal to zero. The observation from Figure 1 can be used to explain the results in Table 1. The selected items for the MM were close to GI, so they both had worse precision for $\hat{γ}$ , as shown in Table 1. The IOM and hybrid selected items were highly overlapped with each other; therefore, they performed similarly on the RMSE for $\hat{θ}$ and $\hat{γ}$ , as shown in Table 1.

Figure 1.

Overlap rates among pairs of MM_σ² _{= 1}, hybrid method, GI, and IOM with each other for the 5-, 10-, and 20-item test length conditions. The circles indicate the median across the students in the boxplot.

The overlap rate in Figure 1 partially explained the results of the average RMSE and reliability in Table 1. The moderate level of overlap rates between the hybrid and the IOM in Figure 1 matched the result in Table 1, where the hybrid method and the IOM had RMSE and reliability values of $\hat{θ}$ and $\hat{γ}$ were close to each other and far from the GI and the MM. The selected items between the MM and the GI were highly overlapped, with the rate reaching 100% for approximately 25% of students. The hybrid and the IOM had low overlap rates for the MM and the GI of approximately 0% in the 5-item situation. This suggests that the MM and the GI selected items for simply improving the precision of θ estimation, whereas the hybrid and the IOM could improve the estimation precision of $\hat{θ}$ and $\hat{γ}$ .

Class Size = 5, 3 or 1

The RMSE for the conditions of class size to 5, 3, and 1 can be found in Table 2. The average bias and reliability were omitted in Table 2 because the patterns were similar to those in Table 1. The smaller class size tended to lead to a higher RMSE on

\hat{θ}

. The effect of class size on

\hat{γ}

RMSE was relatively small. The results suggest that a large class size caused more students to evaluate teachers and thus increased the precision of the θ estimation.

Table 2.

Average RMSE of Latent Trait Estimation Ignoring the γ, Marginal, Iterative, and Hybrid Methods for Small and Extremely Small Class Size Conditions.

	5 items		10 items		20 items
	θ	γ	θ	γ	θ	γ
Class size = 5
GI	0.550	0.643	0.421	0.583	0.163	0.507
MM σ² = 3	0.310	0.616	0.252	0.574	0.146	0.489
MM σ² = 2	0.289	0.628	0.240	0.578	0.142	0.480
MM σ² = 1	0.287	0.612	0.236	0.582	0.133	0.466
IOM	0.276	0.594	0.231	0.574	0.132	0.473
Hybrid	0.278	0.557	0.231	0.541	0.135	0.455
Class size = 3
GI	0.764	1.063	0.598	0.592	0.261	0.518
MM σ² = 3	0.591	0.648	0.427	0.570	0.220	0.484
MM σ² = 2	0.590	0.649	0.415	0.568	0.219	0.481
MM σ² = 1	0.585	0.642	0.413	0.570	0.215	0.476
IOM	0.540	0.598	0.392	0.571	0.215	0.472
Hybrid	0.530	0.575	0.394	0.548	0.218	0.461
Class size = 1
GI	1.831	0.773	1.391	0.599	0.519	0.559
MM σ² = 3	1.010	0.683	0.852	0.574	0.446	0.556
MM σ² = 2	0.989	0.681	0.840	0.572	0.442	0.554
MM σ² = 1	0.988	0.676	0.807	0.553	0.446	0.547
IOM	1.000	0.605	0.816	0.573	0.448	0.509
Hybrid	0.933	0.585	0.773	0.557	0.451	0.501

Under the condition of class size = 5 (the top six rows in Table 2), the IOM and hybrid methods performed slightly better (lower RMSE) than the MM in terms of RMSE for $\hat{θ}$ when only 5 items were administered in the test. When increasing the test length to 10 items and 20 items, the MM, IOM, and hybrid methods performed similarly. Additionally, different variances σ² in the prior distribution of the MM did not influence the RMSEs across the conditions. For $\hat{γ}$ , the GI and MM had larger RMSEs than the IOM and hybrid methods across test lengths.

In the class size = 1 condition (the bottom six rows in Table 2), the IOM had a higher RMSE of the $\hat{θ}$ than the MM and the hybrid methods when only 5 or 10 items were administered. When administering 20 items, the three methods (MM, IOM, and hybrid methods) had similar RMSEs. For the $\hat{γ}$ in the class size = 1 condition, the IOM and hybrid methods had lower RMSE than the GI and MM across the test length. The GI always had the worst precision. All the estimators had large standard errors because of little item information.

ANOVA Result

To understand the effect of the methods, test lengths, and class sizes on the precision of

\hat{θ}

in the simulation, an ANOVA was calculated, where the dependent variables were the RMSE of the

\hat{θ}

and dependent variables were the methods (four levels: GI, MM, IOM, and hybrid), test lengths (three levels: 5, 10, and 20 items), and class sizes (four levels: class size = 1, 3, 5, and 20). The results in Table 3 show that the three factors had a significant three-way interaction effect. This means that the test lengths and class sizes influenced the RMSE. By observing the means of RMSEs of the

\hat{θ}

in each condition shown in Tables 1 and 2, the three-way interaction effect implied that there was no method always performing the best (the lowest RMSE). The performance of the methods also depended on the test length and class size.

Table 3.

ANOVA Summary Table for the Outcome Variable RMSE of θ and the Independent Variables Method, Test Length, and Class Size.

	Df	Sum square	Mean square	F value	Pr(>F)
Method	3	43.71	14.57	434.01	.000***
Test length	1	78.34	78.34	2333.49	.000***
Class size	3	282.08	94.03	2800.62	.000***
Method × test length	3	17.95	5.98	178.25	.000***
Method × class size	9	17.97	2	59.48	.000***
Test length × class size	3	34.78	11.59	345.28	.000***
Method × test length × class size	9	7.69	0.85	25.43	.000***
Residuals	2368	79.5	0.03

Note. Df = Degrees of freedom.

For the ANOVA results for

\hat{γ}

, Table 4 shows that the main effects of method, test length, and class size were significant. There was an interaction between test length and class size. This indicates that a smaller class size would lead to a reduction in the precision of γ when the test length was shorter.

Table 4.

ANOVA Summary Table for the Outcome Variable RMSE of γ and the Independent Variables Method, Test Length, and Class Size.

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
Method	3	2	0.75	2.712	.043*
Test length	1	35	35.31	128.205	.000***
Class size	3	388	129.37	469.779	.000***
Method × test length	3	0	0.11	0.383	.77
Method × class size	9	2	0.2	0.72	.69
Test length × class size	3	4	1.45	5.167	.001**
Method × test length × class size	9	1	0.12	0.435	0.92
Residuals	47,968	13,210	0.28

Note. Df = Degrees of freedom.

Results Conditional on θ Levels

For the conditional θ levels of [−3, −2, −1, 0, 1, 2, 3], Figures 2, 3, and 4 show the results of the 5-item, 10-item, and 20-item conditions with MM[σ² = 1].

Figure 2.

Bias and RMSE of θ and γ conditional on the different θ levels for the test length of 5 items. GI is the gamma ignored method, MM is the marginal method, σ² indicates the prior variance of γ in the marginal method, and the IOM is the iterative method.

Figure 3.

Bias and RMSE of θ and γ conditional on the different θ levels in the test length of 10 items. GI is the gamma ignored method, MM is the marginal method, and σ² indicates the prior variance of γ in the marginal method. IM is the iterative method.

Figure 4.

Bias and RMSE of θ and γ conditional on the different θ levels in the test length of 20 items. GI is the gamma ignored method, MM is the marginal method, σ² indicates the prior variance of γ in the marginal method, and IM is the iterative method.

The upper two plots show the bias of teaching proficiency (on the left, Figure 2(a)) and student harshness (on the right, Figure 2(b)). The lower two plots show the RMSE of teaching proficiency (on the left, Figure 2(c)) and student harshness (on the right, Figure 2(d)). As shown in Figure 2(a), the biases of teaching proficiency in the four methods were large in the high proficiency levels compared with those in the middle proficiency level. High proficiency had a positive bias, and low proficiency had a negative bias. The biased estimates for the extreme levels in the maximum likelihood estimation were anticipated (Lord, 1986). Lord (1986) demonstrated that the bias of the maximum likelihood estimation is the function with the inverse information function (see equation (6) of Lord, 1986), so the extreme levels would bias more than the middle levels. The hybrid method had less biased estimates for θ compared with the other methods. Generally, all methods had biases close to zero in the middle θ levels. In Figure 2(b), the bias of student harshness remained at zero across the levels of teaching proficiency. The average bias of student harshness was close to zero because the student harshness had the mean constrained to zero for model identification.

For the RMSE of teacher proficiency (see Figure 2(c)), the hybrid method had similar performance with the MM and the IOM from the −1 to 1 level of teaching proficiency but lower RMSE at the −3 and 3 levels than the MM and the IOM. The hybrid method performed better than the MM and the IOM at the extreme θ levels. GI had the highest RMSE θ across all levels of teaching proficiency. This suggests that the three proposed methods improved the RMSE for all teacher proficiency levels. For the RMSE of student harshness γ (see Figure 2(d)), the GI and MM had similar RMSEs across all levels of teaching proficiency. This is because neither the GI nor the MM updated γ nor selected items by considering γ estimates. Although the MM considered prior information for γ, the prior was constant across all students. The selected items in the MM provided the maximum information for updated $\hat{θ}$ and prior information of γ. In other words, the MM individualized the test items in line with the updated $\hat{θ}$ . In contrast, the hybrid method and the IOM performed better on γ estimates across all θ levels because they updated $\hat{γ}$ for individuals. The hybrid method showed the lowest RMSE across all θ levels. The IOM had a similar RMSE at the middle θ levels to the hybrid method but a higher RMSE at extreme θ levels than the hybrid method. This indicates that the hybrid method, which combines the strengths of the MM and the IOM, can improve the precision of the estimation at the extreme θ levels.

Figure 3 shows the bias and RMSE conditional on the θ levels for the 10-item situation. Figure 3(a) shows that the hybrid method had a smaller average bias on θ estimation at the −3 and +3 levels than the other three methods. The average bias for the γ estimates was zero across methods (Figure 3(b)). Figure 3(c) shows that the RMSEs ranged from 0.087 to 0.203 for the IOM, from 0.087 to 0.200 for the MM, and from 0.088 to 0.201 for the GI. The hybrid method performed better than the MM and the IOM for the RMSE of $\hat{θ}$ and the RMSE of $\hat{γ}$ , especially when θ was at the −3 and +3 levels. These trends were similar to those for the 5-item situation in Figure 2.

Figure 4 shows the bias and RMSE conditional on the θ levels for the 20-item condition. Figure 4(a) shows that the GI method had a larger average bias for the θ levels at −3 and +3 than the other methods. The average bias for the γ estimates was zero across methods (Figure 4(b)). Figure 4(c) shows that the MM, IOM, and the hybrid method had a lower RMSE of $\hat{θ}$ than GI across all θ levels, especially at the −3 and +3 levels. The hybrid method performed better than the MM and the IOM regarding the RMSEs of $\hat{θ}$ and $\hat{γ}$ at the +2 and +3 θ levels. However, the hybrid method did not perform better than the IOM and the MM for both RMSEs of $\hat{θ}$ and $\hat{γ}$ at the −3 θ level. This result reflects that the IOM had RMSEs slightly lower than the hybrid in the 20-item condition in Table 1. The possible reason is that the long test length helped the IOM converge to the true values of $\hat{θ}$ and $\hat{γ}$ quickly, and the MM stage in the hybrid may not be helpful at the extreme θ level.

Only 20 Items in the Item Bank

To reflect the realistic situation where the school developed a few items for SET, we simulated an additional condition with only 20 items in the item bank. The RMSEs of

\hat{θ}

and

\hat{γ}

are shown in Table 5. For

\hat{θ}

, the IOM and hybrid performed similarly and better than the MM in the 5-item and 10-item conditions with 20 and 5 class sizes. All methods had the same performance for the 20-item condition because all items in the item bank were exhausted. The MM performed better than the IOM and hybrid methods when the class size equaled 1, whereas the GI performed worst. The RMSE of

\hat{γ}

had a similar trend for the methods.

Table 5.

Root Mean Square Error (RMSE) of $\hat{θ}$ and $\hat{γ}$ When the Item Bank Contained Only 20 Items.

	5 items		10 items		20 items
	$\hat{θ}$	$\hat{γ}$	$\hat{θ}$	$\hat{γ}$	$\hat{θ}$	$\hat{γ}$
Class size = 20
GI	0.171	0.666	0.134	0.452	0.076	0.224
MM σ² = 3	0.138	0.632	0.113	0.340	0.076	0.224
MM σ² = 2	0.138	0.616	0.114	0.331	0.076	0.224
MM σ² = 1	0.137	0.633	0.113	0.338	0.076	0.224
IOM	0.135	0.450	0.101	0.335	0.076	0.224
Hybrid	0.132	0.467	0.103	0.356	0.076	0.224
Class size = 5
GI	0.568	0.654	0.496	0.600	0.164	0.506
MM σ² = 3	0.320	0.638	0.266	0.598	0.164	0.506
MM σ² = 2	0.318	0.637	0.269	0.594	0.164	0.506
MM σ² = 1	0.316	0.642	0.265	0.597	0.164	0.506
IOM	0.292	0.647	0.250	0.599	0.164	0.506
Hybrid	0.299	0.655	0.254	0.604	0.164	0.506
Class size = 1
GI	1.892	0.688	1.476	0.628	0.583	0.546
MM σ² = 3	1.100	0.679	0.884	0.620	0.583	0.546
MM σ² = 2	1.065	0.671	0.859	0.620	0.583	0.546
MM σ² = 1	0.979	0.676	0.847	0.616	0.583	0.546
IOM	1.050	0.595	0.891	0.615	0.583	0.546
Hybrid	1.046	0.593	0.888	0.583	0.583	0.546

Conclusion and Discussion

The current study proposes three estimation methods for SET-CAT to take teacher proficiency and student harshness into account. The GI was the baseline method that fixes all student harshness values at zero and updates the teacher’s provisional proficiency estimates. The MM marginalizes student harshness using prior distributions in the likelihood function, whereas the IOM method updates student harshness based on the response given for each item and updates teacher proficiency for each student’s completed evaluation. The hybrid method advances the MM and the IOM in that the prior information of the rater’s predisposition is used in the early stages of CAT. Moreover, it updates the student harshness and teacher proficiency iteratively when item information is sufficient for teacher proficiency (e.g., standard error smaller than 0.3) in the later stages.

The simulation results show that the hybrid method, MM, and IOM can reduce the RMSE and increase reliability for both teacher proficiency and student harshness, mostly for 5-item and 10-item tests in the class size = 20 condition. This is especially the case when a rated teacher has teaching proficiency at extremely low or high levels; here, the hybrid method and IOM improved the precision of teacher proficiency as well as the precision of student harshness. Among the three methods, the hybrid method yielded higher precision for student harshness and teaching proficiency than the MM and the IOM for extreme levels (+3 and −3) of teaching proficiency, where the maximum difference was approximately 0.07. For more moderate levels of teaching proficiency, the difference in precision between the hybrid and the IOM is not evident. In the 20-item condition, the IOM performed better in estimation precision than the hybrid method for the extreme level of teaching proficiency, approximately 0.04. Thus, the hybrid and IOM methods could be promising methods for online SET.

In the five-student class condition, the MM, IOM, and hybrid methods had similar precision of the θ estimates when the test length equaled 10 or 20 items, but when only 5 items were administered, the MM slightly had a higher RMSE. This finding suggests that the prior influenced the short tests considerably. In the one-student class condition, the IOM performed worse than the MM and the hybrid when 5 or 10 items were administered. In such a small number of students, the prior helped with the estimation precision for the MM and the hybrid. The GI performed worst compared to other methods and thus is not suggested for use in practice.

In the situation of the item bank with 20 items, the result showed that the IOM and the hybrid had lower RMSEs than that of the MM, whereas the MM had lower RMSEs when only one student was in a class. This suggests that using the IOM and hybrid methods is recommended when the class size is equal to or larger than 5. The MM is recommended when only one student evaluates the teacher in a class.

The contribution of the current study is the consideration of the uncertainty of rater harshness and ratee ability, showing the detriment of ignoring students’ ratings on parameter estimates. For example, in physical therapy, clinicians evaluate the stroke patient’s balance function by using instruments such as the Berg balance scale, the balance evaluation systems test, or the dynamic gait index. The item pool (41 items) for assessing the patient’s balance function has been well established (Hsueh et al., 2010). If using the methods in the current study, the clinician’s harshness and patient’s balance can be iteratively updated, so we should expect an improvement in the precision of both measures (clinician harshness and patient ability) concurrently, especially when the patient’s balance ability is very low or high. Our simulation condition of class size = 1 gave insights for the clinical evaluation situation. The MM would be recommended in such a situation, especially when the size of the item bank is limited.

When student identities cannot be collected in the SET assessment (i.e., anonymous students), the individual student’s γ cannot be identified. In this case, the MM is recommended because it employs the prior distribution for students’ predispositions, which was shown to perform better than the GI in terms of the RMSE for θ.

Several future improvements to the hybrid method can be made. Content balance strategies such as the modified multinomial method (Chen et al., 1999), a constrained CAT (Kingsbury & Zara, 1989), the modification of a constrained CAT (Leung et al., 2000), the maximum priority index method (Cheng & Chang, 2009), and the shadow tests approach (van der Linden, 2005) are valuable applications for SET-CAT in the future because they can make the content areas meet the required number of administered items while maintaining the content validity of CAT. On the other hand, a variable-length CAT for SET using the GI, MM, IOM, and hybrid methods can be further explored in future studies. Stopping rules for terminating CAT under the MM, IOM, and hybrid methods will be needed for a variable-length CAT. In summary, the three proposed methods for SET-CAT successfully improved the measurement precision of teacher proficiency and provided ways of administrating CAT with multifaceted models.

Supplemental Material

Supplemental Material - Online Parameter Estimation for Student Evaluation of Teaching

Supplemental Material for Online Parameter Estimation for Student Evaluation of Teaching by Chia-Wen Chen, and Chen-Wei Liu in Applied Psychological Measurement

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

ORCID iD

Chen-Wei Liu

Supplemental Material

Supplemental material for this article is available online.

Appendix

References

Acuña

E. A. V.

(2017). Response styles in student evaluation of teaching. Doctoral dissertation. University of Toronto.

Anderson

H. M.

Cain

Bird

(2005). Online student course evaluations: Review of literature and a pilot study. American Journal of Pharmaceutical Education, 69(1), 5. https://doi.org/10.5688/aj690105

Birnbaum

(1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord

F. M.

Novick

M. R.

Birnbaum

(Eds.), Statistical theories of mental test scores (pp. 397–479). Addison-Wesley.

Boone

W. J.

Townsend

J. S

Staver

J. R.

(2016). Utilizing multifaceted Rasch measurement through FACETS to evaluate science education data sets composed of judges, respondents, and rating scale items: An exemplar utilizing the elementary science teaching analysis matrix instrument. Science Education, 100(7), 221–238. https://doi-org.ezproxy.uio.no/10.1002/sce.21210

Chang

H.-H.

Ying

(1996). A global information approach to computerized adaptive testing. Applied Psychological Measurement, 20(3), 213–229. https://doi.org/10.1177/014662169602000303

Chen

S.-Y.

Ankenmann

R. D.

Spray

J. A.

(1999). Exploring the relationship between item exposure rate and test overlap rate in computerized adaptive testing (ACT Research Report Series, 99, Issue 5). ACT.

Cheng

Chang

H. H.

(2009). The maximum priority index method for severely constrained item selection in computerized adaptive testing. British Journal of Mathematical and Statistical Psychology, 62(Pt 2), 369–383. https://doi.org/10.1348/000711008X304376

Dempster

A. P.

Laird

N. M.

Rubin

D. B.

(1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1–22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x

Dommeyer

C. J.

Baum

Hanna

R. W.

Chapman

K. S.

(2004). Gathering faculty teaching evaluations by in-class and online surveys: Their effects on response rates and evaluations. Assessment & Evaluation in Higher Education, 29(5), 611–623. https://doi.org/10.1080/02602930410001689171

10.

Fisher

R. A.

(1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 222(594–604), 309–368. https://doi.org/10.1098/rsta.1922.0009

11.

Gamliel

Davidovitz

(2005). Online versus traditional teaching evaluation: Mode can matter. Assessment & Evaluation in Higher Education, 30(6), 581–592. https://doi.org/10.1080/02602930500260647

12.

Harwell

Stone

C. A.

Hsu

T.-C.

Kirisci

(1996). Monte Carlo studies in item response theory. Applied Psychological Measurement, 20(2), 101–125. https://doi.org/10.1177/014662169602000201

13.

Holzbach

R. L.

(1978). Rater bias in performance ratings: Superior, self-and peer ratings. Journal of Applied Psychology, 63(5), 579–588. https://doi.org/10.1037/0021-9010.63.5.579

14.

Hoyt

W. T.

(2000). Rater bias in psychological research: When is it a problem and what can we do about it? Psychological Methods, 5(1), 64–86. https://doi.org/10.1037/1082-989x.5.1.64

15.

Hsueh

I.-P.

Chen

J.-H.

Wang

C.-H.

Chen

C.-T.

Sheu

C.-F.

Wang

W.-C.

Hou

W.-H.

Hsieh

C.-L.

(2010). Development of a computerized adaptive test for assessing balance function in patients with stroke. Physical Therapy, 90(9), 1336–1344. https://doi.org/10.2522/ptj.20090395

16.

Kember

Wong

(2000). Implications for evaluation from a study of students' perceptions of good and poor teaching. Higher Education, 40(1), 69–97. https://doi.org/10.1023/A:1004068500314

17.

Kingsbury

G. G.

Zara

A. R.

(1989). Procedures for selecting items for computerized adaptive tests. Applied Measurement in Education, 2(4), 359–375. https://doi.org/10.1207/s15324818ame0204_6

18.

Leung

C.-K.

Chang

H.-H.

Hau

K.-T.

(2000). Content balancing in stratified computerized adaptive testing designs. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA.

19.

Linacre

J. M.

(1993). Generalizability theory and many-facet Rasch measurement [Paper presentation]. In The Annual Meeting of the American Educational Research Association, Atlanta, GA, USA, April 12-16 1993. https://eric.ed.gov/?id=ED364573

20.

Lord

F. M.

(1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum Associates. https://doi.org/10.4324/9780203056615-7

21.

Lord

F. M.

(1986). Maximum likelihood and Bayesian parameter estimation in item response theory. Journal of Educational Measurement, 23(2), 157–162. https://doi.org/10.1111/j.1745-3984.1986.tb00241.x

22.

Muraki

(1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16(2), 159–176. https://doi.org/10.1177/014662169201600206

23.

Paolo

A. M.

Bonaminio

G. A.

Gibson

Partridge

Kallail

(2000). Response rate comparisons of e-mail-and mail-distributed student evaluations. Teaching and Learning in Medicine, 12(2), 81–84. https://doi.org/10.1207/S15328015TLM1202_4

24.

Reckase

M. D.

(2009). Multidimensional item response theory models. In Reckase

M. D.

(Ed.), Multidimensional item response theory (pp. 79–112). Springer. https://doi.org/10.1007/978-0-387-89976-3_4

25.

Rudner

L. M.

(2009). Implementing the graduate management admission test computerized adaptive test. In van der Linden

W. J.

Glas

A. W. C.

(Eds.), Elements of adaptive testing (pp. 151–165). Springer. https://doi.org/10.1007/978-0-387-85461-8_8

26.

Setari

A. P.

Lee

Bradley

K. D.

(2016). A psychometric approach to the validation of a student evaluation of teaching instrument. Studies in Educational Evaluation, 51, 77–87. https://doi-org.ezproxy.uio.no/10.1016/j.stueduc.2016.09.006

27.

Stocking

M. L.

(1994). Three practical issues for modern adaptive testing item pools. ETS Research Report Series, 1994(1), 1–34. https://doi.org/10.1002/j.2333-8504.1994.tb01578.x

28.

Thissen

Mislevy

R. J.

(2000). Testing algorithms. In Wainer

Dorans

N. J.

Eignor

Flaugher

Green

B. F.

Mislevy

R. J.

Steinberg

Thissen

(Eds.), Computerized adaptive testing: A prime (2nd ed.,pp. 125–158). Lawrence Erlbaum Associates. https://doi.org/10.4324/9781410605931-13

29.

van der Linden

W. J.

(2005). A comparison of item-selection methods for adaptive tests with content constraints. Journal of Educational Measurement, 42(3), 283–302. https://doi.org/10.1111/j.1745-3984.2005.00015.x

30.

Wainer

Dorans

N. J.

Flaugher

Green

B. F.

Mislevy

R. J.

(2000). Computerized adaptive testing: A primer. Lawrence Erlbaum Associates. https://doi.org/10.4324/9781410605931-13

31.

Wang

Kolen

M. J.

(2001). Evaluating comparability in computerized adaptive testing: Issues, criteria and an example. Journal of Educational Measurement, 38(1), 19–49. https://doi.org/10.1111/j.1745-3984.2001.tb01115.x

32.

Weissman

(2007). Mutual information item selection in adaptive classification testing. Educational and Psychological Measurement, 67(1), 41–58. https://doi.org/10.1177/0013164406288164

33.

Wesolowski

B. C.

Wind

S. A.

Engelhard Jr

(2016). Examining rater precision in music performance assessment: An analysis of rating scale structure using the multifaceted Rasch partial credit model. Music Perception: An Interdisciplinary Journal, 33(5), 662–678. https://doi.org/10.1525/mp.2016.33.5.662

34.

Wind

S. A.

Jones

(2019). The effects of incomplete rating designs in combination with rater effects. Journal of Educational Measurement, 56(1), 76–100. https://doi-org.ezproxy.uio.no/10.1111/jedm.12201

35.

Wolfe

E. W.

(2004). Identifying rater effects using latent trait models. Psychology Science, 46(1), 35–51.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.01 MB

0.00 MB