Confirming the Factor Structure of a Research-Based Mid-Semester Evaluation of College Teaching

Abstract

End-of-semester evaluations provide scalable data for university administrators, but typically do not provide instructors with timely feedback to inform their teaching practices. Midsemester evaluations have the potential to provide instructors with beneficial formative feedback that can contribute to improved teaching practices and student engagement. However, existing research on the construction of valid, reliable midsemester tools is rare, and there are no existing midsemester evaluation scales that were constructed using education research and psychometric analysis. To address this gap, we designed and piloted a midsemester evaluation of teaching with 29 instructors and 1,350 undergraduate students. We found evidence that our Mid-Semester Evaluation of College Teaching (MSECT) is a valid and reliable measure of four constructs of effective teaching: classroom climate, content, teaching practices, and assessment. Furthermore, our factor structure remained consistent across instructor genders, providing evidence that the MSECT may be less susceptible to gender bias than prior student evaluation measures.

Keywords

student evaluations of teaching college teaching psychometrics

Introduction

Many university instructors collect feedback from their students at two points: midway through the semester and at the end of the semester. The end-of-semester student evaluations of teaching (SETs) are often mandated by the school’s administration or accreditation body. Midsemester evaluations (MSEs) are often voluntary and provide feedback to instructors to improve their teaching within the current term instead of waiting until the next semester. Although end-of-semester SETs have been the subject of decades of critical research (Spooren et al., 2017), much less is known in the literature about MSEs of college teaching.

Formative feedback—which takes place during the process, and is typically low-stakes—is essential for improving performance (Hattie & Timperley, 2007). MSE feedback is typically only reported to the instructor to improve their practice, and not to department chairs or deans (Overall & Marsh, 1979). Therefore, MSEs can be thought of as formative evaluation, and may relate to improvements in teaching. However, designing and implementing a formative teaching evaluation instrument requires both instructors’ time and an expertise in measurement and pedagogy—resources many instructors do not have. Therefore, the field would benefit from a standardized validated instrument because instructors across disciplines could be confident that the feedback they collect is useful without spending time on scale construction.

It is rare for college teaching evaluations to be rigorously evaluated for reliability and validity. In our search, we found no published MSE instruments that are grounded in the education literature, designed for use at midsemester, and that have held up to psychometric scrutiny. In light of this gap, we constructed and assessed a brief, multidimensional evaluation measure designed to be used for MSEs of teaching in higher education contexts. It is particularly important that MSEs’ validity and reliability are assessed, as there is a growing body of literature identifying these problems with end-of-semester evaluations (Spooren et al., 2017).

Theoretical Framework

We draw on theory that identifies the importance of high-quality feedback for improving performance to ground our discussion on the importance of SETs. Hattie and Timperley (2007) note that to improve the quality of accomplishing a complex task, feedback is most effective when it provides information specifically related to the task, and is situated within the context (e.g., within the course an instructor is teaching) so that the feedback can be immediately applied. Feedback on a task is most powerful in improving performance when it is provided during task acquisition. We apply these theories to the practice of teaching in higher education. Although this area of research has focused on the importance of students’ receiving feedback from instructors, it implies that instructors would also benefit from formative feedback from knowledgeable others, including their students.

The content of the feedback is an important determinant of its effectiveness (Hattie & Timperley, 2007; Kember et al., 2002). In this case, we want to generate student feedback on aspects of teaching that are within the instructors’ control, that are strongly related to student learning, and that remain relevant across course disciplines and contexts. When determining what aspects of teaching to assess, we turn to education frameworks such as the Self-System Model of Motivational Development (Skinner et al., 2008), Expectancy-Value Theory perspective (Wigfield & Eccles, 2000), and the Zone of Proximal Development (Vygotsky, 1978). There is a vast literature base to draw from when defining effective college teaching (e.g., Alcott, 2017; Alderman, 2013; Atkinson & Siew Leng, 2013; Bain, 2012; Chickering & Gamson, 1987; Corkin et al., 2017; Murphy & Alexander, 2000).

We designed our measure of effective teaching to align with the Fearless Teaching Framework (FTF), which is a theoretical frame that defines effective teaching as that which promotes students’ motivation and engagement, leading to academic achievement (Donlan et al., 2019). The FTF distills decades of education research and theory into broad categories of effective teaching. Taken together, the foundational education literature indicates that teaching is most effective when it takes place in warm climates that are supportive for learning, includes content that students value and are able to understand, incorporates evidence-based teaching practices such as the inclusion of active learning, and relies on valid assessment strategies that provides guiding feedback to students (Donlan et al., 2019). The FTF, therefore, defines effective teaching based on four essential pieces: climate, content, practice, and assessment (Donlan et al., 2019). It is important to note that the fundamental elements of effective teaching (e.g., the course climate) may be enacted differently across disciplines and classroom contexts, but their overall importance remains (e.g., Corkin et al., 2014; Zubrunn et al., 2014),

SETs

The most common SETs are conducted at the end of each semester. Institutionally standardized end-of-semester SETs enable universities to take a quick, low-cost “quantitative snapshot” of students’ perceptions of teaching and course design (Surgenor, 2013, p. 364). SETs traditionally consist of both Likert-type and open-ended survey items designed to collect quantitative and qualitative feedback about teaching at a scale that might not be possible or economical with other research methods (Hammonds et al., 2017; Richardson, 2005). The feedback gained by end-of-semester SETs can be used by individual instructors to improve their teaching and by administrators for promotion and tenure decisions (Richardson, 2005).

Despite the nearly ubiquitous use of end-of-semester SETs at universities internationally, there are well-documented issues with their validity, reliability, bias, and usefulness (Richardson, 2005; Spooren et al., 2013, 2017; Zabaleta, 2007). Although a few scales have emerged in the literature as being valid and reliable instruments (e.g., Marsh, 1982; Students’ Evaluations of Educational Quality [SEEQ]), most universities still use SETs, constructed in-house, without deeply assessing the instrument’s psychometric properties (Spooren et al., 2017).

The construct validity of end-of-semester SET data is widely debated, and researchers often focus on the observation that despite the field’s agreement that SET instruments must be founded in teaching and learning literature (Richardson, 2005), there is not a research-based conceptual framework that grounds SET items (Spooren et al., 2017). Instead, the typical approach to designing SETs is to choose items based on assessments of student satisfaction and a review of survey items used by other institutions, regardless of whether the item is well-constructed, or connects to current education research (Kember et al., 2002; Marsh, 1987). For example, in spite of the growing evidence that active learning teaching strategies are related to improvements in student learning, SETs typically do not ask students to report on active learning (Kember et al., 2002).

Validity concerns also emerge from the significant evidence that homegrown SET data are biased by a number of factors unrelated to instructional practices. For example, research has demonstrated bias by the gender of the instructor, grade inflation, and if the course is required or an elective (Marsh & Roche, 2000; Spooren, 2010, 2017). Echoing this issue of validity, Oon et al. (2017) and Socha (2013) found that most homegrown end-of semester SETs are not usually assessed psychometrically. This is a problem because without psychometric analysis it is impossible to know the extent to which SET data provide a valid, reliable assessment of teaching quality, or effectiveness.

In addition, the short, summative surveys most commonly administered as SETs do not reliably assess the multidimensionality of teaching (Knol et al., 2016). Although many SETs load onto only one dimension (i.e., teaching effectiveness or student satisfaction), Marsh (1987) and Apodaca and Grad (2005), among other scholars, argue that a multidimensional understanding of teaching quality is more valid and accurate because teaching is not a unidimensional practice (Ambrose et al., 2010; Donlan et al., 2019).

Another weakness in the current practice of administering end-of-semester SETs is that instructors often receive feedback about the Fall semester after the Spring semester has begun, meaning they are unable to use the Fall SETs to revise their course for the Spring. For SETs to provide formative feedback to instructors, summaries of the results would need to be given to instructors well before the start of the semester.

To address the weaknesses of end-of-semester SETs, researchers (e.g., Cohen, 1980) recommend also using MSEs of teaching. Using an MSE designed to collect actionable information, instructors can gain timely, formative feedback from students while they still have time in the semester to implement changes (Costello et al., 2002). Formative feedback has been shown to positively predict the use of more effective teaching practices (e.g., Hampton & Reiser, 2004) and positively relate to the student experience (Costello et al., 2002; Overall & Marsh, 1979). MSEs are often used solely for formative feedback to improve teaching and not for promotion or tenure (Overall & Marsh, 1979). Therefore, instructors can react to feedback in a low-stakes environment initially, and then administer end-of-semester SETs which are often high-stakes.

Purpose of the Study

To assist instructors in their use of MSEs, we constructed a brief survey scale for their use in college classrooms to collect feedback from their students. The scale has four subscales that align with the FTF: climate, content, practice, and assessment. The purpose of this study is to assess the psychometric properties of the scale, including the extent to which it assesses multiple dimensions of teaching (see Figure 1), and any potential form-invariance by instructor gender. We assess invariance by gender to investigate the scale’s susceptibility to potential bias by instructor gender, as is often present in other SETs.

Figure 1.

Conceptual model.

Research Question 1: What underlying factor structure approximately fits the data? We first explored the factor structure of the scale with one half of the sample and selected a solution using a model selection approach (Preacher et al., 2013).

Research Question 2: In alignment with our theoretical frame, to what extent do latent climate, content, practice, and assessment factors fit the underlying structure of the data? After selecting our model in research question 1 based on fit indices and theory, we confirmed the factor structure with the second half of the sample.

Research Question 3: To what extent is the model’s factor structure invariant across instructor gender? In response to the research on bias in course evaluations by instructor gender, we assessed the form invariance of the scale by instructor gender across the full sample. We predicted there will be no major differences in factor structure across genders.

Method

Instrument Development

We designed the Mid-Semester Evaluation of College Teaching (MSECT) to align with the FTF (Donlan et al., 2019), which defines effective teaching through four dimensions: classroom climate, course content, teaching practices, and assessment strategies. These four dimensions were chosen because there is education theory and research that indicates they are strong predictors of student motivation, engagement, and academic success (e.g., Lave & Wenger, 1991; Skinner et al., 2008; Vygotsky, 1978; Wigfield & Eccles, 2000). Furthermore, they are aspects of the classroom experience that the instructor has direct control over, in contrast to class size, physical environment of the room, or demographic characteristics.

The instrument was written with the goal of effective midsemester SETs in mind. The authors drafted each item to correspond with one of the dimensions of the FTF. The items were written to be actionable—meaning that instructors would be able to make changes to their courses based on the feedback they receive. The items were also written to be useful in a wide range of academic disciplines and classroom settings, and the overall scale was designed to be short so that data could be collected during class sessions. The original instrument contained two items designed only for online courses. These were removed from these analyses because none of the courses in this pilot study were taught online.

After the first iteration of the items was written, we assessed the content validity by soliciting feedback from a panel of expert faculty developers and education researchers. These experts were selected because of their knowledge and experience on the topics such as active learning pedagogy, learning assessments, and inclusive climate. Each of these experts served as a professional faculty developer or education researcher at their university’s teaching and learning center. The experts suggested minor wording changes, which were incorporated into the measure. The experts then confirmed the content validity of the final instrument.

Measures

MSECT

The instrument has four subscales: climate, content, practices, and assessment. All questions on the MSECT used the same response set (1 = strongly disagree; 6 = strongly agree). All items are positively worded. All items are intended to measure the students’ perceptions of the teaching and the course. The full items can be found in Table 4.

Climate subscale

A positive classroom climate is supportive, inclusive, accessible, and equitable for diverse groups of students (Morin et al., 2014). The climate subscale consisted of three items designed to assess the multiple dimensions of classroom climate, such as support, accessibility, and inclusion (α = .84).

Content subscale

Course content is more engaging for students when it is relevant to their lives, builds upon the students’ existing knowledge, developmentally appropriate, and in line with the learning objectives (Howard, 2001; Lave & Wenger, 1991).The content subscale included four items and was designed to assess the extent to which students perceived that instructors selected relevant, engaging, developmentally appropriate content (α = .76).

Practice subscale

Teaching practices that positively influence learning and motivation are accessible, well-organized (Wentzel & Brophy, 2014), and aligned with the active learning research (Alexander & Winne, 2006). Effective teachers communicate clear and high expectations for their students, and explain why the course goals are relevant to the students’ lives. The practice subscale included three items about evidence-based teaching practices, such as active learning and scaffolding (α = .73).

Assessment subscale

Assessments (e.g., assignments, tests, projects) encourage student engagement when they are aligned with the course objectives, clearly explained, and do not cause unnecessary stress (Wass et al., 2018).

The assessment subscale consisted of three items that were included to measure the extent to which students perceived that the graded elements were clear and fair (α = .80.)

The final questions of the survey address demographics such as gender, race, and year in school. Gender and race were measured with multiple choice questions in which students could check all responses that applied. To identify their gender, students could check multiple options from male, female, and other. Similarly, students identified their race by selecting among multiple options (e.g., American Indian, White).

Procedure

In Fall 2017 and Spring 2018, we recruited instructors teaching undergraduate courses with at least 30 students. Instructors were informed that their participation would not impact their employment at the University. In addition, they were informed that the evaluation results would only be shared with the research team and the instructor, themselves.

Instructors were asked to share an online survey link with their students during class after students had received feedback on at least one major graded assignment (approximately 8 weeks into a 16-week semester). In an announcement before the students took the online survey, students were encouraged to participate because it would inform the instructor’s teaching awareness and practice. Students were informed that their responses would be confidential and would only be reported to the instructor as anonymized aggregates.

Sample

We collected MSE data from 1,350 undergraduate students in the 2017 to 2018 academic school year ( ${\bar{x}}_{age} = 20.73$ ; 56.74% women, 39.56% men, 1.26% other; 44.67% White, 15.26% Asian or Pacific Islander, 12.15% Black or African American, 5.41% Latinx, 8.22% Multiracial, 1.48% Other, 3.85% prefer not to respond) at a large research university in the Mid-Atlantic (see Table 1 for more demographic information). Data were obtained from 27 courses during the Fall and Spring semesters.

Table 1.

Student Demographic Information.

Variable	Group	Frequency	Sample (%)
Term	Fall	419	31.0
Term	Spring	931	69.0
Student class year	First year	138	10.2
	Sophomore	256	19.0
	Junior	431	31.9
	Senior	482	35.7
	Other	16	1.2
Instructor gender	Woman	483	35.8
Instructor gender	Man	867	64.2

Student responses were deleted from the final data set if they failed to identify their course or their instructor, or they completed the MSE for a course or instructor that was not participating in the study. In addition, we only include student responses for the first semester in which the instructor participated in the study.

Data Analysis

To begin, we randomly split our sample in SPSS so that we could perform exploratory factor analysis (EFA) on one half (n = 652), and confirmatory factor analysis (CFA) on the other (n = 698). There were no significant differences across groups. The distribution of our items contained some negative skew, and so we used a robust maximum likelihood estimator (MLE) in MPlus 8.0 to analyze our data. In total, 81.55% of cases had complete data. All analyses addressed missing data using Full Information Maximum Likelihood (Peugh & Enders, 2004).

First, we ran an EFA model with the first half of the sample that generated solutions with one to five factors to address research question 1. Our goal was to select a solution that retains approximately the correct number of factors. We did this by assessing the model fit indices, model comparison statistics, eigenvalues, and factor loadings generated by MPlus (Henson & Roberts, 2006). Our final selection was also driven by theory and the need for the scale structure to be interpretable by practitioners who plan to use it (Preacher et al., 2013).

Next, to address research question 2, we performed CFA on the second half of the sample to assess the model fit of the selected four-factor solution. We used a sandwich adjustment of the standard errors to account for the complex nested structure of the data, because students’ responses are nested within instructors (Muthén & Satorra, 1995). We did not use multilevel modeling because we do not have Level-2 variables other than a grouping variable, and the nested structure of the data is not central to our research questions (McNeish et al., 2017).

Finally, to address research question 3, we assessed the scale for the gender bias commonly found in other SET measures. To do this, we assessed the form invariance of our final model by instructor gender. Specifically, we ran a model that allowed the structure of the model to vary so that separate models were calculated for men and women (there were no nonbinary instructors in the sample). This procedure allowed us to assess the extent to which the underlying structure of the data we collected differed by gender, and did not assume that men and women would have equal scores if there is no bias.

To evaluate the goodness of fit for our model, we considered several commonly used model fit indices, loadings, and factor covariances: the chi-square test, the root mean square error of approximation (RMSEA), the comparative fit index (CFI), the Tucker–Lewis index (TLI), and the standardized root mean square residual (SRMR). Our criteria for good fit are a nonstatistically significant chi-square, RMSEA values below .06, SRMR values below .08, and a CFI and a TLI value greater than .95 (Hu & Bentler, 1999; Thompson, 2004). We also used the comparative measure of fit the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) to determine which model was superior, with smaller values indicating better fit. Finally, we calculated the construct factor replicability, as measured by the H index with a criteria of H > .80 (Hammer, 2016; Hancock & Mueller, 2001) for each factor to determine how well a group of items represent a latent construct.

Results

We assessed the normality of the data and the extent to which there were outliers. Because there was some negative skew we used a robust estimator in our models. We calculated mean values, standard deviations, and correlations of the scale items for the full sample (see Table 2).

Table 2.

Summary of Correlations, Mean Values, Standard Deviations, and Interclass Correlations for Survey Items.

Item	1	2	3	4	5	6	7	8	9	10	11	12	13	M	SD	ICC
1. Climate 1	—													4.97	1.01	.18
2. Climate 2	.63**	—												4.96	0.99	.12
3. Climate 3	.64**	.66**	—											5.27	0.81	.09
4. Content 1	.33**	.27**	.30**	—										4.95	1.12	.08
5. Content 2	.65**	.53**	.52**	.44**	—									4.61	1.22	.13
6. Content 3	.60**	.56**	.55**	.38**	.66**	—								4.84	1.18	.18
7. Content 4	.33**	.32**	.29**	.40**	.41**	.38**	—							4.40	1.34	.09
8. Practice 1	.34**	.25**	.27**	.24**	.37**	.33**	.22**	—						4.56	1.42	.25
9. Practice 2	.64**	.57**	.55**	.35**	.66**	.62**	.41**	.37**	—					4.65	1.16	.20
10. Practice 3	.61**	.51**	.47**	.39**	.64**	.59**	.33**	.41**	.70**	—				4.59	1.19	.16
11. Assessment 1	.48**	.41**	.43**	.31**	.42**	.51**	.32**	.31**	.52**	.45**	—			4.70	1.25	.21
12. Assessment 2	.41**	.39**	.39**	.23**	.36**	.44**	.25**	.27**	.43**	.45**	.57**	—		4.74	1.13	.20
13. Assessment 4	.47**	.43**	.41**	.29**	.44**	.56**	.33**	.32**	.56**	.51**	.62**	.53**	—	4.77	1.18	.21

M = mean ; SD = standard deviation; ICC = intraclass correlation coefficient.

p < .01.

Research Question 1: What Underlying Factor Structure Approximately Fits the Data?

We conducted an EFA with half of the sample to answer this question. Using a geomin oblique rotation, we calculated solutions based on the correlation matrix that extracted one to five factors, and inspected the eigenvalues, model fit indices, and factor loadings before selecting a final model.

In the unrotated solution, there were two factors with greater than one. The rotated model fit indices are presented in Table 3. We found that the model fit indices stabilized at strong fit with four factors. The five-factor model did not converge. Therefore, we closely inspected the two-factor solution because of the eigenvalues, and the four-factor solution because of model fit indices. The loadings above .4 for the two- and four-factor solutions are presented in Table 4. In the two-factor solution, the factors were weakly correlated (r = .06, p > .05). In the four-factor, the rotated factor correlations ranged from .51 to .67, and all were statistically significant (p < .05).

Table 3.

Exploratory Factor Analysis Model Fit Indices.

Model	No. of parameters	AIC	BIC	RMSEA	χ² (df)	CFI/TLI	SRMR
Factor 1	39	21,665.19	21,839.49	.08	352.31 (65)	.90/.87	.05
Factor 2	51	21,508.55	21,736.48	.08	280.78 (53)	.92/.88	.04
Factor 3	62	21,312.14	21,590.23	.05	114.05 (42)	.97/.95	.02
Factor 4	72	21,265.71	21,587.50	.05	79.14 (32)	.98/.96	.02

Note. AIC = Akaike Information Criterion; BIC = Bayesian Information Criterion; RMSEA = Root Mean Square Error of Approximation; CFI = Comparative Fit Index; TLI = Tucker Lewis Index; SRMR = Standardized Root Mean Square Residual.

Table 4.

Exploratory Factor Analysis Factor Loading for Two- and Four-Factor Solutions.

Item	Two-factor		Four-factor
Item	Factor 1	Factor 2	Factor 1	Factor 2	Factor 3	Factor 4
Cl1. My instructor creates an in-person classroom that is supportive for learning.	.80		.62
Cl2. My instructor makes the class accessible to students with many different needs.	.68		.81
Cl3. My instructor creates an inclusive learning environment where everyone is welcome.	.67		.82
Co1. The content in this course is relevant to me academically, personally, and/or professionally.	.51			.61
Co2. The instructor helps make the content of this course interesting.	.81		.53	.53
Co3. The instructor communicated clear learning outcomes for this course.	.78		.40	.44
Co4. I have the prior knowledge necessary to be successful in this course.	.54			.65
P1. During in- person classes, this course includes in-class activities other than lecture.	.44
P2. My instructor helps me understand new content by connecting it to things I already understand.	.84		.41
P3. My instructor motivates me to put effort into the course.	.80					.96
A1. The assessments (e.g., quizzes, exams, papers) in this course are graded fairly.	.64				.85
A2. My instructor provides me with timely feedback on my work.	.60				.56
A3. The expectations for the assignments are clear.	.70				.55

Note. Only loadings above .4 are presented. Cl = climate; Co = content; P = practice; A = assessment.

Upon inspection of the two-factor and four-factor solutions, we found that the two-factor solution was largely a single-factor solution. That is, there were no items with factor loadings above .4 on the second factor. In contrast, in the four-factor solution, the subscales were somewhat represented in the structure. The climate items were loaded onto Factor 1, the content items were loaded onto Factor 2 (with cross loadings on Factor 1), and the assessment items were loaded onto Factor 3. One practice item was strongly loaded onto Factor 4, though the other two items did not meet the threshold.

Our goal was to select a number of factors that approximated the verisimilitude of the scale structure (Preacher et al., 2013). To do this, it is important to recognize that we approached the EFA solutions within a frame of prior research and theory—that is, we were not agnostic to the result. Instead, we were looking for a factor structure with strong quantitative standing, as well as one that is interpretable within the field. Although the eigenvalues indicated a two-factor solution may approximate the data well, the four-factor solution had stronger model fit indices, and broke the scale along its intended subscales. Therefore, we selected the four-factor model because it would allow instructors using the instrument to understand four dimensions of their teaching—climate, content, practice, and assessment—as opposed to a unidimensional model of teaching. We assessed the strength of this choice in our second research question.

Research Question 2: In Alignment With Our Theoretical Frame, to What Extent Do Latent Climate, Content, Practice, and Assessment Factors Fit the Underlying Structure of the Data?

To address this research question, we ran CFA on the second half of the sample to confirm the factor structure we chose in research question 1. First, we calculated the model’s interclass correlations, and found that there were high enough to warrant a sandwich adjustment of the standard errors (see Table 2). We ran a CFA for a model with four factors, and each item loading onto its corresponding factor (e.g., the climate items loading onto the climate factor only). The model converged normally. There were 29 clusters and the average cluster size was 23.93. The fit indices indicated strong model fit, n = 694; χ²(59) = 157.84, p < .01; AIC = 23,369.40; BIC = 23,573.81; RMSEA = .05; CFI = .97; TLI = .96, SRMR = .03. The chi-square value was statistically significant, however, that is typical with large samples even in the case of good fit. All factor loadings met our .4 cut point (see Table 5).

Table 5.

CFA Model Standardized Factor Loadings and Factor Replicability Scores (H).

Item	CFA factor loadings (SE)
Item	Climate factor H = .84	Content factor H = .81	Practice factor H = .82	Assess factor H = .81
Cl1. My instructor creates an in-person classroom that is supportive for learning.	.83 (.03)
Cl2. My instructor makes the class accessible to students with many different needs.	.77 (.03)
Cl3. My instructor creates an inclusive learning environment where everyone is welcome.	.79 (.03)
Co1. The content in this course is relevant to me academically, personally, and/or professionally.		.51 (.06)
Co2. The instructor helps make the content of this course interesting.		.80 (.02)
Co3. The instructor communicated clear learning outcomes for this course.		.81 (.02)
Co4. I have the prior knowledge necessary to be successful in this course.		.47 (.05)
P1. During in- person classes, this course includes in-class activities other than lecture.			.50 (.05)
P2. My instructor helps me understand new content by connecting it to things I already understand.			.84 (.02)
P3. My instructor motivates me to put effort into the course.			.80 (.04)
A1. The assessments (e.g., quizzes, exams, papers) in this course are graded fairly.				.78 (.04)
A2. My instructor provides me with timely feedback on my work.				.68 (.04)
A3. The expectations for the assignments are clear.				.80 (.04)

Note. CFA = confirmatory factor analysis; H = factor replicability scores; Cl = climate; Co = content; P = practice; A = assessment.

Because we were interested in assessing related but separate dimensions of teaching we also examined the factor covariance matrix to assess evidence of the model’s validity. The four-factor model’s factor covariance matrix is provided in Table 6, and shows that the content factor and climate factor (r = .88, p < .01) and the content factor and practice factor (r = .91, p < .01) were very strongly positively related to one another. Factor replicability was acceptable across all factors (i.e., >.8).

Table 6.

Standardized Factor Covariance Matrix for the Four-Factor Model.

	Standardized factor covariances (SE)
	F1	F2	F3	F4
F1 Climate	1
F2 Content	0.89 (.02)	1
F3 Practice	0.81 (.04)	0.91 (.03)	1
F4 Assessment	0.65 (.06)	0.72 (.05)	.76 (.05)	1

Research Question 3: To What Extent Is the Model’s Factor Structure Invariant Across Instructor Gender?

We assessed equal-form invariance to determine the extent to which the model structure was invariant across instructor gender (Kline, 2011). The data were divided into two independent groups: men (n = 761) and women instructors (n = 455). We specified the same model across these two groups, and allowed all parameters to be freely estimated. The model loaded each of the 13 items onto its corresponding factor (climate, content, practice, and assessment), and allowed all of the factors to covary. Although the model converged normally, there was an error with identification because the numbers of parameters exceeded the number of clusters. Therefore, we reran the model without nesting students into courses, which resolved the problem with identification. This model fit the data well—n = 1,216; χ²(136) = 360.11, p < .01; AIC = 40,491.15; BIC = 40,858.59; RMSEA = .05; CFI = .95; TLI = .94, SRMR = .05—and the loadings were similar across the gender groups (see Table 5), providing evidence of model invariance. We calculated H values for all of the factors by gender, and found that some H values were slightly worse for women, though still marginally acceptable (>.7; see Table 7).

Table 7.

Factor Loadings by Gender.

	Climate (SE)		Content (SE)		Practice (SE)		Assessment (SE)
	Men H = .85	Women H = .84	Men H = .84	Women H = .79	Men H = .84	Women H = .82	Men H = .81	Women H = .77
Cl1	.85 (.02)	.87 (.02)
Cl2	.79 (.03)	.70 (.04)
Cl3	.79 (.03)	.74 (.03)
Co1			.52 (.06)	.43 (.06)
Co2			.85 (.02)	.81 (004)
Co3			.81 (.02)	.78 (.03)
Co4			.51 (.04)	.43 (.03)
P1					.49 (.05)	.45 (.06)
P2					.86 (.02)	.84 (.02)
P3					.81 (.02)	.80 (.04)
A1							.81 (.05)	.71 (.04)
A2							.70 (.06)	.66 (.05)
A3							.79 (.04)	.78 (.03)

Note. H = factor replicability scores; Cl = climate; Co = content; P = practice; A = assessment.

Discussion

MSEs can provide formative feedback to instructors on their teaching. However, there are not currently any validated scales for MSEs of college teaching. To address this gap, we designed and implemented a short multidimensional tool, the MSECT. We gathered data across two semesters, and split the sample in two random halves. We performed EFA on the first half, and found evidence that a two-factor or four-factor solution would fit the data well. Based on model fit indices, theory, and the interpretability of the solutions, we selected a four-factor model. Next, we conducted CFA with the second half and found the four-factor model fit the data well. Finally, prior literature indicates that many end-of-semester course evaluation tools are biased, and work differently for men and women instructors (Spooren et al., 2017). In light of those findings, we assessed model fit for both men and women instructors, and found the model fit well and similarly for both groups.

The content of the MSECT was chosen based on decades of education theory and empirical research, underscoring the importance of classroom climate, course content, teaching practices, and assessment strategies in promoting student engagement and motivation (e.g., Atkinson et al., 2013; Corkin et al., 2014; Frymier & Shulman, 1995). Furthermore, all items were designed to be actionable, and the scale was written to be short enough to administer during class without much disruption. Our findings provide evidence that instructors could use the scale to quickly assess four dimensions of their teaching, which previously would have required multiple long-form assessments or observations that are difficult to conduct at a large scale.

Researchers have documented bias in SETs, and it is clear that the field needs new, psychometrically assessed measures (Spooren et al., 2017). Biased teaching evaluations affect high-stakes decisions such as hiring, promotion, and tenure selections. When universities rely on measures that do not provide instructors with clear, actionable feedback it makes it less likely that instructors can thrive in their teaching and support their students’ achievement. Furthermore, biased measures that favor instructors who are men may reinforce gender gaps in higher education. This study offers a potential pathway that may address both of these problems: a short, actionable survey designed to provide instructors with the formative feedback they need to grow, that is designed to function with invariance across groups.

There are limitations to this study. First, we were unable to randomly select the classrooms, courses, course delivery method (i.e., online vs. in-person), or instructors who volunteered to participate. Instead, instructors who volunteered may have been more comfortable with their teaching and more open to constructive feedback than the typical college instructor. Second, these results only analyze the quantitative findings from the scales, and do not assess the qualitative student comments students made. These comments are essential to understanding the feedback that students provide, but it was beyond the current analysis to include them. Third, these analyses only rely on the reports of students, and do not include any information from instructors about their own teaching. Fourth, we did not use precognitive interviews with students in our initial development stages because of resource constraints. We see great value in conducting cognitive interviews with students to further determine the validity of the items and instrument. Future research can extend our work with assessments of criterion validity using preexisting scales and qualitative reflections on the scales from both students and instructors.

It is also important to note that we did not assess the MSECT for invariance by instructor race. Although we believe it is vital to assess instruments by all dimensions of documented bias (e.g., gender, race, conventional attractiveness) we did not have access to the races of the instructors who participated. We believe future research should purposefully recruit samples of instructors with balanced racial distributions and conduct these analyses.

Notably, we did not assess the MSECT by testing the relationship to university wide or homemade end-of-semester SET scales (e.g., Hampton & Reiser, 2004), and instead assessed the underlying structure of the scales. Although future research should conduct assessments of criterion validity, we believe that the criterion needs to be well-designed and psychometrically validated, which most end-of-semester course evaluations are not.

In sum, we find evidence that the MSECT is a robust measure of college teaching that can provide formative feedback to instructors. With that feedback, instructors can reflect upon and improve their teaching, which should lead to better learning experiences for students, and increased student motivation engagement, and learning.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Alice E. Donlan

References

Alcott

(2017). Does teacher encouragement influence students’ educational progress? A propensity-score matching analysis. Research in Higher Education, 58(7), 773–804. https://doi.org/10.1007/s11162-017-9446-2

Alderman

M. K.

(2013). Motivation for achievement: Possibilities for teaching and learning. Lawrence Erlbaum.

Alexander

P. A.

Winne

P. H.

(2006). Handbook of educational psychology. Lawrence Erlbaum.

Ambrose

Bridges

DiPietro

Lovett

Norman

(2010). How Learning Works: 7 Research-based Principles for Smart Teaching. Jossey-Bass.

Apodaca

Grad

(2005). The dimensionality of student ratings of teaching: Integration of uni- and multidimensional models. Studies in Higher Education, 30(6), 723–748. https://doi.org/10.1080/03075070500340101

Atkinson

Siew Leng

(2013). Improving assessment processes in higher education: Student and teacher perceptions of the effectiveness of a rubric embedded in a LMS. Australasian Journal of Educational Technology, 29(5), 651–666.

Bain

(2012). What the best college students do. The Bell Press of Harvard University Press.

Chickering

A. W.

Gamson

Z. F.

(1987). Seven principles for good practice in undergraduate education. AAHE Bulletin, 39(7), 3–7.

Cohen

P. A.

(1980). Effectiveness of student-rating feedback for improving college instruction: A meta-analysis of findings. Research in Higher Education, 13(4), 321–341. https://doi.org/10.1007/BF00976252

10.

Corkin

D. M.

Horn

Pattison

(2017). The effects of an active learning intervention in biology on college students’ classroom motivational climate perceptions, motivation, and achievement. Educational Psychology, 37(9), 1106–1124. https://doi.org/10.1080/01443410.2017.1324128

11.

Corkin

D. M.

S. L.

Wolters

C. A.

Wiesner

(2014). The role of college classroom climate on academic procrastination. Learning and Individual Differences, 32, 294–303.

12.

Costello

M. L.

Weldon

Brunner

(2002). Reaction cards as a formative evaluation tool: Students’ perceptions of how their use impacted classes. Assessment and Evaluation in Higher Education, 27(1), 23–33. https://doi.org/10.1080/02602930120105036

13.

Donlan

A. E.

Loughlin

S. M.

Byrne

V. L.

(2019). The Fearless Teaching Framework: A model to synthesize foundational education research for university instructors. To Improve the Academy: A Journal of Educational Development, 38(1), 33–49. https://doi.org/10.1002/tia2.20087

14.

Frymier

A. B.

Shulman

G. M.

(1995). “What’s in it for me?” Increasing content relevance to enhance students’ motivation. Communication Education, 44(1), 40–50.

15.

Hammer

(2016). Construct replicability calculator: A microsoft excel-based tool to calculate the Hancock and Mueller (2001) H index. http://DrJosephHammer.com/

16.

Hammonds

Mariano

G. J.

Ammons

Chambers

(2017). Student evaluations of teaching: Improving teaching quality in higher education. Perspectives: Policy and Practice in Higher Education, 21(1), 26–33. https://doi.org/10.1080/13603108.2016.1227388

17.

Hampton

S. E.

Reiser

R. A.

(2004). Effects of a theory-based feedback and consultation process on instruction and learning in college classrooms. Research in Higher Education, 45(5), 497–527.

18.

Hancock

G. R.

Mueller

R. O.

(2001). Rethinking construct reliability within latent variable systems. In R. Cudeck, S. du Toit, & D. Sörbom (Eds.) Structural equation modelsing: Present and future (pp. 195–216). Chicago: Scientific Software International.

19.

Hattie

Timperley

(2007). The power of feedback the meaning of feedback. Review of Educational Research, 77(1), 81–112. https://doi.org/10.3102/003465430298487

20.

Henson

R. K.

Roberts

J. K.

(2006). Use of exploratory factor analysis in published research: Common errors and some comment on improved practice. Educational and Psychological Measurement, 66(3), 393–416.

21.

Howard

T. C.

(2001). Telling their side of the story: African-American students’ perceptions of culturally relevant teaching. The Urban Review: Issues and Ideas in Public Education, 33(2), 131–149. https://doi.org/10.1023/A:1010393224120

22.

L. T.

Bentler

P. M.

(1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55.

23.

Kember

Leung

D. Y. P.

Kwan

K. P.

(2002). Does the use of student feedback questionnaires improve the overall quality of teaching? Assessment & Evaluation in Higher Education, 27(5), 411–425.

24.

Kline

R. B.

(2011). Principles and practice of structural equation modeling. Guilford Press.

25.

Knol

M. H.

Dolan

C. V.

Mellenbergh

G. J.

van der Maas

H. L. J.

(2016). Measuring the quality of university lectures: Development and validation of the Instructional Skills Questionnaire (ISQ). PLoS ONE, 11(2), 1–21.

26.

Lave

Wenger

(1991). Situated learning: Legitimate peripheral participation. Cambridge University Press.

27.

Marsh

H. W.

(1982). SEEQ: A reliable, valid, and useful instrument for collecting students’ evaluations of university teaching. British Journal of Educational Psychology, 52(1), 77–95.

28.

Marsh

H. W.

(1987). Students’ evaluations of university teaching: Research findings, methodological issues, and directions for future research. International Journal of Educational Research, 11(3), 253–388. https://doi.org/10.1016/0883-0355(87)90001–2

29.

Marsh

H. W.

Roche

L. A.

(2000). Effects of grading leniency and low workload on students’ evaluations of teaching: Popular myth, bias, validity, or innocent bystanders? Journal of Educational Psychology, 92(1), 202–228. http://10.0.4.13/0022-0663.92.1.202

30.

McNeish

Stapleton

L. M.

Silverman

R. D.

(2017). On the unnecessary ubiquity of hierarchical linear modeling. Psychological Methods, 22(1), 114–140.

31.

Morin

A. J. S.

Marsh

H. W.

Nagengast

Scalas

L. F.

(2014). Doubly latent multilevel analyses of classroom climate: An illustration. The Journal of Experimental Education, 82(2), 143–167. https://doi.org/10.1080/00220973.2013.769412

32.

Murphy

P. K.

Alexander

P. A.

(2000). A motivated look at motivational terminology. (Special Issue). Contemporary Educational Psychology, 25, 3–53.

33.

Muthén

Satorra

(1995). Complex sample data in structural equation modeling. Sociological Methodology, 25, 267–316.

34.

Oon

Benson

Kam

C. C. S.

(2017). Psychometric quality of a student evaluation of teaching survey in higher education.” Assessment & Evaluation in Higher Education, 42(5), 788–800. https://doi.org/10.1080/02602938.2016.1193119

35.

Overall

J. U.

Marsh

H. W.

(1979). Midterm feedback from students: Its relationship to instructional improvement and students’ cognitive and affective outcomes. Journal of Educational Psychology, 71(6), 856–865. https://doi.org/10.1037/0022-0663.71.6.856

36.

Peugh

J. L.

Enders

C. K.

(2004). Missing data in educational research: A review of reporting practices and suggestions for improvement. Review of Educational Research, 74, 525–556.

37.

Preacher

K. J.

Zhang

Kim

Mels

(2013). Choosing the optimal number of factors in exploratory factor analysis: A model selection perspective. Multivariate Behavioral Research, 48, 28–56. https://doi.org/10.1080/00273171.2012.710386

38.

Richardson

J. T. E.

(2005). Instruments for obtaining student feedback: A review of the literature. Assessment & Evaluation in Higher Education, 30(4), 387–415.

39.

Skinner

Furrer

Marchand

Kindermann

(2008). Engagement and disaffection in the classroom: Part of a larger motivational dynamic? Journal of Educational Psychology, 100(4), 765–781. https://doi.org/10.1037/a0012840

40.

Socha

(2013). A hierarchical approach to students’ assessments of instruction. Assessment & Evaluation in Higher Education, 38(1), 94–113. https://doi.org/10.1080/02602938.2011.604713

41.

Spooren

(2010). On the credibility of the judge. Studies in Educational Evaluation, 36(4), 121–131. http://10.0.3.248/j.stueduc.2011.02.001

42.

Spooren

Brockx

Mortelmans

(2013). On the validity of student evaluation of teaching: The state of the art. Review of Educational Research, 83(4), 598–642. https://doi.org/10.3102/0034654313496870

43.

Spooren

Vandermoere

Vanderstraeten

Pepermans

(2017). Exploring high impact scholarship in research on Student’s Evaluation of Teaching (SET). Educational Research Review, 22, 129–141. https://doi.org/10.1016/j.edurev.2017.09.001

44.

Surgenor

P. W. G.

(2013). Obstacles and opportunities: Addressing the growing pains of summative student evaluation of teaching. Assessment & Evaluation in Higher Education, 38(3), 363–376. https://doi.org/10.1080/02602938.2011.635247

45.

Thompson

(2004). Exploratory and confirmatory factor analysis: Understanding concepts and applications. American Psychological Association.

46.

Vygotsky

L. S.

(1978). Mind in society: The development of higher psychological processes. Harvard University Press.

47.

Wass

Timmermans

Harland

McLean

(2018). Annoyance and frustration: Emotional responses to being assessed in higher education. Active Learning in Higher Education. Advance online publication. https://doi.org/10.1177/1469787418762462

48.

Wentzel

K. R.

Brophy

J. E.

(2014). Motivating students to learn. Routledge.

49.

Wigfield

Eccles

J. S.

(2000). Expectancy-value theory of achievement motivation. Contemporary Educational Psychology, 25(1), 68–81. https://doi.org/10.1006/ceps.1999.1015

50.

Zabaleta

(2007). The use and misuse of student evaluations of teaching. Teaching in Higher Education, 12(1), 55–76.

51.

Zubrunn

McKim

Buhs

Hawley

L. R.

(2014). Support, belonging, motivation, and engagement in the college classroom: A mixed method study. Instructional Science, 42, 661–684.