Abstract
The online administration of student evaluations has its shortcomings, including low participation, or low response rates, and bias. This study examines nonresponse bias in online student evaluations of instruction, that is, the differences between those students who complete online evaluations and those who decide not to complete them. It builds on the work of Estelami that revealed a response bias based on the timing in which the evaluations were completed, that is, differences in early evaluations versus later evaluations. In contrast, this study examines the demographic variables that have contributed to nonresponse bias in online student evaluations, namely gender, grade point average, and ethnicity. It also examines multiple psychographic variables that may contribute to nonresponse bias: time poverty, complaining behavior, and technology savviness. The study utilized t tests and logistic regression (logit analysis) to analyze the data collected from undergraduate business students. This study found that there are significant differences between those who complete online student evaluations and those who do not. Implications for academic administrators are discussed.
Keywords
The use of instruments to assess teaching effectiveness and research on student ratings began in the 1920s (Mau & Opengart, 2012). Although it is recommended to use more than one measure of teaching effectiveness in assessing improvement in teaching (Pritchard, Saccucci, & Potter, 2010), student course evaluations in particular have generated a substantial amount of interest in higher education (Donovan, Mader, & Shinsky, 2007). They are generally the primary measure of teaching effectiveness and are used by many institutions for promotion, tenure, and merit pay increase decisions (Avery, Bryant, Mathios, Kang, & Bell, 2006; Mau & Opengart, 2012), and to evaluate faculty members in deciding whether to change their courses or even keep them in the curriculum (Adams & Umbach, 2011). Furthermore, faculty can use evaluation results to improve their instruction and apply for grants and awards (Adams & Umbach, 2011).
Moreover, interest in evaluations has included issues of validity and reliability (Donovan, Mader, & Shinsky, 2007; Pritchard & Potter, 2011) and concerns about the “leniency hypothesis,” which postulates that better evaluations result if expectations of student performance are lowered. Unethical faculty behavior to favorably influence these ratings may result in “a destruction of educational objectives” (Pritchard & Potter, 2011, p. 2).
This study attempts to build on the research conducted by Estelami (2015) that revealed a response bias between students who completed evaluations early and students who completed evaluations later. In contrast, this study explores nonresponse bias in online student evaluations. Low participation, or nonresponse, to online student evaluation of teaching (SET) is a major shortcoming of the technology (Guder & Malliaris, 2013). Nonresponse bias is defined as, “Error that results from a systematic difference between those who do and those who do not respond to a measurement instrument” (McDaniel & Gates, 2012, p. 156). Nonresponse bias occurs when there are systematic differences in important variables between respondents and nonrespondents (Reio, 2007). Similarly, nonresponse bias, “refers to the bias that exists when respondents to a survey are different from those who did not respond in terms of demographic or attitudinal variables” (Sax, Gilmartin, & Bryant, 2003, p. 411). Surveyed populations, especially students, are responding at lower rates than previous decades, which may have a biasing effect (Sax, Gilmartin, & Bryant, 2003).
Nonresponse bias in survey results is present in a variety of fields of research. Lyness and Kropf (2007) point out that nonresponse bias is evident in cross-national surveys. Cross-national mail surveys without telephone contact are typically coupled with low response rates, which increase the probability of nonresponse bias. Rudig (2010) cautions that researchers administering political activism surveys should be aware that nonresponse bias may exist.
There are multiple examples of the issue of nonresponse bias in health care. Stormark, Heiervang, Heimann, Lundervold, and Gillberg (2008) used binary logistic regression to predict nonresponse bias from teaching ratings of mental health problems in primary school children. The results revealed that children who were recognized for having mental health problems by their teachers were less likely to participate. Finally, even though web-based surveys are a fast and inexpensive way to conduct research, they have been characterized with lower response rates than traditional surveys (Reio, 2007).
As nonresponse increases, the likelihood that nonrespondents’ opinions differ from respondents’ opinions also increases (Adams & Umbach, 2011). From a research perspective, if those who did not respond to the survey differ systematically from those who responded, then the accuracy of the results may be questionable (McDaniel & Gates, 2012). The issue of nonresponse bias of online student evaluations is a persistent one (Adams & Umbach, 2012; Nevo, McClean, & Nevo, 2010).
This study examines the role of demographics which have been shown to contribute to nonresponse bias in online student evaluations: gender, grade point average (GPA), and ethnicity. The study will also examine particular constructs, including time poverty, complaining behavior, and technology savviness. The data will then be compared to assess the differences between students who have completed online evaluations and those who have opted not to complete them. The results may provide additional insight for academic administrators as they continue to evaluate the accuracy of student online evaluations of instruction.
Literature Review
Student Evaluation of Teaching: A Continuing Issue in Academia
Many studies have examined the issues surrounding SET. In fact, “few issues within academics have been as well researched, documented, and long lasting as the debate about the student evaluation of teaching (SET)” (Clayson, 2009, p. 16). For example, Madden, Dillon, and Leak (2010) explore the halo effect in student evaluation of their professors. Students may base their ratings on an overall impression of their instructor based on grade leniency or any other factor that may bias the ratings. Their results showed that the halo effect is present and it does bias the interpretation of teaching effectiveness, especially when comparing across professors.
Clayson (2009) examined the validity of the process, that is, are the evaluations related to what is learned? The meta-analysis showed that the small relationship between the evaluations is situational and that the more objectively that learning is measured, the less likely it will be related to the evaluations.
Most recently, Braga, Paccagnella, and Pellizzari (2014) compared various measures of instructor effectiveness with student evaluations for those instructors. They found that teaching quality matters and the measure of teaching effectiveness was negatively correlated with student evaluations.
Online Evaluations and Response Rates
Incentives to improve response rates have included earlier registration times for the subsequent term for those completing the surveys and allowing students to complete them earlier in the term. Moreover, incentives include encouragement for completion of the survey in more than one of the students’ courses, withholding early access to grades, and grade enhancement (Avery et al., 2006).
Adams and Umbach (2011) also note that student personality types and majors as well as institutional characteristics influence online survey participation. Students who invest the time to do well academically were more likely to respond. The likelihood also increased for courses in the student’s major. Institutional characteristics include availability of technology, incentives, and communication to students regarding their input. There is mixed evidence concerning which evaluation method, in-class or online, yields higher faculty evaluations. Mau and Opengart (2012) found strong evidence to indicate that faculty receive higher evaluations using an in-class or paper-based assessment. However, Linse (2010) and Avery et al. (2006) reported that even though response rates were lower with online evaluations, average online scores were similar to the average scores using in-class evaluations.
Hypotheses
The following section presents each of the study variables and relevant research. Each variable discussion concludes with an appropriate hypothesis.
Time Poverty
All individuals experience various degrees of time poverty, and students are no exception. The literature has documented that lack of time has been a contributing factor to many academic issues: lack of information literacy among medical students (Baro, Endouware, & Ubogu, 2011), delayed understanding of learning material (Scheja, 2006), difficulty implementing computer-assisted language learning (Park & Son, 2009), a challenge to successfully implement distance education programs (Hamzaee, 2005), and withdrawal from e-learning courses (Packham, Jones, Miller, & Brychan, 2004).
Time poverty has also affected noncurriculum aspects of students’ lives: low usage of fitness and recreation centers on college campuses (Smith, 2011), low participation of female college students in intramural sports (Stoll, 2010), convenience orientation as a motivation for food consumption among residence hall students (Marquis, 2005), and a lack of community involvement (Anonymous, 2001). Based on this review, students with time challenges may find online evaluations problematic to complete. Therefore, the following hypothesis is proposed:
Complaining Behavior
Consumer complaints may be directed toward the service provider, acquaintances (negative word-of-mouth), or third parties (Kim & Lehto, 2011). If the consumer believes that the service provider is going to be more responsive or more likely to address the problem, then the consumer is more likely to communicate with the service provider versus taking private action or complaining to a third party (Meng, Wang, Peters, & Lawson, 2010). Furthermore, consumers with relatively high levels of dissatisfaction are more likely to complain than those with neutral or positive experiences (Kim & Lehto, 2011). Similarly, students’ propensity to respond to an SET may be stronger for those students who negatively view their instructor (Groves, 2006). Furthermore, students who have a propensity to complain may make a concerted effort to complete an online evaluation. Thus, the following hypothesis is offered:
Gender
High or low online ratings have been correlated with gender, cultural background, student performance, and academic status (Adams & Umbach, 2011). For example, females are more likely to respond than males (Avery et al., 2006; Dey, 1997; Fidelman, 2007; Porter & Umbach, 2006; Sax, Gilmartin, Lee, & Hagedorn, 2008). However, since this is a highly context-specific phenomenon, some studies show the opposite effect, that is, that males are likely to be more expressive than females. Thus, the following hypothesis is proposed:
GPA
GPA has been used as a variable in a plethora of education-related studies. More recently, Cheng, Ickes, and Verhofstadt (2012) attempted to clarify how two types of family support, social support and economic support, affect college students’ academic performance. They concluded that family social support is more important to women’s success in college, regardless of the level of family economic support. Curs and Harper (2012) studied the behavioral effects of financial aid on student performance. Their findings support that financial aid fosters academic success, rather than simply assisting students to afford institutions. Shippee and Owens (2011) explore gender differences among GPA, depression, and drinking. GPA negatively affected drinking without consistent gender differences. GPA negatively affected depression, and this was more pronounced with females. Deviney, Mills, Gerlich, and Santander (2011) examined several behavioral factors and how they affected performance for academically talented students. Results showed that three behavioral factors, Analysis of Data, Organized Workplace, and Frequent Change, had mean scores that were significantly different among the three GPA groups. Gupta, Walker, and Swanson (2011) sought to determine whether age, gender, GPA, and work experience were factors that influenced ethical choices made by graduate business students. Their results showed that “graduate business students with high GPAs are just as likely to report an inflated grade as a graduate business student who is struggling academically” (Gupta et al., 2011, p. 148). Fairfield-Sonn, Kolluri, Singamsetti, and Wahab (2010) found that undergraduate GPA and gender were reliable in predicting differences in MBA graduation GPA.
High academic performance, measured by course grade, GPA, and/or SAT score, is positively related to the likelihood of responding (Avery et al., 2006; Fidelman, 2007; Porter & Umbach, 2006). Thus, the following hypothesis is presented:
Race
It is well-documented that race can be a variable in terms of survey participation rates. For example, race, ethnicity, and linguistic isolation were determinants in participation rates in public health surveillance surveys in a study by Link, Mokdad, Stackhouse, and Flowers (2006). Participation rates were lower in counties with higher percentages of Blacks. Rates decreased in counties where a larger percentage of the population spoke only Spanish or another Indo-European language. Furthermore, Hersh and Ansolabehere (2013) used a data set of two million voter registration records and showed that participation rates by gender, race, and age variables were different compared with previous research. Blacks participate at higher rates than Whites and, moreover, the relationship between age and participation is not linear and varies by race and gender. In contrast, participation in a smoking trajectory study revealed that female and White students were more likely to participate than male and non-White students (Diviak, Wahl, O’Keefe, Mermelstein, & Flay, 2006).
Generally, White students respond at higher rates online than students of color (Avery et al., 2006; Clarksberg, Robertson, & Einarson, 2008; Fidelman, 2007; Porter & Umbach, 2006). Even though females and White students are more likely to respond online than males and non-Whites, respectively, Porter and Umbach (2006, p. 233) note that there is “little theoretical discussion as to why this should be the case.” The following hypothesis is presented:
Technology Savviness
Generation Y, or synonymously the Net Generation or Millennials, born between 1977 and 1997, are intelligent, creative, achievement-oriented, and technology-savvy (Maroun, 2012). They seem to be the most comfortable with technology and have a natural fit with e-learning (Tyler, 2008). However, some members of this group are more comfortable with technology than others. One study found that technology-savvy, male students reported a greater preference for increasing the use of classroom technology compared with other demographic groups (DiVall, Hayney, Marsh, Neville, & O’Barr, 2013). Intuitively, those students who consider themselves more technology-savvy would be more likely to complete an online student evaluation. Thus, the following hypothesis is presented:
Research Method
Sampling Method
The convenience sample used in this study consisted of undergraduate students (n = 373) from a medium-sized university of approximately 10,000 students in the Southeastern United States. See Table 1 for a summary of the sample characteristics, based on the most represented response or average. At the time of administration of the instrument, the university had an online SET exclusively (no pencil-and-paper, in-class, or paper-based option) for at least the past three years. The respondents were from a variety of business disciplines. They completed a paper-and-pencil questionnaire in multiple business classes during the main summer session. The response rate was near or at 100% since no one refused to complete the questionnaire during the days of administration and no questionnaires needed to be discarded.
Descriptive Information on the Sample.
Questionnaire
The questionnaire was divided into three sections. The first group of questions included basic demographics, such as gender, age, and so on. For purposes of this student-based study, demographics also included self-reported GPA, academic year, academic status (full-time vs. part-time), and major. There was also a question that asked about online purchases.
The next section consisted of various construct measures using existing scales, beginning with a time poverty scale, which was adopted from a study by Reynolds and Beatty (1999; see Bruner, Hensel, & James, 2005). This scale contains five items on a 7-point Likert-type scale. Next, a six-item, 7-point strongly disagree/strongly agree Likert-type scale measuring complaining behavior was included. Its design was based on Wei (2010), who utilized eight statements from Yuksel, Kilinc, and Yuksel (2006). The six items were split evenly in terms of their direction: three were oriented toward complaining behavior and three were oriented against complaining behavior. The two items omitted for this study were not strongly oriented in either direction. The next scale is derived from one of the four factors in a factor analysis of a set of 28 items from the technology readiness index by Parasuraman (2000; see Bearden, Netemeyer, & Haws, 2011).
The third and final section of the questionnaire asked respondents about online versus in-class SET and was placed at the end of the instrument to avoid any possible bias in the responses for the previous questions. The last question separated respondents into three groups: those who have never completed an online student evaluation, those who almost always or always complete an online student evaluation, and those who complete an online evaluation only if the instructor was very good or very bad. An “other” option was included for those who did not identify with any of the other choices. Most of the relatively small group who chose “other” indicated that they completed the evaluation for extra credit (n = 20).
A pretest was conducted with senior capstone marketing students (n = 26). They were asked to write any comments at the bottom of the first page of the questionnaire and, additionally, were asked to openly discuss any issues after the completed questionnaires were collected. Minor changes were made to two of the demographic questions and the last question, regarding participation in completion of online student evaluations, which was restructured for clarity and modified to include the choice of completing the evaluation only if the instructor was very good or very bad.
The questionnaire was administered during class and instructors and/or those administering the questionnaire were asked not to share the research topic with respondents so that the responses to the questions would not be biased.
Analyses
t-Test Analysis
After the data were collected, responses were divided into three groups: responses from students who never complete online evaluations of instruction, responses from students who always or almost always complete online evaluations of instruction, and responses from students who only complete online evaluations of instruction when the instructor was very good or very bad. The latter two groups that are more likely to respond online would be compared with the group that never responds online.
The data were entered and a frequency analysis was run on all variables to detect any inaccuracies in data entry. Inaccuracies were checked using the actual questionnaires. Next, several of the variables were reverse-scored. Table 2 shows the correlation matrix for all independent variables.
Correlation Matrix.
Note. GPA = grade point average. Phi coefficient shown for the female and non-White correlation.
Correlation is significant at the .05 level (two-tailed). **Correlation is significant at the .01 level (two-tailed).
Then, a reliability analysis (Cronbach’s alpha) was run on the multiple-item scales. The results are shown in Table 3. All scales exhibit acceptable levels of reliability, according to Nunnally and Bernstein (1994), with the exception of the complaining scale.
Reliability Coefficients.
The initial statistical analysis included independent sample t tests for the metric-level variables: time poverty, complaining, GPA, and technology-savvy. Pearson Chi-square analysis was used for the nonmetric-level variables: gender and race.
The group that never completes online evaluations was the smallest group (n = 34); however, it is noted that a sample size of “at least 20 can be expected to provide very good results even if the populations are not normal” (Anderson, Sweeney, & Williams, 2009, p. 390). The first t tests were conducted using the group that never completes online evaluations and the group that always completes online evaluations. For Hypothesis 1, regarding time poverty, although the means were in the appropriate direction, the differences were not significant (t[261] = −0.73, p = .47). Hypothesis 2, regarding the predisposition to complain, was not supported (t[258] = −0.42, p = .68). Hypothesis 3, regarding gender, was supported with a Pearson Chi-square (χ2[1, n = 263] = 6.00, p = .01): the response rate for males and females differs. It appears that males are more likely to never respond versus females. The mean GPA was higher for those completing online evaluations (Hypothesis 4), and the difference between that group and the group that never completes online evaluations was significant (t[250] = 2.40, p = .02). A dummy variable was created so that White students and students of color could be compared in Hypothesis 5. Although White students responded more than students of color, the difference was not significant (χ2[4, n = 260] = 5.62, p = .23). And finally, for Hypothesis 6, those who always complete online evaluations are more technology-savvy, but the difference was not significant (t[261] = 1.20, p = .23).
The next t-test analyses compared the group that never completes online evaluations with the group that only completes online evaluations if the instructor was very good or very bad (outlier group). For Hypothesis 1, time poverty, the mean differences were not significant (t[82] = −0.58, p = .56). Similarly, the means for Hypothesis 2, the propensity to complain, were almost identical, and thus the difference was also not significant (t[81] = 0.04, p = .96). The hypothesis regarding gender, Hypothesis 3, was also not significant (χ2[1, n = 84] = 0.07, p = .80). GPA was higher for those who complete online evaluations if the instructor was very good or very bad and the difference was significant for Hypothesis 4 (t[72] = −1.96, p = .06). The difference between White students and students of color was not significant in terms of the two groups in Hypothesis 5 (χ2[6, n = 82] = 5.36, p = .50). Finally, the outlier group was more technology-savvy and the difference was not significant when compared with the group that never completes online evaluations (t[82] = −1.05, p = .29), not supporting Hypothesis 6. Effect size was calculated for each of the metric variables. With the exception of GPA, which suggested a moderate practical significance, the results suggested a low practical significance (Cohen, 1988). These results are summarized in Tables 4 and 5 and Tables 6 and 7.
t Tests: Never Complete Online Evaluations Group and Always Complete Online Evaluations Group.
Significant at p < .05.
Pearson Chi-Square: Never Complete Online Evaluations Group and Always Complete Online Evaluations Group.
Chi-square (df = 1) = 6.00, p = .01.
t Tests: Never Complete Online Evaluations Group and Complete Online Evaluations for Good or Bad Instructors.
Significant at p < .10.
Pearson Chi-Square: Never Complete Online Evaluations Group and Complete Online Evaluations for Good or Bad Instructors.
Note. No differences were significant at p ≤ .10.
Confirmation of t-Test Analyses Using Logit
Logistic regression (logit analysis) in its basic form is limited to two groups for the dependent variable, in this study, one group who never completes online evaluations (dummy variable code = 0) and another group that always or almost always completes online evaluations (dummy variable code = 1). Additional analysis includes the dependent variable of the group that never completes online evaluations (dummy variable code = 0) and another group that only completes online evaluations if the instructor is very good or very bad (dummy variable code = 1).
The validation/confirmation test using logit analysis began with the Hosmer and Lemeshow test to check for model fit. The results showed nonsignificant Chi-square results for both the never/always groups and the never/instructor good or bad groups. A nonsignificant Chi-square indicates that the data fit the model well (Wuensch, 2014). Next, to check for the presence of multicollinearity, interactions were run for the covariates, time poverty, complaining, technology-savvy (independent variables) for both the never/always and never/instructor good or bad groups (dependent variables). All six interactions were nonsignificant.
Additionally, the Box-Tidwell test was run to test the assumption of logistic regression that the relationships between the continuous predictors and the logit (log odds) is linear. A significant Chi-square indicates that the assumption has been violated. Chi-square analyses for the continuous variables in both the never/always and the never/instructor good or bad groups were nonsignificant. The results for the logit analysis were identical to the t-test analyses, with the exception that race was significant for the never complete online evaluations group and complete online evaluations for good or bad instructors group, with β = 0.34 (1), p < .10 (see Tables 8 and 9).
Logit Analysis: Never Complete Online Evaluations Group and Always Complete Online Evaluations Group.
Note. Dependent variable coded as 0 = never complete and 1 = always complete.
Significant at p < .05.
Logit Analysis: Never Complete Online Evaluations Group and Complete Online Evaluations for Good or Bad Instructors Group.
Note. Dependent variable coded as 0 = never complete and 1 = complete for good or bad instructors.
Significant at p < .10. **Significant at p < .05.
Results and Discussion
Today, more and more colleges and universities have adopted the online approach to student course evaluations. There is very little lag time between administration and reporting of results. However, in most cases, students are allowed to complete the evaluations voluntarily anytime during an announced timeframe. As a result, one study noted that the average online response rate is 23% less than the in-class collection response rate (Guder & Malliaris, 2013). Unfortunately, academicians have noted that biases exist between those who complete the evaluations and those who do not.
This study revealed that there are persistent significant differences regarding gender, race, and GPA between students who participate in online student evaluations and those who do not. Those students who refuse to participate in online student evaluations of instruction may jeopardize the representativeness of the sample consisting of students who participate in online student evaluations (Feinberg, Kinnear, & Taylor, 2013).
Analogous to the challenges of online student evaluations, low response rates in mail surveys can also yield nonresponse biases (Groves, 2006). In both cases, there may be multiple characteristics distinguishing respondents from nonrespondents, thus increasing nonresponse bias.
It has been shown that appropriate response inducement for mail surveys can increase a low response rate, typically less than 15% with no incentive, to about 80% (Malhotra, 2010). However, instructor incentives for completion of online student evaluations, such as awarding extra points (extra credit), will bias the results among classes, since these instructor-specific incentives are applied unequally. Equal incentives applied across classes by academic administrators may more closely emulate the mail survey incentive by increasing response rates and reducing nonresponse bias.
Moreover, one way to reduce, if not eliminate, the disadvantages of both administration methods of SET, that is, the traditional paper-and-pencil method and the online method, is to use the positive characteristics of each in a hybrid approach. Evaluations could be administered during class, online in a computer lab setting, with a relatively short window of time allowed to instructors for their administration. Moreover, due to the relatively short timeframe of administration, the response bias noted in the Estelami (2015) study, which occurs in both methods of administration, would also be minimized. Additionally, this hybrid method may be more easily implemented as the presence of technology becomes more pervasive in the classroom.
Conclusion
This study successfully shows that nonresponse bias continues to exist with online SET. Limitations in this study include the relatively small group sizes of the group that never completes online student evaluations and the group that only completes online evaluations if the instructor was very good or very bad. Furthermore, these group sizes are not reflective of the low participation rate that most colleges and universities have experienced with online evaluations. Students may simply complete an evaluation occasionally for no other reason than convenience. However, these students could have selected the “other” option in the instrument, versus “never,” “almost always or always,” “only when the instructor was very good,” or “only when the instructor was very bad.” Furthermore, a relatively small nonresponse group (never completes online evaluations) does not necessarily mean that nonresponse bias is negligible. Groves and Peytcheva (2008, p. 183) note that, “some surveys with low nonresponse rates have estimates with high relative nonresponse bias.”
Ideally, future research should attempt to solicit responses from those students who did not complete their online evaluations and compare these responses with those students who completed them. This will ultimately document the actual differences in evaluations between those who completed them and those who did not, which was suggested by the nonresponse bias found in this study. Estelami’s (2015) results suggest that there would be a mean difference, if late responders are less likely to respond in general. Additionally, other variables may be considered for analysis in terms of their effect on nonresponse bias. For example, Bladon (2010) found that an increasing level of nonresponse bias in telephone surveys is related to respondent education, that is, those with more education become more heavily weighted in evaluation results. Moreover, the sample was taken from a medium-sized regional university in the United States of largely traditional students (18-23 years of age). Samples from other types of colleges and universities could be examined.
In conclusion, the findings of the study indicate that nonresponse bias persists in the online administration of SET. Therefore, colleges and universities should be careful when making administrative decisions based on these evaluations. Future research is needed to address some of the limitations outlined here.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by a Rea and Lillian Steele Summer Grant.
