Statistical Significance and Program Effect

Abstract

In their essay, “Why Assessment Will Never Work . . .,” Bacon and Stewart (2016) recommend that instead of carrying out the expensive process of experimenting themselves, many business schools would get a bigger bang for their buck if they used “published pedagogical studies that use direct measures of learning with sufficient statistical power” (p. 181) to improve instructional programs and student learning. This is because, according to Bacon and Stewart (2016), such studies are doomed to fail the inferential test of significance due to lack of statistical power—a consequence of small enrollments in programs and the unreliability of faculty-made measures of student achievement.

In this Rejoinder, I do not take issue with Bacon and Stewart’s recommendation to use published studies as collateral information about “what might work” to improve student learning for a particular program. Actually, regardless of program size, this is a good recommendation: We should interpret experimental evidence not only in light of a program’s own outcomes but also what has been shown to be the case in related studies. Moreover, I do not take issue with their power analyses and related conclusions that many programs are too small to realistically expect statistical significance even under optimistic treatment effect-size assumptions.

I do, however, take issue with two points in their essay: (1) that reliance on statistical significance is a necessary condition to learn and justify something about program impact on student learning and (2) that the biggest problem programs face is “measuring postchange [sic] outcomes to determine if the implemented changes produced the desired effect” (p. 181). I discuss these issues in greater depth below.

Statistical Significance

Statistical significance is but one indicator of program effect. It depends not only on sample size but also importantly on effect size—the mean difference in control and experimental group performance expressed in standard deviation units.¹ When programs are small, even big treatment effects may not be statistically significant - but they very well may be practically significant. Moreover, the fact that programs grapple with tough decisions about what needs to change to improve, how to implement the change experimentally, how to measure the impact of the change, and how to interpret and generalize the observed impact all may be very beneficial and contribute to program improvement and enhanced student learning in the absence of statistical significance (e.g., Shavelson, 2010). When even small-scale studies are interpreted with collateral information from the literature, as recommended, inferences are strengthened. Even though programs share similarities, each is unique in itself; carrying out experiments on small programs can shed light on the extent to which generalized findings in the literature apply to the particular. This approach was recognized early on in the medical literature as well as the education literature (e.g., Sackett, Rosenberg, Gray, Haynes, & Richardson, 1996; Shavelson, Fu, Kurpius, & Wiley, 2015).

Improvement in Teaching–Learning–Assessment

The second issue, ostensibly about statistical power, is actually about institutional change. It is about administrator and faculty trust, along with the will and capacity to assess learning responsibly in order to use assessment results to “engineer” change in teaching and learning and to get widespread buy-in to do the hard work of initiating and maintaining that change (Shavelson, 2010). While Bacon and Stewart recognize these elements by noting that many programs are experiencing a substantial cost in setting learning goals, aligning curricula with the goals, designing measures of student learning related to those goals, collecting, analyzing, and disseminating results and revising curricula based on faculty discussion and findings, they think that statistical power is the central problem. I think not. Organizational change is hard work; it is especially hard work in higher education where so often these issues are contentious and authority is distributed. Thus, the remainder of this Rejoinder focuses on issues of trust, will, and capacity.

Consider trust. Higher education is buffeted by three forces: political, consumer, and academic. Consumers (students, parents, government and business) purchase education and expect colleges and universities to be accountable for their “products.” Consumers pressure policy makers to hold universities accountable. In recent years, the trust that once bound higher education and government has eroded with increasing tuition, time to graduation, questionable student outcome claims (etc.). Colleges and universities have imposed the accountability mechanism—accreditation—on themselves to head off government intervention.² Hence outside pressure, while having moved business schools to assess student learning as Bacon and Stewart (2016) note, are also often viewed by administrators and especially faculty as a nuisance—a force to comply with at least symbolically if not substantively.

Within higher education, administrators are caught between external and internal forces: accreditation and faculty academic freedom. Perhaps most important, faculty and administrators do not trust one another. Faculty are concerned about how assessment information will be used for example, in promotion and tenure decisions and as constraints on academic freedom (e.g., their independence to determine the curriculum and assess student classroom performance). Moreover, faculty know “the coin of the realm” and passport to opportunity lies in publishing, not teaching.

All these forces make organizational change in higher education challenging and slow moving. I’ve found that accreditation contributes positively by getting institutions to consider gathering evidence and improving student learning; but that is not enough (Shavelson, 2010). Campus leaders must champion the assessment of learning and experimentation to engage faculty substantively and to sustain change in practice. Incentives (e.g., recognition of teaching as an important part of promotion and tenure; considerations of teaching load) need to be in place. Indeed, assessment and experimentation is often an unrecognized (ignored) “overload” to faculty responsibilities. For these reasons, I see assessment of learning and program change as bigger issues than statistical power and significance. Focusing on statistics detracts from substantive sustained change toward the improvement of student outcomes.

The will to change, as the preceding discussion of trust suggests, is often hard to come by. Put enough pressure on faculty and they will comply; relieve the pressure and they will revert to business as usual. And when faculty do respond, often their response is symbolic rather than substantive: do the minimum to comply with external pressures and maintain business as usual as best as possible. This is, of course, an overgeneralization. I have met many faculty crusaders who have taken their responsibility to experiment with and change the way they teach seriously and effectively. But they are unusual and often exhausted; change is hard to maintain.

I have, however, observed the will to change in certain institutions. What characterizes these institutions is that they have managed to align actors up and down the “chain of command” - from president to provost to dean to department chair to faculty and to student—to make change; information flows up and down; information is used for improvement, not punishment.

Capacity to assess, experiment, and make change is, as Bacon and Stewart (2016) note, often lacking. Most faculty and administrators are not experts in assessment development, psychometrics, experimental design, statistical analysis of behavioral experiments, and using evidence from experiments to improve programs. Consequently, the lack of such capacity is a barrier to improving teaching and learning, even when well intentioned. Those institutions that have overcome the capacity gap have often employed an “assessment guru”—a charismatic technical expert who blends into the faculty and works collaboratively to create capacity and motivate (Shavelson, 2010).

Concluding Observation

I agree with Bacon and Stewart that statistical power is an issue when “treatment” effects and sample sizes are small. I also agree that collateral research bearing on the proposed teaching–learning–assessment changes are salutary and should be used regardless of program size in making decisions about improvement. One caveat, however, is that each program is unique. A program’s experience with change is essential both in program development and in evaluating and using external research literature. Their article, however, detracts from the bigger challenge of figuring out how to assess student outcomes, how to build effective programs to enhance student learning, how to design interventions and interpret and generalize findings, and most important, how to overcome the barriers inherent in higher education that work against sustained, substantive change.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Notes

References

Bacon

D. R.

Stewart

K. A.

(2016). Why assessment will never work at many business schools: A call for better utilization of pedagogical research. Journal of Management Education. Advance online publication. doi:10.1177/1052562916645837

Sackett

D. L.

Rosenberg

W. M. C.

Gray

J. A. M.

Haynes

R. B.

Richardson

W. S.

(1996). Evidence based medicine: What it is and what it isn’t. BMJ, 312, 71-72.

Shavelson

R. J.

(2010). Measuring college learning responsibly: Accountability in a new era. Stanford, CA: Stanford University Press.

Shavelson

R. J.

Kurpius

Wiley

(2015). Evidence-based practice in science education. In Gunstone

(Ed.), Encyclopedia of science education (pp. 407-410). New York, NY: Springer.