Abstract

David Thissen’s essay, Bad Questions: An Essay Involving Item Response Theory (2016), is an excellent contribution to the genre of commentaries on the field. It joins the likes of the piece by Thissen’s frequent collaborator, Howard Wainer (2010), who published 14 conversations about three things in this journal 6 years ago. Thissen asks and answers, dismissively, five of his titular “bad questions.” He concludes that what makes them bad “is the framing of the question that demands a yes-or-no, black and white, cut and dried response” (p. 10). He argues for a statistical education that values continua over dichotomies and categories, and I agree. However, I think Thissen, as well as scholars and students of statistics and measurement generally, underappreciate the utility and necessity of dichotomies by decision makers. Ultimately, I believe we can and should inform their dichotomies with meaningful scales and defensible procedures more often than we do. Conveniently, Item Response Theory (IRT) can be quite useful for this purpose.
In this brief response, I distinguish between Thissen’s (2016) first three bad questions and his last two. His first three questions concern statistical and psychometric criteria, for IRT model fit, unidimensionality, and cardinality (interval scale properties), respectively. These are judgments about models and scores. His last two questions concern policy criteria, for student proficiency and teacher effectiveness, respectively. These are judgments about people.
Thissen’s (2016) first three questions are bad in the way that statistical p values are bad. These questions yield incomplete information about practical significance when samples are small, and they are a needless distraction when samples are large. I agree with Thissen that we should shift our attention away from these questions and toward questions of practical significance. In contrast, I argue that the last two questions are bad in part because statisticians and psychometricians have done too little to help answer them. I suggest how we might help, by employing scale anchoring methods and investigating properties of “Frankenstein” score composites that are in common use.
Bad Questions Are Those Irrelevant to the Ultimate Use of Procedures
Thissen (2016) rightly dismisses his first three bad questions as uninteresting because most of their answers lack any sense of practical consequence for a particular use. He answers each question—Does the model fit? Is the test unidimensional? and Is the scale ordinal or interval?—with the same answer, “No.” I would respond to each question the same way, too, but I would also respond with another question, “So What?” Answering the so-what question requires, to my mind, demonstrating the consequence of model misfit, multidimensionality, or ordinality on some use of the procedure or statistic, over and above some counterfactual baseline. Thissen illustrates this well with his suggested experiments that determine the impact of these violations on actual uses of IRT statistics and scores.
Thissen’s (2016) proposals align with a growing movement toward effect sizes, confidence intervals, and meta-analyses, a movement that recognizes conventional statistical tests and their associated p values as incomplete at best (American Psychological Association, 2010). Cumming (2014) has described this movement as “The New Statistics” (p. 7), and he advocates further to dispense with p values altogether in favor of effect sizes and confidence intervals. Efforts like these require us to establish a relevant scale and then evaluate magnitudes along it. A good example of this is Dorans and Feigenbaum’s (1994) Difference That Matters, a criterion for the difference between equating relationships based on whether the difference results in a score discrepancy after scores are rounded to their final reporting scale.
These efforts are also consistent with modern theories of validation (Kane, 2013). Here, the target of validation is not a model or a test, but the use or interpretation of a test-based statistic (e.g., scores) for a purpose. From this perspective, model misfit, multidimensionality, and ordinality are not in and of themselves threats to validity unless we can demonstrate the threats they pose to uses of test-based statistics for particular purposes.
Meaningful Scales Can Support Defensible Dichotomies
Thissen’s (2016) last two questions, about student proficiency and teacher effectiveness, are like the first three in that they require dichotomization of a continuous scale. However, I do not believe we should dismiss these last two questions for the crime of dichotomization alone. Thissen framed the first three questions statistically, where the underlying continuous scale is not the magnitude of misfit, multidimensionality, or ordinality but rather the probability of the magnitude under the null hypothesis. The cutoff, say, p < .05, is arbitrary, yes, but, worse, the scale on which we set this cutoff is not remotely as interesting as the implications of the magnitude itself. I see the crime as choosing an irrelevant scale, not setting an arbitrary dichotomy.
In contrast, I think the continuous scale underlying student proficiency and teacher effectiveness could be quite relevant. The more relevant it is, the more defensible the dichotomization can be. Conveniently, IRT can help. Thissen (2016) notes, correctly, I think, that IRT has two primary purposes: item selection and linking. I think a third, underappreciated use of IRT deserves mention: scaling. By scaling, I mean not only establishing a score scale but also helping to improve interpretation of scores on that scale. An example of such a scaling procedure, and one that benefits from IRT, is scale anchoring (Beaton & Allen, 1992). It results in what is known as item maps or, in particular cases, Wright Maps (Wilson, 2011), where we can associate a score with an item on the basis of a model-implied percentage chance of answering that item correctly. These anchors can improve score interpretation and cut score selection. Unlike the probability scales implied by Thissen’s first three questions, student proficiency and teacher effectiveness scales can be anchored to more relevant, interpretable criteria.
I prefer scale anchoring over Thissen’s (2016) proposal to overlay existing score scales with probabilities of exceeding the cut score. 1 I agree with Thissen that current standard setting procedures are overwrought, misunderstood, and arbitrary, but his proposal has at least three shortcomings. First, Thissen’s proposal does not replace cut scores, it only layers an additional (probability) scale over the arbitrary cut score. Second, I am unconvinced that the “probability of exceeding the cut score” scale is any more interpretable than the original scale itself. Third, if a cut score is set based on this probability, it is practically no different from adjusting an already arbitrary cut score arbitrarily in some direction. The proposal is subversive. It is a method to set cut scores designed to convince people never to set cut scores. This leads me to my penultimate section heading.
You Got to Give the People What They Want
Reading Thissen’s (2016) closing arguments, I kept hearing the 1975 R&B line sung by the O’Jays, You got to give the people what they want. Thissen is against this mantra, when what the people want are “cut scores, critical values, and decisions” (p. 10). I am certainly sympathetic, and a great deal of my own work is an attempt at remedying the statistical and interpretive damage done by arbitrary dichotomies (Ho, 2008; Ho & Reardon, 2012). However, I do not believe that all dichotomization is necessarily thoughtless or harmful. Even when cut scores are arbitrary, setting goals can have positive effects, provided that the full scale is understood and progress along it is achievable. Twisting the carpenter’s adage again, “cut once, measure everywhere” (Ho, 2008, p. 358).
I agree with Thissen (2016) that educational statisticians should never forget the continua underlying categorizations. However, it is precisely our deep knowledge of these continua that poise us to support wiser categorizations and anticipate their limitations. Rather than lament or subvert these processes, I think we should advise and engage with them. I even think we should contribute more dichotomies ourselves, adding to our portfolio of “differences that matter,” as long as they are on scales that are relevant, that we anchor and elucidate. I believe that doing so would cohere the field and increase its usefulness.
There is no more compelling opportunity for this engagement than the one Thissen (2016) raises: value-added models (VAMs) and teacher effectiveness. Thissen rightly notes that VAMs are best served as one of multiple measures, and indeed this is standard practice in state and district policy (Wesson, Potts, & Hill, 2015). Thissen calls VAMs “Rube Goldberg contrivances” (2016, p. 85), and I like to call them “Frankenstein metrics,” particularly when they are composited with other “contrivances” like teacher observation scores. Statisticians and psychometricians are trained with exactly the tools necessary to clarify the properties of these derived composites (Haertel & Ho, in press; Mihaly, McCaffrey, Staiger, & Lockwood, 2013). Rather than withdraw from the process, I think we have a responsibility to engage, critique, and improve it.
Improve Statistical Training by Increasing Student Exposure to Score Uses
Thissen (2016) closes by asking how to improve statistical education. The implications of “the new statistics” for training students include not only broader perspectives on model fit, dimensionality, and cardinality analyses. They require student knowledge of and experience with the contexts in which differences could matter. How are the products of IRT analyses used in practice, in admissions testing, in workplace testing and credentialing, in program evaluation, and as tools of policy? Just as we teach students to pair every simulation study with a real data analysis, they should pair statistical tests with a discussion of the magnitude of practical impact in a context.
In the context of VAMs, Thissen notes that setting cut scores can “take us far from the realm of educational measurement toward an integrated overview of educational research and policy” (2016, pp. 85–86). But isn’t this a trip we’ve had to take, as measurement scholars, countless times? Can’t we predict, by now, that scores we create will be used this way (Ho, 2013)? Most importantly, isn’t considering uses like these the best inoculation against asking bad questions? If so, perhaps we should start training students toward exactly the integrated overview of educational research and policy that Thissen considers a stretch. As Haertel (2013) has argued, we need not undertake this journey alone—we have much to learn from other disciplines.
This training must include but also extend beyond the conventional classroom and lab. I am glad that organizations continue to support internship programs and sponsor graduate assistants, and I hope that this support expands. We can be creative with other opportunities. A testing organization approached me recently about joining their Technical Advisory Committee (TAC). I responded enthusiastically and with an additional proposal, that the organization support graduate student attendance at TAC meetings, with a commitment to facilitate research of joint interest. The benefits are multilateral. In every TAC meeting that I’ve ever attended, there have been answerable research questions without the research bandwidth to answer them. Under this proposal, the organization gains research support, added faculty consultation, and a possible recruitment pathway. Faculty gain research opportunities. And students gain exposure to expert discussion of relevant topics and a possible dissertation topic. I am delighted that the organization agreed. I consider this a nascent pilot, but I am happy to report that, so far, we have not yet asked any of Thissen’s (2016) bad questions.
Footnotes
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
