Abstract
Endorsing a priori the conviction that any science worthy of the name must measure the attributes it investigates, psychometricians adopted a metaphysical paradigm (without acknowledging it as such) to secure its claim that mental tests measure psychological attributes, a claim that was threatened by the inadequacy of test data to secure it. The fundamental axiom of this paradigm was Thorndike’s Credo (“All that exists, exists in some amount and can be measured”; 1918, p. 16), which entails its central lemma, the psychometrician’s fallacy (“All ordered attributes are quantitative”; Michell, 2009, p. 41), and which, in turn, supplies psychometrics’ primary methodological principle (“interval scales can be derived from ordinal data”). Logically, this framework is flawed at every level: Thorndike’s Credo is metaphysical overreach; the psychometrician’s fallacy is just that—a logical fallacy; and their primary methodological principle, a prioristic thinking.
The raison d’être of the new psychology, as Dewey (1884) called the research programme initiated by Fechner, Wundt, and their followers, was that of applying the methods of physical science (viz., experiment and measurement) to mental phenomena. Experimental, quantitative psychologists distinguished themselves sharply from earlier, mental philosophers with their “armchair” methods. In particular, they rebuffed metaphysics. This repudiation was assisted by the attitude towards metaphysics in philosophy itself during the 20th century. Metaphysics became an intellectual pariah, untouchable even in its home discipline, let alone psychology. The irony is that metaphysics has always infused science, psychology included, simply because characterizing phenomena and proposing explanations always presuppose answers to metaphysical questions. 1 Psychology’s repudiation of metaphysics was not so much a case of divorce as one of repression, and one problem with repression is that the repressed sometimes exerts an influence. My argument is that psychometrics has been unconsciously shaped as much by metaphysics as by the character of the phenomena it investigates. In psychometrics, there is a gap between what is observed and what is claimed and that gap has always been filled by metaphysics.
A classic of psychometrics states, “The level of measurement most often specified in mental test theory is interval measurement, which yields an interval scale” (Lord & Novick, 1968, p. 21). However, Cliff and Keats (2003) note, Most of the basic data consists of dichotomous responses, and much of the rest is made of responses on short scales. Neither type of data furnishes information that can inherently be considered more than ordinal. The dominant contemporary treatment of this data is to derive from it scores on interval-scale latent variables through the application of one or another “model” that is presumed to explain the data. We feel that, in the majority of instances, the application of these models is inappropriate. (p. ix)
This gap between “basic data” and “level of measurement most often specified” is anomalous. In mental testing ordinal relations are observed, but interval scale measurements are claimed. This is anomalous because, in science, gaps between observations and reports must be bridged by evidence, in this case, evidence of the structure interval scales require, and such evidence is missing.
To illustrate, consider tests where answers to items are assessed correct or incorrect. Each person’s performance is a sequence of correct or incorrect responses. For any person, X, this sequence is X ’s response pattern. Given a test of exactly n items, because assessment is binary, X ’s response pattern is one of 2n logically possible patterns. Each response pattern defines a class and any two people, X and Y, inhabit the same class if and only if their response patterns are identical. Hence, tests classify. But more structure is discernable. These 2n classes are partially ordered and the people tested likewise: any person X performs better on the test than another person Z if X gets correct every item Z gets correct and more. Not every pair of people tested stands in this relation and, so, as Michael Kane (2008) once noted, tests deliver “at best, a partial ordering [emphasis added]” (p. 104). A partial order is a structure falling between a classification and a strict simple order. 2 A strict simple order lines things up, one after the other. That is, the order relation is transitive, asymmetric, and connected. A partial order, however, is not connected (e.g., in a family tree, some pairs of family members, say two siblings, are not connected by the order relation of being an ancestor of, which is merely transitive and asymmetric).
The cognitive attribute assessed by a test is reflected in the structure of the response patterns. If, for each test item i, the set of cognitive resources (i.e., the knowledge, skills, strategies, etc.) a correct response requires is specified (call it I), each response pattern corresponds to the super set consisting of the union of the sets of cognitive resources required to obtain it. That is, for example, in a three-item test, with items i, j, and k, and corresponding sets of cognitive resources I, J, and K, the response pattern {correct on i, correct on j, correct on k} corresponds to the super set of cognitive resources, {I, J, K}, and the response pattern {correct on i, correct on j, incorrect on k} corresponds to {I, J}, and so on. The attribute assessed by the test is then the partial ordering of these super sets of sets of cognitive resources. 3 Now, an interval scale requires not only that degrees of the relevant attribute constitute a strict simple order but also that the differences between degrees are quantitative (i.e., measurable on a ratio scale). Claiming that test performances possess quantitative structure, when only partially ordered structure is observed, is anomalous without evidential support. Thus, Cliff and Keats are correct in their assessment.
However, this anomaly remains invisible to most psychologists and, consequently, the criticism offered by Cliff and Keats (2003) is not accepted as valid within mainstream psychometrics. 4 Why not? One reason is because a presupposition embedded within the psychometric paradigm prevents its recognition. This presupposition is a metaphysical axiom still steering research and practice to this day and is responsible for the unacknowledged evidential hiatus besetting psychometrics. Furthermore, while this presupposition is metaphysical, it is not generally recognized as being such.
The fundamental axiom: Thorndike’s Credo
A century ago, Edward Lee Thorndike 5 coined what he called psychometrics’ “general Credo”: “Whatever exists at all exists in some amount. To know it thoroughly involves knowing its quantity as well as its quality” (E. L. Thorndike, 1918, p. 16). Rephrased, often misquoted, and occasionally misattributed, it is one of psychology’s most enduring precepts. According to his biographer, one version of it, namely, ‘“All that exists, exists in some amount and can be measured,’ is the epigram that first comes to mind whenever Thorndike is mentioned” (Clifford, 1984, p. 283). It was beloved of textbook writers (e.g., McCall, 1922) and appeared in every edition, 1949 to 1990, of Lee J. Cronbach’s Essentials of Psychological Testing, rephrased as: “If a thing exists, it exists in some amount. If it exists in some amount, it can be measured” (e.g., Cronbach, 1949, p. 12, 1990, p. 34). It is found as far afield as the Nordic Journal of Studies in Educational Policy (Pettersson et al., 2017, p. 31) and quoted by Chinese psychologists, who aligned it with teachings of their ancient philosopher, Mencius (Xu & Li, 2014, p. 884). It is still repeated (e.g., Price, 2017, p. vii), always without support, always as if axiomatic.
The prevailing view is epitomized by Saint-Mont (2012), who said, “it would be a scientific sensation if we encountered some phenomenon that existed, but did not do so in some amount” (p. 469). While Saint-Mont’s term, “some phenomenon,” exploits an ambiguity in Thorndike’s wording, E. L. Thorndike’s (1918) original context shows that by “whatever [emphasis added] exists,” he meant attributes, like height or weight or mathematical ability, and not objects, like elephants or rocks or people, which obviously always exist in both amounts and number. Our immediate experience is that not all attributes exist in amounts. Some are experienced as purely qualitative: for example, one rose does not differ from another by how much roseness it possesses. Of course, things are not always as they seem and attributes initially appearing purely qualitative, such as hot and cold, later proved to be quantitative. But that was discovered, not taken as a first principle. Thorndike, however, accepted that all attributes are quantitative as a first principle. He did this, he said, to counter “the general fear that science and measurement, if applied to human affairs . . . will deface the beauty of life, and corrode its nobility into a sordid materialism” (E. L. Thorndike, 1921, p. 371). He elaborated: I have no time to present evidence, but I beg you to believe that the fear is groundless, based on a radically false psychology. Whatever exists, exists in some amount. To measure it is simply to know its varying amounts. . . . It does not dignify man to make a mystery of him. (E. L. Thorndike, 1921, p. 371)
The only alternative to measuring, he thought, was making a “mystery” of things. In presuming this false dichotomy, he was a product of his age.
For example, at the close of the 18th century, the German physicist Franz Achard said, “the physicist who does not measure only plays and differs from a child only in the nature of his game” (as quoted in Heilbron, 1979, p. 74). It was an increasingly prevalent attitude through the 19th century. In Ian Hacking’s (1983) words, “The world was now conceived in a more quantitative way than ever before. The world is seen as constituted by numerical magnitudes” (p. 242). Lord Kelvin’s famous dictum expressed it too: In physical science, the first essential step in the direction of learning any subject is to find principles of numerical reckoning and practical methods for measuring some quality connected with it. I often say that when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely in your thoughts advanced to the stage of science whatever the matter may be. (Thomson, 1891, pp. 80–81)
6
Kelvin confined this “first essential step” to physics, but psychologists set it free. For Francis Galton (1879), founder of psychometrics, “until the phenomena of any branch of knowledge have been subjected to measurement and number, it cannot assume the status and dignity of a science” (p. 149) and James McKeen Cattell, Galton’s acolyte and Thorndike’s mentor, echoed, “Psychology cannot attain the certainty and exactness of the physical sciences unless it rests on a foundation of . . . measurement” (Cattell, 1890, p. 373). Thorndike’s Credo encapsulated it in a nutshell.
In his Credo, Thorndike expressed his faith in a “Pythagorean power by which number holds sway above the flux,” as Bertrand Russell (1967, p. 13) poeticized it. A religious sect and philosophical movement 7 originating in the 6th century BC, the Pythagoreans taught that reality is constituted numerically. 8 According to Thorndike’s brother, Lynn (an historian of science), the Pythagoreans “held that the whole world is, and that the life of man ought to be, harmoniously ordered in accordance with mathematical principles; nay more, that such principles are living things and that numbers are the essence of the universe” (L. Thorndike, 1905, p. 59). 9 It was a metaphysical doctrine and was embraced a millennium later by Saint Augustine who conjoined it with the biblical teaching that God “ordered all things by measure, number, and weight” (Wisdom, 11: 21), inferring, “that God is the measureless measure, the incalculable number, and the imponderable weight, in accordance with which all things are made” (Roche, 1941, p. 350), giving the idea theological currency. 10 Consequently, after the fall of Rome, Pythagoreanism survived in monasteries, entering mainstream thought in the 17th century (I. B. Cohen, 2005). The triumphs of quantitative physics cemented its status and it became the ideological lubricant easing the birth of psychometrics. Pythagoreanism is one of the West’s most enduring metaphysical tropes (Riedweg, 2005), influencing not only science but also almost the entire spectrum of Western cultural traditions.
Thorndike’s genius was to capture Pythagoreanism in a maxim so simple it was absorbed seamlessly into psychology. And alignment with Pythagoreanism reaped secondary gains: psychometrics presented itself as a “quantitative rational science” 11 to a culture already enjoying the fruits of quantitative physical science and also marketed tests as instruments of scientific measurement (e.g., McCall, 1922; Terman, 1916), helping convert institutions, like schools and the military, to their use.
Given Thorndike’s “profound disinterest in—even distaste for—philosophy” (Clifford, 1984, p. 3), propounding metaphysics sounds unlikely. However, even though he professed little interest in philosophy, he committed himself, albeit unwittingly, to certain philosophical presuppositions. As his mentor William James (1907) had taught, everyone has a philosophy, acknowledged or not. 12 His friend, Robert Woodworth (1932), characterized Thorndike’s as “sane positivism” (p. 366) and, indeed, Thorndike embraced features of positivism, such as its romantic elevation of science over other cultural traditions, and its antipathy to metaphysics. Both his Credo and antipathy stemmed from his attitude to science: his antipathy, because metaphysics is not science, and his Credo cradled his romantic image of science, according to which “We conquer the facts of nature when we observe and experiment upon them. When we measure them we have made them our servants” (E. L. Thorndike, 1903, p. 164). His biographer noted, “like certain other wayward sons of clergy, he (guiltily perhaps) infused his work with messianic fervor, so that science itself took on a crusadelike character” (Jonçich, 1968, p. 437). 13
Neither did those endorsing his Credo necessarily think of it as a metaphysical doctrine. In line with prevailing attitudes, they thought, “we hardly recognize a subject as scientific if measurement is not one of its tools” (Boring, 1929, p. 286) and observations like “the path of science is paved with achievements of the allegedly unachievable” (Spearman, 1937, p. 89) fed their optimism regarding the future of psychological measurement. Enthused by this quantitative imperative, sometimes ascribed to Galileo, “we must measure what is measurable and make measurable what cannot be measured” (Suen, 2008, p. 641),
14
they reasoned, does it matter much if some continue to believe, with Malebranche, Leibniz and Kant, that our data contain nothing “that can properly be called measurement”? . . . After all there is a sense in which logical and mathematical proofs are what the psychology of advertising has called “rationalization copy.” (Bartlett, 1940, p. 441)
Thorndike’s Credo was “rationalization copy” for the “scientific insight” that tests measure psychological attributes. That is, it was a post hoc rationalization—an ideological prop—for conforming to the quantitative imperative, and its metaphysical status went unrecognized.
However, this imperative overlooks the fact that there is no logical necessity that psychological attributes must be quantitative and whether they are is always an empirical issue. Investigating this issue requires answers to prior questions. Foremost is “what is the character of quantitative structure?” and, next, “how do merely classificatory and ordered attributes differ from quantitative attributes?” By the early 20th century, scholars (e.g., Hölder, 1901; 15 Mill, 1843; 16 Russell, 190117) provided answers to these questions. Presuming attribute structure top-down is nonempirical; it must be discovered bottom-up by investigating the fine grain of relevant attributes. The fine grain of the attributes assessed by tests is that of order relations. Had psychometricians stood upon the firmer ground of experience rather than skating upon the thin ice of speculative metaphysics, they would have had no reason to conclude that the attributes assessed by tests possess quantitative structure and, consequently, no reason to attempt to measure them. 18
Presupposing Thorndike’s Credo, psychometricians could not accept that the attributes they assessed are mere partial orders. If all attributes are quantitative, all attributes involve strict simple orders, hence, within any observed, partially ordered set of response patterns, there must be a subset constituting a strict simple order, reflecting the true order on the relevant quantitative attribute and response patterns deviating from this do so because they contain erroneous responses. In this context, a response to an item is erroneous if, given the person’s cognitive resources, it is inadvertent. Inadvertent responses may occur in two ways: first, a person possessing cognitive resources sufficient for a correct response to an item may nonetheless inadvertently get that item incorrect and second, a person not possessing cognitive resources sufficient for a correct response to an item may nonetheless inadvertently get that item correct. Doubtless, inadvertent responses occur, although this is an empirical issue not much investigated. 19 Thorndike’s Credo, if true, entails that such responses must be present whenever the response patterns observed are a partial order. In this way, metaphysics presumes to answer empirical questions and dictates the interpretation of data, propelling research single-mindedly in an exclusively quantitative direction.
Thorndike’s intention was to nip any criticism of this single-mindedness in the bud. Typically, scientists do not admit to such intentions, the received view being, “Scientific method pursues the road of systematic doubt” (Cohen & Nagel, 1934, p. 394), but E. L. Thorndike (1918) was so sure of measurement’s necessity that he asserted, “What is needed in educational measurement is not the utterance by onlookers of criticisms” (p. 24). By his lights, criticism threatened attainment of “the status and dignity of science” (as cited in Galton, 1879, p. 149). However, because criticism is necessary in science, stifling it actually threatens science’s status and dignity, and when the mainstream colludes, what started simply as an error becomes something pathological because it is entrenched (Michell, 2000, 2008).
The central lemma: Credo-lite (or the psychometricians’ fallacy)
Unfolding his Credo’s implications, E. L. Thorndike (1918) delivered what I call Credo-lite: “We have faith that whatever people now measure crudely by mere descriptive words, helped out by the comparative and superlative forms, can be measured more precisely and conveniently if ingenuity and labor are set to the task” (p. 16). To the extent that the comparative or superlative forms fit psychological attributes, Credo-lite licences those striving for interval scale measurement to proceed without further ado. Without his Credo, order relations implicit in the comparative and superlative forms do not entail quantity: quantitative structure involves both order on degrees and additive relations between them, and the former never implies the latter. Not only is it impossible to deduce quantitative structure from mere order, that an attribute is quantitative cannot be reasonably inferred inductively or abductively either. Thinking that such an inference is valid or reasonable is the psychometricians’ fallacy (Michell, 2009, 2012b).
But given Thorndike’s Credo, quantity follows from order and comments like the following by David Andrich (1988) are typical in psychology and appear justified:
Similarly, those of Thurstone (1928/1959),
21
inaugurating attitude measurement, likewise seemed secure: We may say about a man, for example, that he is more in favour of prohibition than some other, and the judgment conveys its meaning very well with the implication of a linear scale along which people or opinions might be allocated. (p. 219)
Conjoined with Thorndike’s Credo, order validly entails quantity; detached from his Credo, the inference from order to quantity is fallacious.
The tendency to commit this fallacy is aided by a compelling cognitive illusion noted by David Hume (1740/1888): “any great difference in the degrees of any quality is call’d a distance by a common metaphor. . . . The ideas of distance and difference are, therefore, connected together. Connected ideas are readily taken for each other” (p. 393). When one degree of a quality is greater than another, there is a tendency to interpret the difference via the metaphor of distance. For example, Henri Bergson claimed, “as soon as a thing is acknowledged to be capable of increase or decrease, it seems natural to ask by how much [emphasis added] it decreases or by how much [emphasis added] it increases” (1889/1913, p. 72). Many psychologists would agree, but those thinking this way overlook the fact that in the context of qualitative orderings, “how much?” is a loaded question, presuming unobserved quantity and blind to the invalidity of inferring quantity from mere order.
This cognitive illusion is reinforced by the idea that ordered attributes are measurable on ordinal scales. In ordinal scales, numbers are assigned to different degrees so that order between degrees is reflected by order on the numbers assigned. However, whenever the number assigned to any degree x (call it n(x)) is greater than the number assigned to another, y (call it n(y)), by the laws of arithmetic, it follows that n(x) – n(y) > 0. This inequality is invariant under all admissible scale transformations (i.e., increasing monotonic transformations) and, so, implies that for ordinal scales, numerical differences are empirically meaningful, 22 which appears to imply that there are quantitative differences between degrees of ordered attributes. This is why ordinal scales are often seen as interval scales in waiting.
This cognitive illusion is so commonplace that Helmholtz’s (1887/1971) observation, “attributes of objects which can be compared in order to determine which is greater and which is smaller, or whether they are equal, we call quantities” (p. 453), characterizes psychometrics notwithstanding that as Hume’s contemporary, Thomas Reid (1748/1849), warned, “Those who have defined quantity to be whatever is capable of more or less, have given too wide a notion of it” (p. 715). In a quantitative ordering, the degrees differ from one another quantitatively and because any degree is always composed of smaller degrees of the same attribute, they cannot differ from one another qualitatively. This means that if degrees in an ordering differ qualitatively, the ordered attribute cannot be quantitative (Michell, 2012a, 2012b, 2019).
Consider response patterns again. As noted, person X performs better than person Z whenever X gets correct all items Z gets correct plus at least one more. Now, suppose that some person, A, performs better than another, B, and B better than C, and suppose for argument’s sake that in each case, the difference is with respect to performance on only one item. That is, there is one item (call it i) that A answers correctly and B answers incorrectly and another item (call it j) that B answers correctly and C answers incorrectly. Furthermore, suppose that i is more difficult than j (in the sense of requiring greater knowledge, requiring greater skill, or involving more complex strategies). 23 Then, the difference between A and B on the attribute assessed must be some component of the cognitive resources required for a correct response to i, which A possesses and B lacks; and the difference between B and C on the attribute assessed must likewise be some component of the cognitive resources required for a correct response to j, which B possesses and C lacks. But, because these differences involve different particulars of knowledge, different skills, or different strategies, and so forth, the differences between A and B and between B and C on the relevant attribute will be qualitative. Hence, in cognitive tests, an ordering on response patterns typically precludes even the possibility that the cognitive attribute assessed is quantitative. 24 Had they paid sufficient attention to test data’s manifest form and to the general character of quantitative attributes, psychometricians would never have proposed quantitative theories.
However, Credo-lite deflects attention from test data’s manifest form, and more efficiently than Thorndike’s Credo because it claims less. And because of Hume’s cognitive illusion, Credo-lite appears logically necessary: if one degree of a quality is greater than another, it seems only logical to ask “how much greater?” Despite that, Credo-lite is just as much a metaphysical principle as Thorndike’s Credo because it legislates a priori restrictions on existence, it being logically equivalent to the claim that merely ordinal attributes cannot exist. Furthermore, these two metaphysical presuppositions are mutually reinforcing: Thorndike’s Credo entails Credo-lite and the apparent logical necessity of Credo-lite appears to confirm Thorndike’s Credo where it counts most in psychology, that is, in relation to ordered attributes. Through the metaphysical lenses of Thorndike’s Credo and Credo-lite, it must have seemed “that by some new trick of method” psychometrics could “readily be put on a par with the physical sciences” (M. R. Cohen, 1931, p. 350). 25
Primary methodological principle: Interval scales always underlie ordinal data
The value of Thorndike’s Credo and Credo-lite to psychometrics is that, with respect to partially ordered attributes, these metaphysical principles not only entail the existence of a hidden true order (and, hence, of the possibility of an ordinal scale) but also of a quantitative attribute, measurable on an interval scale, at least. Ordinal scales reflect order between degrees, while interval scales reflect quantitative relations between them. S. S. Stevens (1951), who coined these terms, said, “With the interval scale we come to a form that is ‘quantitative’ in the ordinary sense of the word” (p. 27). This is because interval scales, being ratio scales on intervals, entail quantitative structure just as much as do ratio scales.
Initially, Stevens (1946) judged: “the scales used widely and effectively by psychologists are ordinal scales” (p. 679), failing to see that mental test data falls short of the structure necessary for ordinal scales. Psychometricians, however, believed they had transcended ordinal scales by postulating that psychological attributes are distributed “normally” 26 and reasoned that transforming test scores to approximate a normal distribution would deliver interval scale measurements. Francis Galton (1875) had taken the “first halting steps” (Cowan, 1972, p. 509) along this path and once mental testing became established it was routinely applied (e.g., Otis, 1917). So, without evidence of additional structure and thus contradicting the logic of his theory of scales, Stevens (1951) reclassified “standard scores” on achievement tests as interval scales (p. 25), lamely explaining, “the assumption of normality has the advocacy of a certain pragmatic usefulness in the measurement of many human traits” (p. 28). Fifty years earlier, Karl Pearson (1901) had warned, “I can only recognize the occurrence of the normal curve—the Laplacian curve of errors—as a very abnormal phenomenon” (p. 111); and 20 years after that, Stevens’ mentor, E. G. Boring (1920), reviewing psychometric practice, sagely concluded, “it is senseless to seek in the logical process of mathematical elaboration a psychologically significant precision that was not present in the psychological setting of the problem” (p. 33). These cautions fell on deaf ears and, by midcentury, the normal curve was as if deified in the eyes of psychologists and had become the “trick of method” by which it was fancied interval scales emerged from ordinal data.
Francis Edgeworth (1888) introduced the idea that each person’s score contains a random error component, supposedly drawn in each instance from a hypothetical normal distribution of possible errors, which resembled, he said, a “gend’armes hat” (p. 600). Charles Spearman (1904) developed this idea into classical test theory, which became psychometricians’ staple diet. However, “the method of ‘postulating’ what we want has many advantages; they are the same as the advantages of theft over honest toil,” as Stevens liked to quote (1951, p. 14; original in Russell, 1919, p. 71).
A variant of the same idea occurs in item response theories (IRTs)—the psychometricians’ new staple. For instance, the Rasch model (e.g., Andrich, 1988) is based on the idea that when people attempt a test of some cognitive ability, (a) each person has a level of the relevant ability, (b) each item has a level of difficulty, (c) the probability of a person getting an item correct increases with their ability and decreases with the item’s difficulty, and (d) the precise value of this probability is determined via a logistic error distribution (empirically indistinguishable from the normal distribution). Specifically, this theory says person, i, attempting item j on occasion k, gets j correct if and only if δi + εk ⩾ δj (where δi is i’s ability, δj is j’s difficulty and εk is the error on k, randomly sampled on occasion k from a hypothetical logistic distribution of possible errors having a mean of zero). 27 Given a set of response patterns, and assuming the model is true, interval scale measures of δi and δj may be estimated. However, under error-free conditions, when εk is zero, only ordinal scales could result, 28 which means the theoretical ingredient enabling interval scale estimates is the postulated form of the distribution of errors. For each item, this distribution entails the relationship between the relevant ability and the probability of getting the item correct (its so-called “response curve”). Since the true shape of this distribution is unknown, the keystone of the alleged bridge from ordinal data to interval scale measurement in the Rasch model is speculation and even if the relevant attribute is quantitative, as presumed, the Rasch speculation is merely one amongst an indefinitely large field of logical possibilities.
This situation was recently characterized by Rod McDonald (2013): The attribute that we first aim to model, and then aim to measure, must be quantifiable in principle. By this I mean that it must have ordinal properties, admitting of “more” or “less.” However, its metric—not only the origin and unit of measurement, but its entire calibration—is not given by data and generally must be imposed by the model. (p. 123)
Test data are partial orders and the theory applied treats these as evidence for interval scale structure, often leaning upon statistical goodness of fit tests for support. However, if there is no evidence independent of these statistical tests that the relevant psychological attribute is quantitative, the hypothesis that it is actually a partial order is always supported more strongly by the data than any quantitative hypothesis ever could be. It is only because metaphysics (Thorndike’s Credo or Credo-lite) rules out, a priori, ordinal hypotheses that quantitative hypotheses are preferred. Attempts to measure using IRTs is driven not primarily by data, but by theories, and metaphysics motivates the theories.
Now, either the relevant attribute is quantitative or it is not. If not, it lacks quantitative structure, in which case the IRT applied is false, and interval scale estimates derived are mock measurements. On the other hand, if the relevant attribute is quantitative, the IRT used is just one amongst an infinite array of possible theories and it is well known that “a given data set that is fit by an IRT model can be fit equally well by another model for which the form of the response curves is a monotonic increasing function of the form specified in the first model” (Jones & Appelbaum, 1989, pp. 25–26). In the absence of independent evidence of quantitative structure, such models at best only support ordinal scales (Heene et al., 2016).
This fact, if not apparent after Boring’s (1920) general analysis, is certainly apparent since Goldstein and Wood (1989) reviewed the use of IRTs, but their arguments proved no deterrent. Many of those using or constructing mental tests remain oblivious because introductory texts, if they mention controversial aspects of psychometrics, use the fact of controversy as an excuse to proceed as usual in the interim, as recently noted, for example, in the case of rating scales: “Regardless of ambiguities and disagreement, researchers generally treat Likert-type scales . . . as an interval level of measurement” (Furr, 2011, p. 15). Another text recommends that The best procedure would seem to be to treat ordinal measurements as though they were interval measurements. . . . Most behavioural and social science data are ordinal. However, through certain scaling methods and assumptions, it can be considered as interval scaled data. (Kerlinger & Lee, 2000, pp. 638–639)
Some psychometricians attempt to argue around this difficulty. For example, Jum Nunnally (1970) states: a good argument can be made that there are no “real” or “correct” intervals for any measurement scale, but rather that the intervals are established as a matter of convention. . . . The issue is one of which calibration of intervals will prove most useful in the long run. (p. 21)
Nunnally’s “good argument” is that because test scores are ordinal they are only a monotonic transformation away from being interval scale measures and since monotonic transformations make little real difference to outcomes in psychological research (e.g., to product moment correlation coefficients with other variables or to order relations between means), treating test scores as interval scale measures is the most useful way to go. 29 However, if one believed there were no “real” or “correct” intervals, there would be no scientific reason to seek interval scales. In this instance, appealing to usefulness as a criterion puts pragmatism ahead of realism. 30
Most would agree that psychological tests are sometimes useful in serving the interests of researchers, institutions, and individuals. However, since most applications of psychological tests exploit only ordinal relations in test data (as often noted before, e.g., by Cliff & Keats, 2003, and Sijtsma & Molenaar, 2002), the issue of “which calibration of intervals will prove most useful in the long run” (Nunnally, 1970, p. 21) rarely arises. Since applied psychometrics is possible without claiming measurement of any kind, simply by treating test scores as the frequencies they are (as Cronbach & Gleser, 1957, noted), there is no scientific reason to think in terms of measurement.
Susan Embretson (2006) argues slightly differently: Because constructs are theoretical constructions, there is no natural metric. If a person’s position on a construct is to be represented numerically, it is important to justify estimates of the latent construct on the basis of a measurement theory. (p. 51)
By “a measurement theory” she means, amongst others, the Rasch model, which entails a number of scientific virtues flowing from its property of “specific objectivity” (Rasch, 1960). But if “there is no natural metric” (Embretson, 2006, p. 51; i.e., if the relevant attribute is not quantitative), the Rasch model is false and its virtues not entailed. In claiming interval scale measurements and then denying that there are real intervals or natural metrics in psychological attributes, psychometricians, like Wittgenstein (1922), “throw away the ladder after having climbed up on it” (p. 108), rendering their claims groundless.
Conclusion
The gap between data observed and measures claimed is evident when viewed via the Guttman scale concept (Guttman, 1944). For any test, the response patterns are a Guttman scale when and only when for every pair of different response patterns, A and B, either A is correct on every item B is correct on and at least one more, or vice versa. In a Guttman scale, response patterns form a strict simple order and enable construction of an ordinal scale. According to Cliff (1983), “the Guttman scale is one of the very clearest examples of a good idea in all of psychological measurement” (p. 284), but mainstream psychometricians say, “In spite of the intuitive appeal of the Guttman scale it is highly impractical” (Nunnally, 1967, p. 64) because with most tests, the set of response patterns typically contains pairs unordered by the above relation 31 and, so, the set of observed response patterns is a partial order, not a strict simple one.
What follows from the perceived “impracticality” of the Guttman scale concept, however, is the following conclusion: the structure of observed response patterns on most tests fails the requirements even for an ordinal scale. Now, if the structure of test data fails the requirements for ordinal scales, why would anyone think it could sustain interval scales? It might be half plausible to suppose that, but for errors, the observed partial order masks an underlying strict simple order (only half plausible because without knowing the incidence of erroneous responses that supposition is speculation); but it is highly implausible to conclude that the observed partial order masks an underlying quantitative structure because not only is the incidence of error unknown, but also there is no good reason to believe that quantitative attributes underlie test data. The gap between partial orders and interval scales is significant, but psychometricians fancy it is bridged by metaphysics, when what is required is science. Apparently, intoning Thorndike’s Credo induces the delusion that quantitative structures underlie partial orders and slipping a “gend’armes hat” over one’s eyes conjures visions of interval scales where to the unobstructed view only partial order is evident.
While all inquiry presupposes metaphysical presuppositions and there is nothing anomalous per se about accepting metaphysical input, such input is counterproductive when it presumes structure not evident in data. Metaphysics overreaches when it goes beyond formal features of situations and seeks to impose material content. It might not be true, as Lord Kelvin said, “mathematics is the only true metaphysics” 32 (Thompson, 1910, p. 1124), but it is true that those who specified the mathematical structures that attributes may possibly display, such as the structure of classifications, orders, and quantitative structures, were doing metaphysics, for these specifications are of the formal character of attributes. On the other hand, the issues of which attributes possess which structure are material (i.e., empirical) issues. When a metaphysical doctrine, such as Pythagoreanism, answers empirical questions, it presumes upon nature. Metaphysical realism (Armstrong, 2010; Michell, 2005) describes the different structural forms attributes may have and leaves discovery of which is present in any empirical context to observation. This accords well with Galileo’s actual advice: “We must not ask nature to accommodate herself to what might seem to us the best description and order, but must adapt our intellect to what she has made, certain that such is best and not something else” (as quoted in Crombie, 1994, p. 45).
However, it may be objected that where hypotheses come from is irrelevant. If they come from metaphysics, even from Pythagorean metaphysics, so what? The issue of importance in science is “are hypotheses true?” That is so, but hypotheses must be discovered to be true, not presumed true a priori. This takes us to the heart of the method common to all science. Percy Bridgman (1955), psychology’s erstwhile methodological guru,
33
once said, “scientific method, as far as it is a method, is nothing more than doing one’s damnedest [sic] with one’s mind, no holds barred” (p. 535) and the dominant contemporary view agrees that there is no one, exclusively scientific, method (Woodcock, 2014). However, there is a universal method applying to all intellectual pursuits, science included, namely, logical criticism (Cohen & Nagel, 1934), and this method restrains the impulse to do “one’s damnedest with one’s mind, no holds barred” (Bridgman, 1955, p. 535), for in science some “holds” are barred (e.g., presuming hypotheses true a priori). Logical criticism, applied to the problem of explaining the partial orders constituting performances on ability tests does not, in the first instance, lead to quantitative hypotheses. Instead, proposing a causal role for partially ordered cognitive structures composed of sets of cognitive resources will always be sufficient.
34
Thirty years ago, Snow and Lohman (1989) observed, The evidence from cognitive psychology suggests that test performances are comprised of complex assemblies of component information-processing actions that are adapted to task requirements during performance. The implication is that sign–trait interpretations of test scores and their intercorrelations are superficial summaries at best. At worst, they have misled scientists, and the public, into thinking of fundamental, fixed entities, measured in amounts. (p. 317)
The quantitative attributes conjured by psychometricians are incommensurate with the concepts of modern cognitive psychology because, while cognitive scientists abduced theories from data, bottom-up, through considering the knowledge required to solve problems of an intellectual kind; psychometricians, ignoring their data’s fine grain, imposed theories top-down in accordance with the metaphysical conceit that all explanatory concepts fit the procrustean mould of quantitative attributes found in physics.
Footnotes
Acknowledgements
This paper is based on a lecture presented at the University of Minnesota, Minneapolis, USA, December 11, 2018 and a contribution to a Workshop on Measurement in Cognitive Science at Ruhr-University Bochum, Germany, June 6, 2019. My thanks to Niels Waller, Leslie Yonce, and Insa Lawler.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
