Abstract
This response to Humphry (2013) and Sijtsma (2012) is confined to three issues: the nature of the Rasch paradox, the relevance of the theory of conjoint measurement to psychometrics, and the relationship between test items on the one hand and the character of the attributes they assess on the other. First, contrary to Sijtsma’s view, it is argued that the Rasch model does involve a genuine paradox. Second, typical of psychometricians generally, both Humphry and Sijtsma misunderstand the role the theory of conjoint measurement is able to play in psychometrics. It is argued that in conjunction with item response theory models, it has a significant role. Finally, complementary to Humphry’s and Sijtsma’s insistence upon the importance of theories of the attribute, I argue that features of attribute structure can be inferred from the character of test items and briefly sketch an argument that in the first instance at least, the attributes tests assess contain a feature logically incompatible with quantitative structure.
Humphry (2013) and Sijtsma (2012) commented upon aspects of my critique of psychometrics. I thank them for providing an occasion to reflect again upon some of these issues. Beneath our disagreements over the Rasch paradox and the relevance of conjoint measurement, there are lines of convergence in our thought. One such is discernable in Sijtsma’s conviction that “meaningful measurement is possible only if enough is known about the attribute so as to justify its logical operationalization into prescriptions from which a measurement instrument can be developed” (2012, p. 787). In a similar vein, Humphry emphasises the importance of developing theories of the attribute to be measured, bolstering this contention with examples of the role it played in physics. That psychometricians need, as a starting point, theories of the relevant attributes capable of entailing principles for the construction of test items has been a theme of my critique of psychometrics for a quarter of a century (Michell, 1990). As I will show later, this theme has important consequences.
The Rasch paradox
However, first, consider the Rasch paradox (Michell, 2008a, 2008b). It is as follows: the Rasch model implies that eliminating all error factors from observational conditions dramatically decreases the precision of our observations in psychological testing. This is paradoxical because it is normally thought that eliminating error factors improves the precision of observations. However, Sijtsma maintains that there is no paradox here.
Perhaps this is no more than a semantic disagreement. Literally, a paradox arises when something opposed to generally accepted opinion proves true. The term derives from the Greek paradoxos meaning opposed to existing notions. The English term, paradox, also has a strict, technical meaning in logic (a paradox there existing when an apparently true proposition appears to entail a contradiction). Alongside that meaning there is in common parlance a spectrum of colloquial meanings ranging from apparent contradictions to claims that are merely surprising in the light of accepted wisdom. In this respect, the term paradox is like measurement, which also has a strict scientific meaning (estimation of the ratio between a magnitude and a unit) as well as colloquial meanings (such as in Stevens, 1946). Those who use colloquial senses of measurement can hardly object when colloquial senses of paradox are also used.
The Rasch paradox is clearly visible if the psychometric concept of error is explicated. In the context of ability testing, where a person’s response to any item may be correct or incorrect, a response is an error if and only if either the person’s cognitive resources are sufficient to sustain a correct response but the response actually made is incorrect or the person’s cognitive resources are not sufficient to sustain a correct response but the response actually made is correct. In either case the response is erroneous in the sense that it gives a false indication of the person’s cognitive state. Errors occur in testing because extraneous, unknown, or uncontrolled causes sometimes affect the response given.
The Rasch model, like other item response theories (IRTs) is based upon the premise that for any one-dimensional test, the range of cognitive resources underlying performance is a quantitative attribute. It is then natural to suppose that each person possesses a specific cognitive level and, in a context where no error factors are present, each test item requires some definite cognitive level or better for a non-erroneous correct response. However, it is believed that error factors are always present and, so, parametric models may be seen as proposing that on any occasion, the probability of a person succeeding on an item is a function of the person’s cognitive level augmented or diminished by an error component randomly sampled from a distribution of possible errors, which is (approximately) normal with a mean of zero. A person responds correctly whenever this sum at least matches the item’s difficulty and incorrectly otherwise. Consequently, the probability of a person succeeding on any item is an index of the magnitude of the quantitative difference between the person’s cognitive level and the item’s difficulty. Given the truth of this hypothesis, measures of a person’s cognitive level and item’s difficulty can be estimated from a body of appropriate data. However, if error factors were not present, measures could not be estimated and the only relations entailed would be order relations, the test then being a Guttman scale (i.e., in relation to any item, a person’s response would tell whether the item’s difficulty exceeded the person’s cognitive level). The paradox then is that were it possible to improve conditions of observation by removing all error factors, this would adversely affect the resulting observations, transforming them from quantitative to merely ordinal. This is a paradox because it seems natural to suppose that improving conditions of observation would always lead to improved observations. I have never concluded from this paradox that “the Rasch model … does not have a metric scale” (Sijtsma, 2012, p. 797), only that without errors in the data, metric estimates would not be possible.
Humphry states that removal of all error factors is merely a sufficient but not a necessary condition for obtaining a Guttman scale, but this is not strictly relevant to the paradox itself. The paradox remains a theoretical possibility, as he admits, and that is all I claim. Sijtsma offers a different objection, viz., that it is a mistake to see the Rasch model (or other IRT models) as adding random error to the Guttman data structure. However, this is not my interpretation. I interpret the Rasch model as adding random error to the Guttman model.
A Guttman data structure has this form. Suppose there are n items in the relevant test (where n is a finite integer) and for purposes of recording each person’s responses to the n items, let the items be listed in order from least to most difficult and let correct responses be recorded as 1 and incorrect responses as 0. The Guttman model requires that each person’s response pattern consist of an ordered sequence of m 1s (where m is an integer such that 0 ≤ m ≤ n) followed by n-m 0s. A Guttman data structure occurs whenever this requirement is satisfied for all people doing the test. Adding random error to any such data structure would simply lead to another data structure and a data structure is never a model. A data structure is a description of the responses given by a set of people; a model or theory is an account of the processes producing data structures.
The idea of seeing the Rasch model as the outcome of adding random error to the Guttman model was suggested to me by Sutcliffe (1986), who was following earlier suggestions of Lord (1980) and Lord and Novick (1968) regarding IRT models. As Sutcliffe notes, IRT models are just one variation upon the theme of treating error as a metric index, a theme going back to Fechner (1860), and first made explicit in the Anglophone psychological literature by Urban (1910). He also notes that Boring (1920) and Stevens (1960) both criticised this approach to psychological measurement, so neither his criticisms nor mine are original.
Mathematically, the relationship between the models may be expressed as follows. Suppose that a person, p, whose level of the relevant cognitive resources is θp, attempts a test item, i, of difficulty level, δi, on some occasion, j. The Guttman model states that p will respond correctly to i on occasion j if and only if
Random error may be added to the Guttman model by further supposing that on each occasion, p’s cognitive resources are augmented or diminished by an amount, ej, drawn at random from a distribution of possible errors, with a mean of zero, in which case p will respond correctly to i on occasion j if and only if
Sutcliffe notes that if the probability density function from which ej is sampled is the inverse hyperbolic tangent function then the Rasch model results. If, on the other hand, it is the normal probability density function, then Lord’s IRT model ensues. These two are indistinguishable relative to test data, of course, but they have different theoretical virtues and it is after considering these that many psychometricians form a preference for the Rasch model. Other density functions are theoretically possible but not usually considered.
This is how the Rasch model results from adding random error to Guttman’s model and it is clear that if ej = 0 on every occasion, Guttman’s model results from subtracting random error from Rasch’s. This subtraction provides the locus of the Rasch paradox. However, this paradox was never a matter of interest in its own right. Attention was drawn to it because in a context where little is known about the incidence or distribution of error and where there is no independent evidence that the relevant attribute is quantitative, attempting to estimate measures from errors indicates a reckless credulity. In over a century of psychometric research, there have been few serious investigations into the character of error in testing and, yet, psychometricians are ready to accept models that explicitly stipulate the precise mathematical form error takes in relation to the other great psychometric unknown, viz., the quantitative structure of the attribute underlying performance on the test. Boring’s admonition that “It is senseless to seek in the logical process of mathematical elaboration a psychologically significant precision that was not present in the psychological setting of the problem” (Boring, 1920, p. 33) was an early warning wasted.
I am not against using error as a metric index. Where the relevant attribute is known to be quantitative and its relationship to errors also known, error can be validly employed as such an index. However, that use requires knowing facts not known in psychometrics. The Rasch model is promoted as a method of scientific measurement, but those who use it do not look outside the model to see whether it matches reality. In fact, they are actively discouraged from doing so. The fit of data to the Rasch model is promoted as providing a sufficient condition for measurement. The model is used as a measurement criterion because its formal properties are thought to be such that when data fits the model, the model must deliver measurements. Its users seem unaware that this is only true when the relevant attribute is quantitative. Statistical fit of data to the model is alone insufficient to establish this. As Roberts and Pashler (2000) warn, even a good statistical fit to a model is poor evidence for the model’s truth. Half a century ago, Skinner derided psychologists besotted with mathematical models for only wanting a “paper doll” to call their own instead of a “fickle-minded real live girl” (Skinner, 1959, p. 251), a relationship now instantiated by psychometricians married to IRT models. Attachment to their “paper doll” deflects attention from the “fickle-minded” reality underlying test performance. In the case of the Rasch model, if the relevant attributes are not quantitative, the paper doll “measures” are counterfeit currency.
This is not a complaint against the use of IRT models. It is a recommendation for their critical as opposed to credulous use. As Sijtsma emphasises, “an IRT model (as any formal model) is an idealization; hence, the assumptions are never entirely true for the attribute of interest” (2012, p. 796), from which it follows that critical psychometricians will be more interested in the “fickle-minded” phenomena that their models are tested against than in the models themselves. They will see any model as providing at best mere glimpses through a glass darkly. In investigating whether psychological attributes are quantitative, IRT models are just one tool at our disposal and because all tools are fallible, the optimal strategy is to use as many as possible. It is for this reason that I am dismayed by Sijtsma’s and Humphry’s cheap dismissals of the theory of conjoint measurement.
The relevance of conjoint measurement to psychometrics
The theory of conjoint measurement was first brought to light in psychology by Luce and Tukey (1964) and given a more thorough exposition in Krantz, Luce, Suppes, and Tversky (1971). These authors promoted a sophisticated version of the representational theory of measurement (as opposed to Stevens’, 1951, cruder, operational version) and presented conjoint measurement within this framework. Subsequently, I gave an introduction to conjoint measurement, showing its relevance to some simple psychological measurement models (Michell, 1990). There and elsewhere (e.g., Michell, 2007a, 2007b), I made no secret of my objections to the representational theory of measurement and preference for a realist alternative. Because I accept conjoint measurement and reject the representational theory within which Luce and others had couched it, Humphry claims to find a “tension” in my thought. Let me say emphatically that there is no tension here. The theory of conjoint measurement is a set of mathematical theorems about relations between mathematical structures and as with any such theorems, they may be validly applied in any context where structures of the relevant kind are present. Philosophical theories are irrelevant to the application of mathematical theorems.
However, more dismaying than this is the fact that both Humphry and Sijtsma dismiss the theory of conjoint measurement as a viable research tool. It is dismaying because this theory and IRT are not opposed and no one is forced to choose either one or the other. As research tools, they provide independent glasses through which the “fickle-minded” phenomena of testing might be glimpsed. It is true that they are often presented as alternative models for measurement, but this is a misunderstanding.
To begin with, the theory of conjoint measurement is not a model for measurement. It is part of the body of knowledge known as measurement theory, but this ambiguous label has been a source of confusion. Measurement theory does not tell us how to go about measuring. Measurement theory might more aptly be called measurability theory, for it tells what it is for attributes to be measurable. Hölder (1901) used the expression, “die Lehre vom Mass,” literally, the theory of measure, which I translated as theory of measurement conforming to established usage (Michell & Ernst, 1996, 1997). But Hölder was not telling us how to go about quantifying attributes. He specified conditions upon an attribute under which measures exist. He described a kind of structure sufficient for ratios to possess the structure of the positive real numbers. He did not prescribe how we might come to know that an attribute possesses this kind of structure. Henceforth, I use the expression, logic of measurement, to describe this enterprise.
The theory of conjoint measurement is part of the logic of measurement. It describes conditions upon an ordering of an attribute indicative of underlying quantitative structure when that attribute is a non-interactive function of just two others. However, the ways in which this theory might be applied in relation to particular attributes are details for the relevant scientists to determine, given their existing knowledge of the relevant attributes and their capacity to control extraneous variables. In itself, the theory of conjoint measurement contains no prescriptions for its application. However, whenever any substantive theory of an attribute displays the kind of relationship between attributes portrayed in the theory of conjoint measurement, certain consequences (e.g., satisfaction of the cancellation conditions) are entailed and it may be possible in certain circumstances to test these against data.
In relation to this claim, there is a further point to clarify. The theory of conjoint measurement, as a theory about the structure of attributes, is not a theory about data. Data, because it derives from observations and because all observations are fallible, may contain error. The theory of conjoint measurement says nothing about error and, so, is not a theory about observations. It is a theory about the relevant attributes, as they are or might be, independent of observation. It is a misunderstanding to criticise it for not taking account of error in data. It does not do this because it is not a theory about data.
Nevertheless, this misunderstanding has been a cause of its marginalisation within psychometrics since 1964, which says more about psychometricians than it says about the theory, as Cliff (1992) observed. Psychometricians characteristically prefer theories about data and especially probabilistic theories incorporating “random error.” They see “deterministic” theories, as conjoint measurement is sometimes mistakenly described, as falling short. However, the contrast between probabilistic and deterministic theories is one between different theories about causal processes and the theory of conjoint measurement is not a theory about causal processes. Long ago, John Stuart Mill (1843) distinguished uniformities of coexistence from uniformities of succession, the former being generalisations concerning structures and the latter generalisations about processes. Uniformities of coexistence say nothing about how things work causally. They describe how things are structured independently of causal processes, including our observations. For example, the axioms of quantity are uniformities of coexistence (Michell, 1990). It was Campbell’s (1920) error to mistake them for uniformities of succession, a mistake endlessly repeated since (e.g., Nagel, 1932). Similarly, the axioms of conjoint measurement are uniformities of coexistence. In dismissing the theory of conjoint measurement, psychometricians have overlooked this fundamental distinction and assumed that conjoint measurement is, like IRT, a theory about processes. This is the source of their mistake, but it is not all there is to it, for even theories about processes need not be probabilistic or even about data. In science, theories come in more than one form and a nuanced appreciation of theories is sensitive to the full range of differences.
Physics, for example, contains many “deterministic” laws, which, incidentally, explain much better than any probabilistic psychological counterpart. It is sometimes said this is because data in physics is not contaminated by error to the extent it is in psychology. However, this was not true in earlier centuries when many physical laws were first formulated. Laws such as Galileo’s law of free fall, Mersenne’s laws of vibrating strings, Boyle’s law, Hooke’s law, and Newton’s quantitative laws are idealisations, describing relationships between quantitative attributes, not as measured, but as they were thought to be independent of observation (Roche, 1998). The difference between probabilistic and deterministic laws concerns not the relative magnitude of error, but what the laws are intended to describe. These physical laws were never intended to chart the relationships between quantitative attributes as measured by scientists. They were intended to display the relationships between attributes as they exist in nature. Errors of measurement were not seen as something to be incorporated into the laws themselves. They were seen as something to be addressed outside their scope and to be explained by the presence of factors extraneous to the law and beyond the scientist’s control. It was only later, with the advent of Mach’s positivism, that scientific laws came to be seen by some philosophers as summaries of experience (i.e., measurements) rather than descriptions of nature.
There is nothing intrinsically non-realist about uniformities of coexistence or deterministic theories and they have advantages that theories of data lack. When the theory of conjoint measurement is applied to any psychometric hypothesis that is thought to instantiate the relevant kind of relationship between attributes, its implications in the first instance concern order relations upon one of those attributes and only secondarily and always only in conjunction with other assumptions or theories does it entail explicit predictions about data. Of course, every scientist knows that data may contain error and different approaches to this fact are possible when it comes to testing predictions. The simplest approach when testing a hypothesis is to consider which of two possibilities is the more plausible in the observational circumstances under which data was collected: to regard the relevant hypothesis as true and departures from predictions as errors or to regard the relevant hypothesis as false. If neither seems more plausible further research is necessary. This approach has the advantage of not implicating further assumptions in the evaluation of the relevant psychometric hypothesis, but it has the disadvantage of treating errors informally and outside of the scope of the hypothesis tested. On the other hand, if it is plausible to make assumptions about the form that errors take under the assumption that the relevant hypothesis is true, then appropriate statistical tests of conjoint measurement cancellation conditions may always be carried out (e.g., Domingue, in press; Karabatsos, 2006). This has the advantage of enabling statistical tests of predictions, but the disadvantage of yoking the relevant hypothesis to particular assumptions about the form of error. Any scientist attempting to look through the data via the dark glass of theory to the underlying “fickle-minded” attributes while ignorant of the frequency and distribution of error could profitably use both approaches.
As I have noted, many psychometricians typically do not seem interested in investigating the issue of whether the attributes they aspire to measure are really quantitative (Michell, 2000). Instead, they are primarily interested in already claiming that they can measure such attributes. This is why they see the contrast between the Rasch model and the theory of conjoint measurement as a choice between competing models. On the other hand, were they interested in investigating whether the relevant attribute is quantitative, these theories would be seen as complementary.
According to the Rasch model, the probability of a person responding correctly to a test item is a non-interactive function of the person’s cognitive level and the item’s level of difficulty. This is the kind of relationship to which the theory of conjoint measurement applies. Keats (1967) was the first to note this and a small number of psychometricians have applied it (e.g., Perline, Wright, & Wainer, 1979). However, until recently, such applications were rarely accompanied by a sensible rationale. Part of the problem is a lack of interest in developing and investigating theories of the attribute underlying performance. Were that of interest, discovering the structure of such attributes (e.g., are they merely ordinal or quantitative?) would be paramount. Testing the statistical fit of data to the Rasch model is an insensitive way of investigating this. Sijtsma lists the assumptions of the Rasch model, one of which is unidimensionality, and it is clear that these cannot be tested easily in isolation. Even when data fits the model, it is far from clear that the relevant attribute is quantitative, for, as noted, good statistical fit does not entail the truth of the model. Such results are never more than a guide and a critical approach necessitates investigating matters further. Testing the cancellation conditions of conjoint measurement means looking at conditions that are separately sensitive to ordinal (single cancellation) and additive structure (double cancellation). Testing these provides additional information and allows the investigator to begin to build a picture of the structural conditions satisfied by the attributes involved (e.g., Kyngdon, 2011).
It should not be thought, however, that such tests exhaust the available possibilities. In general, of course, there are only three ways to consider the truth of any proposition: by direct observation, by making inferences from truths already known, and by conjoining the proposition with others and testing predictions entailed by the joint set. This last is often called the hypothetico-deductive method and it currently takes pride of place in psychological research. Testing fit to the Rasch model and testing predictions from conjoint measurement theory both fall into this methodological category. Since the psychological attributes under investigation in psychometrics are theoretical in character, they are not generally open to direct observation, so that approach is unavailable. However, as Haig (2013) has recently reminded us, the second approach mentioned above (that of making inferences from truths already known) includes the practice of making inferences directly from the phenomena and this has been largely overlooked in psychology. Given all the difficulties in arriving at worthwhile theories of the attributes that psychometricians aspire to measure, it is self-defeating to pass up any reasonable avenue of investigation. It may be that if the phenomena involved in the psychological setting of testing are closely analysed, inferences can be made about the structure of the attributes tests assess.
The psychological setting of psychometrics
In his final section, Sijtsma casts his approach to psychometrics as an alternative to any based exclusively on conjoint measurement theory or item response theory (Sijtsma, 2012). As already stressed, these are not alternatives and seeing them as such is counterproductive. This is not to criticise Sijtsma’s interest in investigating the character of the attributes assessed by tests. It is a plea for a more comprehensive approach. In fact, Sijtsma’s case studies are highly instructive in this regard. They display the intimate relation between test items and the character of the attribute assessed and they illuminate the tenuous place of measurement in psychological testing.
Sijtsma illustrates how test items may be entailed by a well-articulated theory of the relevant attribute. However, this relationship holds in both directions. Just as a detailed theory of the character of the attribute entails the character of items required for its assessment, so the set of items constituting a test imply something of the character of the attribute assessed. Here, there is not space to unfold this point in detail and the following explication is in general terms and confined to the case of cognitive tests containing only binary items.
In such a case, the phenomena of testing are these: people attempt a set of items and a person’s response to any item, correct or otherwise, is either a veridical or erroneous indicator of the person’s actual cognitive state. Relative to any item, the class of cognitive resources sufficient for a veridical correct response can be inferred by specifying the cognitive demands of the item. In order to respond correctly veridically, a person needs to know certain specific things, to have certain specific skills, and be able to use certain specific problem-solving strategies. In relation to any test, in principle, the cognitive resources required can be specified for each item. In the first instance, it is the extent of the person’s mastery of the totality of these resources that is the attribute assessed by the test. I will call it the primary attribute assessed.
Beyond this primary attribute, it is often supposed that a test also assesses the person’s level on one or more higher-level theoretical constructs, such as the various mental abilities commonly invoked. Despite the logical defects of the construct concept (Michell, 2013), this supposition is widely assumed true. However, two caveats are relevant. First, after more than a century of invocations, psychologists have failed to specify the intrinsic character of any ability and remain fixated at the level of merely dispositional characterisations, such as, for example, abstract reasoning ability, verbal ability. Dispositional characterisations only specify what abilities do, not what they are. If their intrinsic character is unknown, whether they are quantitative is likewise not known. Hence, claims to measure them are fatuous. Second, the minimal kind of structure required of abilities is only that necessary to account for people’s levels of mastery of the cognitive resources involved in what I have called the primary attributes. As a general principle, in order to account for individual differences on any primary attribute, the construct invoked does not require a structure more complex than that of the primary attribute itself. Thus, if the structure of the primary attribute is not quantitative, quantitative structure is not required in the construct, that is, in this case, in the putative mental abilities.
So, what then is the structure of primary attributes? This will vary from test to test, as Sijtsma’s case studies illustrate. In general terms, the structure of the primary attribute can be inferred from the content of the cognitive resources sufficient for veridical correct responses to the items constituting the relevant test. For any test items, j and k, the cognitive resources, Cj, sufficient for a veridical correct response to j indicate a higher degree of the relevant attribute than those, Ck, sufficient for a veridical correct response to k if and only if Cj includes Ck and more besides. Such a relation, because it is a special case of class inclusion, is transitive and asymmetric and so will constitute a strict partial order on any primary attribute, although in extreme cases it defines two special structures. At one extreme is the case where the cognitive resources required by any pair of items within a given test never stand in this relation. Then the primary attribute is no more than a mere classification. If the test contains n items, because each response is either correct or incorrect, the primary attribute will involve in principle 2n classes into which people doing the test may be classified.
At the other extreme is the kind of case where for the relevant primary attribute, the relation is not just transitive and asymmetric but is also connected (i.e., for each pair of items, j and k, in the test, either Cj indicates a higher degree than Ck or vice versa). Then the complete set of cognitive resources involved in all items on the test constitutes a strict simple order and if there are n items and no responses are errors, the people doing the test fall into a hierarchy of n+1 ordered classes. This would be a Guttman scale, but the presence of erroneous responses would entail a partial order approximating this strict simple order, which could mean a data structure that an IRT model might easily fit.
Such an ordered structure on the class of cognitive resources is the nearest that a primary attribute can come to a quantitative structure. It is not quantitative. It is merely ordinal, while quantitative structure is both ordinal and additive. Furthermore, the structure of the primary attribute contains a feature incompatible with quantitative structure. This can be explained as follows. Given any three magnitudes, x, y, and z, of a quantitative attribute, where x > y and y > z, the difference between x and y must be greater than, less than, or equal to the difference between y and z. This is a necessary condition for quantitative structure (Hölder, 1901). However, it is a condition that cannot obtain with the sorts of primary attributes assessed in testing. Consider any three items, h, j, and k, in such a test, where Ch includes Cj and Cj includes Ck, so that these three degrees are ordered. Since the difference between the cognitive resources involved in Ch and Cj cannot be the same as the difference between Cj and Ck (otherwise Ch and Cj would not differ) and is also qualitatively different to the difference between Cj and Ck (because it involves an extra item of knowledge, an extra skill or extra strategy), it follows that the one difference cannot also be either intrinsically greater than or less than the other. Hence, not only are such attributes not quantitative, the structure that they possess is logically incompatible with quantitative structure. Hence, it is not that primary attributes might be quantitative and we simply do not know it. It is stronger than that: the primary attributes that cognitive tests assess cannot be quantitative.
The problem is that first, for any quantitative attribute, quantitative structure requires that differences between pairs of magnitudes be in all respects quantitatively homogeneous and in no respect qualitatively heterogeneous, and second, that the differences between degrees of any primary attribute assessed in testing are qualitatively heterogeneous. Hence, they cannot be quantitative. Psychometricians have spent a century vainly believing they are measuring the attributes tests assess, but these attributes cannot be measured because their structure excludes quantitative structure.
Of course, this argument shows only that primary attributes cannot be quantitative, not that underlying constructs—abilities—cannot be. However, as already indicated, if the primary attributes are not quantitative, there is no a priori reason for supposing that underlying constructs are quantitative either. Hence, the default position in psychometrics is that the attributes tests assess are at best ordinal. If beyond that it can be shown, via say the use of conjoint measurement, that some underlying constructs show signs of additional quantitative structure, there would be some justification for constructing quantitative theories and even, perhaps, eventually claiming measurement. However, psychometrics is at present far from that situation and nothing illustrates this more eloquently than Sijtsma’s failure to present amongst his case studies, one involving the measurement of a psychological attribute possessing quantitative structure (2012).
Sijtsma’s examples involve only classification and ordering and, as far as primary attributes are concerned, if my argument is correct, that is no accident, it is a logical necessity. Should any psychometricians disagree, and doubtless they will, the challenge for them is to satisfy Sijtsma’s requirement of presenting a theory of the attribute assessed that links identifiable features of the test items with not only ordinal but additive characteristics of the resulting scale (2012). That is, such a theory must be able to say for any pair of items, j and k, that not only is j more difficult than k because of these and these identifiable features, but also that for any quadruple of items, g, h, j, and k, the difference in difficulty between g and h is r (where r is a real number) times the difference in difficulty between j and k because of these, these, and these identifiable item features. Only when this is achieved will the theory of the attribute justify any claim to measure the relevant attribute.
Not that I expect this challenge to be taken up, nor even that Sijtsma’s example of developing detailed theories of attributes capable of sustaining principles for item construction will be widely followed. I attempted to implement a similar approach 25 years ago in the area of attitude assessment (Michell, 1990, 1994) and apart from a small number of critically minded researchers (e.g., Johnson, 2001, 2007; Kyngdon & Richards, 2007; Sherman, 1994), little came of it. Injunctions for improvement of psychometrics as a science will only gain traction amongst those who think there is room for improvement. Those who think that psychometric methods already deliver measurement will not be motivated to invest resources in labour-intensive alternatives, especially ones that put the very prospect of measurement at risk.
Markus and Borsboom, responding to Trendler’s (2009) arresting argument noted, “measurement in psychology demonstrates a perplexing resilience to attempts to part ways with it, reminiscent of Mr. Johnson’s cat in Harry S. Miller’s song” (2012, p. 454), “The Cat Came Back.” This allusion presumes the illusion that psychometrics already possesses measurement. Where there is no measurement, there is no parting of the ways with it and, hence, no question of the “cat”—measurement—coming back. The “cat” was never there to either leave or persistently return. What was and is and, it seems, ever shall be there is the ghost of the “cat,” viz., the illusion that measurement already exists in psychometrics. About this, Markus and Borsboom are right: try to exorcise this ghost as often as you like, it will not stay away!
Footnotes
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
