Abstract
Trendler (2009) argued that psychological attributes cannot be measured, as the experimental manipulation and control necessary for the application of measurement theory cannot be achieved. It is argued that Trendler’s conclusion ignored deeper issues. The scientific measurement of psychological attributes depends not only upon adequate stimulus control, but also upon descriptive theories of psychological systems and the demonstration of pure differences in degree (magnitude) within attributes hypothesized to be quantitative. For some classes of stimuli, where descriptive theories of the response process exist and where quantitative features in the stimuli themselves can be empirically manipulated, the demonstration of pure differences in degree is plausible and the scientific measurement of the relevant attributes credible. Where attribute differences between stimuli have identified qualitative causes, stimuli cannot be engineered to produce equivalent magnitudes of the relevant attribute. Here is where Trendler’s Millean Quantity Objection has force.
Psychologists understand that their field does not possess a system of measurements like physics. Statements like “Penny has 5.37 Ne of neuroticism,” where Ne is a well-defined unit quantity of the attribute, have no literal meaning. Nonetheless, psychologists firmly believe that anxiety, attitudes, cognitive abilities, utility, and other attributes are measurable.
Joel Michell (1997, 1999, 2008a) has repeatedly questioned this belief. He has argued that psychologists have failed to understand exactly what measurement means scientifically (Michell, 1999). In metrology, the field of physics that studies measurement and quantity, a measurement is the product of a real number and a unit (Emerson, 2008). The term real number is meant in the strict mathematical sense of a member of the real number line, and unit is a well-defined unit magnitude of a continuous quantity, such as the metre, kilogram, or second. But rather than advance the definition of measurement understood in physics, the paradigm of quantitative science, Michell (1997, 1999) has persuasively shown that psychologists have taken as definitive the non-scientific theory of measurement espoused by Harvard psychologist Stanley Smith Stevens (1906–1973), who claimed that measurement is “the assignment of numerals to objects and events according to rule” (Stevens, 1946, p. 677). Psychologists have also failed to produce evidence that psychological attributes are genuine quantities (Michell, 1997); and may have confused the merely ordinal relations typically observed between the levels of psychological attributes for evidence that such attributes are quantitative (Michell, 2009a). Psychological measurement therefore is a hypothesis, and because psychometricians in particular have failed to scientifically investigate this hypothesis, psychometrics is a pathological science (Michell, 2000).
But for Michell all is not lost. If psychologists were educated to understand exactly what real scientific measurement is (Michell, 1999), and significant changes were made to the university syllabus (Michell, 2001), the question of “Is this attribute measurable?” may well be investigated in a genuinely scientific manner. In the absence of such changes, psychologists’ attempts at measurement are akin to attempting medicine without knowledge of anatomy (Michell, 2008b). The theory of conjoint measurement (Luce & Tukey, 1964), which can reveal the additivity of quantities not capable of unrestricted side-by-side combination or concatenation, has the potential to quantify psychological attributes; and so hence should form part of a psychologist’s education (Michell, 1990). However, Michell has recently expressed pessimism on the needed changes to university curricula and training, remarking that “I am dismayed especially by the failure of psychologists to understand the character of quantitative structure, and the failure of undergraduate courses in psychology to rectify this deficit” (Michell, 2008b, pp. 126–127).
Trendler (2009) added his voice to Michell’s. Whilst he agreed that psychologists have laboured under a flawed conception of measurement, he asserted that even Michell’s rather muted optimism is misplaced. The theory of conjoint measurement may have well been a scientific revolution in the understanding of quantity (Cliff, 1992; Michell, 1999), but this was not the revolution psychology needed. To truly quantify and measure psychological attributes, Trendler (2009) called for a Galilean Revolution—the development of experimental apparatus which can enable psychologists to “actively intervene in the course of nature and deliberately manipulate the phenomena of interest” (Trendler, 2009, pp. 587–588). Citing Ohm’s (1826) quantification of electric current as a case in point, Trendler argued that it is through the creation and refinement of experiment and apparatus, not developments in measurement theory, that attributes have been quantified in the history of science.
Trendler (2009) argued that an intensive quantity can be experimentally manipulated such that equal magnitudes of the attribute can be determined. Indeed, this is how a quantity must behave (Hölder, 1901). The term intensive quantity refers to those quantities where additive relations between magnitudes cannot be ascertained through the direct observation of the objects that possess quantities of this kind. Temperature is one physical example. Additivity can be directly observed for extensive quantities, such as length and mass, as these quantities are capable of concatenation. Natural concatenation operations have never been discovered for any psychological attribute. Greater magnitudes of intelligence, for example, cannot be obtained by combining people’s heads. Psychological quantities, if they exist, must be intensive.
Intensive attributes, according to Trendler (2009), must be quantified via an extant quantity of some kind (or “observable,” as he called it). As an example, he cited Ohm’s (1826) use of plane angle to determine equivalent magnitudes of electrical potential difference. Trendler nominated reaction time as a candidate “observable” for motivation in responding to ability test items. He argued that if there was a causal relation between motivation and test item reaction time, then the same amount of an unspecified reward of some kind (presumably administered after a correct response) should produce, within the limits of systematic and random error, equal magnitudes of reaction time. But he dismissed such a situation as “utopian” (p. 591), as the causal complexity in responding to test items is too great to be controlled by experimental apparatus. Furthermore, he argued that the object of experimental manipulation needs to be the biological substrate of the human brain, and asked “how should we ‘slice and dice’ the brain of a test subject in such a way that only motivation influences reaction time and that all other factors which might additionally influence behaviour are under control?” (p. 592). As the biology of the human brain cannot be experimentally manipulated, psychological phenomena cannot be made to depend upon empirical conditions that are conducive to experimental manipulation, and the Galilean Revolution is impossible in psychology. Hence “psychological phenomena are not sufficiently manageable. That is, they are neither manipulable nor are they controllable to the extent necessary for an empirically meaningful application of measurement theory. Hence they are not measurable” (p. 592). This conclusion Trendler named the Millean Quantity Objection.
It is argued in this paper that Trendler’s conclusion was premature and his analysis ignored deeper issues. It is not the lack of experimental manipulation and apparatus per se that determines the measurability of natural attributes. Rather, it depends on genuine continuous quantities actually existing and upon the creation of descriptive theories of the natural systems in which such quantities behave. This is the case with measurement in physics. For example, the relationship between temperature, pressure, and volume, described by Boyle’s Law (Boyle, 1662), was understood long before the development of accurate thermometry in the 19th century (Middleton, 1966). The International System of Units (SI) (Bureau International des Poids et Mesures, BIPM, 2006) currently defines the second as “the duration of 9 192 631 770 periods of the radiation corresponding to the transition between the two hyperfine levels of the ground state of the caesium 133 atom” (p. 113). This definition of the second would simply be impossible without descriptive theories of atomic physics. The definition of other physical unit quantities, such as the ampere (cf. BIPM, 2006), also depends upon physical theory.
Where psychological measurement is most strongly argued to have been achieved, such as psychophysics, utility theory, and psychometrics, empirical study is based upon the presentation of stimuli to humans (such as test items) and inferring something of the relevant, unobservable attribute (such as a cognitive ability of some kind) from the observable response made to the stimulus (such as a correct answer). If sensations, utility, and cognitive abilities are indeed measurable, descriptive theories of the response process which connect the hypothesized psychological quantity to identifiable features of the stimuli must be developed (Michell, 2008b). Otherwise, no scientific basis exists for the identification of any causal, lawful relationship between the psychological attribute, the stimulus, and the response to the stimulus. To use Trendler’s (2009) example, how could a causal relationship be established between motivation and reaction time without a theory as to why the former influences magnitudes of the latter? Demonstrating in an experiment that the former can predict the latter does not suffice, as without a theory, there is no scientific basis to hypothesize that the observed prediction will be seen in other empirical situations. Contrary to Trendler (2009), the fundamental problem facing psychological measurement is not lack of experimental control, manipulation, and apparatus. It is the lack of extant quantities and the lack of descriptive theories of psychological systems.
In this paper it is argued that psychological measurement may only be possible for stimulus domains where: (a) there exist substantive theories which attempt to describe the cognitive processes involved in responding to such stimuli; (b) the stimuli at least in part consist of extant quantities, either discrete or continuous, whose relation to the relevant, psychological quantity is described by the kind of theories mentioned in point (a); and (c) these stimulus quantities are capable of being empirically manipulated so as to produce, in conjunction with the theories of point (a), homogeneous differences between degrees and pairs of degrees of the relevant, psychological quantity (Michell, 2009b).
With respect to point (a), this paper presents an overview of cumulative prospect theory (Kahneman & Tversky, 1979; Tversky & Kahneman, 1992) and the Lexile Framework for Reading (Stenner, Burdick, Sanford, & Burdick, 2006). Prospect theory attempts to describe the utility human beings have for incremental gains and losses under conditions of risk or uncertainty, whilst the Lexile Framework attempts a descriptive account of individual differences in the ability to read continuous prose text.
The stimuli relevant to prospect theory and the Lexile Framework are simple gambles and embedded sentence cloze reading test items, respectively. With respect to point (b), both these kinds of stimuli possess quantitative features which the relevant theory relates to the hypothesized psychological quantity. In regards to point (c), it is shown that certain quantitative features of gambles and embedded sentence cloze reading items can be manipulated, in conjunction with the relevant theory, to produce other stimuli which are of greater, lesser, or equal magnitude with respect to the relevant attribute.
For certain classes of stimuli, experimental manipulation is possible, yet such manipulation reveals only impure or heterogeneous differences in degree and between degrees of the relevant attribute. Heterogeneous differences are differences in kind, not amount, and hence heterogeneous attributes are not measurable. To illustrate this, mathematics items from the 2003 Trends in International Mathematics and Science Study (TIMSS) 4th grade mathematics test (International Association for the Evaluation of Educational Achievement, IEA) and attitude items towards abortion (Roberts, Donoghue, & Laughlin, 2000) and capital punishment (Andrich, 1995) are presented. These stimuli do not possess features which can be manipulated to produce equivalent magnitudes of mathematical ability and attitude. Qualitative differences, however, can be readily identified as plausibly causing differences in the difficulty of mathematics items and in the intrinsic favourability of attitude statements.
Theory-based, putative psychological measurement
Cumulative prospect theory
The utility of incremental gains and losses under conditions or risk was first studied by 18th-century Swiss mathematician Daniel Bernoulli (1738). Bernoulli rejected the prevailing view of the time in which the expected value of a gamble was considered synonymous with the gamble’s utility. For example, a simple lottery consisting of a 75% chance of winning $1,000 and a 25% chance of winning nothing has an expected value of (.75 × $1000) + (.25 × $0) = $750. Bernoulli argued that expected value and utility are not identical. As decision makers are risk averse, utility is a concave function of risky event outcomes. The utility of a gain or loss was also a function of a person’s total asset position. Two centuries later, mathematical treatments of Bernoulli’s ideas by von Neumann and Morgenstern (1944) created what is known as expected utility theory, which is the received view of utility under risk in economics (Levy, 2008).
Allais (1953) discovered that the von Neumann and Morgenstern (1944) independence axiom of expected utility theory was robustly violated by human choice behaviour. This axiom asserts that if Gamble A was preferred to Gamble B, then this preference must be maintained if both gambles are subject to the same transformation. Suppose that A was a simple lottery consisting of an 80% chance of winning $4,000 and B was a sure consequence—a 100% chance of receiving $3000. Kahneman and Tversky (1979) found that 80% of their test participants preferred B to A. They then presented Lottery C, a 20% chance of winning $4000, and Lottery D, a 25% chance of winning $3000. Lotteries C and D are in fact probability mixtures of lotteries A and B, such that C = .25(A) and D = .25(B). As the majority of test participants chose B over A, they should have chosen D over C. However, 65% chose Lottery C over Lottery D.
This reversal of preferences was highly replicable and became known as the Allais Paradox. Attempts at discrediting it failed (e.g., Grether & Plott, 1979). The paradox is now considered the most powerful evidence against expected utility theory (Levy, 2008). Alternative theories have been proposed, such as cumulative prospect theory (Kahneman & Tversky, 1979; Tversky & Kahneman, 1992), for which Kahneman shared the 2002 Nobel Economics Prize. The defining feature of cumulative prospect theory is the “fourfold pattern of risk attitude”. Human beings are:
risk averse for gains that have a moderate to high probability of occurring (e.g., preference for certain gains over probable gains that may be larger);
risk inclined for losses that have a moderate to high probability of occurring (e.g., choice of medical treatments which avoid certain deaths for merely probable ones [Tversky & Kahneman, 1981]);
risk averse for small probabilities of large losses (e.g., purchasing of renter’s insurance); and
risk inclined for small probabilities of large gains (e.g., purchasing of lottery tickets).
These behaviours are modelled in cumulative prospect theory by the probability weighting function. This function creates decision weights from the gamble’s outcome probabilities; and it is these which enable cumulative prospect theory to predict the Allais Paradox. For example, Lottery B’s utility of 3,000 exceeds that of Lottery A (2,270). Hence cumulative prospect theory predicts the certainty effect (i.e., decision makers intrinsically prefer certain outcomes over probable ones). Lottery C’s utility of 868 is greater than Lottery D’s utility of 737, hence cumulative prospect theory predicts the Allais Paradox. Appendix A is a brief technical summary of cumulative prospect theory.
The Lexile Framework for Reading
Psychometricians and educators have expended considerable effort in attempting to measure the ability to read continuous prose text. At one stage, over 90 different psychometric tests for reading existed (Mitchell, 1985). Such tests aim to measure individual differences in reading ability. Given that all mental abilities are cognitive attributes, the measurement of individual differences in reading test performance must be based upon theories which describe those cognitive attributes which cause these differences (Michell, 2008b). The Lexile Framework for Reading (Stenner et al., 2006) is one such account.
The stimulus central to the Lexile Framework is a reading test item type known as an embedded sentence cloze item. The “stem” of such a reading item is a piece of professionally edited continuous prose text, often taken from a published monograph. An item writer composes a sentence that requires the respondent to select the missing word in order to “cloze” the sentence (Figure 1).

An “embedded sentence cloze” reading-item type created from text in Homer’s Iliad.
In the Lexile Framework, the difficulty of such items varies as a function of the demand the prose text of the item stem places upon an individual’s verbal working memory (VWM) capacity and vocabulary. The longer the sentences, the greater the demand made upon the reader’s VWM. If the words used in the stem appear infrequently in written prose, the greater the demand placed upon the reader’s vocabulary. Hence individual differences in performance upon tests consisting of this kind of reading item are caused by individual differences in VWM capacity and vocabulary. The larger a person’s vocabulary and VWM capacity, the more able that person is a reader.
The difficulty of an embedded sentence cloze reading item is calculated from the mean sentence length, a proxy variable for VWM capacity demand, and the mean log word frequency, a proxy variable for text vocabulary demand (Stenner et al., 2006). Mean sentence length is simply the ratio of the number of words in the item stem to the number of sentence endings. For example, the stem of the item depicted in Figure 1 contains 90 words and four sentence endings (four full stops). Hence the mean sentence length is 90:4 or simply 22.5. Calculation of the mean log word frequency is more involved. Each word in the item stem is assessed as to how frequently it appears in a text corpus of some kind, such as the Carroll corpus (Carroll, Davies, & Richman, 1971). The common logarithm of each word frequency is calculated, then summed, and an arithmetic mean of these logs calculated. For the stem of the item depicted in Figure 1, the mean log word frequency is 3.68, which is calculated from a 500 million word corpus created by MetaMetrics, Inc.
A simple regression equation transforms the mean sentence length and the mean log word frequency values into a difficulty value for the item. This difficulty is then transformed into a putative measure on the Lexile scale. The Lexile Framework is unique in the behavioural sciences as it is the only psychometric system in which a unit of measurement is explicitly defined (Kyngdon, 2011b). The unit of the Lexile scale, represented by an “L,” is defined as 1/1000th of the difference in difficulty between a sample of basal primer texts and Grolier’s Electronic Encyclopedia (1986; Stenner et al., 2006). The Lexile scale is therefore an “interval” scale. The Lexile difficulty measure of the item depicted in Figure 1 is 1,210 L. Appendix B is a technical summary of the Lexile Framework.
Homogeneity, stimulus manipulation, and measurement
The concept of homogeneity
That continuous quantities are homogeneous was recognized by Hölder (1901) and has been recently emphasised by Michell (2009b). By homogeneous, it is meant that the degrees (magnitudes) of a continuous quantity must vary only by amount, not kind. For example, let a and b be two 10 metre lengths of rigid steel rods. Let c be a 13 metre-long steel rod. Length is a homogeneous quantity as it can be empirically determined that a is equal to b (i.e., 10 m = 10 m), a is less than c (i.e., 10 m < 13 m), and b is less than c (i.e., 10 m < 13 m). Length is homogeneous because c is of greater magnitude than both a and b for the sole reason that the steel rod c possesses exactly 3 m more of the same attribute (i.e., length). This difference is not caused by a difference in kind, such as differences in the kind of steel alloy which were used to make the rods. For example, steel rod a might be made of stainless steel whilst rods b and c might be made of tungsten carbide steel. In general terms, length is homogeneous because for each and every pair of length magnitudes, x and y, either x is equal to y, x is greater than y, or x is less than y (Hölder, 1901).
The differences between the magnitudes of a continuous quantity must also be homogeneous. The differences between lengths, for example, must also be lengths. Consider another steel rod d of length 17 m. The difference between rods d and c, for example, is 17 m − 13 m = 4 m. The difference between rods c and a is 13 m − 10 m = 3 m. Therefore the difference between rods c and d is greater than the difference between rods a and c by a magnitude of 1 m. As the difference between c and d and the difference between a and c are differences only in length, length must be homogeneous. No other kind of difference is involved.
Homogeneity and utility
Utility is hypothesized to be a continuous, quantitative psychological attribute (e.g., Tversky, Sattath, & Slovic, 1988). If this hypothesis is true, then the outcomes and outcome probabilities of simple gambles must be open to the kind of empirical manipulation which enables for any pair of gambles: the determination that the utility of one gamble is either greater than, less than, or equal to the utility of the other.
Simple gambles appear to be open to such manipulation. Consider the following two lotteries:
A. a zero percent chance of winning $100 and a 100% chance of winning $0; and
B. a zero percent chance of winning $1,000 and a 100% chance of winning $0.
That both of these lotteries have a utility value of zero in cumulative prospect theory is psychologically plausible, as they present no possibility of a gain or loss for the decision maker (Luce, 2000). Lottery B is simply Lottery A with its monetary outcomes multiplied by $10. Manipulation of the outcomes of a lottery such that another lottery of equal utility is produced is evidence consistent with utility being a continuous quantity. Further evidence can be obtained as follows. Consider the following lottery:
C. a 66% chance of winning $500 and a 34% chance of winning $0.
According to cumulative prospect theory, Lottery C has a utility value of 120.65 and hence exceeds the utility of both Lotteries A and B. Consider the following lottery:
D. a 15% chance to win $39 and an 85% chance to win $23.10.
The utility of Lottery D is 17.95 and so therefore has less utility than Lottery C but more than either Lottery A or Lottery B.
Simple lotteries are also able to be manipulated such that evidence of homogeneous differences in utility between lotteries is produced. The difference in utility between Lottery D and Lottery A is simply 17.95 − 0 = 17.95. If utility is a homogeneous, continuous quantity, another lottery must be able to be created which has a utility difference of precisely 17.95 between it and Lottery D. Consider Lottery E:
E. a 25% chance to win $80 and a 75% chance to win $50.
The utility of Lottery E is 35.9. Hence the difference in utility values between Lottery D and Lottery E is 35.9 − 17.95 = 17.95. This is equal to the difference in utility between Lottery D and Lottery A. Moreover, the utility difference between Lottery C and Lottery E is 120.65 − 35.9 = 84.75. This difference is greater than the difference between Lottery D and Lottery E (17.95), but less than the difference between Lottery C and Lottery A (120.65).
Hence, simple lotteries can be empirically manipulated such that the homogeneity of utility is revealed. That this is possible suggests that the utility of gains and losses under conditions of risk or uncertainty is a psychological attribute capable of scientific measurement.
Homogeneity and individual differences in reading ability
If sentence length and word frequency truly are causally relevant to individual differences in reading ability, embedded sentence cloze reading items of differing topics and semantic content must be able to be engineered such that both items are of equal Lexile difficulty. Sentence length and word frequency must also be open to manipulation such that for any pair of embedded sentence cloze items, the difference in Lexile difficulty between them is either greater than, less than, or equal to the difference in difficulty between any other item pair.
Figure 2 is an embedded sentence cloze item created by MetaMetrics, Inc for use in a reading test.

An embedded sentence cloze reading-item type created from text in the children’s book Arthur and the Scare Your Pants Off Club (Krensky, 1998).
Figure 3 displays an embedded sentence cloze reading item created by the author about an automobile breaking down.

An embedded sentence cloze reading item created by the author of 350 L.
The Lexile difficulty of the items in both Figure 2 and Figure 3 is 350 L. Holding the number of words constant at 119, and by increasing the sentence length and using slightly rarer words, the author created another item conveying the same story as the item in Figure 3 (Figure 4).

An embedded sentence cloze reading item created by the author of 660 L.
The difficulty of the item in Figure 4 was 660 L. Hence the difference between this item and the item in Figure 3 was 310 L. If reading ability, as conceptualized in the Lexile Framework, is quantitative, then the sentence lengths and word frequencies of the text in the item stem should be able to be manipulated such that another item is produced which is approximately 310 L greater than the item of Figure 4.
The difficulty of the item in Figure 5 was 970 L, which meant that there was a 310 L difference between it and the item of Figure 3. Similarly to gambles in utility theory, it would seem that the manipulation of the quantitative components of continuous prose text leads to the creation of reading items whose difficulties are either less than, greater than, or equal to other reading items. Moreover, it would seem that embedded sentence cloze reading items can be engineered whilst holding the content and number of words the same, such that differences between the Lexile difficulties of the texts are approximately equal. This is evidence in support of reading ability, as conceptualized in the Lexile Framework, being homogeneous, quantitative, and measurable.

An embedded sentence cloze reading item created by the author of 970 L.
Heterogeneous, non-measurable psychological attributes
Cognitive ability in mathematics and the TIMSS assessment
Since the pioneering research of Spearman (1904) and Binet (1903), it has been consistently argued that cognitive abilities are measurable, continuous quantities. Yet psychological theory describing the cognitive processes of responding to ability test items is almost completely absent. Rather ironically, the most important reference work in educational testing and assessment is titled Educational Measurement (Brennan, 2006), despite its complete lack of any discussion on the extant literature on formal measurement theory (Borsboom, 2009). Yen and Fitzpatrick’s (2006) chapter in this tome presents the class of psychometric models known as Item Response Theory (IRT) models. Although IRT originated in the work of Lawley (1943), Lord (1952), and Rasch (1960), it was not until the 1970s, when computers powerful enough to run the complex parameter estimation algorithms became available, that IRT models become widely used (Mislevy, 1987). By the end of the 20th century, the psychometrician Embretson (1999) proclaimed that IRT had become “state of the art for psychometric methods” (p. 407).
Large-scale assessments in education employ IRT methods to obtain putative measurements of cognitive abilities, such as the Trends in International Mathematics and Science Study (TIMSS). The TIMSS assesses cognitive ability in mathematics and science. An IRT model known as the three-parameter model (Birnbaum, 1968) is used to analyse responses to multiple-choice items (Gonzalez, Galia, & Li, 2004; Appendix C is a technical summary of this model). But if mathematical ability is genuinely measureable, the differences in difficulty between items must be homogeneous and quantitative.
Some of the items used in the 2003 TIMSS mathematics assessment of 4th grade school children were made publicly available after the test was administered (IEA, 2007). Such items are called released items and were published with the relevant psychometric analyses, including estimates of item difficulty.
The three-parameter model scale values for TIMSS item M031341 (Figure 6) and TIMSS item M011015 (Figure 7) were -.755 and -.084, respectively. These values are putative measures of how hard the item is to solve (the item’s difficulty). Items with large negative values are very easy and items with large positive values are very difficult. Accordingly, the decimals item was the harder item of the above pair of items.

TIMMS item number M031341: content domain “Number”; main topic “Whole Numbers”; and cognitive domain “Reasoning”.

TIMMS item number M011015: content domain “Number”; main topic “Fractions & Decimals”; and cognitive domain “Knowing Facts and Procedures”.
Another two of the released TIMSS items are presented in Figures 8 and 9. These items are more difficult than the previous two, with scale values of .777 and 1.556. As these scale values are considered to be putative “interval scale” (Stevens, 1946) measurements, differences in scale values between pairs of these items are hypothesized to be caused by a homogeneous and quantitative attribute of mathematical ability. Evidence in support of this hypothesis, however, is not discernible from the content and structure of the above items.

TIMMS item number M031178: content domain “Measurement”; main topic “Tools, Techniques and Formulas”; and cognitive domain “Solving Routine Problems”.

TIMMS item number M031249: content domain “Algebra”; main topic “Equations and Formulas”; and cognitive domain “Using Concepts”.
The decimals item (Figure 7) requires knowledge of decimal numbers and skill in applying arithmetic operations to decimal numbers. TIMSS item M031249 (Figure 9) is the most difficult of the four mathematics items. It is an item which tests an examinee’s knowledge of algebra and his or her skill in using equations and formulae. It is unlike the other items in two respects. Firstly, it is not a multiple-choice item, so there is no chance of randomly selecting the correct answer from a list. Secondly, it contains a red herring designed to distract the examinee, in that it is not necessary at all to calculate the value represented by “■” to solve the problem. All an examinee would have to do is add 6 to the 703 figure, which is given in the first line of the item. Hence the item is really an easy arithmetic item, but the red herring has effectively increased its difficulty. Red herrings are often used by constructors of mathematics tests to discriminate between those examinees who are capable of ignoring irrelevant information to solve items and those who are not.
What makes the algebra item (Figure 9) more difficult than the decimals item (Figure 7) is not obviously homogeneous and quantitative. Both items do require skill in applying arithmetic operations to numbers, but the algebra item requires knowledge of algebra and the ability to recognize a “red herring” whilst the decimals item does not. These are qualitative differences in knowledge and skill, not quantitative differences in an amount of something.
Several qualitative differences exist between the measurement item (Figure 8) and the whole numbers item (Figure 9). The measurement item requires knowledge units for the measurement of time and unit conversions. It requires skill in applying arithmetic to different units of time. It also requires the examinee to be familiar with soccer and the concept that one practises for sporting competitions. The whole numbers item also requires skill in applying arithmetic operations to positive whole numbers, but it makes no demands upon the examinee’s knowledge of physical quantities and their units of measurement. This item also requires that the examinee is knowledgeable as to what a calculator is and assumes that the examinee has some skill and previous experience in using one. The item is not requesting a simple calculation as a response, but is asking the examinee to select the correct strategy for rectifying the calculation error.
The scale difference in difficulty of 1.64 between the algebra and decimals items is close to the scale difference of 1.532 between the measurement and whole numbers items. If these scale differences were genuine measurements of the differences in difficulty between the pairs of items, then such differences would be due to a nearly equal amount of the same homogenous quantity. That is, what would make the algebra item harder than the decimals item is an amount of a continuous quantity that is nearly identical to another amount of that same quantity between the measurement item and the whole numbers item. But no evidence of such a continuous quantity can be inferred from the content and structure of the items themselves. What makes the algebra item harder than the decimals item (knowledge of algebra and “test wiseness” in dealing with the red herring) is qualitatively different to what makes the measurement item harder than the whole numbers item (knowledge of units of measurement of time and correct unit conversions). Hence these differences in difficulty between item pairs are heterogeneous and therefore not quantitative.
Attitudes towards the social issues of abortion and capital punishment
Thurstone (1928) declared that “attitudes can be measured” in the title of his well-known paper. But such confidence must be sustained by evidence that attitudes are continuous quantities. Statements expressing attitudes must themselves be homogeneous and must also exhibit homogeneous attitudinal differences.
Psychometricians have attempted to study attitudes using variants of the IRT models they have used to analyse data from tests of cognitive abilities. An IRT model for the measurement of attitudes called the “generalized graded unfolding model” was proposed by Roberts et al. (2000; Appendix D contains a technical summary of this model). This model is a complex probabilistic analogue to the non-stochastic theory of unidimensional unfolding proposed by Coombs (1964). If the model fits data obtained from an attitude survey, it is claimed that attitudes are measured on an “interval scale” (Stevens, 1946). Such scales are used with the assumption that differences between the degrees of the relevant attribute are homogeneous and quantitative. Hence attitudinal differences must also be of this kind.
Roberts et al. (2000) applied the generalized graded unfolding model to data from a 50-item survey of attitudes towards abortion. Culling those attitude items which did not “fit” their model, they created a 20-item survey to assess attitudes towards abortion. Each item within the survey was reported with a putative interval scale measurement of its intrinsic favourability. Two of the items comprising this scale were as follows:
1. Abortion should not be made readily available to everyone.
2. Abortion is basically immoral except when the woman’s physical health is in danger.
The scale values of these items were −1.6 and −1.1, respectively. As can be expected with interval scales, these measurements are negative. In this particular case, negative measurements are indicative of items which express attitudes opposing abortion. The scale difference between these items is (−1.6) − (−1.1) = −1.6 + 1.1 = −0.5.
Unlike the case with utility differences between gambles, scale differences between these attitude items seem to be caused not by differences in amount, but rather by differences in kind. The first item makes a blanket statement to the effect that abortion should not be made readily available to everyone. It makes no qualifications for making exemptions nor does it state a reason why abortion should be restricted in this manner. The measurement of the second item suggests the item has greater intrinsic favourability than the first item (i.e., it is more “pro-abortion”). However, this purported quantitative difference appears to be caused by a difference in kind, not a difference in amount. Item 2 is different from item 1 in that it (a) specifies an exemption for women whose pregnancies are life-threatening and (b) expresses a judgement concerning the morality of abortion. These are two explicitly qualitative features of the second item. Hence item 2 cannot be created from item 1 by manipulating some readily identifiable quantitative components of the first item. The difference is apparently caused by different qualitative aspects of the abortion issue. Such differences are heterogeneous and attitudes towards abortion are non-quantitative.
That the differences between attitudes towards abortion are qualitative is borne out by comparison to the differences between the other attitude items developed by Roberts et al. (2000). Another two of their attitude items were as follows:
3. Abortion should generally be legal, but should never be used as a conventional method of birth control.
4. Although abortion on demand seems quite extreme, I generally favour a woman’s right to choose.
Item 3’s measurement was 0.6 and Item 4’s was 1.1. Hence the scale difference between these items (0.5) was equal to the absolute value of the difference between Item 1 and 2. This equal scale difference, however, is illusory. Item 3 expresses an attitude which concerns the legality of abortion and abortion as a method of birth control. These are qualitative aspects of the abortion issue that are not raised in Items 1, 2, or 4. Item 4 expresses an attitude which is against a laissez-faire approach to abortion, but it also raises another social issue altogether—that of women’s rights. So whilst the 0.5 scale difference between Items 3 and 4 is seemingly caused by the qualitative differences of the legality of abortion, birth control, and women’s rights, the 0.5 absolute scale difference between Items 1 and 2 appears to arise from differences in moral judgement and exemptions. Such heterogeneity is firm evidence against the hypothesis that attitudes towards abortion are measurable.
Heterogeneous differences between attitude statements are not unique to the social issue of abortion. Andrich (1995) investigated attitudes towards the issue of capital punishment, using a set of eight items created by Wohlwill (1963). Like Roberts, he also created an IRT analogue to Coombs’ (1964) theory, which he called the “simple hyperbolic cosine model for direct responses” (Appendix D contains a technical summary of this model). Andrich (1995) administered these items to 41 students undertaking a course in educational assessment. He found that his IRT model fitted the data from all eight items. Three of the items were as follows:
5. The state cannot teach the sacredness of human life by destroying it.
6. I don’t believe in capital punishment but I am not sure it isn’t necessary.
7. I think capital punishment is necessary but I wish it were not.
The scale values for Items 5, 6, and 7 were -7.83, -2.26, and 2.29, respectively. The absolute value of the scale difference between the Items 5 and 6 was 5.57, which is almost equal to the difference between Items 6 and 7 (2.29 – (-2.26) = 5.55). Yet there are no explicitly quantitative features which have been manipulated to produce these nearly equal IRT scale value differences. As with the abortion items, these scale differences appear to be caused by qualitative differences in the content of the items themselves. Item 5 raises the issue of the state educating its citizens through its actions. It also makes a strong value judgement concerning human life by describing it as sacred and by describing capital punishment as destroying life. It is these which render Item 5 as more “anti” capital punishment than Item 6, which simply states a non-belief in capital punishment and expresses an uncertainty with respect to its necessity as a form of punishment. Item 7 is different again. It expresses the unambiguous attitude that capital punishment is necessary, but expresses a personal wish that this was not the case. As with attitudes towards abortion, heterogeneous qualitative differences in kind exist between items which express attitudes towards capital punishment. Such differences belie IRT scale values as putative scientific measurements of attitude.
Discussion
Unlike simple lotteries or Lexile reading items, it appears that in no way can an attitude statement or a TIMSS mathematics item be manipulated so as to produce another statement or item of equal magnitude of intrinsic favourability or difficulty. But this is not to say that such stimuli cannot be manipulated at all. The TIMSS items do indeed possess quantitative components, such as the magnitude of values in the algebra item (Figure 9). But there is no extant theory to guide such manipulation, and so therefore there is no scientific way of predicting how difficult a mathematics item created through such manipulation would be. It is clearly not enough for stimuli to contain manipulable, quantitative features. As Michell (2008b) argued, how such stimulus features relate to the relevant attribute must be made explicit by a theory of the response process.
That attitude statements lack quantitative features does not mean that such items cannot be manipulated. Adding or removing predicates from an attitude item can lead to the creation of other attitude items. Indeed, Michell (1994) developed what he called the theory of the ordinal determinable. This theory creates a set of attitude statements through the coherent bifurcation of predicates with other predicates and their logical opposites. An explicit order is induced upon the statements, with respect to their intrinsic favourability, without the necessity of analysing response data. Moreover, the theory can predict the magnitude of the distances between attitude items and the order upon them (the so-called “ordered metric scale” of unidimensional unfolding; Coombs, 1964). These distances, however, are ephemeral, as, in Michell’s (1994) theory, attitude statements differ only in kind, not amount, by the addition or removal of qualitatively different predicates.
Michell’s (1994) theory demonstrates that stimuli with no obvious quantitative features are amenable to experimental control. Hence it would seem that lack of experimental control is not the obstacle preventing psychological measurement (Trendler, 2009). Several studies, however, found that the theory of conjoint measurement (Luce & Tukey, 1964) was supported by response data elicited by statements created using Michell’s theory (T. Johnson, 2001; Kyngdon, 2006; Michell, 1994; Sherman, 1994). Do these studies not suggest that latent psychological quantities can be revealed by the manipulation of qualitative stimulus features?
No. Each of these four studies used only six items. In the context of Coombs’ (1964) theory of unfolding, the probability that the data will satisfy the conjoint measurement cancellation axioms at random is .5874 (Michell, 1994). Hence these studies were biased towards a positive result and as such they cannot be interpreted as providing evidence of quantity. When this bias has been controlled by the use of eight attitude statements, failure of the cancellation axioms has been observed (Kyngdon & Richards, 2007). Application of the theory of conjoint measurement has therefore revealed no compelling evidence of quantitative attitudes, even when the relevant stimuli have been created and manipulated via theory.
Caveats must also be made for prospect theory and the Lexile Framework. The theory of conjoint measurement was used as the formal proof for prospect theory (Kahneman & Tversky, 1979), but it has never been applied empirically to test the hypothesis of quantitative utility. Furthermore, no explicit unit of measurement has ever been defined for utility. This needs to be remedied before it can be plausibly argued that scientific measures of utility are possible. However, this is not to gainsay cumulative prospect theory. Not only has it explained the Allais Paradox and the equity premium puzzle (Benartzi & Thaler, 1995), it predicts a variety of phenomena, from the behaviour of options traders (Fox, Rogers, & Tversky, 1996) to asset pricing (Barberis, Huang, & Santos, 2001) and insurance policy choice (E. J. Johnson, Hershy, Meszaros, & Kunreuther, 1993). Trendler’s (2009) failure to discuss utility theory was itself a significant weakness of his analysis.
Elsewhere (Kyngdon, 2011a), I have empirically tested the Lexile Framework using the theory of conjoint measurement within Karabatsos’ (2001) probabilistic framework. I found that the single cancellation axiom was rejected. Only through permutation of the columns of the conjoint array were all cancellation axioms supported. Given the columns were ordered according to the magnitude of the Lexile item difficulty measures, I concluded that the item difficulty systematic measurement error of 170 L (Stenner et al., 2006) was the plausible cause of axiom violation in the original array. This implies that the Lexile Framework does not fully describe the cognitive processes which cause individual differences in reading ability. Recent research by Luce and Steingrimmson (2011), however, suggests a different manner by which the conjoint axioms could be tested upon the Lexile Framework.
Conclusion
Presenting his Millean Quantity Objection, Trendler (2009) expounded that psychological measurement was impossible. Stressing the importance of Hölder’s (1901) first axiom, which stipulates that a magnitude of a quantity must be either greater, lesser, or equal to any other magnitude, Trendler argued that mental attributes are neither controllable nor manipulable to the extent demanded. The requisite experimental apparatus does not exist and never will. Hence neither will psychological measurement.
It was argued that Trendler’s analysis ignored deeper issues. It is not the lack of experimental apparatus and control per se which determines the possibility of psychological measurement, but the actual existence of psychological quantities and descriptive theories of the systems in which such quantities behave. Descriptive theory in addition to experimental apparatus and control has been the story of measurement in physics. Whilst temperature today is measured by such sophisticated apparatus as electronic thermometers, such devices are only possible owing to the advances in physical theories of temperature which have occurred from at least Boyle (1662) onwards. Scientific measurement cannot be conducted in isolation from theory, however sketchy and incomplete such theory may be initially.
Recognizing the importance of theory for psychological measurement, Michell (2008b) argued that descriptive theories of the response process were needed, in which the hypothesized psychological quantity is explicitly connected to identifiable features of the relevant stimuli. Expanding Michell’s thesis, it was argued that measurement may be restricted to those classes of stimuli that possess extant quantities, either discrete or continuous. Furthermore, these stimulus quantities must be capable of being empirically manipulated so as to produce, in conjunction with descriptive theory, plausibly homogeneous differences between degrees and pairs of degrees of the relevant, hypothesized psychological quantity.
Cumulative prospect theory (Tversky & Kahneman, 1992) and the Lexile Framework for Reading (Stenner et al., 2006) were presented as example theories. The former connects the outcomes and outcome probabilities of simple gambles to utility and the latter connects sentence length and word frequency of reading test items to individual differences in reading ability. It was shown that for relevant stimuli, manipulation of stimulus quantities yielded putative homogeneous differences between degrees and pairs of degrees of utility and reading-item difficulty. By way of contrast, whilst attitude and mathematics items are capable of experimental manipulation, only heterogeneous causes for differences between degrees and pairs of degrees of mathematical ability and attitude are identifiable. Hence despite the sophistication of modern psychometric IRT models, attitudes and mathematical ability are not measurable. Trendler’s (2009) Millean Quantity Objection, therefore, appears to have force for psychological attributes where heterogeneous, qualitative differences between degrees can be deduced from the relevant stimuli. Attempts at genuinely scientific, psychological measurement may best commence with the identification of classes of stimuli which contain extant quantities.
Footnotes
Appendix A: Cumulative prospect theory
Appendix B: The Lexile Framework for Reading
Appendix C: The three-parameter logistic model
Appendix D: Item Response Theory unfolding models
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
