Abstract
Chalmers recently published a critique of the use of ordinal
Keywords
Introduction
This note is a response to the recent article by Chalmers (2018) concerning alleged misconceptions around ordinal
Like Chalmers, we will mainly refer to coefficient
Assumptions, Consequences, and Definitions
Continuous Random Variables Versus Discretizations of Continuous Random Variables
We agree with Chalmers that the calculation of coefficient
Chalmers is correct that both the CTT model and the general notion of reliability do not require continuous variables for direct and coherent interpretations but he is mistaken that coefficient
When computing coefficient
This attempt to recover the continuous structure of latent response processes is also the motivation for the use of underlying variable approaches such as probit regression models, as well as polychoric correlation (e.g., Jöreskog, 1994; Quiroga, 1992), or hybrid approaches in the categorical variable methodology estimator of Muthén (1984) in structural equation modeling and factor analysis in general. Therefore, Chalmers’ allegations of the limited usefulness of such an approach would apply equally to this long tradition of widely used underlying variable approaches and the long list of developments that have emerged from this framework.
Mathematically, of course, coefficient
Information Content of Related Measurements
It is false that “ordinal
The fact that these correlations are similar is simply a consequence of the discretization procedure; all measurements (discretized or not) here are quantifying the same phenomenon. The information content in any particular measurement is free to vary depending on the exact measurement procedure, but naturally the information contents of similar measurements will be similar. They need not be identical, and nowhere do Zumbo et al. (2007) claim otherwise. The fact that ordinal
The Definition of the Classical Test Theory Model
Next, it must be noted that the CTT model of measurement error does not assume that
where
The two properties that Chalmers claims are assumptions of the model are easily seen to be consequences of the model under this correct specification. To wit,
Applying double expectation, it follows that
The measurement error model that Chalmers proposes is actually what is known as an errors in variables model, and is common in the econometrics literature (e.g., see Hausman, 2001). This is a much weaker measurement error model than that of CTT and lacks the rich structure induced by the CTT model’s individual-level exchangeability of errors condition. This condition is really the key, novel structure of the CTT model; without it, we would not have the defining property that the expectation of the observed score should equal the true score for every individual (see Kroc & Zumbo, 2019, for more discussion).
The Definition of Reliability Versus Its Quantification
Researchers and test users often associate the concept of reliability with terms such as “dependability,”“precision,”“repeatability,” and so on, assuming these things are consistent with the mathematical definition in the CTT. In that context, reliability is defined as the ratio of the true score variance and the total variance, or equivalently as, the squared correlation between observed scores and true scores (Gulliksen, 1950; Lord & Novick, 1968; Novick, 1966). Although reliability has been defined in many different ways in test theory, and for most purposes it is immaterial which algebraic expression is taken as a definition and which expressions are regarded as theorems, one advantage of taking the ratio of true score variance and observed score variance as the definition is that it encompasses all observed scores with nonzero variance, whereas the squared correlation is not defined if true score variance happens to be zero.
More generally, Zimmerman and Zumbo (2001) introduced an operator theory formulation of CTT. They described the measurement process as a collection of linear operators acting on a Hilbert space of true score vectors. In this way, the concepts of true score and error score can be naturally associated with projection operators on this Hilbert space. Once this identification is made, metric concepts of distance, length, angle, and orthogonality have immediate implications for test theory. They went on to show, exploiting their operator formalism, that one can consider reliability as a mathematical object that can be defined as another type of projection.
It is this mathematical object, the conventional CTT reliability, that Zumbo and his colleagues call the “theoretical reliability.” The qualifier ‘theoretical’ is appropriate here because this object emerges from the abstract mathematical structure of the underlying (Hilbert) space, and this object is not formally estimated in day-to-day psychometric work. Instead, quantifiers like coefficient
Although it is commonplace to see the phrase “estimate the reliability” in the psychometric literature, the term “estimate” is deceptive. From a purely formal perspective, one could say that quantifiers like coefficient
Because the mathematical object of (theoretical) reliability is defined as a ratio of two components of variance with respect to a population, a given numerical value of “reliability” can be associated with many different combinations of values of true-score variance and error-score variance. To resolve this, one may choose to bound the error term in this quotient and therefore define a quantifier of the reliability by the design of a measurement experiment. Thus, there are choices one must make, from (1) choosing the manner in which to bound the error, such as internal consistency of a single test administration, interrater variation, or measurement variation over time, to (2) designing of the experiment to actually measure the quantifier of interest, the estimand, to (3) the choice of estimator. Different estimators naturally yield different properties of their resultant sample estimates. Which of these properties are most desirable depends on the objective of the psychometric analysis.
Confusion about estimators (mathematical expressions), estimates (sample values of an estimator), and estimands (target quantities that estimators are defined to quantify) plague most discussions of reliability and coefficient
Stevens’“Scales of Measurement” and Continuity
The Frivolity of Stevens’ Scales of Measurement
It is frustrating that so many quantitative social scientists continue to rely on Stevens’ (1946) proposed “scales of measurement” as a coherent way to distinguish and categorize measurements. Consideration of these scales is nearly absent in the mathematical and statistical literature of at least the past 30 years, and with good reason: they do not categorize actual measurements in a statistically useful way. Nevertheless, many quantitative social scientists continue to appeal to these scales to try to justify usage or criticism of all manners of methodological choices, often contributing little more than confusion. It is high time to stop this.
Quantitative social scientists often cite Stevens’ scales of measurement as a reason for the appropriateness or not of applying a particular statistical procedure to certain kinds of data. Classically, Stevens proposed that one should only consider count and proportion-based statistics for nominal data, additionally allowing rank-based statistics for ordinal data, mean-based statistics (including covariances and Pearson correlations) for interval data, and making no restrictions at all on ratio data. What seems to have been lost in the many decades since Stevens’ original proposal is the criterion by which he judged the appropriateness of these statistics for these different conceptual types of data. This criterion was invariance of the statistic under a particular group structure (Stevens, 1946, p. 678); for example, Stevens argued that statistics deemed appropriate for nominal data should be invariant under permutations of the arbitrary labels one assigns to the nominal categories. However, this particular criterion that Stevens proposed is only one of many different criteria one could imagine. As has been borne out of the decades since Stevens’ original proposal, it is clear that many statistics have enjoyed immensely successful usage in a variety of contexts that would be deemed strictly “inappropriate” according to Stevens’ criterion.
The most obvious example of this is the fact that no one seems to have any problem with including nominal or ordinal variables as predictors in a regression model. Consider the following simple model:
where
When
Applying double expectation, we know that
Plugging this expression into the equations in Equation (2), we recover the natural interpretations of
A simpler example of the unhelpfulness of Stevens’ scales of measurement is that, for binary random variables, proportions are mathematically equivalent to means. Indeed, suppose
A final example of the unhelpfulness of Stevens’ scales of measurement is supplied by a statement in Chalmers (2018) that characterizes what he considers to be “Misconception 1” surrounding the use of ordinal Coefficient Additionally, interval data do not inherently require an infinite number of subdivisions in the measured variables (i.e., do not need to be coded with decimal places or fractions). This measurement scale only requires that the distances between commensurate values represent the same quantity. (p. 1061)
It is this final line that explains Chalmers’ confusion. The confusion is understandable though given that this (incorrect) characterization of interval data has become the industry norm in quantitative social science.
Granting this interpretation for the moment, Chalmers is primarily concerned with binary variables arising within the context of dichotomous item responses on a test. Thus, these variables encode whether a respondent answers each item on the test correctly or not. If we consider a single-item test where
Continuous Data
Reexamining the above quote from Chalmers uncovers yet another important flaw in Stevens’ scales of measurement. So far, we have blithely accepted Stevens’ criterion of invariance of a statistic under the group transformation appropriate to the scale of measurement. However, it is not at all obvious what invariance under a group transformation meant for Stevens.
Certainly, it did not mean that the value of a statistic would remain unchanged after a group transformation since, for example, the mean, median, and mode will all change values when scaling a random variable by anything other than unity. It seems what he envisioned was that if one applied the group operation appropriate to the proposed scale of measurement to a set of sample data, then the particular sample data point (realized or hypothetical) that corresponded to the sample statistic would remain the same after the transformation. This is what Stevens seems to imply with the following: Thus, the case that stands at the median (mid-point) of a distribution maintains its position under all transformations which preserve order (isotonic group), but an item located at the mean remains at the mean only under transformations as restricted as those of the linear group. (p. 678)
From this quotation, it is clear that Stevens’ proposed categories of interval and ratio data can only apply to continuous quantities. If not, then there is not necessarily any case/item/data point “at the mean.” For the dichotomous
The only apparent way out of this quandary is to recognize that alleged interval or ratio data must arise from only continuous random variables (or certain kinds of discrete-continuous mixtures). Within the context of Chalmers’ original criticisms then, requiring interval data to make sense of coefficient
We are hardly the first authors to point out some of the many flaws with Stevens’ proposed scales of measurement (see e.g., Mosteller & Tukey, 1977; Velleman & Wilkinson, 1993, or Chrisman, 1998). Yet Stevens’ original proposal still clings stubbornly to life in quantitative social science circles. While we recognize that Stevens’ work was novel and quite promising in its time, we have learned more than enough in the intervening 70+ years to lay its modern usefulness to rest.
As for the question of determining which statistics are most appropriate for which kind of data, we repeat the advice of Zimmerman (1995) from more than 20 years ago: Current evidence . . . suggests that the probability distribution of a random variable, not the level of measurement, is paramount in determining which statistical test is appropriate. (p. 93)
To this, we generalize that it is the probability distribution of a random variable, not any purported level of measurement, that should determine which statistic is most appropriate. This statement is the generic justification for preferring the use of ordinal
A Measurement Is a Choice; a True Quantity Is Indifferent
Different Measurements Can Quantify the Same Phenomenon
Perhaps the most distressing misconception in Chalmers (2018) is the failure to recognize that we may choose to measure a true quantity encoded in a random variable in many different ways. Chalmers quibbles with the definition of ordinal substituting polychoric correlations into the required matrix to compute coefficient
This statement seems to fail to recognize that we use observed data to study unobserved quantities all the time. Indeed, the entire domain of measurement error is concerned with precisely this enterprise, where a random variable of interest cannot be measured directly, so can only be studied by some observable proxy. It is hardly necessary to point out how lucrative this enterprise has been, and there exist vast arrays of resources summarizing the possibilities (e.g., see Gustafson, 2003, or Kroc & Zumbo, 2019).
Chalmers prefers to use a test composed of dichotomous items to bound the reliability of this test as a measure of a discretized version of the latent continuous process. Zumbo et al. (2007) prefer to use the same test to bound the reliability of the test as a measure of the latent continuous process itself. Here, we see that the two proposals want to use the same measurement process (a test of dichotomous items) to study two intimately related random variables: a latent continuous process or one of its possible discretizations. Both of these propositions are perfectly acceptable from a statistical point of view depending on one’s research goals. There is no mathematical, statistical, or conceptual problem with studying some underlying latent phenomenon even if one cannot measure precise realizations of that phenomenon directly. This is what measurement error modelling is for.
However, just because we can compute things, does not mean that those quantities are inherently meaningful, or that they accurately capture the phenomenon we are trying to quantify. In the case of coefficient
Changing Measurements Does Not Change True Scores
In this same vein of measurement as a choice, we point out that Chalmers’ claim that “applying data transformations to the continuous
This of course aligns with reality since each measurement process generates its own error process. For example, one could measure a person’s height via any one of the three measurements: (1) use a tape measure and record height to the nearest centimeter, (2) use a yard stick and record height to the nearest yard, or (3) if the person is taller than you are, record 6 feet; otherwise, record 3 feet. All three of these measurements quantify the same phenomenon, and all come equipped with their own error processes. Of course, some of these measurements are better than others at capturing the true quantity of interest.
The same holds in the context of studying a latent continuous phenomenon by means of a discretized (e.g., Likert-type) measurement proxy. If
This general algebraic phenomenon has been exploited in the past, notably by Ekström (2009), to show that the statistical information captured by the phi coefficient is equivalent to that captured by the tetrachoric correlation, and that, under mild conditions, the statistical information captured by Spearman’s rank correlation is equivalent to that captured by the polychoric correlation. These results reflect the general reality that the way we choose to quantify (i.e., measure) a particular phenomenon will not change the true underlying value of that phenomenon (observer effects and quantum entanglement aside). The discretization that occurs when measuring a latent continuous trait by a Likert-response to a certain scale simply changes the measurement X, and corresponding error E; it does not affect the true score T.
Final Thoughts
Chalmers (2018) proposes four misconceptions surrounding the justification for and use of ordinal
We have seen that Claim (1) is incoherent because of its reliance on an argument from Stevens’ scales of measurement. Claim (2) is actually correct and reflects neither a misconception nor a problem of any kind (see section Different Measurements Can Quantify the Same Phenomenon). Claim (3) is empirically refuted in Zumbo et al. (2007), while Chalmers’ (2018) theoretical justification for it hinges on an algebra mistake. Both Claim (3) and Claim (4) expose a failure to recognize that one can use many different measurements to quantify the same phenomenon.
Finally, it should be noted that Chalmers (2018, pp. 1067-1068) himself concedes that ordinal
Footnotes
Acknowledgements
We acknowledge the support of The University of British Columbia—Paragon Research Agreement. The authors would like to thank Dr. Oscar L. Olvera Astivia for comments and feedback on an earlier version of this article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
