Abstract
I first argue that there are three major currents in the contemporary debate on operationalism. These are: (a) methodologists who criticize operationalism qua philosophy, (b) psychologists who practice operationalization that is allegedly philosophically noncommittal, and (c) critics who claim that psychologists’ validation practices commit them to stronger operationalism than psychologists realize. I then outline respectful operationalism and argue that it avoids the main criticisms of operationalism while incorporating some of the epistemic benefits of operational definitions. I show how respectful operationalism aligns with other operationalism-friendly theories, such as generalizability theory and Michael T. Kane’s argument-based approach to validity.
There are, broadly speaking, two angles to the story of operationalism 1 in psychology. On the one hand, there is the story of operationalism as a philosophical thesis—a story that goes something like this: Operationalism sprouted from the work of Percy W. Bridgman, an American who went on to win the Nobel Prize for his work on the physics of high pressures. Bridgman’s observations of conceptual confusion in physics led him to think that it is useful to tie the meaning of a concept to the empirical operation used to capture that concept. Operationalism soon enticed some of the most famous psychologists, economists, and philosophers of the first half of the 20th century, from S. S. Stevens and B. F. Skinner to Carl Hempel and Paul Samuelson. Several philosophically inclined bright minds proposed and debated different versions of operationalism for two decades, but whatever way they spun it, operationalism seemed fatally flawed. By the mid-1950s, Bridgman felt that he had “created a Frankenstein” (quoted in Chang, 2009; cited in Frank, 1956) and many enthusiasts gave up on operationalism. Many contemporary commentators call operationalism an erroneous, unworkable, and dead philosophy (Lovett & Hood, 2011; Maul, 2017; Meehl, 1995).
The other side of the story is that of psychological practice, that is, the story of operationalism as it guides the way psychologists build, validate, and use measures of psychological attributes. The exact nature of the operationalism contemporary psychologists’ practice is hard to pin down, mostly because it is rare for people to declare themselves operationalist. Rather than debating an ism, psychologists focus on an activity: operationalization, the process of defining their target concepts in measurable terms. Textbooks such as Coolican’s Research Methods and Statistics in Psychology (2014), Haslam and McGarty’s Research Methods and Statistics in Psychology (2003), as well as Evans and Rooney’s Methods in Psychological Research (2011) tell the student of psychological measurement that operationalization is a crucial step in the road to high-quality research (esp. Coolican, 2014, p. 37). Although there is arguably no consensus on how operationalization should be carried out in psychology (Slife et al., 2016), it is no doubt an activity that permeates all of psychological research.
Without the details, it looks like we might have two separate stories of scientific (or philosophical) progress here: on the one hand, a story of a philosophical theory (operationalism) that was rejected after careful analysis, and on the other hand, a story of an experimental activity (operationalization) that lives on because it has stood the test of time and scientific scrutiny. But the situation is complicated by the fact that contemporary psychologists are accused of unwarranted, epistemically suspect operationalism (Borsboom, 2006; Maul, 2017; McGrane, 2015; Michell, 2008). For example, Michell (2008, esp. p. 10) argues that operationalism is a way for psychologists to cover up their ignorance about the nature of their target attribute. What is going on? Are psychologists practicing an unworkable operationalism? Or are the critics misdiagnosing current psychology? What is the relationship between operationalization in contemporary psychology and the supposedly dead operationalist philosophy?
My first aim in this article is to clarify the status of current debates on operationalism. I argue that there are three major currents in the contemporary debate. First, there are methodologists who criticize the above-described operationalist philosophy. Second, there are psychologists who practice the above-described operationalization assuming that this activity is philosophically noncommittal. Third, and most importantly, there are those who criticize psychologists for sliding into an operationalism that is more demanding and less defensible than psychologists realize. The juxtaposition of these three views shows that the debate on operationalism is not resolved.
In the second argumentative part of this article I define and defend a new type of operationalism, one that I call respectful operationalism. Roughly put, respectful operationalism is the view that psychologists may define a target concept in terms of a test (e.g., “intelligence is what the test tests”), as long as that test is validated to incorporate common connotations of the target concept and the usability of the measure. I argue that respectful operationalism incorporates some of the main benefits of operationalist thinking while guarding against a slide to the kind of operationalism that critics have warned against.
The article proceeds as follows. The next section provides a brief history of operationalism. Then, the paper diagnoses contemporary debates on operationalism. After this, the article defines and defends respectful operationalism. The section following this responds to objections and the final section concludes.
Brief history of operationalism
Operationalism stands for multiple things. For some, it is a theory of meaning: it tells us what it means to say that “Bill is depressed.” For others, it is a research heuristic: it tells us how to study whether or not Bill is depressed. For yet others, operationalism is a metaphysical thesis: it tells us what depression is, for example, that depression is a bodily condition (cf. Flanagan, 1980). Because the term has so many meanings, it is easy to mislabel authors’ ideas on operationalism. 2 Thus, in this brief historical overview I will refrain from classifying ideas that have been denoted by the term operationalism. After the historical overview, I will zoom in on more exact definitions.
Percy W. Bridgman is typically credited as the father of operationalism (Chang, 2009). His most famous, and most often criticized, declaration of operationalism appeared in his book The Logic of Modern Physics (1927): “In general, we mean by any concept nothing more than a set of operations; the concept is synonymous with the corresponding set of operations” (p. 5).
This statement has been read as a radically reductionist theory of meaning: there is nothing more to the meaning of concepts such as length than the operations used to measure those concepts. But some commentators believe that this is a misinterpretation, and that actually Bridgman defended a less radical operationalism (see, e.g., Garner et al., 1956). According to Chang (2017b), Bridgman’s less radical operationalism stands for the view that scientists should be careful not to assume that different operations measure the same thing, even when the same name or term is applied to the two operations.
One of the main reasons Bridgman opted for operationalism was his desire to make science “safe.” Bridgman thought that before Einstein’s revolution, physics was plagued by conceptual confusion that had resulted from unreflective extension of old concepts to new realms of research (Chang, 2009). According to Bridgman, operationalism would guard against such unreflective extensions, thus making scientists less vulnerable to conceptual confusion.
However, Bridgman’s notion of safety is not the only one associated with operationalism. Many psychologists thought that operationalism would secure intersubjective agreement and repeatability of inferences. On this view, defining concepts in terms of operations is meant to eliminate the kind of error and disagreement that tends to accompany inferences to nonoperational concepts. Hull (1968) summarizes the desire for safety thus: If a scientific concept is synonymous with a set of operations, and if these operations are such that they can be performed publicly by any qualified person, then the intersubjectivity and repeatability so important to the objectivity of science are guaranteed. (pp. 438–439)
Carl Hempel, Herbert Feigl, and other logical empiricists were intrigued by Bridgman’s operationalism. The operationalist focus on observability and testability resonated with logical empiricists, who were trying to put the language of science on a firm, verifiable, or confirmable basis. Rudolf Carnap dreamed of a language of science in which all statements are either testable by uncontroversial observational means or reducible to statements that are testable in such a manner. However, operationalism differed from logical empiricism with respect to the unit of analysis: where operationalism focused on the meaning of concepts, logical empiricism typically looked at the empirical meaning of statements (Hempel, 1954). Nonetheless, proponents of the two approaches debated and exchanged ideas.
The seeds of operationalist thinking had been planted well before Bridgman formulated his operationalism and before logical empiricists came across it. In his paper “How to Make Our Ideas Clear” (1878), C. S. Peirce expressed ideas that resonate with later operationalism, arguing that our conception of an object is constituted by the effects that object has. Inclinations toward operationalism are also implicit in what came to be known as behaviorism. In his 1913 article, John B. Watson introduced behaviorism in reaction to Freudian psychology, which focused on consciousness and other unobservable attributes. 3 Watson (1913) thought psychologists should stop referring to consciousness and related unobservables, because there is no agreement on what those concepts stand for and how to study them: “The time seems to have come when psychology must discard all reference to consciousness; when it need no longer delude itself into thinking that it is making mental states the object of observation” (p. 163).
Instead, according to Watson, psychologists should focus on observable changes in behavior. More specifically, Watson’s behaviorist psychology consists of reports of reactions in experimental settings—how people respond to visual stimulus or words in a memory test, for example. The operationalist flavor is clear here, because memory, vision, and other psychological capacities are studied and described in terms of behavior in a test setting, which is a type of operation. 4
Watson’s focus in his 1913 description of behaviorism is not that inner states do not exist, but that inferences to internal states are more controversial than reports of behavior. Two people can easily agree on the answers a patient gives to a personality test but disagree on what those answers mean in terms of the patient’s personality and mental health. Focus on behavior will make psychology more transparent and accessible, and thereby more useful to other human endeavors, according to Watson: “If psychology would follow the plan I suggest, the educator, the physician, the jurist and the business man could utilize our data in a practical way, as soon as we are able, experimentally, to obtain them” (1913, p. 168).
Operationalism proper, however, was brought to psychology some 20 years later, in the mid-1930s. 5 One of its most influential early proponents was Harvard psychologist Stanley Smith Stevens. Stevens, who worked within the psychophysical tradition studying, for example, subjective auditory experience, cited Carnap and Bridgman in his papers on operationalism. Stevens studied how different properties of a tone depend on each other, for example, how the subjective experience of volume of a tone depends on the intensity and frequency of the tone. In one of Stevens’ experiments, the participants would hear two tones of different frequencies and adjust the intensity of one of the tones until they experienced the volumes of the two tones as equal (Stevens, 1934). Stevens could then define subjective experience of volume in terms of this operation, that is, the subjective experience of volume could be defined in terms of the judgments of difference and equality, which participants make when placed in the experimental setting just described (cf. Stevens, 1935). Stevens’ (1935) motivation for such an operationalist approach was akin to Watson’s: to render psychology and other sciences public and repeatable, and thereby objective (p. 522).
The work of Stevens started discussions on the place of operationalism in psychological methodology. The journal Psychological Review offered a platform for many a paper on operationalism in the late 1930s and the early 1940s, and these discussions culminated in a symposium in 1945. Contributors to the symposium included behaviorist psychologists (e.g., B. F. Skinner), logical empiricists (e.g., Herbert Feigl) and Bridgman himself, thereby bringing together the three main strands of operationalist thinking. The contributors appear to have disagreed over what operationalism is and what its role should be in psychology and science more broadly (Green, 1992). Despite failure to agree on fundamentals, interest in operationalism persisted for another decade or so. In 1953, another operationalism symposium was organized at the meeting of the American Association for the Advancement of Science.
By 1959, however, many previously operationally minded psychologists had come to doubt the adequacy of operationalism (Green, 1992). As logical empiricists modified their own position(s) in light of now-well-known attacks and critiques, their philosophy started to diverge from the early operationalism as well (Hempel, 1950, 1954). In psychology too, operationalism gradually faded into the background, although the discussion reemerged a couple of times: in the early 1980s in the pages of the Journal of Mind and Behavior, in the early 1990s in the pages of the journal Theory & Psychology and again in the beginning of 2000 in Theory & Psychology (Feest, 2005).
Although operationalism is no longer a trendy topic in philosophical and methodological debates, several commentators argue that operationalism lives on in various forms in contemporary psychological practice (e.g., Rogers, 1989). In the next section, I will zoom in on the way operationalism is defined, defended, and critiqued in contemporary psychology.
The contemporary debate
Strong, weak, and sliding
In this section, I argue that we can distinguish at least three broad strands of arguments from the contemporary debate on operationalism in psychology. This list is not meant to exhaust debates around operationalism, but to outline some major currents: (a) arguments against strong operationalism, where strong operationalism is the view that psychological concepts denote 6 test-dependent qualities; (b) practitioners of weak operationalism, where weak operationalism is the view that researchers must study test outcomes no matter what the target concept denotes; and (c) arguments criticizing weak operationalism of sliding into strong operationalism. I will now discuss these three currents one at a time.
Against strong operationalism
As examples of arguments against what I call strong operationalism, consider the following statements: Although positivism, behaviorism, and operationalism have all since been almost uniformly rejected as philosophically unworkable, their legacies remain influential throughout the social-scientific literature. (Maul, 2017, p. 60) Our reference to “contemporary operationism” is not meant to suggest that operationism is a live position in contemporary philosophy of science, only that it informs a particular contemporary approach to the study of psychopathology. (Lovett & Hood, 2011, p. 219, Footnote 2) Accepting operationism (an erroneous philosophy of science) and the pseudomedical model (definition by syndrome only) engenders a wrongheaded research approach, unlikely to pay off in the long run. (Meehl, 1995, p. 267)
What is the operationalism these critics are calling philosophically unworkable, erroneous, and dead? According to Benjamin Lovett and Brian Hood (2011), operationalism manifests when “investigators treat observable behavior as constitutive of mental disorders” (p. 209). Paul Meehl (1995) states that “some superoperational psychologists talk as though inferring theoretical entities were somehow methodologically sinful” (p. 268). Finally, Andrew Maul (2017) writes that in operationalism, “[t]heorizing about a psychological attribute . . . cannot take place independently of the construction of a specific measure of that attribute” (p. 60).
From the authors’ remarks it looks like Lovett, Hood, Maul, and Meehl regard operationalism as an approach in which targets of psychological investigation are conceptualized (and theorized about) in terms of observable behavior in a test setting, rather than in terms of test-independent qualities that drive that test behavior. I suggest we capture this kind of operationalism with the label strong operationalism:
To explicate D1, I propose the following characterizations of test-dependence and test-independence:
Why is strong operationalism dead, erroneous, and unworkable? One often-repeated charge against operationalism is that it leads to harmful proliferation of concepts (e.g., Hempel, 1966; Hull, 1968; Leahey, 1980). Strong operationalism could lead to such proliferation, if it is complemented with the specification that all psychological measures (of which there are plenty) define a unique psychological concept. Another common objection is that operationalism is arbitrary (cf. Meehl, 1995), in the sense that it is not appropriately responsive to empirical and theoretical knowledge about psychological phenomena and meanings attached to psychological concepts. For example, the often-quoted operationalist slogan “intelligence is what the tests test” (Boring, 1923) seems to illegitimately ignore the fact that to most people, intelligence connotes much more than performance on an IQ test.
The most pervasive worry, however, is that test performance is simply the wrong kind of thing for psychologists to focus on. Many critics of operationalism are realists about psychological measurement (Borsboom et al., 2003; Hood, 2008; Meehl, 1995). 8 They argue that psychologists should focus on the measurement of test-independent qualities—that is, qualities that may bring about a certain kind of observable behavior in a test setting but that are conceptually or ontologically independent of tests.
In order for realism to count as more than a statement of research preferences, its proponents need to argue that operationalism fails to reap the epistemic fruits realism offers. Many realists argue that zooming in on the test-independent causes of observed behavior improves understanding as well as researchers’ and policy-makers’ capacity to successfully implement an intervention (for literature on the functions of causes see, e.g., Cartwright & Hardie, 2017; Howick et al., 2013; Machamer et al., 2000). For example, when it comes to treating depression and distinguishing it from anxiety and other related diseases, it is (or would) arguably (be) more useful to zoom in on the biological and psychological causes of depressed mood than it is to stare at equivocal test scores on the Hamilton Depression Rating Scale (HDRS).
To recap, strong operationalism is the view that target concepts denote observable behavior in response to a test rather than test-independent determinants or causes of that behavior. Strong operationalism has been heavily criticized by researchers who typically identify as realists. But does the critique of strong operationalism apply to practicing psychologists?
Weak operationalism
In 1956, psychologists Wendell Garner, Harold Hake, and Charles Eriksen wrote in the Psychological Review that some of their contemporaries tended to define perception (and other concepts) in terms of reactions to stimulus in a test. Garner, Hake, and Eriksen argued that this kind of operationalism constituted “a perversion of the fundamentals of [operationalism] as stated by its originators” (p. 149). They thought that in a defensible form of operationalism, participants’ responses would be used not to define the target concept, but to infer the unobservable properties of the target concept (p. 150). The operationalism they outline and defend is compatible with the idea that psychological tests are means to investigate psychological qualities that are ontologically or conceptually independent of the test—that is, a type of realism. Indeed, Rozeboom (2005) refers to Garner, Hake, and Eriksen as psychologists who thought of operationalism as a “version of empirical realism” (p. 1320). If this is the operationalism that contemporary psychologists subscribe to, then the critique presented in the previous section does not apply to contemporary psychology.
It is hard to tell whether contemporary psychologists are operationalists—strong or otherwise—because they do not use the term operationalism much. Rather, textbooks regularly emphasize the need to operationalize target concepts in order to make them measurable (Slife et al., 2016). In their textbook, Evans and Rooney (2011) describe operationalization as an activity that turns a conceptual hypothesis into a research hypothesis, where conceptual hypothesis is an intangible claim such as “Outgoing people are happier than people who keep to themselves” and a research hypothesis is a claim about test outcomes such as “People who score high on standard test E of extraversion give higher ratings of happiness on test H than do people who score low on E.” This move is visually represented in Figure 1.

Operationalization.
If the ubiquitous activity of operationalizing is thought of as a form of operationalism, then operationalism is not a philosophical stance but at best a research heuristic. This kind of operationalism consists merely of the recommendation to state one’s hypotheses in measurable terms, whether or not one is a realist about the target concept. For example, it recommends that researchers study happiness differences in terms of differences in ratings on a questionnaire that is appropriately linked to their target concept, whether or not that concept is defined in realistic terms. Let us call this weak operationalism:
D4 is lax, because this kind of research heuristic is rarely explicitly stated in textbooks or elsewhere as an ism. Rather, this definition is meant to capture an arguably commonsensical motivation that looms behind research psychologists’ ubiquitous operationalizations: researchers need numbers to do statistics, so they construct number-generating operations that seemingly connect with the target concept. To ensure that the connection is not merely apparent, psychologists validate their measures—a process that textbooks typically cover much more extensively than they do operationalization (e.g., Coolican, 2014, p. 37).
Weak operationalism is ostensibly compatible with realism about target concepts. 10 A weakly operationalist researcher might conceptualize depression as a test-independent quality and use measure validation to establish a connection between that quality and its operationalization. In such cases, the previously described criticisms of strong operationalism do not translate into criticisms of weak operationalism. However, the same openness that lets weak operationalists be realists allows strong operationalism to return to psychological research.
Gaps and slides
Weak operationalism is a very unconstrained framework, which means that its practical application might end up looking a lot like strong operationalism, whether or not that is what the researcher intends. More concretely, it is possible that practitioners end up studying operationally defined concepts, because the validation of their measures fails to establish a sufficient connection between the measurement operation and the test-independent target. For example, even if one thinks that happiness is a test-independent concept, 11 one may fail to investigate that concept if the link between a happiness questionnaire and the target concept is not adequately established. The claims one makes on the basis of such a “gappy” happiness questionnaire are merely claims about questionnaire outcomes, not the concept of interest.
This claim about the possibility of a measure–concept gap is not only a hypothetical worry, but rather a general characterization of criticisms that are regularly levelled against psychologists. Psychologists are often accused of reifying their target concepts, that is, they are criticized for illegitimately inferring from an observed test performance to a test-independent “thing” that brings about that performance (Kane, 2006, 2013). As examples of reification, consider debates about mental disorders or intelligence: Hyman (2010) argues that DSM-classified mental disorders are frequently treated illegitimately as natural, test-independent entities, while Gould (1981) famously criticized the inference that differences in IQ test performance point to differences in the brain (but see Hood, 2008, for a critique of Gould). When psychologists reify, they make claims about test-independent entities or qualities even though they only have epistemic warrant for claims about observed behavior in a test.
It may be tempting to think that the awkward slide from realist inclinations to (strong) operationalist research practice is an exception rather than the rule. But several scholars have argued that (strong) operationalism may be ingrained in much of psychological research, and that this is worrying. In a similar vein as Michell (1997, 2008), McGrane (2015) argues that S. S. Stevens’ operationalism helped popularize a research approach where measurement operations create—rather than allow inferences to—psychological concepts: The psychological sciences have adopted practices where psychological quantities may be invented at the will of the researcher and attention is then focused upon ever more creative and technical means to impose “real number” mathematics upon psychological attributes with little to no theoretical justification for doing so. (p. 6)
In other words, even if there were realist inclinations in the background, the operationalizations are rarely adequately linked back to test-independent qualities, according to McGrane. What that adequate or justifiable “linking back” requires is debated, but many think it would require more theorizing and more rigorous modeling of the connection between test outcomes and causal variables (Borsboom, 2006; Maul, 2017; McClimans et al., 2017). The fact that there is debate about the required improvements does not take away the gist of the argument against contemporary psychological practice: it (arguably) fails to move from claims about test behavior (i.e., operationally defined target concepts) to test-independent qualities.
Contemporary operationalism
We have now identified three positions in contemporary debates on operationalism: (a) arguments against strong operationalism, where strong operationalism is the view that psychological concepts denote test behavior; (b) practitioners of weak operationalism, where weak operationalism is the view that researchers must study test behavior no matter how they conceptualize the targets of psychological research; and (c) critics of psychological research, who accuse psychologists of failing to close the gap between a target concept denoting a test-independent quality and a measure intended to capture that quality.
Whatever position one sympathizes with, it should be apparent that there is currently no shared approach to thinking about and practicing operationalism. In the rest of this article, I am going to defend respectful operationalism (RO), a new approach that navigates between the benefits and problems of earlier forms of operationalism. On the one hand, RO is mindful of the fact that test-independent qualities are hard to understand and measure, and that sometimes the best we can do is study test behavior—a pragmatic attitude that, as I argued above, motivates weak operationalism. On the other hand, RO tries to prevent unjustified inferential slides from test behavior to test-independent qualities—that is, the kinds of slides realists are worried about.
Respectful operationalism
Extra-operational meanings and usability
Roughly put, RO is the view that we may define a target concept in terms of a test, as long as that test is validated to incorporate common connotations of the target concept and the usability of the measure. 12 More technically:
As we will see, respectful operationalists do very similar things to current test-developing psychologists. But there are subtle and consequential differences that move us forward in debates about operationalism. I will explain this intellectual import better in the following section. This section focuses on describing RO.
Respectful operationalists start from the widely recognized, extra-operational connotations 13 of the target concept. For example, intelligence connotes problem-solving ability, which is why it makes sense for intelligence tests to involve problem-solving tasks (Yoakum & Yerkes, 1920). Depression, in turn, connotes low mood and sadness, which is why depression measures should probe these aspects in one way or another. So far, so good. It requires more effort to be respectful of the details of extra-operational connotations, for example, the idea that intelligence tends to be associated with scholastic achievement, or the thought that depression does not encompass anxiety although the two are often coextensive. To incorporate such meanings, respectful operationalists employ techniques that are often called content validation and criterion validation. As an example of criterion validity, we can think of intelligence researchers’ studies on the correlation of intelligence test scores and school grades (Gygi et al., 2017). As for content validation, consider for example an often-cited article from Bagby et al. (2004), where they argue that the often-used HDRS is not content valid, because it includes questions about anxiety despite the fact that anxiety is not part of the current conceptualization of depression.
In addition to validating for extra-operational meanings, the respectful operationalist validates her measure for usability. One part of usability is consistency of results. For example, when validating his depression measure, Max Hamilton (1960) studied how consistently two raters assess patients by studying the correlation between the two sets of ratings. Consistency indicates usability, because it is efficient for different people to be able to administer the test and have results that are immediately somewhat comparable. Other methods of reliability estimation indicate how well one test administration represents or coheres with repeated administrations of the test (Cronbach et al., 1963). Such representativeness is useful, because it shows how stable we can expect our operational diagnosis of, say, major depression to be across independent test administrations. Another aspect of usability is the predictive capacities of the test, that is, its predictive validity. It is useful, for example, if patterns of scores on a depression test allow us to predict what treatment the patient is most likely to respond to. Hamilton originally wanted to use his depression measure for this purpose: to classify types of depression and prescribe treatments accordingly (Hamilton & White, 1959; Worboys, 2013).
The respectful operationalist develops her measure by balancing the demands of usability and respect for extra-operational meaning. She drops an item here and there to improve consistency, she then adds an item to incorporate some left-out meanings, then changes the wording on yet another item to make it easier for participants to understand, and so on and so forth. So far this sounds like a validation process realists can get behind—but that is about to change. When the respectful operationalist is satisfied with the result of her balancing act, she declares that the test defines her target concept. She does this knowing that some of the unruly and ever-changing meanings of the concept have been left out. She grants that her operational concept is one among a plurality of concepts that are legitimately designated by the same term. 14
Why does the respectful operationalist make this move? Because she has not validated the test-independent denotation of her test, that is, she does not know what kinds of attributes or entities her measure tracks. The fact that two interviewers give consistent ratings does not tell us what drives the way interviewees respond. Do their expressions of depressive mood reflect a stable state or merely the feeling that dominated their experience during the test? Are there biological causes to the way patients respond to test items? Are the respondents prone to exaggeration or other bias? Do they appear sad when they are in fact anxious about the test situation? Similarly, correlations between intelligence tests and school grades do not tell us about the determinants of intelligence test scores. Are they determined by an innate ability? Is education a better explanation of score differences? What role does test-wiseness play in intelligence test outcomes? In short, the operationalist validation does not warrant inferences from test score to that which brings it about.
The operationalist leaves these questions unanswered, because the test she has devised has merits of its own. First, operational definitions help ensure comparability of conclusions across studies. For example, if different studies use a HDRS score of seven as the defining threshold of remission from depression, it is easy to compare conclusions across those studies (all other things—such as compliance with interview guidelines—being equal; e.g., Thase et al., 2001). By contrast, if each research group makes their own inferences about what HDRS scores mean, the groups will likely end up talking past each other. Second, the tests that define operational concepts tend to have useful predictive functions. For example, a depression test that predicts the treatment a patient is likely to respond to is useful even if we do not know why the test has this predictive function.
In sum, the respectful operationalist balances two kinds of considerations when validating a test: usability and extra-operational meanings. She then defines her target concept in terms of the test. She uses the test and the concept it defines to serve predictive functions and to compare conclusions across studies.
The intellectual import of respectful operationalism
Respectful operationalists use the same methods that are already psychologists’ bread and butter. What does RO bring to contemporary debates? And more broadly, how does RO relate to recent scholarship on measure and concept development? In this section I explicate the intellectual and pragmatic import of RO.
Although RO looks in many respects like validation-business-as-usual, it moves us forward from several conundrums. First of all, RO does not fall prey to the common objections against strong operationalism. The way RO validates measures for extra-operational meanings guards against arbitrary definitions, such as intelligence defined in terms of political allegiance or well-being defined in terms of high school grade point average (GPA). There is also nothing about RO that prevents new, better 15 tests from replacing old ones. Hence, RO does not lead to uncontrolled proliferation of operational concepts. And finally, although RO admittedly falls short of the explanatory and intervention-aiding capacities of knowledge about causal mechanisms, RO serves other useful aims when causes are not available. More specifically, RO serves comparison across studies and predictive functions.
RO improves upon weak operationalism by explicitly and clearly defining target concepts in terms of tests. By legitimizing operation-specific definitions, RO discourages unwarranted inferential slides to claims about test-independent qualities. But note that nothing about RO commands the use of operational definitions in all of psychology. RO can be pursued side-by-side with the realist project of trying to justify inferences from test scores to test-independent causes—as long as one is clear about what kind of validation, realist or RO, has been carried out on each measure. Sometimes an RO project might pave the way for a measure that allows inferences to test-independent qualities. For example, even if a depression test is originally constructed and used to summarize test responses, further validation could lead one to identify a physical or psychological property (or a cluster of them) that brings about the test outcomes.
This vision of the development of scientific concepts—that is, the idea that scientific concepts may start off as denoting something easily observable, test-specific, or behavioral, and then develop to denote something not directly observable, test-independent, and causally efficacious—coheres with recent historico-philosophical work on scientific concept formation (e.g., Chang, 2004; Nersessian, 2008). Consider, for example, the cluster of diseases we now call diabetes mellitus. Ancient physicians used diabetes to denote conditions involving continuous urination and thirst, in the 19th century it captured conditions where glucose is exerted in the urine, while in the 20th century the concept divided to denote two different conditions involving different dysfunctions in the production and usage of insulin (Zajac et al., 2017). The improvement of tests (e.g., of glucose levels) went hand-in-hand with the narrowing of the concept from one that denotes a large amount of superficially similar conditions to one that captures a smaller, more homogeneous group of conditions that share a common biological mechanism.
The development of a psychological measure may also start off as a project aiming to categorize different kinds of test outcomes and gradually develop into a project inferring test-independent qualities from those outcomes. Although RO projects can turn into realist projects in this way, 16 the conceptual distinction between RO and realism reminds us to be clear about the exact nature of the inferences a psychological measure has been validated for. When reporting on the validation of a psychological measure, one should be careful to explicate whether the measure may only be used to make modest claims about test behavior or whether there is a reasonable degree of support for ambitious claims about causes and test-independent qualities. This is particularly important in psychological sciences, where target concepts such as depression, happiness, and intelligence have rich meanings in ordinary language, and that therefore easily entice unwarranted inferences.
In emphasizing the need to flag the inferences that a measure has been validated for, the above-described outlook on validation resembles Michael T. Kane’s argument-based approach to validity. Kane’s (2016) approach states that there are, broadly speaking, two steps to measure validation: (a) specifying the intended use of the test scores and (b) evaluation of the plausibility of those claims in light of appropriate evidence. According to Kane (2006), the argument-based approach: allows for a variety of possible interpretations and uses for test scores, with the caveat that any proposed interpretation or use be justified by appropriate evidence. So, operationally defined variables are fine as long as we recognize them for what they are, and do not slide any theoretical claims in under the radar. (p. 443)
RO builds on an important premise of the argument-based approach, which is that the validity of a psychometric measure depends on its ability to meet the needs of the usage the measure is intended for. However, unlike the argument-based approach, RO is not a general theory of (psychometric) validity, but rather a new approach specifically intended to guide the formation of useful operational concepts so as to avoid the problems that have long plagued operationally inclined psychological research. In other words, although accepting the argument-based approach to validity is a minimal condition to accepting RO, there is much more at stake in my argument for RO, namely its ability to (partially) redeem operationalism. The fact that the argument-based approach accommodates operational definitions is no argument in defense of the usefulness and tenability of (a form of) operationalism.
RO also bears resemblance to generalizability theory, an integrated approach to psychometric reliability and validity that received a famous explication in 1963 in a paper by Lee Cronbach, Nageswari Rajaratnam, and Goldine Gleser. Roughly put, generalizability theory says that a measurement procedure is valid to the extent that it provides accurate estimates of the universe of observations that define the target concept (Cronbach et al., 1963; cf. Kane, 1982, p. 134). In this framework, target concepts are defined in terms of some universe of observations, such as observations of responses to anxiety test A in hospital H, under the supervision of interviewer I, and so on for other conditions of interest. In generalizability theory, validation involves justifying an inference from actual observations to a universe of potential observations. The operationalist flavor comes from the fact that the focus is on observations and potential observations rather than on an unobservable, causally efficacious entity or quality.
I believe that considerations over generalizability are a crucial part of the respectfully operationalist validation. A measure that is validated in the RO way is meant to warrant test-specific claims such as “Mei received a higher score on depression test D than Kai” and “Patients’ scores on anxiety test A are, on average, 50% lower after treatment with drug D.” In order to determine the exact meaning of such claims, the respectful operationalist must be able to assert the boundaries of the test, that is, to say what is and what is not a necessary component of the test. For example, is it part of a depression test that the interviewer is a clinician, or can it be anyone? Is it part of an anxiety test that the interviewer follows a specific interview guideline? The theory of generalizability can help respectful operationalists determine what testing conditions observed scores generalize to, which in turn helps them articulate more precisely the meaning of test-specific claims.
But there is more to RO than that. For one, there is the requirement to ensure that the operationally defined target concept resonates with broader connotations of the concept. RO also covers considerations of usability that go beyond generalizability, such as the length of the test, the wording of the items, and so on. More broadly, RO and generalizability theory serve different functions: the former is meant as a coherent, workable alternative to critiqued forms of operationalism, while the latter provides technical and interpretative tools for generalization.
In this section I have described, in broad brush strokes, how RO moves us forward in the current debate on operationalism and how it relates to developments in the larger measurement literature. I argued that RO does not fall prey to the severest problems with strong operationalism and that, unlike weak operationalism, RO guards against unwarranted slips from test-dependent claims to claims about test-independent qualities. I also put forward the case that RO is compatible with recent scholarship on the joint development of measures and concepts. Finally, I argued that although RO as a whole is a new framework for operational concept formation in psychology, it does share elements with Kane’s argument-based approach to validation and the generalizability theory.
Objections
This article has focused on the place of operationalism in contemporary psychology and its potential in future research. There are three reasons why I have not delved into long-standing foundational issues so far. First, a lot has already been written on the philosophical problems surrounding operationalism, while its potential in contemporary psychological research has received less attention (see the second section of this article). Second, as the large amount of ink spilled on operationalism testifies, these issues are too vast to be settled in one article. And third, I have addressed some of the long-standing philosophical objections elsewhere (Vessonen, 2019a). However, to assure the reader that RO does not inherit the issues that plague other forms of operationalism, I now outline brief responses to well-known objections.
The first objection states that operationalism is incompatible with measure validation, because operationalist measures are valid by stipulation (“intelligence is whatever IQ tests measure”). For someone who buys this view of operationalism, it might be incomprehensible that RO involves validation for extra-operational meanings (e.g., content and criterion validation as explained in the third section of this article). The objection can be outlined as an argument:
Operationalism never involves validation.
RO involves validation.
Therefore RO cannot be operationalism.
I believe that P1 is too strict of a definition of operationalism (see also Chang, 2009). I dare suggest that in practice, nobody is that strict about operationalism. Otherwise we would see hardcore operationalists defining depression in terms of thermometers and temperature in terms of questionnaires. We do not see such claims thrown around in the literature, because almost everyone accepts that we should ensure that scientific concepts of depression and temperature bear some resemblance to the everyday (scientific) usage of those concepts (see Carnap, 1950b, Chapter 1). The first objection fails because P1 is false: operationalism does involve validation of extra-operational meanings.
The second, related worry is that no amount of empirical investigation convinces a true operationalist that their definition is wrong. In other words, if one defines intelligence solely in terms of an IQ test (Boring, 1923), even if that IQ test does not correlate with any other measures of success, this does not convince the operationalist to change the definition—after all, they are already measuring intelligence by definition. This move is unintuitive to realists: surely a definition of intelligence that has nothing to do with academic or other forms of success is incorrect and should be modified.
As ought to be clear from my description of RO, I think operationalists can and should change their concepts in light of empirical evidence. The difference between realists and operationalists is that while realists evaluate the correctness of concepts, operationalists evaluate their usefulness. If an IQ test fails to predict academic success, the respectful operationalist might revise the concept because they deem it useful to preserve the pretheoretic intuition that intelligence relates to academic achievement. However, there is nothing incorrect about a definition of intelligence that is not related to academic achievement—indeed, one might also argue that the demand for such a relation is elitist, outmoded, or otherwise lacking in value.
A third objection follows immediately: if there is no shared standard, such as truth, to arbitrate between definitions, what stops researchers from cherry-picking whatever definitions suit their purposes? Are we not back to drowning in arbitrary concepts? My response is two-fold. First of all, there are many (epistemic) values besides truth that (most) scientists buy into, such as consistency, simplicity, exactness, applicability, as well as predictive and explanatory capacity (on values in science see Douglas, 2015; on values in concept formation see Dutilh Novaes, 2020). Concepts that embody these shared values are likely to attract wider acceptance and outlive arbitrary concepts. Second, although the definition of a concept is not a matter of truth (see Carnap, 1950a, 1950b; I defend this view in Vessonen, 2019b), our capacity to define more useful concepts depends on uncovering truths about the denotation of that concept. 17 For example, if we establish that a particular IQ test does not correlate with real-world problem-solving skills, we can use that fact to evaluate the usefulness of the concept the IQ test defines. In other words, usefulness is not independent of truths. 18
The final objection I have the space to consider is the curious relation between operationalism and accuracy. Donald Gillies (1972) argued that strict operationalists cannot make sense of claims such as “Type A thermometers measure temperature more accurately than Type B thermometers.” If every measure defines their target concept, there is no way we can compare two measures on their ability to track common target values. The objection, then, is that an approach to measurement that cannot make sense of accuracy is simply not science (Michell, 1990).
My response to this objection is again two-fold. First, while I think it is true that operationalists cannot evaluate the accuracy of their own measures, it is not true that operationalists cannot make sense of accuracy as a standard of measurement in general. To explain this, I need to first clarify that, in my view, operationalists need not be antirealists (Vessonen, 2019b). In other words, operationalists do not have to deny (a) the possibility of epistemic access to test-independent entities or (b) the existence of such entities (see Lovett & Hood, 2011). Rather, operationalists refrain from claims about test-independent entities, because they do not currently have epistemic access to those entities. Hence, even though they cannot state how well (i.e., accurately) their own measure tracks a test-independent entity or quality, they need not deny that evaluating the accuracy of a measure is in principle possible.
Second, although operationalists cannot evaluate accuracy, they can evaluate other merits of their measures. For example, they can evaluate precision—that is, the consistency of results when numerous measurements are taken under similar conditions. This again highlights that operationalists do have standards for improving their measures, even though those standards are not directly related to test-independent qualities. Whether this kind of instrument development should be called “measurement” or simply “testing” is up for terminological debate.
Conclusion
I have argued that there are three main strands in the contemporary discussion on operationalism: (a) arguments against strong operationalism, (b) practices aligned with weak operationalism, and (c) criticisms of practitioners who slide toward stronger forms of operationalism than they realize. I then defined and defended RO, an approach that incorporates elements from already existing theory and practice to form a new, coherent, and philosophically defensible operationalism. I argued that RO realizes the benefits of operationalism but leaves room for other, more realist measurement practices within psychology.
Since this article is the first presentation of RO, a number of open questions remain up for debate. What are the best ways to incorporate RO into research practice? What is sufficient warrant for claims about test-independent qualities? How should we handle the plurality of related concepts that likely ensues when we allow respectfully operationalist and realist validation to co-exist? To answer these and other open questions, I believe we need a combination of literature on concept formation and measure validation (e.g., Alexandrova & Haybron, 2016; Angner, 2013; McClimans, 2013; Peterson, 2018), literature on psychometric validation (e.g., Borsboom, 2005; Cronbach & Meehl, 1955; Fiske, 1971; Loevinger, 1947), and studies of the history of operationalism (e.g., Chang, 2017b; Feest, 2005; Rogers, 1989).
Footnotes
Acknowledgements
The author would like to thank Anna Alexandrova, Denny Borsboom, Hasok Chang, M. Colombo, Juha Haaja, Tim Lewens, Andrew Maul, and the anonymous reviewers for their comments on drafts of this article.
Funding
The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This article grew out of the author’s doctoral research at University of Cambridge, funded by the UK Arts and Humanities Research Council; Newnham College; the British Society for the Philosophy of Science; and Cambridge Commonwealth, European and International Trust.
