Abstract
The term “depth cue” is fundamental to and widely used in vision science. However, despite the prevalence and importance of that concept, there is virtually no study on its theoretical foundations and coherence. This article aims at filling that gap by investigating both its historical development and its current use within the predominant computational approach to vision. Against the backdrop of Wittgenstein’s therapeutic approach to philosophy, it is shown that both traditional and current characterizations of depth cues suffer from a serious logical flaw known as “homunculus” or “mereological fallacy.” It is suggested that the problem of homuncular language impedes critical thinking and theorizing in vision science since it obscures the matters at issue by disguising explanatorily empty expressions as explanatory hypotheses. Furthermore, it is argued that homuncular language is not confined to the concept of depth cues but typical of current cognitive science in general since it is linked to its most fundamental assumption of the brain being an information processing system. In conclusion, resulting implications for cognitive science and cognitive scientists are considered.
Sooner or later, every vision scientist will come across the term “cue” in his or her studies—virtually any standard textbook on visual perception contains this term and it is prevalent in research articles as well. For example, a short overview reveals that it is used in more than half of the approximately 350 articles published in 2014 in the Journal of Vision and in roughly half of all the articles’ abstracts in Vision Research and Perception. One may therefore presume that “cue” is a widely established and thus well-explicated concept in vision science. Surprisingly, the fact is that there are remarkably few to no publications explicitly addressing that concept and its theoretical foundations. 1
Although used throughout the broad field of vision science, the cue concept plays a critical role in theoretical approaches to depth perception. Additionally, looking at the historical development of theories of vision, it becomes apparent that its origin lies in that province. While theories of depth perception can be traced back (at least) to the Ionian period of ancient Greece (Howard, 2012), the theoretical interest in depth perception intensified noticeably in the wake of Kepler’s theory that light is projected onto the retina as an inverted, reversed, and distorted “picture” (pictura) of the outside world’s objects. 2 Kepler’s proposition seems to point towards a problem that is still widely considered to be the fundamental problem of depth perception—namely, how perceiving depth and distance is possible at all whilst everything that is at our visual system’s disposal is a “two-dimensional image” of the outside world. According to the cue-approach, it is because of the retinae’s planarity that depth and distance are not “immediately” perceived but only by means of so-called depth cues. Frequently, pictures like Figure 1 are used for further illustration.

Exemplary visual illustration of the concept of depth cues. The top figure (a) does not “contain” depth cues, the bottom figure (b) “contains” several depth cues. See text for further details.
It is immediately seen that Figure 1(b) is associated with a much more compelling impression of depth than Figure 1(a), which looks rather like two-dimensional geometric forms were attached to a plane. Moreover, judgments regarding the spatial relations between the depicted objects are far easier considering Figure 1(b) than Figure 1(a). One might judge the three objects depicted in Figure 1(a) as either being equal in size but of varying distance or as being equally far away but of different size, whereas in Figure 1(b) it is most likely that they will be judged as being equal in size but differing in distance. In terms of the cue-approach to depth perception, these striking differences are explained by the presence of various depth cues in Figure 1(b) and the absence of such depth cues in Figure 1(a). For example, the shadow falling from the object in the middle of Figure 1(b) on the object right beside it is considered to be such a depth cue the visual system can “interpret” as a hint, clue, or sign of the fact that the former is located closer to the viewer than the latter. Usually, the exposition of the cue-approach to depth perception is confined to an ordered listing of all the well-known depth cues like accommodation, occlusion, linear perspective, motion parallax, binocular disparity, relative size, etc. (see Howard, 2012, for a detailed example of such a listing). Note that these items seem to derive from quite different descriptive levels: some seem to refer to mere physiological processes (e.g., the changing of lens curvature in the case of accommodation), others to certain aspects of our perceptions (e.g., occlusion or shadows), and regarding still others (e.g., linear perspective or “known size”), it is difficult to say which descriptive level they are supposed to correspond to. The rationale for subsuming these items under the concept of depth cues remains unclear except for the general remark that they are all “information about depth” the visual system can use to “infer,” “estimate,” or “calculate” depth. Yet, without further clarification, these characterizations are not intelligible. Taken literally, these expressions seem odd at first glance since we usually do not ascribe the ability to make deductions or perform arithmetic operations to parts of our body. Hence, the question arises whether vision scientists use these expressions in a different but meaningful way and whether these theoretical claims help to facilitate our understanding of visual perception.
The concept of depth cues is investigated in this article by first briefly summarizing its historical development by means of influential theoretical approaches to (depth) perception (for a more detailed account, see Pagel, 2016). Second, following the general thrust of Wittgenstein’s linguistic-therapeutic approach to philosophy, and drawing both on Hacker’s conceptual analyses of psychological concepts (e.g., Hacker, 2013) and Searle’s remarks on artificial intelligence (e.g., Searle, 1982), current characterizations of depth cues within the widely accepted information processing framework are analyzed with the aim of probing their coherence. Finally, resultant implications for vision science are discussed.
A (very brief) history of the concept of depth cues
The concept of depth cues has a long history. The idea that neither depth nor distance are “immediately given” by vision (or cannot be perceived “directly”) but have to be “inferred” or “estimated” by other means can be traced back to the sophisticated theoretical approach to vision by Ibn al-Haytham (Alhacen, c.965–c.1040) in his Book of Optics. 3 Foreshadowing surprisingly numerous aspects of Helmholtz’s highly influential theory of perceptions as “unconscious inferences,” Alhacen believed that perceiving certain visible characteristics, including the distance of an object, involves inferential processes which, due to frequent reiteration, are happening in an imperceptible amount of time and are thus unbeknown to the perceiving subject (Smith, 2001). According to Alhacen, a (“correct”) perception of an object’s distance is only possible if “its distance is spanned by a range of continuous, ordered bodies, and sight perceives those bodies as well as their measures” (Smith, 2001, p. 454)—a remark that can be related to (at least) two depth cues known to current vision science, namely “familiar size” and “relative size.”
Roughly half a millennium later, René Descartes (1596–1650) revitalized the discussion on depth and distance perception by combining geometrical considerations with physiological knowledge emerging from systematic anatomical studies of the eye. Without going into a detailed analysis of his often heterogeneous views, Descartes generally considered vision as a multi-stage process (see Atherton, 2002), whereby the last stage consists of interpretations or inferences of the “raw material of sense” by the mind, resulting in the perception of an object as part of the exterior world with a specific position in space, distance, form, and size (see Descartes, trans. 2001). Concerning depth perception, Descartes holds that the light patterns reflected onto the retinae are, in essence, two-dimensional and the thus mechanically induced “movements of nerves” therefore not sufficient for perceiving depth and distance. Only by means of certain additional “mechanisms,” which form a “rather heterogeneous and ill-assorted group” (Wolf-Devine, 2000, p. 511) and are all known as depth cues in current vision science, are depth and distance perceived. Besides what is nowadays known as accommodation and image blur as cues to depth, Descartes emphasize the role of vergence movements in depth perception. Drawing on the suggestive metaphor of a blind man using two sticks in his hands in order to “calculate as if by natural geometry” the distance of objects around him by applying trigonometry, Descartes claims that “similarly, when our two eyes A and B are turned towards point X, the length of the line AB and the size of the two angles XAB and XBA enable us to know where the point X is” (Descartes, trans. 2001, p. 106). Furthermore, drawing on observations carried out on the removed eye of an ox, Descartes claims that the distance of objects is perceived by “comparing the size of the images that they imprint on the back of the eye” (Descartes, trans. 2001, p. 107) with our knowledge of the objects’ “real sizes”—the so-called cue of apparent size. In the course of the next three centuries, the number of depth cues proposed grew steadily. For example, in the parts of his Inquiry and his Essays dedicated to vision, Thomas Reid (1710–1796), who introduced the sensation/perception distinction into vision science, lists five “signs of distance,” all considered depth cues by current vision science (e.g., Reid, 1764/1997).
Until its replacement (or supplementation) by the computational approach in the last quarter of the 20th century, one of the most dominant and influential theories of vision emerging from the growing physiological orientation of psychology in the late 19th century was put forth by Hermann von Helmholtz (1821–1894). Although Helmholtz’s thought-provoking, Kant-oriented metaphysical stance of sensations as “mere signs” or symbols for the affairs of an ultimately, and, in principle, unknowable world (see Helmholtz, 1878/1977) is largely neglected in current vision science, his ubiquitously cited doctrine of perceptions as “unconscious inferences” still seems to lie more or less at the heart of the currently prevailing theoretical framework. Strongly influenced by Lotze’s (1886) theory of “local signs” (see also Koenderink, 1984), Müller’s (1843) “law of specific nerve energies,” and Mill’s (1882) conception of inductive reasoning, Helmholtz’s theoretical views on vision underwent various significant changes and are, taken as a whole, nothing short of complex and (unfortunately, all too frequently) ambiguous; for example, his distinction between sensation, perception, “idea,” “apperception,” and “immediate perception,” or his use of Kant’s law of causality (for a further discussion, see Hatfield, 1990b; Pagel, 2016).
In spite of running the risk of oversimplifying the matter in hand, one could roughly sketch Helmholtz’s general view on vision as follows (see Helmholtz, 1925, §26): the stimulation of nerve fibers in the eye results in “visual sensations,” a concept that remains hazy and ambiguous throughout all of Helmholtz’s writings. Since these visual sensations are effects worked on the nervous system by external causes (usually light), the respective effects depend on the nature of both the thing causing the effect and the thing on which the effect was exerted, sometimes referred to by Helmholtz as the nervous system, other times as “consciousness” (Bewusstsein). Consequently, a visual sensation is not an image or copy of the particular thing that caused it, it rather acts as a symbol or sign that “need not have any similarity at all with what it is the sign of” (Helmholtz, 1878/1977, p. 122). Yet, visual sensations are not useless either, epistemologically speaking, since they “give us a report of what is peculiar to the external influence.” They are “signs of something, be it something existing or happening, and … they can form for us an image of the law of this thing happening” (Helmholtz, 1878/1977, p. 122). Visual perception in the usual sense, i.e., perceiving spatially extended objects as belonging to an exterior world, requires the interpretation of these signs via what Helmholtz calls “unconscious inferences” or “inductive conclusions.” According to Helmholtz, every visual sensation can be conceived of as or is at least analogous to the minor premise of a syllogism, whereas perceptions are analogous to the respective conclusions (see Hacker, 1995, for a critical discussion of the general notion of “perceptions as conclusions”). Stressing the importance of intentional experimentation on the part of the perceiving subject, it is claimed that, based on constant experiences and learning, perceptions—i.e., conclusions “reached by analogy without reflection” (Helmholtz, 1925, p. 26)—are formed, thereby learning what specific sensations signify.
Regarding depth perception, Helmholtz argues as well that depth and distance cannot be perceived “directly” because “the eye gives only perspective surface-images” (1925, p. 23) or “flat perspective images upon the retina … representing only two dimensions” (Helmholtz, 1885, p. 287). Depth and distance perception are deemed possible only by means of specific “sources of information” (Hilfsmittel), again all considered as depth cues by current vision science. Helmholtz distinguishes between two such sources: one depending heavily on experience and enabling us to merely form “an idea of the distance,” while regarding the other, “sensation is involved, and we have an actual perception of the distance” (Helmholtz, 1925, p. 282), although this distinction remains obscure in light of Helmholtz’s own conceptual framework and the logic of his sign theory. The former Hilfsmittel include what are sometimes called “pictorial cues” (e.g., Goldstein, 1999) like occlusion, relative size, apparent size, familiar size, and atmospheric perspective, whereas the latter include accommodation, convergence, motion parallax, and binocular disparity. With regard to the purpose of this article, it is worthwhile to quote some of Helmholtz’s descriptions at full length. For example, concerning apparent size as a cue to depth, Helmholtz writes, thus echoing Descartes’ characterization: The same object seen at different distances will be depicted on the retina by images of different sizes and will subtend different visual angles. The farther it is away, the less its apparent size will be. Thus, just as astronomers can compute the variations of the distances of the sun and moon from the changes in the apparent sizes of these bodies, so, knowing the size of an object … we can estimate the distance from us by means of the visual angle subtended or, what amounts to the same thing, by means of the size of the image on the retina. (1925, p. 236)
Regarding binocular disparity, considered as essential for depth and distance perception by Helmholtz, he writes: “A direct image of a portion of space of three dimensions is not afforded either by the eye or by the hand. It is only by comparing the images in the two eyes … that the idea of solid bodies is obtained” (1925, p. 23). What is striking about these passages is Helmholtz’s use of vocabulary loaded with cognitive terms like “estimating,” “knowing,” “computing,” “comparing,” etc. Exceptionally conspicuous in this regard is a passage from Helmholtz’s talk Ueber das Sehen des Menschen, again describing binocular disparity as an essential cue to depth: We have two eyes which, as often as they perceive a spatially extended body, continually view the world from two different viewpoints, thereby continually presenting two different perspective views to our consciousness for examination. … In this way, we continually construct the spatial relations of surrounding objects from two different perspective views our eyes provide to us. (1896/2002, p. 104, author’s translation)
As we shall see in the following section, the remarkable tendency to describe the depth cues’ functioning in cognitive vocabulary is not only detectable in the works of precursors of the modern cue-approach to depth perception like Alhacen, Descartes, and Helmholtz, but, with a new livery, still persistent in current vision science.
The concept of depth cues in current vision science
The general framework of current vision science
The general theoretical framework of current (mainstream) vision science, although rarely explicated at all in full detail, might be best described as a mixture of Helmholtz’s inferential paradigm, the general impetus of the computational approach to visual perception by David Marr (e.g., 1982), and a strong neurobiological focus. Embedded within realism (mostly naïve realism, as a matter of fact), it is assumed that light reflected from independently existing objects of the physical world is projected onto the retinae and acts as the “visual input” of procedures of information processing done by the visual system—a term routinely used to designate certain areas of the brain. This “input” is considered as principally underdetermined compared to the “output” of the process, i.e., our perceptions of spatially extended objects of specific form, color, and position in space. 4 Metaphorically speaking, it is assumed that there exists a wide “informational gap” between the projected light pattern or “visual input” and our respective perceptions. The bridging of that gap is considered an abstract task of problem-solving the brain has to “manage,” which is referred to as “information processing,” “interpretation,” or “inference,” and quite often these descriptions are used synonymously.
As already mentioned, the two-dimensionality of the light pattern projected onto the retina, predominantly referred to as the “retinal image,” is still considered to be the principal problem of depth perception: The fundamental problem in depth perception is due to the geometry of perspective projection, which reduces the three-dimensional coordinates of the visual scene to the 2D coordinates of the retinal images. The third dimension of space has to be inferred from the 2D images. The visual system uses several sources of information—depth cues such as disparity, perspective, and motion parallax—to estimate the layout of the 3D scene. (Hillis, Watt, Landy, & Banks, 2004, p. 967)
It is interesting to note the different types of descriptions of what is accomplished by the use of depth cues. They include the notion of the brain using depth cues to “estimate” depth (e.g., MacKenzie, Murray, & Wilcox, 2008; Muller, Brenner, & Smeets, 2009), “infer” or “derive” depth (e.g., Knill, 2007a, 2007b) or “translate retinal stimulation into perceptions of depth” (Levine, 2000). Equally noteworthy are the usually short descriptions of what depth cues in general are. They are characterized as “information about the third dimension of visual space” (Wolfe, Kluender, & Levi, 2009, p. 136), “depth information” (Palmer, 1999, p. 202), “a source of information regarding depth” (Levine, 2000, p. 297), or as something “which provides information about depth” (Yantis, 2014, p. 190).
Moreover, it is assumed that multiple depth cues are “available” to the brain in normal viewing situations, that these cues can “interact” in the course of information processing, and that they are finally “combined” by the brain according to specific rules. Typically, this process of so-called cue combination or cue integration is described within a mathematical model and overall, the respective descriptions suggest that the resulting “depth estimates” or “depth values” are quantitative parameters (i.e., numbers), constituting what some authors (e.g., Howard, 2012; Landy, Maloney, Johnston, & Young, 1995) call a “depth map,” which is characterized as an “internal representation” of the spatial layout of the viewed scene. Again, the brain is claimed to be the agent of these activities and a vocabulary highly loaded with intentional and cognitive terms is used, including, for example, the notion that the brain “makes assumptions” (Knill, 2007b), “resolves ambiguity” and “interprets” (Seydell, Knill, & Trommershäuser, 2010), “uses different strategies” (Wismeijer, Erkelens, van Ee, & Wexler, 2010), or is “accepting or rejecting evidence” (Gregory, 1998). Some authors even speak of the brain as “understanding the structure in depth of a complex natural scene” (Cutting & Vishton, 1995), as having “a large expanse of visual experience” (Pylyshyn, 2003), or as something that actually “sees” and is “building a model of the world” (Churchland & Sejnowski, 1992).
As mentioned previously, these characterizations loaded with intentional and cognitive vocabulary prompt a lot of questions. Do vision scientists use them literally or as some sort of façon de parler or analogy? And if there is a way to make sense of these expressions, is such a manner of speaking useful for fostering our thinking about visual perception? I will try to shed a little more light upon these matters in the following sections.
The characterization of depth cues as “information about depth”
The almost solely used characterization of depth cues as “information about depth” seems to be a clear-cut concession to the aforementioned predominant theoretical framework of cognitive psychology, i.e., the idea of the brain being an information processing system, in essence comparable to a digital computer (e.g., Palmer, 1999). However, the term “information” and especially its use in the sciences is polymorphically tricky and nothing short of a conceptual maze since there is a vast number of heterogeneous characterizations, depending both on the conceptual framework of the respective discipline and the phenomena studied. 5 Hence, a variety of authors (e.g., Bogdan, 1994; Dennett, 1989) are skeptical whether there is even a rudimentary, mutually recognized understanding of the concept of information underlying these heterogeneous characterizations.
Probably the most controversial issue pertains to the question of whether the concept of information refers to a phenomenon presupposing certain cognitive faculties like intentionality and understanding or rather to something that is “in the world” and independent of cognitive agents, i.e., a naturalistic and non-intentional notion of information. The first position, occasionally labeled “information as semantic content,” bears strong resemblance to our ordinary use of that term, for example, when we speak of “getting information by reading a newspaper” or “being in need of more information in order to find one’s way in unfamiliar territory.” Information in the sense of semantic content is necessarily conditional on the usage of some sort of symbolic system and a mutually shared knowledge of or familiarity with both the (conventional) rules regarding its usage (syntax) and the meaning of used symbols or their respective combinations (semantics). 6 For example, having neither knowledge of syntax nor semantics of the Russian language, we would hardly speak of getting information about anything by looking at a Russian newspaper. Information in this sense is both addressed to and received by subjects alone and cannot be reduced to physical states of affairs. An analysis of the physical properties of the Russian newspaper (the type of paper and ink used, the dimensions of the ink blots, etc.), however detailed, will not lead to an understanding of what it is about—a subject “interpreting” relevant physical properties as information about something (i.e., someone able to read and understand Russian) is mandatory.
In the mathematical theory of communication (MTC), a branch of probability theory sometimes called “information theory,” the term “information” is used in a very different, technical, and well-defined sense. Very broadly speaking, information in MTC is a quantity, dependent on the statistical rarity or probability of occurrence of symbols (or strings of symbols) an abstract device can produce. 7 Without going into technical details, the crucial point for the purpose of this article is that in MTC the meaning of symbols or strings of symbols produced, i.e., what the message is about, is completely irrelevant in regard to its amount of information. In their seminal work, Shannon and Weaver (1962) explicitly point out that “information [sensu MTC] must not be confused with meaning” (p. 8). Contrary to information as semantic content, no subject interpreting certain physical occurrences as meaningful symbols, as being about something, is required—in MTC, information is an “objective commodity” (Dretske, 1982).
Are there any plain indications in what sense proponents of the cue-approach want to use the term “information”? Unfortunately not; it seems to be used in a more or less vague way and without further explanation. Nevertheless, the manner in which vision scientists employ the term seems to point more likely towards the idea of information as semantic content than the MTC conception of information. Since characterized as “information about depth,” the concept of depth cues seems to imply a semantic content. 8 Although usually conceptualized as quantities or “depth values,” the information provided by depth cues is supposed to have a certain meaning for the brain; it is supposedly specifying, for example, the distance of objects. Consequently, this conception implies the brain as being able to “use” a symbolic system with syntax and semantics in order to “interpret” the neuronal activity triggered by the light projected onto the retina as symbols specifying (amongst other things) the spatial layout of the environment.
Another indication of this fundamental assumption underlying the general theoretical framework of the current cue-approach (and closely connected to the concept of information as well), is the notion of “internal representations” of depth. The outcome of the computational process attributed to the brain is characterized as an “internal representation” of the information about the spatial layout provided by depth cues. Therefore, a discussion of this notion might be helpful to get a clearer view of the concept of depth cues.
The notion of “internal representations of depth”
Unfortunately, similar problems are encountered here. Just like the term “information,” the notion of “internal representations” is omnipresent in current vision science, but rarely discussed. Customarily, and advocates of the information processing paradigm concur (e.g., Palmer, 1978), we speak of a representation when something is meant “to stand for something else,” which is, of course, no less in need of explanation than the term “representation” itself. We do not speak of something representing itself, thus a representation X is always representing something differing from X, what one might call the content or meaning of X (Haugeland, 1991), or in short, Y. For example, on a map, circles may represent cities and the color of these circles the respective population figure. Note that X’s property of being a representation of Y is not determined by the intrinsic characteristics of X. Principally, anything may represent anything else. On a map, cities might be represented equally well by squares and the respective population figure by the size of these squares. The crucial point is that X is used as a representation for Y within the context of an implicit or explicit social act in which the rules of using X as a representation for Y are either defined by convention or simply given by the usage of X as a representation for Y. In the case of maps, these rules are explicated by the respective key. Consequently, X is only a representation of Y for someone knowing that X is used as a representation of Y, i.e., for someone knowing the rules of X’s usage as a representation of Y. 9 Colored circles on a piece of paper as such do not represent anything at all for me if I am unaware that these circles are used as a representation of, for example, cities and their respective population figures.
Do vision scientists use the term “representation” in that sense? Explicit remarks by David Marr, innovator of the “computational approach to vision,” suggest so. Marr (1982) claims that the brain is “having access to systems of internal representations” (p. 6) and that vision “is a process that produces from images of the external world a description that is useful to the viewer” (p. 31). The terms “representation” and “description” are characterized as follows: “a representation is a formal system for making explicit certain entities or types of information … I shall call the result of using a representation to describe a given entity a description of the entity in that representation” (p. 20). According to Marr, both the binary and decimal systems are examples of formal systems representing numbers. For example, the symbol “37” is a description of the number thirty-seven in the decimal system and the symbol “100101,” a description of the same number in the binary system. Both systems are governed by certain rules, specifying the relations between representation and represented, i.e., in what precise way numbers are represented or described by symbols. These rules, as Marr’s own example clearly shows, are purely conventional, they are made by someone intending to use something as a representation. More importantly, these representations are supposed to explicate or describe something, i.e., they have semantic content. But they only describe or are about something for someone with knowledge of the respective rules of usage. For most people, the symbol “100101” does not represent the number thirty-seven but the number one hundred thousand one hundred and one, since the majority has no working knowledge of the binary system. For someone with absolutely no knowledge of the Hindu-Arabic numeral system, the symbol “100101” does not represent anything at all.
Yet, vision scientists hold that the “internal representations” the brain is supposed to “operate” on are physically realized or embodied in the brain as patterns of neuronal activity triggered by light projected onto the retina. These “representations” supposedly have a semantic content, resulting in an explicit symbolic description of the surrounding world’s objects and its spatial layout. In summary, the current cue-approach to depth perception seems to imply that the brain “interprets” its own physical properties as symbols with semantic content, as information about something, just like a subject with knowledge of Morse Code interprets a message in Morse as being about something. The pressing question is whether or not that is, taken literally, a coherent conception. I believe it is not.
Homuncular language in vision science
Different general theoretical embeddings of the cue-approach to depth perception notwithstanding, the descriptions of the depth cues’ functioning principles from Alhacen to the current account are linked to a central conceptual problem, namely, a form of what is known as the homunculus fallacy, or, as Bennett and Hacker (2003) call it, the mereological fallacy. The term “homunculus fallacy” was first introduced by Kenny (1981) to characterize the “reckless application of human-being predicates to insufficient human-like objects … since its most naïve form is tantamount to the postulation of a little man within a man to explain experience and behavior” (p. 155). Since it is not a formal fallacy in terms of propositional logic, I will use the broader term “homuncular language” for the remainder of this article. Surely, no proponent of the cue-approach would explicitly postulate a “little human within a human” since it would be both an absurd notion and would result in an explanatory regress since the homunculus’ cognitive abilities, which are supposed to explain depth perception, need now to be explained. Rather, the problem is that, taken literally, the characterizations of depth cues implicitly presuppose such a homunculus to be meaningful in the first place.
Homuncular language in classical theories of vision
In classical accounts of the cue-approach, the problem of homuncular language is often easy to spot. For example, when Helmholtz (1925) claims that we estimate the distance of objects “by means of the size of the image on the retina” (p. 236), he is using homuncular language since there are no images of objects on the retina but only a distribution of light. If we, as subjects equipped with the ability to see, were to look at our retinae, we would see images of objects. But we do not and cannot look at our own retinae and consequently have no access to such “image sizes.” Therefore, speaking of “using the retinal image’s size” is only meaningful if one assumes a homunculus, equipped with the same perceptual powers that are in need of explanation, looking at the pattern of light projected onto the retina and thus seeing it as an image. 10 Although more subtle, the same holds for Helmholtz’s assertion pertaining to binocular disparity as a depth cue, viz. that “two different perspective views” of a scene are presented “to our consciousness for examination” (Helmholtz, 1896/2002, p. 104, author’s translation). Following the same line of argument, there are no two different “perspective views” of a scene projected onto the retinae and consciousness neither does nor can see anything (we do) and therefore does not examine such “perspective views.” Again, that phrase is only meaningful if one assumes a homunculus looking at the different light patterns on the retinae, thereby having two perspective views of a scene which she could examine.
Likewise, Descartes is using homuncular language by claiming that the distance of an object is perceived by “comparing” its size in the “retinal image” with its real size or “calculated” by the mind “using” trigonometry. The mind can neither calculate nor use trigonometric functions and it certainly has no access to lengths of certain lines or sizes of angles. We, not the mind, can draw a geometric abstraction of a specific viewing situation, measure the length of lines and the sizes of angles, and then use trigonometry to calculate the distance between points, but the mind does not and cannot do such things—and neither does it refrain from doing them.
Since these classical theories are at least 120 years old, could it be that perhaps such a loose language had to be used because there was, if at all, only little knowledge of the brain’s anatomy and its complex functioning compared to this day? If that were the case, one would expect a considerable reduction of homuncular language in current vision science due to the immense developments of neurobiological methods and knowledge.
Homuncular language in current vision science
However, as the numerous quotations above strongly suggest, the problem of homuncular language is still prevalent in the current cue-approach to depth perception, albeit it is harder to identify since it is embedded in a vocabulary borrowed from our everyday language concerning the functioning of digital computers. It is held that the brain is, in essence, a (biological) computer, and just as digital computers are occasionally described as “calculating,” “estimating,” “interpreting,” “using representations,” “processing information,” “making assumptions” or “decisions,” vision scientists, in order to explain perception, accordingly attribute these activities to the brain. But these expressions are highly misleading and I believe that there is vast theoretical danger in such talk.
As the previous considerations indicate and Searle (1980, 1982, 1990) has convincingly shown, a computer taken by itself does not do any of these things and neither does it refrain from doing so. We use computers to calculate, estimate, and process information (which is only information about something to us, not the computer), for instance by writing software whose input and output represents something for us. The diverse physical states of a computer and their transitions as such do not represent anything, they are not about anything except for the programmer or user using them as representations for or information about something. This is the core of Searle’s (1980, 1982) “Chinese Room Argument”: syntax is not sufficient for semantics. Furthermore, following Searle (1990), even the description of a physical system as a computer, i.e., a “rule-following” symbol-manipulating system, is not determined by its intrinsic physical features. Nothing is intrinsically a “syntactic machine.” Principally, any object with sufficiently varying states may be given a syntactical description. The characterization of a physical system as symbol-manipulating is relative to someone outside the system, an observer interpreting the respective physical events and their transitions as syntactic. In short, “syntax and symbols are not defined in terms of physics” (Searle, 1990, p. 35), both are observer-relative characterizations. Consequently, describing a computer or the brain as “calculating,” “processing information,” or as “symbol-manipulating” is only meaningful in reference to a subject interpreting their respective physical features as such. 11 But, as Searle (1990) points out emphatically, concerning the matter in hand there is a crucial difference between a digital computer and our brains. Regarding the former, it is us as programmers or simple users interpreting the variety of different physical events semantically and syntactically and using them as such. Speaking of a computer as “calculating” or “using information” is misleading but, in a sense, harmless because it can be given a meaningful sense by adding that it is an abridged way of saying that we use computers to calculate or process information. However, the situation is fundamentally different in the case of our brains. We cannot “use” our brains in the same way we use a computer. But if a subject interpreting the physical states of a system as syntactic and semantic is a sine qua non for meaningfully describing that system as “information-processing,” then the problem of homuncular language is inevitable, due to current vision science’s most fundamental assumption.
In his diverse remarks on the philosophy of psychology, Wittgenstein makes a similar point, albeit resulting from a much more complex and radical argumentation. Drawing on the extensive work of Hacker and colleagues on that matter (see, e.g., Baker & Hacker, 1982; Bennett & Hacker, 2001, 2003; Hacker, 1990; Smit & Hacker, 2014), the assertions put forth by current proponents of the cue-approach are (from a Wittgensteinian standpoint) neither right nor wrong, they simply have no meaning since criteria for their use neither exist nor have they been laid down by vision scientists using them. “Calculating,” “interpreting,” “making decisions,” “inferring,” “understanding,” etc. are psychological predicates and, according to §281 of Wittgenstein’s (1958) Philosophical Investigations (I), applicable to human beings only, not parts of them. It is well beyond the scope of this article to analyze and evaluate Wittgenstein’s argumentation concerning that remark in full detail since it is embedded in a complex nexus of his views on understanding and meaning, rule-following, first- and third-person psychological sentences, etc. 12 It is, however, of paramount importance that it is a remark neither justified nor justifiable by empirical evidence but a conceptual remark on the logical grammar of our language (Smit & Hacker, 2014). There is no way to determine empirically or suddenly “discover” that brains or computers make decisions and assumptions, interpret, infer, or understand something because, in marked contrast to human beings, it is not clear what is to count as a brain’s or computer’s doing so. These expressions (as well as their negations) transgress the logical grammar of our language; they are the result of what Hacker and colleagues call a conceptual confusion regarding psychological predicates. It is like asking whether or not a tree calculates, interprets, or is making decisions, or, to take one of Wittgenstein’s most famous examples, whether or not the number three is colored.
The question remains why homuncular language is so blatantly used within the cue-approach to depth perception and current vision science in general. Are vision scientists simply not aware of the fact that they are using it, is it inevitable, or is it perhaps supposed to serve a specific purpose? Could it be somehow or other theoretically fruitful to propose that depth cues are “information about depth” the brain can use in order to “infer” the spatial layout of a scene?
Is homuncular language fruitful in vision science?
Certainly, there is no a priori reason to not extend existing concepts in novel ways and introduce new ways of speaking, especially in scientific theory. 13 For example, one could, for whatever theoretical reasons, establish a new way of speaking by saying that one will speak of (natural) numbers as “colored” if they are even and as “not colored” if they are odd. The question whether or not the number three is colored is meaningful now, in that new technical sense, and can be answered (negatively), though this new use of “colored” runs the risk of causing confusion because of our various customary uses of the term. But that is not the case in the matter at hand. Vision scientists using homuncular language have not laid down new conditions of application for these psychological predicates, they have not stated how they want to use them or how they want them to be interpreted. Rather, they seem to imply that the meaning of these expressions is entirely clear by reference to their customary use, i.e., the idea that since we know what the word “brain” means and what it means to say that a human being calculates, makes decisions, etc., we also know what it means to say that the brain is doing so, which is logically untenable.
Vision scientists do not use these expressions metonymically either, as mere and harmless ways of speaking. As I have tried to demonstrate by various quotations, the whole explanatory force of the current cue-approach to depth perception rests on those assertions—they are supposed to explain (depth) perception. By saying that the brain “interprets” depth cues as “information about depth” by means of “calculating” or “inferring,” vision scientists clearly do not mean that the perceiving subject is doing these things, as we sometimes mean that a person is a quick thinker by saying that his brain is working fast. On the contrary, it is regularly and explicitly stated that the perceiving subject is not aware of those processes, which is not surprising either. Claiming that we perceive depth because we use information, infer, calculate, or estimate would be obviously at odds with our own experiences and an explanatory regress as well.
Visual perception is undoubtedly one of the most intriguing, complex, and perplexing aspects of our lives and the answer to the question of why we see what we do will most certainly not be a simple one. Perhaps the use of homuncular language in vision science is just a manifestation of both our still-limited knowledge of vision and the effort to handle its immense complexity on a theoretical and linguistic level. Clearly, the concept of depth cues is intuitively appealing, since anybody looking at Figure 1 will confirm that there is an actual difference in perceived depth when looking at the two pictures. Furthermore, without a brain we are not able to see at all and current mathematical models of cue-integration are quite suitable for fitting quantitative data derived from experiments on depth perception. However, the fact that such data can be adequately modeled by cue-integration models is highly interesting, but it does no more to support the claim that the brain is doing any of the “calculations” specified in those models than the fact that the planetary movements can accurately be described in mathematical terms supports the claim that the planets are somehow “calculating” the way they have to “travel”—such an argument would simply confuse the mathematical description of a process with the process described.
But the problem of homuncular language bites deeper than that since there is great danger in such loose talk for critical thinking and theorizing about visual perception and cognition in general. Homuncular language can give rise to the illusion that we are closer to accounting for mental phenomena than we really are. If the brain can be characterized in psychological terms like “interpreting,” “inferring,” or even “understanding,” it looks as though we are in the right conceptual sphere when trying to explain mental phenomena. Yet, as has already been said, these assertions are neither justified nor justifiable by empirical evidence and if we strip our theoretical language concerning the brain of these misleading predicates and restrict it to the actual results of respective experiments, for example that certain neurons fire more frequently when a certain stimulus configuration is presented to a subject, it becomes obvious how far we actually are from accounting for the phenomena we wish to explain. There is no explanatory gain in saying that certain neurons have the ability to “detect” lines or that the brain “interprets” depth cues. Homuncular language has the air of explanation but it is ultimately explanatorily empty. From a pedagogical stance, homuncular language is especially harmful for students relatively new to vision science. Textbooks claiming that we perceive depth because the brain “infers” or “understands” the spatial layout from “information about depth” nurture the mental attitude that visual perception is not puzzling at all, that we have, in principle, figured out the big picture and only have to fill in the details with further experiments, thereby preventing the important development of sensitivity for both the manifoldness of phenomena and the related problems at hand.
Concluding remarks
The aim of this article was to show that the concept of depth cues, one of the most prominent and frequently used concepts in classical and current vision science, suffers from a serious logical flaw, namely that the characterizations of depth cues presuppose a homunculus within human beings, equipped with our cognitive abilities to be meaningful, which is explanatorily empty. Furthermore, even if these assertions were meaningful (or if such a homunculus would actually be postulated), there would remain an even more troubling problem. The cue concept, as well as the current approach to vision in general, is supposed to explain our perceptions, i.e., what we see. The connection, however, between the brain’s (or homunculus’) “interpreting,” “comparing,” “calculating,” “estimating,” etc. and our seeing something remains totally obscure. This issue is not at all elucidated by claiming that the brain, by means of depth cues, is simply “generating a stable three-dimensional percept from two-dimensional retinal images” (Troscianko, Montagnon, Le Clerk, Malbert, & Chanteau, 1991, p. 1923) or “translating a two-dimensional retinal surface … into perceptions of depth” (Levine, 2000, p. 297) since it is not at all clear what these assertions are supposed to mean. They certainly do not help our understanding of why we see what we do.
Of course, the general thrust of the arguments presented here is not at all new. Various authors, generally of philosophical provenance (e.g., Bennett & Hacker, 2003; Dennett, 1981; Heil, 1981; Kenny, 1981; Ryle, 1949/2009; Searle, 1990; Wittgenstein, 1958), have emphatically cautioned against the prevalence and associated logical inconsistencies of homuncular language in cognitive science. Speaking as a vision scientist, it is, however, both surprising and frustrating that there seems to be little to no discussion about that matter within the cognitive/vision science community. To my knowledge, the only exception worth remarking on is what is known as the ecological approach to visual perception (Gibson, 1966, 1979/1986). Boldly challenging the traditional inferential/computational conceptualization of perception in its entirety (e.g., by refuting the sensation/perception distinction, the notion of the “meaningless and impoverished stimulus,” stimulus-response theory, and the concept of cues), Gibson introduced a fresh and radically different general theoretical framework of perception with its own idiosyncratic terminology. 14 Yet, despite a substantial body of theoretical and empirical work by advocates of the ecological approach (e.g., Fajen, Riley, & Turvey, 2008; Freeman, 1965; Mark, 1987; Orlandi, 2013; Oudejans, Michaels, Bakker, & Dolné, 1996; Reed, 1982; Stoffregen, Gorday, Sheng, & Flynn, 1999; Turvey, 1992; Warren, 1984; Withagen & Michaels, 2005; Wraga, 1999), its impact on current (mainstream) vision science has been marginal, at best. Worse still, as Costall and Morris (2015) have shown, there is a noticeable tendency in textbooks on visual perception to assimilate Gibson’s ideas with the very theoretical positions he fiercely attacked, thereby effectively neutralizing his critique and further facilitating the impression that the current general framework is without dispute. 15
Perpetually, research articles and textbooks loaded with numerous and often unequivocal examples of homuncular language are being published. Vision scientists need to scrutinize such talk with care and try to avoid it. Or, if we feel that its use might serve a specific purpose, e.g., a picturesque or “as-if ” description for processes we do not fully understand, we have to reveal that to the reader in an unambiguous fashion in order to prevent the impression that we know more than we actually do. Drawing on the concept of depth cues heuristically may yield relevant and interesting empirical results (see, e.g., Koenderink, van Doorn, Pinna, & Pepperell, 2016; Pagel, 2017; Todorović, 2008; Vishwanath, Girshick, & Banks, 2005), but these results remain ultimately incomprehensible due to the inherent theoretical problems of that concept and the absence of a logically consistent general theoretical framework. Furthermore, it would facilitate critical thinking about visual perception and psychological theories in general if textbooks emphasized that the current theoretical framework is not without dispute, that other, radically different theoretical approaches, such as Gibson’s theory or the phenomenological approach (e.g., Husserl, 1997; Merleau-Ponty, 1945/2012) are serious alternatives which need to be discussed.
Needless to say, I strongly disagree with the opinion that a discussion about the “adequacy of words” will turn out to be endless, ultimately fruitless, and should be left, if at all, to philosophers only; that cognitive scientists should instead concentrate on collecting empirical data in order to “unravel important facts” about human beings (e.g., Zeki, 1993). Experiments carried out by cognitive scientists are strongly shaped by the current theoretical framework, the interpretation of respective empirical results even more. Therefore, cognitive scientists should be aware of their conceptual framework and need to subject it to critical scrutiny because if the arguments presented here are coherent, this would have far-reaching implications for current cognitive science as a whole—to paraphrase a remark by Wittgenstein (1980): one should not forget to go down to the foundations, to put the question marks deep down enough.
Footnotes
Acknowledgements
I thank Dieter Heyer, the reviewers, and the editors for helpful discussions and/or comments on earlier drafts of this article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
