Abstract
Sensory substitution devices (SSDs) can convey visuospatial information through spatialised auditory or tactile stimulation using wearable technology. However, the level of information loss associated with this transformation is unknown. In this study, novice users discriminated the location of two objects at 1.2 m using devices that transformed a 16 × 8-depth map into spatially distributed patterns of light, sound, or touch on the abdomen. Results showed that through active sensing, participants could discriminate the vertical position of objects to a visual angle of 1°, 14°, and 21°, and their distance to 2 cm, 8 cm, and 29 cm using these visual, auditory, and haptic SSDs, respectively. Visual SSDs significantly outperformed auditory and tactile SSDs on vertical localisation, whereas for depth perception, all devices significantly differed from one another (visual > auditory > haptic). Our findings highlight the high level of acuity possible for SSDs even with low spatial resolutions (e.g., 16 × 8) and quantify the level of information loss attributable to this transformation for the SSD user. Finally, we discuss ways of closing this “modality gap” found in SSDs and conclude that this process is best benchmarked against performance with SSDs that return to their primary modality (e.g., visuospatial into visual).
Introduction
Vision provides an abundance of spatial information, helping to localise individual objects, facilitate orientation, and assist navigation within the environment (Epstein, 2008; Kelly, McNamara, Bodenheimer, Carr, & Rieser, 2008; Michel & Henaff, 2004; Urbanski et al., 2008). Relative to those with visual experience (the sighted, late-blind), congenitally blind individuals who have never experienced rich visual information showcase a range of subtle shifts in spatial processing, including a lack of automatic integration between tactile and auditory spatial reference frames (Hötting & Röder, 2004; Hötting, Rösler, & Röder, 2004); distorted representations of space, compromising the ability to make judgements based on the Euclidean distance between auditory targets (Gori, Sandini, Martinoli, & Burr, 2013); contraction of space resulting in an overestimation of distance (Cappagli, Cocchi, & Gori, 2017; Kolarik, Pardhan, Cirstea, & Moore, 2017); increased difficulty in replicating movement in space (Gori, Cappagli, Baud-Bovy, & Finocchietti, 2017); and finally, show a preference for, and effective use of, egocentric (self-oriented) representations of external space but a slower processing of allocentric (environment-oriented) representations, which for some but not all situations can result in increased errors (Corazzini, Tinti, Schmidt, Mirandola, & Cornoldi, 2010; Heller & Kennedy, 1990; Iachini, Ruggiero, & Ruotolo, 2014; Pasqualotto & Proulx, 2012; Ruggiero, Ruotolo, & Iachini, 2012, 2018; Vercillo, Tonelli, & Gori, 2018), which in turn, can result in an impaired spatial memory for the congenitally blind (Pasqualotto, Spiller, Jansari, & Proulx, 2013). Overall, the experience of vision continues to influence our representation of space even if no visual input is currently available. As such, providing structured spatial information to the fully blind not only serves as an intervention to assist in daily tasks (Elli, Benetti, & Collignon, 2014; Hamilton-Fletcher, Obrist, Watten, Mengucci, & Ward, 2016; Kristjánsson et al., 2016; Maidenbaum, Abboud, & Amedi, 2014), but it allows users to interpret their remaining senses more accurately (Cappagli, Finocchietti, Baud-Bovy, Cocchi, & Gori, 2017) and explore how our metamodal representation of space can develop through additional sensory stimulation (Pasqualotto & Esenkaya, 2016).
One modern approach to extending the sensory world of fully blind individuals is to use “sensory substitution devices” (SSDs) that translate the sensory properties of one sense (e.g., vision) into another (e.g., hearing or touch). The substituted modality is defined by the specificity of the sensory signal (e.g., only vision experiences colour) and the sensorimotor contingencies involved (e.g., only vision experiences “visual occlusion’—O'Regan & Noë, 2001). If only a subset of these visual signals are experienced, such as only conveying spatial information, and that the sensorimotor contingencies involved are closest to vision rather than veridical hearing or touch, then the information would be described as “visuospatial.” SSDs use consistent mappings between the senses to allow the user to decode the pattern of received signals back into a coherent mental image of the world upon which to act. For SSDs converting vision into sound or touch, these received signals can indicate the position, size, shape, colour, and motion of visual objects (Gori, Cappagli, Tonelli, Baud-Bovy, & Finocchietti, 2016; Hamilton-Fletcher & Ward, 2013). Over time, these can even evoke a qualitative shift in user perception so that sound or touch can allow the user to perceive spatialised patterns of light (Ortiz et al., 2011; Ward & Meijer, 2010), a form of synthetically acquired synaesthesia comparable to vision (Ward & Wright, 2014). SSDs not only have the potential to provide access to new spatial information and shift metamodal representations of space, but interviews with users suggest a range of practical benefits outside of the lab including enhanced safety, independence, and facilitating curiosity of the environment, further encouraging spatial exploration (Hamilton-Fletcher, Obrist, et al., 2016). Visuospatial SSDs tend to fall under two main types of feedback, haptic (the active use of touch), or auditory (the sense of listening), and while we explore specific examples of these SSDs later, wider overviews of SSDs are available (Gori et al., 2016; Hamilton-Fletcher & Ward, 2013).
Haptic (also known as tactile) devices were first popularised in 1969 with the Tactile Vision Sensory Substitution device (Bach-y-Rita, Collins, Saunders, White, & Scadden, 1969). This device used a television studio camera (which users could manipulate) connected to vibrating solenoids, positioned in the back of a dentist’s chair, arranged in a matrix that resembled the camera’s pixels. Bach-y-Rita and his research group continued to improve their design throughout the following decades, experimenting with electrotactile stimulation, and, most notably, leading to the creation of the Tongue Display Unit (TDU; Sampaio, Maris, & Bach-y-Rita, 2001); an SSD that converts 2D greyscale images into a spatialised pattern of electrotactile stimulation on the tongue in real time. The output is constrained to a 12 × 12 matrix with users able to reach a visual acuity of up to 20/430 for single objects (for comparison, 20/20 is optimal, and 20/200 is the threshold for blindness). More recently, the BrainPort has utilised a 20 × 20 matrix; however, there is no recorded increase in visual acuity from this (Nau, Bach, & Fisher, 2013; Stronks, Mitchell, Nau, & Barnes, 2016).
Since Bach-y-Rita’s work, vibrotactile technology has progressed with the use of 3D cameras and mobile computer processing. What once required a film crew, many hands and a somewhat unwieldy dentistry chair can now fit into a single-unit wearable vest (Ertan, Lee, Willets, Tan, & Pentland, 1998; Wacker et al., 2016). Many different haptic feedback vests have been developed over the past few years (Ertan et al., 1998; Jones, Nakamura, & Lockyer, 2004; Rochlis, 1998; Spanlang, Normand, Giannopoulos, & Slater, 2010), with the VibroVision among the most recent (Wacker et al., 2016). Like most haptic vests, the VibroVision contains an input source, in this case, a 3D sensor to record the necessary visuospatial information; an image processing unit, which coverts the visuospatial information into values that drive; finally, a tactile output, through a series of vibrating motors arranged in a 16 × 8 matrix on the users’ abdomen (Wacker et al., 2016).
While visuospatial-into-audio devices were created much earlier (e.g., Noiszewski’s Elektroftalm in 1897—Starkiewicz & Kuliszewski, 1963), they only gained in popularity for scientific research after the introduction of “the vOICe” (Meijer, 1992). This software converts 2D greyscale images into sounds played through headphones, utilising stereo playback and time for the horizontal axis (scanning from left to right over time), pitch changes to represent verticality (higher pitches = higher locations), and volume indicating brightness (louder = brighter). In combination, this image . ∕.. would produce a low-pitched tone in the left ear that rises to a high-pitched tone in the right ear. The vOICe is still considered one of the higher resolution auditory devices, with the device typically outputting at a resolution of 176 × 64 and allowing a perceptual resolution bordering on 20/200 on single object discrimination (Haigh, Brown, Meijer, & Proulx, 2013; Striem-Amit, Guendelman, & Amedi, 2012).
More recent SSDs have tended to prioritise 3D information owing to its immediate practicality for locating individual objects, segmenting their shape, and utility for navigation (Caraiman et al., 2017; Dunai, Peris-Fajarnés, Lluna, & Defez, 2013; Fristot, Boucheteil, Granjon, Pellerin, & Alleysson, 2012; Spagnol, Baldan, & Unnthorsson, 2017; Stoll et al., 2015). When objects within the field of view (FoV) of a 3D sensor are sonified in a manner that replicates the perception of sounds emanating from the location of the object, this process is also referred to as a “virtual acoustic space” (Eckert, Blex, & Friedrich, 2018; González-Mora, Rodriguez-Hernandez, Burunat, Martin, & Castellano, 2006; González-Mora, Rodriguez-Hernandez, Rodriguez-Ramos, Díaz-Saco, & Sosa, 1999; Rodríguez-Hernández et al., 2010). One modern incarnation of this is the “Synaestheatre” that converts a depth image from a 3D sensor into realistically spatialised sounds (Hamilton-Fletcher, Obrist, et al., 2016). The sounds are spatialised using head-related transfer function (HRTF), which describes how the spatial positions of the listener and sound source alter the received sounds in terms of interaural timing, intensity, as well as distortions created by the head and pinnae (Algazi, Avendano, & Duda, 2001; Kistler & Wightman, 1992; Kulkarni, Isabelle, & Colburn, 1999; Potisk, 2015). The Synaestheatre produces audio that updates in real time with sounds that are fully 3D spatialised. As such, the Synaestheatre is similar to the VibroVision in that azimuth and elevation are conveyed spatially (rather than through abstract codes), albeit through hearing and touch, respectively.
To date, studies examining how effectively visuospatial information can be converted into hearing or touch have been difficult to compare owing to several confounds that obscure why such differences exist, as well as lacking a “visual” benchmark to contextualise these spatial discrimination abilities. Prior studies examining auditory/haptic approaches (either directly or indirectly) have varied in the following ways: the type of spatial information being provided (e.g., 2D luminance vs. a single-point depth sensor—Bermejo, Di Paolo, Hüg, & Arias, 2015); the spatial resolution being encoded on devices (12 × 12—Sampaio et al., 2001; 20 × 20—Nau et al., 2013; 176 × 64—Haigh et al., 2013; Striem-Amit et al., 2012); variations in the temporal resolution (TDU/BrainPort updating in real time; the vOICe updating once per second); the presentation of information to the user (e.g., “all-at-once” for the TDU/BrainPort or “column-by-column” for the vOICe); there is also an inconsistent use of abstract mappings (e.g., verticality is represented spatially in the BrainPort but via pitch in the vOICe); and finally, without a “visual” condition using this same information to contextualise performance, it is unclear how much information is “lost” in the translation and what the upper bound of performance is likely to be (Cha, Horch, & Normann, 1992). By keeping these aspects consistent across visual, auditory, and haptic conditions, it becomes possible to establish the extent of information loss that occurs from transforming between modalities while enabling a “fairer” comparison.
To address these gaps in knowledge, the present study compares SSDs that output the same spatial resolution (16 × 8) into spatially distributed patterns of visual, auditory, or tactile stimulation. Users are then evaluated in terms of spatial acuity when using these devices to determine the extent of information loss incurred from a purely spatial transformation for sensory substitution. The Synaestheatre device (auditory) and the VibroVision vest (haptic) were selected for comparison as they can output the same spatial resolution into spatialised sound or touch, making them an ideal comparison. To establish the upper limit possible for using a 16 × 8-depth map, a final condition of viewing this information through spatialised patches of light was created (also using the Synaestheatre software). To examine their spatial acuity, participants were tasked with identifying which of two stimuli was either highest (in the vertical condition) or closest (in the distance condition), using the Synaestheatre (visual), VibroVision (haptic), and Synaestheatre (auditory) in turn. Through a comparison of visual, haptic, and auditory approaches, the spatial information loss attributable to transformations between modalities from using traditional SSD designs can be established.
Hypotheses
Given that the participants were healthy sighted individuals who will be most familiar with using their vision for spatial perception, it is extremely likely that they will be most successful at spatial awareness tasks using their eyesight, compared with haptic or auditory approaches (H1). It is also expected that participants will yield a greater accuracy on the depth perception tasks compared with verticality (H2) across all sensory modalities; this is due to the form of the SSDs’ output, the signal for depth being primarily represented through amplitude (brightness, loudness, vibration intensity), compared with verticality which requires the user to perform a more complex interpretation of the signal to discriminate positions (detailed later). As the spatial resolution is identical for both the Synaestheatre and the VibroVision, it is possible that the users will perform equally well on the spatial perception tasks (H0), indicating that the technology provides the limit on discrimination rather than the user’s discrimination abilities with each modality.
Method
Participants
Right-handed participants (N = 22; age = 24 ± 1 years; 16 female, 6 male) were recruited via internal university communications and were predominantly of an academic background; participation was incentivised by the possibility of winning £25 and an opportunity to use novel technology. All the participants read an information sheet about the experiment and signed a written form of consent. Ethical clearance was provided by the host institution. Participants were also instructed that they may withdraw from the experiment at any time of their choosing. Only participants who had normal, or corrected-to-normal, eyesight and who explicitly reported normal hearing were accepted. No participants had prior experience of using SSDs.
Materials
Synaestheatre
The Synaestheatre device converts visuospatial information into an audio signal using a 3D camera sensor (Structure Sensor, Occipital, USA) mounted on a virtual reality headset (Z4, BOBOVR, China) and a smartphone application (SE, Apple, USA; coded in objective C). The application converts the depth image relayed from the sensor into grey noise. Grey noise was chosen as it is perceptually the same loudness across all frequencies, making modulation by HRTF easier to identify. The audio sample is filtered by Panorama (version 5.86, Wavearts, Inc) for Reaper audio software (version 5, Cockos, Inc) that provides HRTF modulation to the grey-noise audio samples. To replicate the HRTF characteristics, the Panorama plug-in filter creates a virtual sound source and a virtual listener, then synthesises a stereo audio signal to replicate the effect of a sound source 2 m away from a human head, varying in azimuth and elevation. To align the depth map and the audio file HRTF locations, the FoV of the camera (horizontal = 58°; vertical = 45°) was separated into a 16 × 8 resolution, allowing us to generate 128 audio files at combinations of the following increments; horizontal degrees –29, –25, –21, –17, –14, –10, –6, –2, +2, +6, +10, +14, +17, +21, +25, +29; vertical degrees –23, –16, –10, –3, +3, +10, +16, +23. The measurements to calculate the HRTF were appropriated from a KEMAR head dummy, modelled from the CIPIC HRTF database (Algazi, Duda, Thompson, & Avendano, 2001). The sound source was 500 ms of grey noise, with HRTF varying its received interaural timing and intensity (for horizontal localisation) and spectral frequency content (for vertical localisation).
Grey noise is ideal for using HRTF to interpret verticality as it is only possible to distinguish the vertical locations of “broadband sounds” which contain a wide range of frequencies (e.g., noise, music, or speech). By contrast, it is not possible to tell the vertical location of a single frequency from hearing alone (Blauert, 1997). This is due to the shape of the pinnae (outer ear). When sound bounces off the pinnae, and the source is from a higher vertical place, it preserves more high-frequency content, giving the perceived sound a “tinnier” quality. When the sound source is from a lower place, the higher frequencies are suppressed, giving the perceived noise a quality of bass (Parise, Knorre, & Ernst, 2014). Because grey noise is the same loudness across all frequencies, this makes any subtle sound changes caused by vertical HRTF easier to discriminate for the user (see Figure 1). Closer distances are represented through increasing loudness to the user. The combination of HRTF and depth-into-loudness allows the user to locate objects in 3D space. During sonification of the depth map, a slight horizontal timing offset is applied to help separate left/right objects; however, changes in signal are instantly responsive to movement of the head or objects. In the visual condition, participants view the Synaestheatre’s depth map directly through a stereo image displayed on an iPhone screen that is mounted in a mobile VR headset (see Figure 2). For the end user wearing the VR headset, this is experienced as viewing a 2D “screen” in front of them with a 16 × 8 resolution; the horizontal and vertical position of external objects are mapped to their respective horizontal and vertical pixel locations, with the proximity of objects being conveyed through the brightness of the pixels on the screen.

Spectrogram of noise at different elevations using head-related transfer function; darker patches indicate higher amplitudes. Noise at lower elevations suppresses higher frequencies, while at higher elevations, the higher frequencies are preserved while lower frequencies are suppressed.

Left image is of a Synaestheatre user; right image is of the Synaestheatre observing two disc stimuli that vary in distance to the sensor. In the visual condition, users directly observe the 16 × 8-depth map. The iPhone shows two identical 16 × 8-depth maps as each one is presented separately to each eye within the VR headset, which is subjectively perceived as a single image. In the “auditory” condition, users do not receive any visual stimulation and only perceive sounds being played from those spatial locations are heard by the subject.
VibroVision
The VibroVision vest is a device developed by Wacker et al. (2016) that converts visuospatial information into a 2D vibration image on the abdomen. Like the Synaestheatre device, it converts space into a 16 × 8-depth map matrix (128 depth points), but instead of the output signal being provided as unique sound files, the information is outputted through 128 eccentric rotating mass pager motors (see Figure 3). For the end user, the presence of an object picked up by the sensor is felt as a distinct patch of vibration on the abdomen; the horizontal and vertical spatial location of the stimuli in the sensor image is mapped to the horizontal and vertical position on the abdomen, with closer objects making the vibrational patch on the abdomen vibrate more intensely. The spatial information is gathered using a 3D camera (Xtion PRO LIVE, ASUS, Taiwan) mounted to the chest area on the vest; the depth information is relayed to a small computer (2, Raspberry Pi, UK) and then is processed to provide a signal for the vibration motors; processing and vibration motor control is coded in Python, using the OpenCV library. Each electric motor vibrates at a frequency of between 183 and 233 Hz and is sized at 2.25 cm2, with an average centre-to-centre intertactor spacing of 2.98 cm horizontally and 3.04 cm vertically. This creates an overall stimulation space of 47 cm horizontally by 23.5 cm vertically centred on the subject’s torso. The centre-to-centre intertactor spacing of the VibroVision vest exceeds the 2 cm two-point threshold for the abdomen for the 18 to 28 age range of our participants (Stevens & Choo, 1996—also see Lederman & Klatzky, 2009; Weinstein, 1968). While the spatial acuity for vibrotactile stimuli on the abdomen remains an open question (although see Cholewiak, Brill, & Schwab, 2004), the spatial discrimination of vibrotactile stimuli on the back using 4 cm intertactor spacings can reach up to 70% to 77% correct responses in a 3AFC task (Jóhannesson et al., 2017). Because the abdomen has a more sensitive two-point discrimination threshold than the back, the abdomen may also have a higher spatial acuity for vibrotactile stimulation. The vest was calibrated to the same scale depth map as the Synaestheatre, with the FoV of the sensor at 58° (horizontal) and 48° (vertical). For the present study, the vest was powered with mains 240 V electricity through a cable; however, an option exists to use a 5 A LiPo battery pack for extra manoeuvrability.

Left image shows a VibroVision vest user; the chest-mounted camera provides 3D information that is convey via spatialised vibrations arranged in a 16 × 8 matrix (right image).
Both the Synaestheatre and VibroVision were given a maximum sensing range of 2 m so that only stimuli were detected by the sensor. Distance was conveyed to users via a linear increase in intensity ramping up from 2 m to 0 m, manifesting as light intensity, auditory amplitude, or vibrotactile amplitude. In addition, as a result of the visual angle between each “pixel,” at certain distances (e.g., closer than 96.4 cm), the stimulus discs could be spread across additional “pixels,” producing a larger area of stimulation (see Figure 4).

How stimuli are perceived by subjects. The top row showcases the depth condition, and the bottom row showcases the verticality condition. The “Stimuli” column shows visual image examples of stimuli disc locations. The “Visual SSD” column showcases the visual images observed by subjects while wearing the VR headset, and distance information is coded as pixel brightness (closer = brighter). The “Audio SSD” column shows which audio files are active (audio icons) and their loudness (sound waves) to stimuli locations to the subject; distance information is coded as audio file loudness (closer = louder). Examples of the auditory SSD can be found at the following links: https://youtu.be/QO2omDc3Orw and https://youtu.be/5_rASaDsg5c. The “Haptic SSD” column shows which vibrotactile motors would be active (white blocks) and their vibrotactile intensity (block size) to these stimuli locations, and distance is coded as vibrotactile intensity. As a result of visual angles, at certain thresholds (e.g., closer than 96.4 cm), closer stimuli would activate more “pixels,” resulting in a larger area of stimulation across all SSDs. Subjects were stationary in all conditions but could “look around” by varying the pitch, yaw, and roll of the sensor. Subjects were blindfolded in the auditory and haptic SSD conditions.
Stimuli
Each stimulus consisted of a wooden base (20 cm × 20 cm × 4 cm) with a 1.2-cm hole drilled in the centre. A dowel rod (1.2 cm × 150 cm), marked in 1 cm increments, was inserted in the hole flush with the bottom of the base of the stimuli. The final piece of each stimulus (the head) was a cardboard disc (diameter = 34 cm) with a peg attached to the centre, to allow the circle to be moved to any position on the dowel and, therefore, measure verticality. Two parallel tape measures (200 cm from the participant test mark) were placed on the floor at 100 cm apart to measure depth.
Procedure
Participants were given minimal training and instruction to ensure that results were indicative of device intuition and not meticulous practice. This focus on initial user abilities expands upon our earlier work on the initial impressions of SSDs by potential blind users, who expressed a desire for devices that intuitively conveyed spatial information (Hamilton-Fletcher, Obrist, et al., 2016). A positive initial user experience can bypass the difficulties cited by current SSD experts during their initial learning phase and lower the barrier to entry for long-term adoption and accessing the benefits of SSD expertise (Ward & Meijer, 2010).
Participants were shown depth map representations of the disc stimuli (see Figure 2) while not blindfolded to aid them in understanding how they would perceive the stimuli once sight was removed. The presentation of stimuli for all practice trials and conditions was fully randomised in Microsoft Excel as to whether the left or right disc stimulus was the correct answer (i.e., higher or closer). Before completing the experimental trials, each participant would complete two practice trials using each device, in which they would receive feedback as to whether they made a correct guess. During the experimental trials, the participants received no feedback as to whether they made correct guesses.
The experiment followed a within-subject staircase design, with all participants using every device to complete both depth and verticality spatial perception tasks. All distances and measurements described with relation to the stimuli are using their central points. The depth task involved standing on a mark on the floor in front and centre of the two parallel stimuli (at 100 cm apart); the disc stimuli were set 100 cm from the floor; and the “central point” from which the disc stimulus locations varied was set 120 cm from the participants’ location (see Figure 5). The participant was then required to answer the question which stimuli they thought was closer to them (their left or right), by indicating verbally with “left” or “right.” The experimenter writes down each response, which also informs them as to whether the difficulty is increasing or decreasing for the subsequent trial. The distance between the stimuli was initially set at ± 50 cm (e.g., the right stimuli would be set 50 cm closer from the central point, and the left stimuli would be 50 cm further from the central point). This distance would reduce by 15 cm for every correct guess, thereby increasing the difficulty in identifying the difference between the stimuli. When the participant eventually made a mistake, a reversal would occur, increasing the distance by 5 cm. From then on, each correct guess would decrease the difference by 5 cm, and each incorrect guess increases the difference by 5 cm. The stimuli would always increase or decrease relative to the central point; this was done to ensure a uniform increase/decrease in difficulty across all devices and participants. There were 10 trials for depth perception for each of the devices. The benefit of following this type of adaptive staircase procedure, as opposed to the standard “3-down-1-up” method (Cornsweet, 1962), is that the number of trials could be decreased drastically. Due to the manual movement of stimuli, each trial was time costly. Therefore, it was practical to reduce the number of trials to minimise the overall experimentation time. However, one potential downside of this method was that if a participant was to make a mistake in the first few trials, and this mistake was not indicative of their actual ability at using the device, it could be impossible for them to reach their actual performance limit in the remaining trials. This was controlled by allowing subjects to return to increasing the difficulty in increments of 15 cm on correct answers, if a mistake was made in one of the first two trials, followed by 3 correct answers in a row. This ensured that a plateau could always be reached by the subject within the 10 trials. In the study, this rule ended up only being used for one subject with one SSD. Our pilot testing used two subjects to explore a variety of protocols (e.g., performance level at each difficulty; 1-down-1-up only), however, the time taken and number of trials made alternative approaches impractical. Further testing allowed us to estimate the number of trials at which a participant would hit their peak performance; this typically occurred at Trial Number 6. To further ensure that subjects had reached the limit of their spatial discrimination abilities, each participant’s dataset was visually inspected to ensure that their performance had reached a plateau; this was the case for all participants using each sensory modality.

A top-down view of the depth condition. The (a) stimuli consists of two discs of 34 cm diameter suspended 100 cm from the floor and 100 cm horizontally from one another. The task involved (b) systematically varying the stimuli in their distance (±) from the (c) central point indicated by the horizontal red (dark grey) dotted line, which is set to a distance of 120 cm from the (d) participants’ location. The participant is tasked with answering whether the left or right stimulus disc is closer to them using the provided sensory substitution device (Synaestheatre, VibroVision).Note: Please refer to the online version of the article to view the figures in colour.
The protocol for testing vertical perception was similar to the protocol used in the depth perception condition. The stimuli were placed along the central point at 120 cm from the participant. The disc stimuli were offset from one another relative to the “middle point” 100 cm above floor level, starting at ± 50 cm from this point for each disc stimulus (see Figure 6). The measurement for verticality was taken from the centre point of each stimulus disc; therefore, considering the disc’s diameter (34 cm), if a participant managed to score an accuracy of 20 cm, each stimulus disc would effectively be overlapping the other by 14 cm in terms of vertical space. Measurements were taken in this way to diminish the time between trials and to ensure accurate data collection. As with the depth perception, the participant would answer which one did they perceive as higher (their left or right), and the distance would reduce by 15 cm until the participant made a mistake. After a reversal, changes to the difference would occur at 5 cm increments. All changes in the distance between the stimuli heads were anchored around the 100 cm mark on the dowel rods. There were 10 trials for vertical perception for each device.

A first-person view of the verticality condition. The (a) stimuli consist of two discs of 34 cm diameter, positioned 120 cm in front of the participant’s location; the stimuli are set with a horizontal centre-to-centre distance of 100 cm from one another. The task involved (b) systematically varying the stimuli in their distance (±) from the (c) middle point indicated by the horizontal red (dark grey) dotted line. The participant is tasked with answering whether the left or right stimulus disc is higher using the provided sensory substitution device (Synaestheatre, VibroVision).Note: Please refer to the online version of the article to view the figures in colour.
The first data collected from each participant was in the control condition. The participant used the Synaestheatre device with the sound turned off, and the visual depth map turned on. This condition provides excellent control data, as both devices were calibrated to the same 16 ×8-depth map; therefore, in sighted individuals, this represented the highest probable accuracy. The second device that the participants used was the VibroVision vest; during this condition, the participants were blindfolded and wore sound-reducing earplugs (to minimise the sound effect of the vibrating motors). The vest was tightly wrapped over one layer of participant’s clothing, and if the participant was of a slight build, additional cling film was wrapped around the vest to make certain that the motor matrix had sufficient contact with the torso. Finally, the participants used the Synaestheatre with the visual depth map switched off and the sound switched on. During the tasks, subjects stood in a fixed stationary position, from which they could “look around” using the sensor through varying its pitch, yaw, and roll. For every device, the participant would complete the verticality trials first, followed by the depth trials. The visual condition came first as it was the easiest and was useful to familiarise participants with the task. The head-mounted camera used in the visual and auditory conditions caused minor discomfort to wear so was chosen as the first and last task. These pragmatic considerations come at the expense of counter-balancing which should be considered by future studies. Finally, after each trial, the participant would give a confidence rating of their answer from 1 to 10 (1 = complete guess, 10 = completely certain).
Analysis
Final scores were taken as the average from the last three trials of each condition and analysed using SPSS (version 23, IBM Corp.). Six dependent variables were created including visual SSDs for vertical tasks (V-Vert), visual SSDs for depth tasks (V-Depth), haptic SSDs for vertical tasks (H-Vert), haptic SSDs for depth tasks (H-Depth), auditory SSDs for vertical tasks (A-Vert), and auditory SSDs for depth tasks (A-Depth). Any outliers (of which there were two) were detected and removed by using the outlier labelling method with g = 2.2; the reason for which was due to a relatively small sample size, and other methods (including the common outlier labelling method, with g = 1.5) tending to perform on the conservative side when dealing with smaller groups (Hoaglin, Iglewicz, & Tukey, 1986). The data were checked for normality by examining skewness and kurtosis, in addition to visually inspecting the plots. Furthermore, to check for any violations of sphericity, Mauchly’s test of sphericity was employed. A one-way repeated measures analysis of variance (ANOVA) was conducted, and subsequent post hoc pairwise comparisons made using the Bonferroni correction method. Effect sizes were calculated using Cohen’s d formula by dividing the mean difference by the pooled standard deviation. The participants’ confidence scores for each trial were separately compiled with mean scores calculated on a trial-by-trial basis.
Results
After data were imported into SPSS, the outlier labelling method (Hoaglin et al., 1986) established the acceptable bounds for each variable (V-Vert = 0 to 6.34 cm; V-Depth = 0 to 6.34 cm; H-Vert = 0 to 152.33 cm; H-Depth = 0 to 96.34 cm; A-Vert = 0 to 102.99 cm; and A-Depth = 0 to 50.67 cm) and that two outliers existed, which were subsequently removed from the dataset. Visual inspection of the data points on a histogram, in addition to examining skewness and kurtosis for V-Vert (skewness = 1.54, SE = .51; kurtosis =
Mean Spatial Discrimination Scores, Visual Angle, Standard Deviations, and Confidence Intervals for the Visual, Haptic, and Auditory Devices.
Note. SSDs = sensory substitution devices.
Mauchly’s test of sphericity showed that the assumption of sphericity was not met for verticality, χ2(2) = 6.43, p = .04, or depth, χ2(2) = 19.97, p < .001; therefore, Huynh-Feldt and Greenhouse-Geisser estimates of sphericity correction were applied respectively (ε = .77 and ε = .60). The repeated measures ANOVA suggested that there were main effects for both the verticality, F(1.65, 31.29) = 24.44, p < .001, η2 = .56, and depth tasks, F(1.20, 22.75) = 26.19, p < .001, η2 = .58; therefore, post hoc pairwise comparisons were made using a Bonferroni adjustment (see Table 2 for mean differences for verticality and Table 3 for mean differences for depth).
Mean Difference Between Devices for Verticality.
Note. Repeated values in grey.
**p≤.001.
Mean Differences Between Devices for Depth.
Note. Repeated values in grey.
p≤.05. **p≤.001.
For discriminating verticality, users were significantly more accurate with visual SSDs than both haptic (p < .001, d = 2.65) and auditory SSDs (p < .001, d = 2.67); there was no significant difference in performance between haptic and auditory SSDs.
For discriminating depth, users were significantly more accurate with visual SSDs than both haptic and auditory SSDs (p < .001, d = 2.44; and p = .01, d = 1.41, respectively). However, unlike the outcomes for the verticality tasks, in the depth condition, users were significantly more accurate with auditory SSDs than haptic SSDs (p = .001, d = 1.48).
To compare whether discriminating verticality or depth differed in their difficulty, a series of Bonferroni-corrected paired sample t tests were performed with each device. On average, with the visual SSD, subjects performed slightly better with vertical discrimination (M = 2.09, SE = 0.28) than depth discrimination (M = 2.34, SE = 0.26); this difference, of –0.25 cm, was not significant, t(19) = –0.59, p = .561, d = 0.21. For the haptic SSD, subjects performed slightly worse with vertical discrimination (M = 44.42, SE = 6.69) than depth discrimination (M = 28.83, SE = 4.60); this difference, of 15.58 cm, was not significant, t(19) = 1.76, p = .095, d = 0.61. For the auditory SSD, subjects discriminated verticality to an average of 30.08 cm (SE = 4.22), with substantially superior depth discrimination abilities of 8.25 cm (SE = 1.62); this difference, of 21.83 cm, was highly significant, t(19) = 5.94, p < . 001, d = 1.53. Overall, this indicates that discriminating verticality and depth was of a similar difficulty for visual and haptic approaches but that depth was significantly easier to discriminate than verticality for auditory SSDs.
For confidence ratings, vision had the highest mean across all 10 trials for vertical perception (see Figure 7, upper row) and had a total combined average of 7.34 ± 1.67. The mean scores for haptic and auditory were very similar at 4.32 ± 0.63 and 4.26 ± 0.58, respectively. Regarding depth perception (see Figure 7, lower row), vision had a higher total mean score than both haptic and auditory, scoring at 7.24 ± 2.01, compared with 5.59 ± 0.56 (haptic), and 5.44 ± 0.97 (auditory). However, examination of Figure 7 shows that visual, despite initially holding a high level of confidence rating on the first six trials, rapidly declines to below haptic at Trial Number 10 (4.00 ± 2.32 vs. 4.60 ± 2.13) and only slightly holds above auditory (3.95 ± 1.91).

Mean confidence scores plotted against mean performance across all 10 trials for vertical (top row) and depth (bottom row) perception for visual, haptic, and auditory SSDs, respectively. Scores are given in blue (dark grey) and confidence ratings in orange (light grey); error bars represent 95% confidence intervals.Note: Please refer to the online version of the article to view the figures in colour.
Discussion
The present research aimed to compare spatial perception by users when the same information (a 16 × 8-depth map) is perceived as spatialised light, sound, or touch using sensory substitution. This approach allows a quantification of the level of information loss attributable to transforming between modalities in traditional SSD designs across two tasks (vertical and depth localisation) as well as their confidence in making these assessments. In the verticality condition, visual SSDs had a significantly higher accuracy than our auditory and haptic approaches, discriminating between visual angles of 1°, 14°, and 21°, respectively. For depth perception, visual feedback was once again significantly more accurate than auditory and haptic feedback in our SSDs; however, this time auditory feedback significantly outperformed haptic feedback, with discrimination abilities at 2 cm, 8 cm, and 29 cm for visual, auditory, and haptic SSDs, respectively. Users had the highest levels of confidence using visual feedback; however, despite users being equally confident in using haptic and auditory SSDs, auditory specialisation appeared to yield higher accuracies relative to haptic approaches.
For both vertical and depth perception tasks, the visual SSDs were significantly more accurate than SSDs utilising alternative sensory modalities (H1 was accepted). As predicted, our participant groups had superior spatial perception from using visual information. However, their degree of spatial acuity was more surprising; for objects at 1.2 m, users could interpret visual information to within just a few centimetres (V-Vert = 2.09 ± 1.24 cm; V-Depth = 2.34 ± 1.16 cm). Considering that all the devices work from identical 16 × 8-depth maps, the visual scores likely represent the best possible performance from this information, providing useful knowledge about the practical capacity of using “low” spatial resolutions in SSDs, and create a benchmark to which users of audio and haptic SSDs could conceivably improve to with additional training or design changes.
These “visual-into-visual” SSDs may allow experimenters and designers to quickly assess the effectiveness of the information provided and establish the upper bound of performance users can get from this information. This provides the advantage of eliminating the additional confounds that arise from converting between modalities (Cha et al., 1992) while also reducing the time required to train users on how to interpret auditory or tactile information in a visual manner, or reach “expertise” (Ortiz et al., 2011; Ward & Meijer, 2010). This considered, care must also be taken to represent the final desired output fairly in vision, for example, the vOICe SSD outputs information serially, column-by-column to the end user, and as such, any visual comparison should also output information in a similar manner such as through the use of column-like apertures (Day & Duffy, 1988; James, Huh, & Kim, 2010; Mateef, Popov, & Hohnsbein, 1993; Morgan, Findlay, & Watt, 1982). These parallels also provide new opportunities for SSD design to learn from how this information can be more efficiently delivered and processed by the end user (Craddock, Martinovic, & Lawson, 2011; Króliczak, Goodale, & Humphrey, 2003).
Comparing the difficulties of the vertical discrimination task and the depth discrimination task revealed that while for the visual and haptic SSDs these tasks did not significantly differ in their difficulty, for the auditory SSDs, discriminating verticality was significantly more difficult than discriminating depth (H2 was partially accepted). This is likely due to the differences in difficulty between discriminating two bursts of noise that varied in loudness (as required in the depth task) and discriminating between two bursts of noise that varied in spectral frequency content (as required in the verticality task). The lack of familiarity with vertical hearing or the complexity in terms of discriminating between subtle changes in frequency is likely to underscore these differences. It could also be stated that the depth task also provided additional redundant information in terms of the total “area of stimulation” and that it may be unclear whether subjects were utilising intensity or this. However, the discrimination thresholds reached by subjects in all SSDs surpassed the distance at which stimuli took up additional “pixels,” making this cue unavailable. This could be further evaluated in the future through utilising stimuli of different physical sizes.
These results provide an important baseline for SSDs, in assessing translation methods that piggyback off of veridical spatial perception for each sense. In turn, these baselines can inform future SSD design by comparing changes in design with changes in performance. To date, many different types of SSD have been reported in the scientific literature, utilising a huge variety of signal outputs (Ertan et al., 1998; Hamilton-Fletcher, Obrist, et al., 2016; Hamilton-Fletcher & Ward, 2013; Jóhannesson, Balan, Unnthorsson, Moldoveanu, & Kristjánsson, 2016; Jones et al., 2004; Rochlis, 1998; Spanlang et al., 2010; Wacker et al., 2016; Ward & Meijer, 2010). Historically, these devices have been difficult to compare against each other, across different tasks, or against a reasonable benchmark, limiting our knowledge on effective approaches to turning vision into sound or touch. We propose that a comparison of the same information—either kept within the source modality (e.g., vision) or transferred to another (e.g., hearing, touch)—allows the most accurate assessment of how well the translation to other modalities works within these devices.
With a baseline comparison of “visual-into-visual” SSDs in place, any changes in sensory substitution design that reduce any “modality gap” between visual and audio/haptic approaches can be evaluated. Knowing the upper limit of performance on an SSD can set expectations in how best to present this information to the user. This covers a huge range of design considerations, from utilising the substituting senses’ ability to discriminate and categorise information (e.g., utilising the principles of spatialised hearing—Blauert, 1997, or how “auditory objects” are perceived from an audio stream—Bregman, 1994), a consideration of users’ differing abilities (e.g., early-blind have impaired vertical hearing localisation—Lewald, 2002; Zwiers, Van Opstal, & Cruysberg, 2001; but superior tactile acuity, discrimination of auditory pitch, loudness, and horizontal localisation cues—Fieger, Röder, Teder-Sälejärvi, Hillyard, & Neville, 2006; Goldreich & Kanics, 2003; Gougoux et al., 2004; Kolarik, Cirstea, & Pardhan, 2013; Norman & Bartholomew, 2011; Röder et al., 1999; Wan, Wood, Reutens, & Wilson, 2010), and when appropriate, taking advantage of the users’ multisensory processing biases (e.g., cross-modal correspondences—Deroy, Fasiello, Hayward, & Auvray, 2016; Hamilton-Fletcher et al., 2018; Hamilton-Fletcher, Witzel, Reby, & Ward, 2017; Hamilton-Fletcher, Wright, & Ward, 2016). Beyond this, many other factors remain to assess, such as selecting task-appropriate information; appropriate spatial, temporal, and colour resolutions; avoiding sensory and attentional overloading; training and usability in daily life; and finally, changes in perception, externalisation, and qualia (Auvray, Hanneton, & O'Regan, 2007; Bertram & Stafford, 2016; Brown & Proulx, 2016; Brown, Simpson, & Proulx, 2014, 2015; Hamilton-Fletcher & Ward, 2013; Hartcher-O’Brien & Auvray, 2014; Kristjánsson et al., 2016; Ortiz et al., 2011; Ward & Meijer, 2010).
When comparing between the substituted modalities (hearing, touch), there was partial support for H1 for the difference between haptic and auditory devices, specifically for the assessment of depth, with the Synaestheatre being significantly more accurate than the VibroVision vest (by 20.58 cm). This suggests that comparing two objects that differ in amplitude (and sometimes area of stimulation) is easier to discriminate using our auditory rather than our tactile output. While vertical localisation performance was superior for the Synaestheatre (14°) than the VibroVision (21°), this is not to a significant degree. In terms of vertical localisation within hearing, users are typically very limited (3.65°—Perrott & Saberi, 1990; with movement increasing this resolution—Strybel, Manligas, & Perrott, 1992). As such, there is room to improve spatial localisation in audio SSDs to reach this level, most likely through personalised rather than generic HRTFs (reaching 9.47°—Pec, Bujacz, Strumillo, & Materka, 2008). There also exists the possibility of exaggerating natural elevation cues from the pinnae or spatial coordinates for peripheral stimuli, although it is currently unclear if any perceptual or spatial exaggerations can be effectively utilised or integrated with veridical hearing. The most common alternative is the use of cross-modal correspondences, such as substituting height for pitch, which may create easier to discriminate signal changes for users (Haigh et al., 2013; Striem-Amit et al., 2012); however, this can have similar limitations (Brown & Proulx, 2016; Brown et al., 2014, 2015), and while this also has the advantage of being intuitive for those with prior visual experience, it is potentially not intuitive for the congenitally blind (Deroy et al., 2016; Eitan, Ornoy, & Granot, 2012). In addition, the use of pitch for height limits the use of pitch for other qualitative content such as texture, colour, or shape (Hamilton-Fletcher, Wright, et al., 2016, 2017, 2018).
The tactile approach taken by the VibroVision resulted in errors in vertical localisation at 44.42 cm (± 29.90) and depth localisation at 28.83 cm (± 20.59) for objects at 1.2 m. Tactile approaches also showed the largest amount of variation in user abilities (see Table 1). Some of this may be due to physical factors limiting skin resolution, such as variations in clothing insulation or body size influencing the location of the vibrotactile pads. Jóhannesson et al. (2017) examined the vibrotactile spatial acuity of the back, providing important implications for vibrotactile SSDs. They note that as the distance between tactor motors is reduced, the total information increases, however user accuracy decreases. Furthermore, they found that tactors mounted to sponge with a 3-cm distance (centre to centre) were significantly more accurate than tactors mounted to fabric (as with the VibroVision) at 4-cm distance (centre to centre). They suggest that this is due to the sponge aiding in reducing the amount of vibration distribution from the tactors, allowing a more localised sense of vibration. Van Erp (2005) attached tactor motors directly to the skin, not practical for widespread use, but in the case of a single individual using this technology, direct skin-tactor contact provided improved accuracy.
Within the present study, subjects were able to conduct a limited range of movement with the sensor, and previous studies have shown that this form of self-initiated motion can also improve users’ spatial abilities; for example, when viewing an object, manipulating the position of the sensor will produce predictable movement of the object’s tactile representation on the body, allowing users to both find the threshold at which an object’s signal moves to a new tactor, as well as easily predict which tactor is most likely to become active based on this movement. This predictability in signal change reduces the ambiguity of which tactors are active, effectively increasing the users’ functional resolution (Van Erp, 2005). In discriminating multiple tactors, the role of training is particularly important, with even short training sessions improving spatial discrimination by 36% and intensity discrimination by 44% (Stronks, Walker, Parker, & Barnes, 2017). These considerations could help narrow the gap between the VibroVision and alternative modalities while keeping the advantages inherent to the tactile channel, such as no auditory inference and using areas of the skin not essential for daily life.
Outside of SSD research, previous attempts to compare the processing of similar spatial information through vision or touch (via the fingertips) have reduced the visual resolution of the stimuli to simulate equivalent receptor densities between modalities. When this kind of equivalent information is presented via image blurring, comparable performance has been found for letter and Braille stimuli presented as blurred images or raised patterns (Loomis, 1981, 1982); however, further differences have emerged for how the modalities process dot and joined-Braille stimuli, with touch better at separating out line distractors and vision better at separating out dot distractors (Loomis, 1993). When equivalence is reached through miniaturising the visual image, functional equivalence was reached for images sized at 0.037° visual angle and 5 mm for passive tactile stimulation (4 mm for active exploration), which corresponds to similar multiples of receptor spacing—whether SA for touch or cones for vision (Phillips, Johnson, & Browne, 1983). Cho, Craig, Hsiao, and Bensmaia (2015) compared spatial patterns on larger stimuli that spanned either ∼100 SA1 afferents (1 cm2) through tactile stimulation or ∼100 retinal cones (0.083° visual angle) via visual stimulation and found superior performance for visual presentations. This suggests that visual performance increases more than tactile when increasing numbers of afferents are involved. Relevant to the present SSD research, while subjects can discriminate vibrational stimuli at closer distances than the intertactor spacings used by the VibroVision vest, this refers to minimum thresholds for discrimination, not perfect performance (Jóhannesson et al., 2017), which means that the ability to resolve points of stimulation for the haptic SSD is likely inferior than for visual SSDs.
We observed an interesting disparity between participant confidence and abilities. Predictably, confidence was highest with vision, until the task became vastly more difficult in the last few trials; here, participants reduced in confidence, but not ability. For haptic and auditory SSDs, confidence was similar throughout the trials, despite differences in ability. For depth, the auditory Synaestheatre was significantly more accurate than the VibroVision vest (by 20.58 cm), and although statistically nonsignificant, again auditory perception was more accurate than haptic for discriminating verticality (by 14.33 cm). This discrepancy may occur from users overestimating their tactile localisation/intensity discrimination abilities, or tactile sensations may be perceived as more intuitive for users to interpret from. It is unclear to what extent this lowering of confidence from visual SSDs to auditory or haptic versions is due to the increased difficulty with discriminating the outputted signal, or unfamiliarity with utilising auditory/haptic information in this way. One way to disentangle these competing explanations is by introducing signals that are easier to discriminate (e.g., verticality represented by pitch changes or tactile codes), to see if these can reach parity with visual SSDs in confidence as well as ability. Future studies may also wish to examine the difference in ability and confidence of blind end users in using SSDs to solve spatial tasks, especially because blind individuals are likely to focus more extensively on hearing and touch to perform spatial discrimination tasks in daily life. In terms of SSD training, users may feel more inclined to continue to learn and practice with SSDs they are confident using; nevertheless, if a user’s confidence and actual abilities are mismatched, this could have negative consequences. Understanding how users evaluate their abilities (metacognition) will be essential to furthering training regimes and improving user abilities and safety.
SSDs that provide a sense of spatial perception have the capacity to alter both the perceptions and representations of space in users with visual impairments or blindness. As previously noted, individuals with no visual experience tend to have more separated spatial reference frames, prefer egocentric co-ordinates for representing space, have impaired spatial memory/recognition, have less flexibility in navigation, have distorted representations of space, have impaired vertical hearing, and find it harder to replicate spatial relationships relative to those with prior visual experience such as the sighted, or late-blind (Corazzini et al., 2010; Gori et al., 2017; Hötting & Röder, 2004; Hötting et al., 2004; Iachini et al., 2014; Kolarik et al., 2017; Lewald, 2002; Pasqualotto & Proulx, 2012; Pasqualotto et al., 2013; Ruggiero et al., 2018; Vercillo et al., 2018). Despite any differences, functional equivalence is also frequently observed on a variety of spatial tasks irrespective of the participant’s experience of sight or their given information source, whether from vision, hearing, touch, or language (for a review, see Giudice, 2018). Any differences in how various sightedness groups tend to discriminate, represent, and utilise the available information need not be inevitable or immutable, rather, that additional training and information can change all of these factors. Their use of SSDs can reveal what specific spatial information is key to integrating, calibrating, and utilising more effective representations of external space. This has begun to be explored through both fostering specific audio-motor contingencies (Cappagli, Finocchietti, et al., 2017); however, only a few alterations to spatial perception have been explored with the more “visual-like” SSDs, such as the use of ego/allocentric representations (Pasqualotto & Esenkaya, 2016). By contrast, having prior visual experience can also carry over vision-specific distortions such as the Ponzo illusion or vertical–horizontal illusion to stimuli sonified by auditory SSDs that that the congenitally blind do not experience, allowing a more accurate reconstruction (Renier, Bruyer, & De Volder, 2006; Renier et al., 2005). Overall, the reported range of effects to spatial perception found in congenital blindness remains an open area for further exploration for SSDs as an intervention going forward.
While this study examined SSDs that convert visuospatial information into spatialised light, sound, or touch in isolation, more open questions remain as to the ability of users to integrate this spatial information across modalities (i.e., visuospatial into both audio and tactile stimulations). It is currently unclear whether these hybrid SSDs confer any advantages in performance or whether this information can be effectively integrated. An additional question relates to whether distinct information presented to each sense (e.g., visuospatial into touch but colour information into sound) can also be effectively integrated into a unified coherent representation of external space. Part of the promise of hybrid SSDs is the potential to expand past the processing limitations inherent to individual modalities; however, this remains to be realised.
Overall, the present study provides evidence of the functional spatial resolutions possible for participants using visual, auditory, or haptic SSDs when the visuospatial information provided remains constant. Our methodology showcases how the “upper limit” of performance for SSDs can be quickly established and isolated from any reductions in the functional spatial resolution attributable to converting spatial signals into other modalities using conventional SSD designs. Furthermore, we identify multiple ways in which any reductions in spatial resolution attributable to this transformation can be reduced for a wide range of users. This information can help identify optimal ways of providing visuospatial information to the visually impaired to further enhance their representation of external space.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
