Abstract
Social media is an overwhelmingly visual medium, and we ask the simple question: How can the data and images of social media posts be transformed into something as meaningful and vivid in the auditory sense? Such a design would be useful for eyes-free browsing and could enhance the existing visual media. Our strategy first uses artificial intelligence systems to transform low-level input data into high-level sociocultural features. These features are then conveyed using a multifactored temporal design that uses speech, sonification, auditory scenes, and music.
Keywords
Sonifying social-media feeds increases accessibility and enhances cultural meaning and enjoyment for users.
This article starts with a simple question: How might this feed be heard? Then, more specifically, how could the layers of meaning and information embedded in this feed be represented through sonic information design? Although intriguing in its own right, such a design could also be useful in situations where someone wishes to have an experience of the feed without using his or her eyes. Similar to listening to podcasts or music, social media could be used while on the go or while doing other manual activities, like cooking or housework. Sounds could also supplement the existing visual browsing experience.
In this article, we present an aspirational design for such a system with two parts. First, we use artificial intelligence systems to transform the raw data of the feed into more meaningful high-level features. Second, we use a heterogeneous mixture of auditory display types to represent this new information content. We believe both strategies will be useful for future work in the area, and we will illustrate our design with examples contrasting current text-to-speech and enhanced versions.
Background and Motivation
Social media is an increasingly important means of civics, communication and learning about the world (Gil de Zúñiga, Jung, & Valenzuela, 2012; Kavanaugh et al., 2012). It is a multimodal medium that draws heavily upon visual culture: representing content and telling stories in many visually compelling ways (Gamson, Croteau, Hoynes, & Sasson, 1992; Lambert & Press, 2013).
In this article, we ask how social media might be transformed into an auditory form – that is to say, represented with sound. When this transformation is objective, systematic, and reproducible, such that it can be used with any feed whatsoever, it is called sonification (Hermann, 2008). Although there are many applications of sonification, including assistive technology, process monitoring, intelligent alarms, data exploration, and aiding movement (Hermann, Hunt, & Neuhoff, 2011), new trends in the field seek to address the aesthetic (Vickers & Hogg, 2006), social (Supper, 2012), and cultural (Barrass, 2012) potentials of the medium. In particular, an open question remains as to how sonifications might become popular and fun for broad audiences (Barrass, 2012) while staying true to its systematic and objective nature (Winters & Weinberg, 2015). Consider the four posts in Figure 1. What would they sound like? We propose that sonification of social media and images provides such an opportunity.

A collection of four social media posts with images. This article presents an aspirational design for displaying them as audio and includes examples comparing their enhanced versions with the default screen-reader options.
Social media accessibility
Text-to-speech is the standard of Web accessibility (Leuthold, Bargas-Avila, & Opwis, 2008), and many parts of the Web-browsing experience can be made accessible through semantics (Semaan, Tekli, Issa, Tekli, & Chbeir, 2013). However, several challenges have emerged for making the social media experience accessible to visually impaired users, one of which is the increasing pervasiveness of digital images (Morris et al., 2016). With modern machine learning systems, it is possible to generate automatic “alt text” or captions for images (Fang et al., 2015), which has been shown to enhance the experience of visually impaired users of Facebook (Wu, Wieland, Farivar, & Schiller, 2017). However, there is still much room for improvement, including conveying appropriate levels of trust in users (MacLeod, Bennett, Morris, & Cutrell, 2017) and using nonspeech audio to create “rich and evocative understandings” of digital images that include a “sense of presence or aesthetic” (Morris, Johnson, Bennett, & Cutrell, 2018).
Approaches to listening to images
There are many ways in which nonspeech audio can be used to understand or represent an image in an eyes-free setting. For example, many sensory substitution devices (SSDs) transform low-level image data, like color, hue, saturation, or brightness, into auditory parameters, like timbre, pitch, or volume (see Hamilton-Fletcher and Ward, 2013, for a review). Audiophotography systems associate images with recorded “sounds of the moment,” which have been shown to enliven photographs and enhance memories (Frohlich & Tallyn, 1999). In the present design, we introduce a fusion of these two types of systems wherein sounds of the moment and music are added retrospectively based upon content automatically detected in the image. Compared with SSDs that operate on low-level image data (e.g., RGB or HSV), we hypothesize that this high-level approach will aid in faster auditory recognition of people, actions, and objects in an image. (Imagine the sonification of pixels in a photograph compared with the sound of a man laughing. Without significant training, a user would recognize the gender and emotion of a person in the image faster in the latter case.)
Design Description
Preprocessing
To explore this design space, we used a collection of tweets that represented a range of topics, including news, politics, celebrities, and humor, all of which contained images. (That collection can be found at https://twitter.com/msrbvipstudy. See MacLeod et al., 2017, for more details on how this collection was curated.) Data from these posts were extracted and analyzed using several Microsoft Cognitive Service artificial intelligence systems (https://azure.microsoft.com/en-us/services/cognitive-services). In particular, the Computer Vision system was used to generate an automatic description of the image, list objects in the image (“tags”), identify any celebrities, and determine the coordinates and gender of any faces in the image. The Emotion system was used to link the coordinates of each face generated by the Computer Vision system with a value for seven emotion categories. Finally, the Text Analytics system was used to determine the sentiment of the text in each post.
An additional quantity termed impact was imagined, designating the relative quantity of activity generated by that post (e.g., number of likes, reblogs). No algorithm or system was available to produce this quantity, so for the purposes of design, its value was equally distributed evenly across the collection. A diagram that summarizes the full high-level transformation is provided in Figure 2.

A figure summarizing the transformation. The image and text of the social media post are sent to Microsoft Cognitive Services, which generates socially and culturally meaningful data. These new features become the basis for a multifaceted approach to auditory display incorporating auditory icons, soundscapes, music, speech, and sonification. OCR stands for optical character recognition and is used for reading text on images. The cloud means that the parameter was imagined and not fully implemented at this time.
Mapping strategy
To display the additional information made available by the artificial intelligence systems, the sound design sought to create a coherent balance of the following types of sound:
Speech
Sonification
Auditory scenes
Music
Speech was used to speak the text of the user name, main post, and automatically generated image caption. Sonification was used to signal the sentiment and impact using acoustic cues designed to communicate emotion (Winters & Wanderley, 2013). For example, a low-impact positive-sentiment post would sound slow, soft, consonant, and major, and a high-impact negative-sentiment post would sound fast, loud, dissonant and minor. A unique auditory scene – composed of short auditory icons and longer “soundscapes” – was created from the high-level image content. Short, nonspeech auditory icons of males and females expressing various emotions were used to represent the gender and emotion of any faces in the image (e.g., a woman laughing, a man crying), and additional auditory icons represented any sound-producing objects (e.g., a knife sharpening, a bird chirping). Soundscapes were longer in duration and were used to represent recognized actions or environments in the scene (e.g., the sounds of a baseball game or a park). Finally, a random selection of nonvocal background music was used to set the mood of an image, and “theme music” was used if any celebrities were identified. For example, identifying Paul McCartney in the image would trigger playing an excerpt of the song “Band on the Run.”
These new auditory components were layered and arranged in time according to three guiding principles. First, we followed a conventional ordering of spoken accessibility content: user name, post text, then image caption. Second, we used layering to minimize the amount of extra time beyond the conventional spoken content. Finally, we began the auditory scenes and music before the main text of the post was spoken – “setting the scene” for the spoken post using the audio generated by the image. The strategy for temporal evolution reflecting these choices is summarized in Figure 3.

A figure summarizing the approach to the temporal evolution of each post. The user name is spoken first, followed closely by the sonification of sentiment and impact. The sonification fades away as the short auditory icons representing the gender and emotion of faces in the scene are introduced. The music and soundscape fade in, reaching maximum volume as the objects in the scene are introduced. The music and soundscapes then fade into the background as the post text is spoken and finally fade to silence before the spoken image caption.
Demonstration
A demonstration of our design applied to a collection of five social media posts is presented in an online video (https://archive.org/details/Auditory_Display_of_Social_Media). The video contrasts text-to-speech versions of the post with the subsequent enhanced version. The first case represents the current speech-only auditory experience, and the second version includes a sonification, auditory scene, music, and image caption.
Opportunities for Auditory Content
We introduce this aspirational design for several reasons. First, we think that social media data provide a clear example where sonification can be applied effectively as a social and cultural medium (Barrass, 2012). Listeners unfamiliar with sonification as a scientific method can still enjoy listening to sonifications of their feed for the objective information it conveys. The layers of information represented in each unique feed, the number of social media users, and the dearth of auditory content in current social media displays make a strong case for its application.
We also think that our data pipeline and mapping strategy point to more general strategies that will be useful in the field. Instead of sonifying low-level image data, like HSV or RGB, we use artificial intelligence as a preprocessing layer to extract high-level image content and enliven the image with automatically generated “sounds of the moment.” In general, we hypothesize that artificial intelligence systems will become an essential component of many auditory display systems, enabling more efficient and cognitively meaningful data to sound mappings.
Finally, our design leverages multiple auditory display types, including speech, sonification, auditory icons, soundscapes, and music. Although the field typically separates these as independent display strategies and does not include music (Hermann et al., 2011), we believe that our mixture of display types is most appropriate for the sociocultural context. All auditory display types have different strengths and weakness, and for data that are heterogeneous in nature, we believe the most compelling display will be created only through thoughtful combination.
Conclusion
Social media and images provide a rich context to explore the sociocultural potential of sonic information design, and the future is full of possibilities for the application of auditory display. We hope that our aspirational design and examples can inspire ideas and guide opportunities for this ever-changing information stream.
Footnotes
Acknowledgements
We thank Haley Macleod and Cindy Bennett for their insightful thoughts and contributions as we worked towards the final design. The design described in this manuscript was pursued as part of an internship by the first author at Microsoft Research.
