Abstract
We perform a socio-computational interrogation of the google search by image algorithm, a main component of the google search engine. We audit the algorithm by presenting it with more than 40 thousands faces of all ages and more than four races and collecting and analyzing the assigned labels with the appropriate statistical tools. We find that the algorithm reproduces white male patriarchal structures, often simplifying, stereotyping and discriminating females and non-white individuals, while providing more positive descriptions of white men. By drawing from Bourdieu’s theory of cultural reproduction, we link these results to the attitudes of the algorithm’s designers, owners, and the dataset the algorithm was trained on. We further underpin the problematic nature of the algorithm by using the ethnographic practice of studying-up: We show how the algorithm places individuals at the top of the tech industry within the socio-cultural reality that they shaped, many times creating biased representations of them. We claim that the use of social-theoretic frameworks such as the above are able to contribute to improved algorithmic accountability, algorithmic impact assessment and provide additional and more critical depth in algorithmic bias and auditing studies. Based on the analysis, we discuss the scientific and design implications and provide suggestions for alternative ways to design just socio-algorithmic systems.
Introduction
Scientists and philosophers have long stated that any type of technological medium de facto transforms the social, cultural, and political relations existing in that world (McLuhan & Fiore, 1967; Winner, 1980). Algorithms are no exception, as their invasion and ubiquity in every aspect of society constantly redefines human sociability, political conduct, and social structures. Given this, scientists investigate how algorithms impact society (Pasquale, 2015), the context of their use (Selbst et al., 2019), the conditions of their creation (Kärkkäinen & Joo, 2019) and properties of their structure (Yang et al., 2020). These investigations can take place on a theoretical level (Blodgett et al., 2020) or a mathematical level (Barocas et al., 2017), can take place by testing already existing algorithmic implementations in the industry and society (Buolamwini & Gebru, 2018), by evaluating the people and social groups who are influenced by the algorithms (Woodruff et al., 2018), or by evaluating the algorithms’ designers themselves (Barrett & Kreiss, 2019).
Furthermore, scientists develop frameworks that can assist researchers and policymakers with the systematic analysis, evaluation, design, and governance of algorithms (Mittelstadt et al., 2016). Selbst et al. (2019) stated that the use of social theories (e.g., from science and technology studies) can contribute to the development of context-aware machine learning applications. In the case of using computation as means for locating algorithmic injustice, Abebe et al. (2020) described specific roles that computational analysis can take, in order to be able to confront deeper patterns of inequality, which many studies related to algorithmic bias fail to define and take into consideration (Blodgett et al., 2020). Barabas et al. (2020) argued that there is also a need to “study up.” That is, to reorient algorithmic studies, by focusing on the relationship between socially dominant groups and technological artifacts. This can contribute to further knowledge generation about algorithms, as well as provide critical perspectives of society in algorithmic studies.
Statement of the Problem
By taking into consideration the proposals to rethink the use of computational tools, the need to complement them with social-theoretic work, and the call to reorient algorithmic studies upwards, this study investigates the google search by image algorithm, a basic component of the google search engine. Also known as reverse image search, the algorithm eliminates the requirement to input keywords or terms into the search function and the image itself becomes the query. Results include similar images, web results, pages with the image, and different resolutions of the image. Further, the algorithm offers a “possible related search” based on the image’s metadata (Figure 1(a)). (a) Overview of the google search by image algorithm. For a given image, the algorithm generates a label and returns similar images and potentially related URLs. (b) Label returned by the algorithm when shown an African woman in tribal dressing.
Google does not disclose much about how this process is performed. According to a general explanation from 2011 (see Google (2011)), the algorithm analyzes the image for identifiers such as color, point, lines and texture by using computer vision tools. Based on this information, the algorithm creates a search query, which is then matched against the images in Google’s extensive back-end dataset of images. Finally, matching and similar images are returned as results to the user. The above description reveals that the algorithm belongs to the class of content-based image retrieval systems (Wan et al., 2014), nonetheless its exact structure remains unknown.
Similar to prior work that has shown that various components of the google search engine might result in racist inferences, discriminate against social groups, and (re)produce social power and information asymmetries (Diaz, 2008; Noble, 2018; Urman et al., 2021; Vincent, 2018), the same applies for the google search by image algorithm. For example, as Figure 1(b) shows, the algorithm labels a picture of an African woman in the traditional tribal uniform as “most black women in the world.” Provoked by cases such as this, we uncover how the google search by image algorithm (re)produces social structures and biases by investigating how it views people of different races, genders, and ages. To draw awareness about the algorithm’s function and by “studying up”, we evaluate how the algorithm views key-stakeholders in the tech industry, and show how these actors are placed in the sociopolitical reality that they have designed. To achieve the above, we perform a socio-computational interrogation and seek the answer to the following research questions: RQ1: How does the google search by image algorithm see people of different races, genders, and ages? RQ2: Which cultural and social structures are (re)produced by the algorithm? RQ3: What happens when individuals on the top of the techno-hierarchical ladder become the objects of the algorithm?
Significance of the Study and Contributions
The study provides novel insights into an understudied component of the most popular search engine in the world, which attracts 88% of search queries worldwide (J. Johnson 2021). Not only does it show how the algorithm views people of different genders, races, and ages, but also evaluates generated knowledge, uncovering sociopolitical structures immanent in the algorithm. The study also critically confronts individuals who have the power to change algorithmic inferences, by placing them in the algorithm’s sociopolitical reality. The study serves as a scientific and design provocation and provides following concrete contributions: • We evaluate the google search by image algorithm’s culture by drawing from Bourdieu’s theory of cultural reproduction, treating it as a subject and interrogating it through computational means. The interrogation takes place by showing the algorithm more than 48 thousand images of people of different ages, genders, and races (prompts) and analyzing the collected labels (responses) with quantitative and qualitative tools. Since we want to give minimal stimuli to the algorithm to uncover its social perceptions, the images are people’s portraits and we specifically investigate how the algorithm evaluates human appearance. • We find that the algorithm’s responses focus on following categories: celebrity status, gender, age, external features (beauty, shape, and hairstyle), sentiment, ethnic and racial heritage, and location. • We show that label selection is dependent on an individual’s race and gender, both in terms of the language in which the algorithm will use and the content of the label. • We find that the algorithm (re)produces sociopolitical and cultural structures that exist in the white patriarchal society. This is visible in the algorithm’s simplified description of women and non-white individuals, its frequent stereotyping, and its inability to make accurate predictions about people. Furthermore, the algorithm exoticizes non-white individuals in terms of beauty, associates them with negative sentiments, and tends to assign more adjectives related to their physique and external features. In contrary, the algorithm relates white males more often with terms about character virtues and higher social status, as well as provides richer descriptions about them. • We study up individuals that are in high positions within the techno-hierarchical ladder. By showing images of them to the algorithm, we illustrate how the algorithm places them in the already described culture, creating biased representations of them. In this way, we provoke them to reflect on the influence algorithms such as this have on people, and to consider ways to bend technology towards justice. • Based on the analysis, we discuss the benefits of integrating social-theoretic frameworks such as the above in algorithmic bias and auditing studies. We further comment on the scientific and design implications, point out the ethical issues we faced in performing our investigation, and propose scientific and design alternatives that can work towards socio-algorithmic justice.
Theoretical Approach
Beyond Algorithmic Bias
When an algorithm becomes the object of scientific investigation, the tactics to investigate it are also defined by the limits and possibilities of the extracted knowledge. To understand an algorithm’s culture, we combine theoretic and computational means. We do so because pure algorithmic bias studies often lack the ability to associate algorithms with important cultural and sociopolitical aspects (Abebe et al., 2020; Blodgett et al., 2020). On the other hand, theories coming from philosophy, sociology, science and technology studies, and ethnography, have successfully generated knowledge about cultural and political aspects of algorithms (Cave & ÓhÉigeartaigh, 2018; Seaver, 2017), as well as unfair and problematic outcomes (Mittelstadt et al., 2016; Selbst et al., 2019). Therefore, we argue that quantitative studies, be those algorithmic bias analyses or algorithm audits, can benefit from incorporating frameworks coming from outside fields such as the above.
We locate the benefits of combining theoretic and computational means for knowledge extraction in four distinct dimensions. First, studying algorithms based on social theories helps us understand the algorithms and the inferences they produce. Even studies that investigate algorithmic inferences purely mathematically (Obermeyer & Mullainathan, 2019; Papakyriakopoulos et al., 2020) conclude that it is inadequate to analyze them apart from the social conditions in which they were created, be that an NLP architecture or an automated decision making system. Second, an algorithm’s placement in a social reality leads us to a better understanding of the world. For example, combining and interpreting social and algorithmic patterns dissolves the false dichotomy of the digital and the real and gives a holistic picture of the world we are all situated in (Haines, 2017; Murray-Rust et al., 2019). It also provides knowledge on how social groups that are the objects of algorithmic investigations are positioned in society, as well as which general structures of power and ethics are being reproduced. Third, such an analysis of algorithms uncovers features of the coupling between algorithms and society. It detects how algorithms prescribe objectivity, learns which social groups are in, and which excluded, and shows how an algorithm can lead to calculated publics (Gillespie, 2014), that is, how the objects of algorithmic inferences learn to perceive themselves. Furthermore, the social context-aware analysis of an algorithm leads to an understanding of how an algorithm participates in a social reality, how it composes the social reality, and how it becomes the social reality (Neyland, 2019; van den Broek, 2019), all the while transforming what is considered ethical, acceptable and normal. Last but not least, a socio-computational study of algorithms provides the benefit of computation: the deployment of algorithms for generating knowledge about a part of the world provides researchers a vast amount of structured information that can complement and support any qualitative analysis (Abebe et al., 2020; Christin, 2020; Markham, 2016). When algorithms become the object of investigation, computation can function as an especially ideal mediator between researchers and algorithms, acting as a translator of thought between the investigator and the investigated.
Studying Up Technology
Any knowledge extracted during the practice of developing, deploying, or analyzing an algorithm is situated knowledge (Elish & Boyd, 2018; Haraway, 1988), that is, knowledge that is dependent on the social and political context adopted by each individual when interacting with the algorithm (consciously or unconsciously). This perspective dependent knowledge generation process allows researchers to elect the appropriate framework to investigate algorithms and answer their descriptive or prescriptive research questions. Building on this, Barabas et al. (2020) called for researchers to study up in the field of algorithmic fairness. Studying up (Nader, 1972) denotes the reflexive analysis of the upper end of the social power structure when investigating a cultural system. In the case of algorithms, studying up includes the focus of the analysis on the powerful actors of an algorithmic case study, be that tech companies, regulators, or social groups and individuals that benefit the most from the existence of algorithmic systems. As a practice, studying up provides three concrete benefits. First, it is a more integrative scientific practice, because it allows knowledge generation on a generally understudied part of the social hierarchy. Second, it promotes a democratized scientific paradigm, since power asymmetries become the epicenter of an investigation. Third, it allows researchers to investigate new objects that might be more interesting to them, avoiding the re-analysis of objects and structures that have been studied repeatedly in the past.
We perform the socio-computational interrogation of the google by image algorithm also by studying powerful individuals in the tech industry in order to bring new perspectives and provocations in the community of algorithmic fairness, which holds a peculiar social structure in itself. Specifically, D′ Ignacio and Klein (D’Ignazio & Klein, 2020) argue in their work that a large part of data ethics research is shaped and promoted by the very institutions that have yet to become the objects of critical scientific investigation. This can have hindering scientific effects, since the boundary between studier and studied are blurred, raising some troubling ethical and political concerns, which technoscience has sought to overcome for a long time (Forsythe, 1999). To constructively contribute to the debate, we adopt a relational thinking framework when performing the interrogation (Stich & Colyar, 2015). We exploit the nature of situated knowledge and we reflect on what it means for an individual to be on the top of the tech industry as an algorithm designer, but also how their privilege is obscured when they become objects of the algorithmic inferences. We also adopt this relational thinking in studying up to ourselves. Since the study is performed by a black female and a white male, we reflect on what this means when we interpret the specific algorithmic inferences in emic terms, and how we come to the final intersubjectively generated knowledge (Peters & Wendland, 2016). Furthermore, we acknowledge our individual position in the social hierarchy, and the automatic limitations this poses in holistically uncovering all of its features.
Conceptual Framework and Interrogation Tactics
An ideal way to sociopolitically evaluate the google search by image algorithm is to interrogate, investigate, and understand how the algorithm’s designers, engineers, and owners conceptualized and shaped the artifact. This can explain how practices, social context, histories and values shape the algorithm’s culture (Denton et al., 2020; Seaver, 2018). Nevertheless, the socio-technical systems of proprietary algorithms such as the one we are analyzing are extremely opaque, making it infeasible to perform an analysis of that magnitude and type (Alvarado & Humphreys, 2017; Burrell, 2016). To overcome this issue, we adopt an alternate framework that follows Bourdieu’s theory of cultural reproduction (Bourdieu & Passeron, 1990), treating the algorithm itself as a subject that was shaped by a specific culture, and as any individual, it (re)produces the conditions of its creation. The theory dictates that an individual’s world views, behaviors, and language will contain and replicate the values, culture, and power structures of the society they were created in. As a child being educated in a society replicates the values and behaviors taught at school and the family, the algorithm will also replicate values based on its input data, the decisions of the machine learning practitioners that created it, as well as the company that it owns it. This allows us to overcome issues of opacity, since we can easily interact with the algorithm through the google website and by learning about it we can also learn about the overall socio-technical system which includes the algorithm developers, owners, and designers. In this way, we can obtain an integrative view of the algorithm’s culture. We acknowledge that the field of cultural studies holds a critical position against quantification (Dixon-Román, 2016). Nevertheless, we claim that by using such a theoretic framework to evaluate quantitative results, we are able to provide additional depth to our analysis. Furthermore, we chose Bourdieu’s cultural reproduction framework not only because it suits our study, but because it shows that algorithmic bias studies can benefit from social scientific frameworks in ways that are not directly visible, or were not explicitly developed for understanding technological systems.
To extract information from the algorithm-subject, we construct a pipeline, which allows us to systematically and rigorously audit it (Sandvig et al., 2014) in the form of an interrogation (Figure 2). Since the algorithm takes images as input, the selection of an image and the feeding of it into the algorithm denotes the process of prompting the subject. The algorithm, in return, provides a label for each image it receives—the response. Nevertheless, each image and the corresponding label provides limited knowledge to us about how the algorithm sees the world. Therefore, we develop a data-intensive process that allows us to show thousands of images to the algorithm and save the generated responses. The process is structured as follows: First, we create datasets of images that comprise the pool of our prompts. Then, we program an automated anonymous browsing agent that crawls the google images website, iteratively uploading each image in the dataset and then saving the algorithm’s labels (responses). The anonymity of the browser ensures that the google search engine will personalize their responses to a minimal degree. Afterwards, we employ qualitative techniques and computational tools to understand the collected information about the algorithms’ culture through its handling of people’s faces and bodies. The performed socio-computational interrogation has the advantage of scaling (Hsu, 2014), improves transparency and replication (Abramson et al., 2018), and enhances algorithmic bias research, by incorporating a social scientific perspective when evaluating algorithm’s behavior. Overview of the interrogation tactics. The researchers show the algorithm images (prompts) and collect and analyze the corresponding labels (responses).
The primary scope of the study is to understand how the algorithm views people of different races, genders, and ages (RQ1), and based on that to uncover cultural, institutional, and social structures reproduced by it (RQ2). Furthermore, we study up powerful actors in the tech-industry (RQ3), by investigating how the algorithm perceives them. To perform these tasks, we split the interrogation into two distinct parts: exploratory prompting and focused exploration. Exploratory prompts are the input pictures we show to the algorithm in order to uncover general regularities in its perceptions. Focused exploration, on the other hand, encompasses additional computational analyses that aim to investigate how the subject organizes knowledge about a specific topic.
Exploratory Prompts
Overview of Various Computer Vision Datasets and Their Properties.
The ratios and frequency of classes in the UTKFace dataset can be found in Figure 3. According to the dataset creators, the images were collected by using the actual google search engine, nonetheless by using exclusively age-related queries. The fact that the image extraction was not ethnic group, gender, and appearance-related, makes it ideal for our experimental design, as it rules out data leakage. Any labels that the google search by image algorithm returns that would potentially be gender, ethnic group or appearance-related will be explicitly dependent on the function of the algorithm, and not of the decisions the creators of the UTKFace dataset made, by which we audit the search engine. Distribution of metadata categories race, age, gender in the UTKFace dataset.
As the dataset creators state, the age labels created for each image were generated by the DEX algorithm (Rothe et al., 2015), while the gender and race labels were assigned by two human coders. Given this, the data generation process of the dataset automatically poses specific restrictions to our study and directly situates the knowledge created. First, since the ground truth values are algorithmic predictions or human evaluations, they are approximations of the actual ones, inserting a bias in our study. Second, race is a social construct, hence the given racial categories are a product of the views of the algorithm’s creators, and not of the individuals in the dataset. Third, the algorithm used classified people in binary genders, automatically erasing non-binary genders. We acknowledge that these facts automatically pose a limitation in our analysis and is a serious ethical concern that should be taken into consideration by the whole scientific community. We decided to use the features in our analysis, as they give important insights about the google algorithm, but further discuss the issue later on and included it in the study’s scientific and design implications.
To gather adequate information about the algorithm’s views, we show the images to it in two formats. The first format includes the original photos, which might include the whole body of an individual, and might show the context and background in which the individual is in. The second format is the close-up, with images containing only the face of an individual. In this way, we extract as much information from the algorithm as possible, and see what is of interest to it in its answers. At the exploratory prompts phase, we feed the algorithm a total of 48.200 images (examples in Figure 4). We use the google translate API (Cloud, 2020) to detect the languages in which the algorithm returns its responses. We create weighted graphs of the most frequently assigned labels for male and female individuals across racial groups, as well as measure the normalized entropy of labels’ distribution for each race-gender subgroup, since sample size varied between them. The equation has the form Exemplary images in the UTKFace dataset (Zhang & Qi, 2017).

where H is the subgroup-specific entropy value, n is the total number of labels appearing for a subgroup and p(xi) is the probability of appearance for a subgroup-specific label. Mathematical entropy serves as a measure that quantifies how diverse labels are for each population, with higher entropy corresponding to higher diversity. Furthermore, we use the pre-trained language models of spacy (Honnibal & Montani, 2017) and perform named entity recognition and part-of-speech-tagging. In this way, we extract important information about the language and content in the algorithm’s inferences. Next, we go through all the labels manually, and extract frequent appearance adjectives assigned to the different social groups and genders, as well as stereotypes and simplifications. We also locate and evaluate unexpected labels that the algorithm returns. We visualize some of these findings through word-clouds, in order to make algorithmic inferences more tangible to the reader. The above results help us understand how the algorithm views society (RQ1) and the cultural and social structures (re)produced by the algorithm (RQ2).
Focused Exploration
After getting a general overview of the algorithms inferences, we perform a focused exploration to deepen our understanding of the algorithm’s culture (RQ1 and RQ2). Given the appearance of specific regularities between labels and social groups, we develop a multinomial logistic regression model to quantify and confirm these associations, as well as to uncover more. In the model, we use labels as our dependent variable and gender, racial group, and age as independent variables. We create a list of 39 adjectives (labels) that we insert into our model, each of them relating to a person’s appearance, character, sentiments, size or social status. The list is created by manually going through each of the labels in the dataset and isolating all of the unique adjectives that belong in the above categories, the sum of which were 39. We replace any label that contained these adjectives with the adjective, and any other label with the other_token label, which we also use as baseline label in our model. This means that we create a multinomial logistic regression model that predicts the probability that an image will be assigned a label that contain a specific adjective, or non of them, based on its metadata. The model has the form
where C is an adjective-specific label in the dataset, K is the total number of labels, βic are label-specific estimators, age is the continuous variable depicting the age, and black, asian, indian, other are indicator variables representing social groups. To solve the model, we used the MNLogit function from the statsmodels library in python (Seabold & Perktold, 2010), which calculates model parameters by maximum likelihood estimation.
In order to continue studying up, we use the white male as the baseline population for our model in order compare how differently the algorithm depicts white men as compared with groups lower in the social hierarchy. We evaluate the results based on the existing sociopolitical and socio-historical conditions in our society, uncovering biases but also showing cases where the algorithm deconstructs white male privilege. We also investigate the algorithms predictive abilities. Prior research shows that commercial algorithms discriminate based on classes such as gender and race, by making poor predictions for those categories (Buolamwini & Gebru, 2018). Therefore, we compare whether gender-specific labels generated by the algorithm identify individuals’ gender in the same way as in the actual metadata in the dataset. We compare the agreement rates for all social groups and investigate whether there is systematic discrimination of individuals of a specific race. Taking into consideration the above results, we apply Bourdieu’s theory of cultural reproduction to critically evaluate the traced associations.
List of Tech Industry Stakeholders That we Study up.
Socio-Computational Interrogation
In the following, we present the results of our socio-computational interrogation. We split it into three parts. The first part corresponds to how the algorithm recognized the world generally. The second part illustrates more concrete group perceptions and associations. The third part evaluates the algorithm’s culture, and offers a critical provocation of the techno-hierarchical ladder.
An Algorithmically Visioned Society
Categorization of Terms Appearing in Algorithm’s Labels, Their Frequency, and Top Examples.
The overview shows that all metadata categories (gender, race, age) existing in the initial dataset were recognizable by the algorithm. Not only were social constructs such as gender and race inferred by the algorithm, but so did age play a big role in how it saw individuals and how it described them. Furthermore, the algorithm inferred an individual’s gender in 15% of the labels, assigning a non-binary gender to them (e.g., trans) 28 times. It also returned labels about the sexual orientation of individuals (e.g., gay, lesbian) 44 times. This shows that in the algorithm’s views and culture people were neither binary gendered nor exclusively heterosexual. Nevertheless, it also means that the algorithm made assumptions about individuals that are not dependent on one’s appearance, providing information about how the algorithm evaluated gender and sexuality in its social reality.
Besides the algorithm’s ability to recognize social constructs, it also adapted the language and content of the generated labels according to them. Especially for the language of labels, the assumed racial group of an individual played a significant role. As Figure 5 shows, the majority of labels were in English (between 70% to 90% depending on the category). Nevertheless, when the picture included someone classified as Asian in the initial dataset, the chance of returning a label in Chinese was significantly higher (10% of the labels). The same applies for pictures of individuals classified as Indian (6% of labels in hindu) or individuals from the ‘other’ category, which included individuals with Latinx origin (6% of labels in Spanish). This result demonstrates that the algorithm viewed language as an element of the respective culture of individuals. Distribution of the top 13 label languages across different racial categories.
Gender Classification Agreement by Race Between The Initial Dataset and Google Search by Image Algorithm’s Predictions.
The above results already provide significant information about how the algorithm viewed individuals of different genders, ages, and races (RQ1), as well as properties of algorithm’s culture (RQ2). The algorithm (re)produced a society categorized by race, gender, and age, in which sexuality is an existent element, and language is seen as a cultural property. Furthermore, the algorithm’s culture contained a hierarchical structure among genders and races, as seen by the above biases. In the following, we illustrate further features of this hierarchy.
Hierarchical perceptions and attitudes
The tendency of the algorithm to behave differently based on race and gender becomes clearer when analyzing the distribution of top labels. The networks in Figure 6 show how the top labels for each gender were distributed across racial groups. The width of each connection shows the strength of labels’ association within each group. In the Male network space, we can see the language shared across the races included “close up”, “gentleman” and “man”, while the Female network space shows that women were typically labeled as “girl” across all ages, a troublesome term when used in the wrong social context (Fogarty, 2019). Asian men were also often misclassified as “girls”, a result of the algorithmic bias already discussed above. The networks show that labels related to age were also highly prevalent, with those about children and babies appearing most often for Indian, Asians and Other for both genders. On the other hand, labels associated with the elderly such as “elders” and “senior citizen”, had higher weights for White men and women. These differences were nevertheless a result of the data distribution in our prompts, and not of the algorithm’s tendency to associate specific races with these terms. Hair was a particularly relevant label for both black men and women, with words such as “afro”, “buzz-cut” and “frisur” (German for haircut) being some of their most common attributes, an association replicating the actual socio-political nature of hair for black people (Dash, 2006; Okazawa-Rey et al., 1987). The fact that one of the top labels was in German can be attributed to the German IP the crawler run from, providing evidence that the algorithm adapted its language based on its knowledge about our location, which is one of the personalization features that the google search engine uses (Barysevich, 2018). Hair was also a distinguishing feature for the algorithm when shown white females, often simplifying their appearance under the term “blond”, another term that carries a socially ambiguous meaning, often seen as positive while other times associated with various negative stereotypes (Sherrow, 2006). Results also suggest that Black men were frequently labeled as “Player” while Asian and White men as “businessperson”, assuming stereotypes in their respective careers (Paek & Shah, 2003; Tyler Eastman 2001). No such career or action oriented labels appear for women (with the exception of “sitting”, associated with Asian women, attributed to the initial dataset). Finally, the objectively assumed truth of “Human” was more closely related to Black Men and Black and Asian Women. In general, the word “human” was associated with images featuring mostly non-white men. This classification, although technically accurate, suggests that, for the algorithm, “human” was implicit in whiteness but not in other races, especially for men. Furthermore, the label “human” is an oversimplification, which is a structural element of stereotyping as a human heuristic (McCauley et al., 1980). The value of these results increases, given the diversity of labels the algorithm assigned to each social group. The analysis showed that the algorithm, in terms of entropy, did generate significantly more diverse labels for a specific gender-ethnicity group, with differences being marginal (White men—0.64, White women—0.61, Black men—0.67, Black women—0.65, Asian men—0.61, Asian women—0.61, Indian men—0.55, Indian women—0.64, Other men—0.68, and Other women—0.61.) This means that although the algorithm created multiple labels for each social group, the ones related to non-white populations contained stereotypical representations. The above networks show the algorithm’s top generated labels for each gender were distributed across racial groups. The width of each connection correspond to the number of images containing the label for a gender-race group divided by the total number of images for the gender-race group in the dataset.
Logistic Regression Results. Dependent Variable Levels Comprised 39 Adjectives Belonging to Six Categories: Attractiveness, Character, Feature, Sentiment, Size, Status. Independent Variables Included Race, Gender, And Age. Model Baseline was White-Male, And Reported Results Show Statistically Significant Variables (p ≤ 0.05) and the Sign of the Estimator (Positive/Negative). More Detailed Results can be found in the Appendix.
A characteristic part of the algorithm’s culture was its perception of attractiveness. The algorithm associated terms from the “attractiveness” category more with women than with men (with the lone exception of the label “Handsome”) and with younger ages. Furthermore, regardless of whether positive or negative, the algorithm associated labels around attractiveness with non-white races (e.g., “beautiful”, “ugly”), once again confirming its focus on the appearance of these races. This persistent association of attractiveness with non-white individuals was not a coincidence. This can not only be seen in terms such as “hot” or “sexy” in the logistic regression results, but also when looking at the labeled dataset, where images tagged “hot” were mostly non-white women. This finding was also reflected in labels in the dataset that were expressed in superlative forms. As seen in the word cloud in Figure 7, labels such as “best african beautiful girls” and “hottest hong kong women” were really common, whereas labels including the white race were non-existent, suggesting a theme of exoticization towards non-white individuals. This behavior of the algorithm again mirrors perceptions existing in western societies, where female non-whites are often exoticized and sexualized (Forrest-Bank & Jenson, 2015; Sue et al., 2007). Word cloud of superlative terms used in the labels.
In terms of age, the algorithm associated different groups of adjectives to younger and older people. Positive attractiveness terms such as beautiful, cute, pretty and sweet were associated to younger individuals. The same applied to specific size related terms (big, small, chubby). These algorithmic attitudes conform with general ageist tendencies related to beauty in the society, both on its idealization and oppressive aspects (Veresiu & Parmentier, 2021). In contrast, concepts revealing well-being such as classy, rich and happy, together with adjectives revealing triviality (average, human) were associated more to older populations. This shows that the algorithm evaluated age in a generally complex manner. On the one hand, it idealized the appearance of younger individuals, while it ignored or reduced the ones of older people. On the other hand, the algorithm “recognized” in older individuals’ portraits specific socially enticing values, such as happiness and social status.
The above analysis provided further evidence of the algorithm’s culture (RQ2) and of its perceptions on age, gender and race (RQ1). Not only does its culture include a clear hierarchical structure stemming from the white patriarchy, but the algorithm also reproduced behaviors and generated inferences that discriminate and type-cast individuals. Moreover, the algorithm possesses a clear conception of beauty, which strongly aligns with its hierarchical view that white and male is normal.
Critical Confrontation
The described white patriarchy, the exoticization of female appearance, the idealization of the young body, the hierarchy of masculinity, the ageist associations and the trivialization of non-white and older populations is not something that the algorithm conceives and produces on its own. The above structures are what Bourdieu would call the habitus of a social subject. They are the dispositions, perceptions and attitudes that an individual holds and comprise a structured and structuring structure (Bourdieu, 1990). They are a structure, because they emerge systematically and regularly through algorithm’s inferences. They are structured, because they were learned by the algorithm through a formal training process on a specific dataset, based on designers’ and creators’ incentives. They are structuring, because they shape the present and future practices of the individuals who use the algorithm. The algorithm’s habitus, therefore, is per se relational, and was created and functions within a field (Maton, 2014), a space which includes the individuals and institutions who shaped and are shaped by the algorithm (Figure 8). Cultural reproduction within the field of the google search by image algorithm. The algorithm reproduces a culture existing in the dataset it was trained on, and the culture of its owners and designers. Furthermore, the culture of the dataset itself is a result of owners’ and designers’ decisions. This culture is further replicated in society, by the individuals use of the search engine.
In this field, cultural reproduction is a constitutive part of it. In our case, the algorithm externalized a specific culture through its inferences, which are constantly shown and prescribed to the users who use the search engine. Since the algorithm’s habitus is relational, this culture was prescribed to the algorithm by its owners, designers, and training dataset. Even if the detected biases stem from problematic user queries that comprise part of the dataset, the responsibility for discriminatory algorithmic inferences still lies with the owners and designers. The culture of the training dataset was allowed to be in a specific way because of their perceptions, choices and incentives. They are the ones who decided to use information that contain strong social hierarchies, ignoring its problematic nature. As the powerful actors in the socio-technical context, designers and owners led the algorithm to reproduce the problematic culture and diffuse it further in the society.
Bourdieu’s theory allows us to draw a direct connection between algorithmic inferences, the society and the algorithm’s designers and owners. We are able to lay the foundations for impact assessment, since we detect specific social groups that are discriminated by the algorithm, and explain the nature of the discrimination based on social-theoretic knowledge. We are able to formulate statements of accountability, as the framework assigns direct responsibility to the algorithms’ owners and designers for any detected effects in the society.
Since Bourdieu’s theory draws a direct connection between algorithmic effects and its owners, the ethnographic practice of studying up allows us to challenge the power of powerful individuals in the tech industry to generate culture and to criticize them for their choices. We showed the algorithm the faces of 9 men and women high in the techo-hierarchical ladder ten times, after stripping them of their privilege, and evaluated the answers returned. The findings, found in Figure 9, show that the algorithm placed individuals in the previously described socio-cultural reality, creating biased representations of them. First, the algorithm produced much more diverse labels for men than to women (45 vs. 13 unique labels). Women were usually described as “blond” and “girl” (81% of the times), while the most frequent label for men was “gentleman” (30% of the time), with no other label appearing more than 5 times. As it is directly visible, labels were much more positive for men than for women. Besides the troublesome use of terms such as “blond” and “girl”, the algorithm described female individuals as spokespersons, commented on the hairstyle of Bozoma St. John, and generated a sexual label for Marissa Mayer. Furthermore, the algorithm recognized women or their profession only in 6% of the time. On the contrary, the algorithm recognized male individuals or their profession 22% of the time, and associated them with much more action and career oriented terms, such as “business person”, “public speaking”, and “event”. Male labels also had much more positive qualities in them, such as “fun” and “talented.” The above results demonstrate that the algorithm treated individuals differently based on their gender, confirming that even in the case of powerful actors, the algorithm’s biased perceptions persisted. Furthermore, regardless of gender, the algorithm failed to understand individuals behavior accurately. For example, the algorithm returned the label “singing” because individuals were holding a microphone. Moreover, the algorithm characterized Jeff Bezos as a nerd, and saw Eric Schmidt as a police officer. Labels assigned by the google search by image algorithm to women and men in the top of the techno-hierarchical ladder. The number embedded in colored circles corresponds to the number of times the label in the respective color was associated to an individual. Since the algorithm associated much more labels to men, for simplicity reasons we did not connect all of them to individuals in the visualization.
By studying up powerful individuals, and making them the object of algorithmic investigation (RQ3), we managed to point out the algorithm’s culture, predispositions, and vulnerabilities in a provocative way. Not only did the algorithm externalize its biased views about the world on individuals, discriminating women, but also exhibited frequent misclassifications. On top of that, by changing the population under investigation and focusing upward, we were able to show the extreme lack of racial and gender diversity in high positions in the tech industry. Most individuals are white, and women had professions of lower importance in the hierarchy. Overall, by performing our analysis, we uncovered an important part of the algorithm’s culture, as well as of the total system that the algorithm is an element of.
Scientific and Design Impetus
The socio-computational interrogation of the google search by image algorithm uncovered part of the social and cultural structures (re)produced by a main component of the most popular search engine in the world. The analysis revealed that a major commercial algorithmic implementation results in unjust associations, inferences, and predictions, favoring white-male individuals and discriminating against people of other genders and races. Besides describing this phenomenon in-detail, the study provoked key-stakeholders in the tech-industry to reflect on the issue, by making them the object of algorithm’s hierarchical conceptions.
The integration of the sociological framework of Bourdieu and the anthropological tactics of Nader allowed us to perform an algorithmic bias study that goes beyond simply auditing the algorithm. The theory of cultural reproduction allowed us to frame algorithmic behavior in terms of the cultural values it mediates and to locate discriminatory practices stemming from the white patriarchy, instead of just reporting mathematical differences between social groups. This knowledge can be useful when performing further studies or auditing procedures that quantify the exact impact of biases that algorithmic systems have on user groups (see, e.g., (Metaxa et al., 2021; Mitchell et al., 2019)). Furthermore, the theory assigned clear-cut responsibility to the algorithm’s designers and owners for its problematic behavior by positioning the algorithmic bias study within the social context the algorithm functions, hence promoting visibility and accountability. Similarly, the practice of studying up allowed us to highlight existing power-dynamics in the environments that create these algorithms. It also allowed us to critically confront these environments, making them the object of the algorithm’s inferences, underpinning the normative dimension of our study in a more interesting way. We do not claim that we exhaustively applied the above theories in interpreting the google search by image algorithm. Nonetheless, we argue that we showed the value of using the specific or similar social-theoretic frameworks when performing algorithmic bias or auditing studies.
While performing the above contributions, we encountered multiple issues related to the scientific rigor of our analysis, as well as the design vulnerabilities of the investigated algorithmic system.
From a scientific perspective, we faced technical difficulties and ethical dilemmas during study design and evaluation. First, in order to analyze the algorithm’s conceptions about race and gender, we had to use a dataset containing images with metadata algorithmically generated, which only included four main races and two genders. This not only automatically erased other races, making our own analysis less diverse, but also the initial data does not correspond to an actual ground truth, but an approximate one. Given this approximation, it was troublesome for us to define the boarders of social constructs of race and gender and to perform the analysis in a robust way. Taking this into consideration, we urge the scientific community to consider developing datasets that present identities as constructed and fluid features, and not as rigid categorizations. In cases that this is not possible, then it is scientifically and ethically more righteous to describe in detail the ethical and practical limitations and consequences of the study. This path is necessary, especially in cases when the actual individual did not provide information on or about themselves. We faced similar issues when using pre-trained language models for lingual entity recognition. Not only did the models not cover all the languages appearing in our dataset, but algorithmic predictions were not perfect either. Scientists should therefore be conscious that existing technical solutions pose serious limitations towards diverse and inclusive research. In addition, diversifying scientific analysis presumes much additional work, which scientists should be willing to perform, by considering at the start of the study who should be included and why.
From an algorithmic design perspective, the fact that the google search by image algorithm (re)produces the structures of white patriarchy is highly problematic. The oversimplified representation of females and non-white individuals, and the stereotyping and discrimination against them illustrates once again how much should be done by private and public actors toward just algorithmic implementations. Especially when going through the generated labels, we found the recurring regularity of non-white individuals being presented under the theme hot and and human (Figure 10). Non-white women, especially the young ones, were exoticized and sexualized, while non-white men were reduced to the self-evident description of being human; the least information an algorithm could provide about an individual. As described above, the algorithm’s behavior mirrors existing social conditions, which are the result of socio-historical and political processes that have generated structural asymmetries. Nevertheless, biases and representations such as the above should not be part of one of the most popular algorithmic implementations in the world. Even if the algorithm learned and returned queries corresponding to associations that users previously made, designers should generate frameworks that prevent the harm of individuals regardless of their social group; not doing may suggest that the above described issues are not of central concern to them. The mitigation of harms could be achieved by making equal predictions for individuals regardless their gender, race, or age, by avoiding oversimplifications and historical stereotypes associated with groups, and providing ethical guidelines that clearly state the result categories that search algorithms should yield. A prerequisite for this is the understanding of the social context in which an algorithm is employed, how it will be used, and how the algorithm will impact the diverse user population (Selbst et al., 2019). There is still much to be done towards just algorithmic implementations, and public and private actors should work toward this direction using transparent and accountable means. This study shed light on the issues of a sub-component of the most popular search engine in the world, but further research needs to follow that investigates further search engines that are available and used by billions of users. Image cloud for a random sample of images labeled as hot (left) and human (right). In the vast amount of cases, the algorithm labeled as such non-white females and men, respectively, exoticizing and oversimplifying individuals.
Conclusion
In this study, we performed a socio-computational interrogation of the google search by image algorithm, a main component of the google search engine. Drawing from prior theoretical work, we created a framework for understanding and evaluating the algorithm’s culture. By using computational and qualitative tools, we found that the algorithm reproduced structures stemming from white patriarchal society, discriminating and stereotyping females and non-white social groups. As a scientific and design provocation, we studied up key-stakeholders in the tech industry, stripping them of their privileges and showing how the algorithm places them within its socio-algorithmic reality, generating biased representations of them. Based on the results, we discussed scientific and design implications and solutions towards just and inclusive socio-algorithmic systems.
Footnotes
Acknowledgments
This work was supported in part by the Center for Information Technology Policy at Princeton University. The authors thank Elizabeth Ann Watkins and Ashley Gorham for their constructive feedback, Chelsea Barabas for her work that inspired this study, as well as the participants of the CITP fellow meeting and the CITP lunch seminar for their ideas on how to improve the study. Authors would also like to thank the editor of the journal and the anonymous reviewers for their valuable support.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Appendix
Multinomial logistic regression results. Significance codes: ** p ≤ 0.01, * p ≤ 0.05.
| Attractiveness | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Beautiful | Cute | Handsome | hot | Pretty | Sexy | Ugly | Beauty | Attractive | Prettiest | Sweet | Smart | |
| Age | −0.01** | −0.13** | −0.01 | −0.01** | −0.05** | −0.01 | 0.01 | −0.04** | −0.03 | −0.06 | −0.08 |
|
| Female | 2.72** | 0.19 | −3.85** | 0.40** | 2.14** | 0.99** | 0.22 | 2.53** | −0.12 | 3.25 | 1.08 |
|
| Black | 1.69** | 0.15 | 0.90** | 0.15 | −0.30 | 0.99** | 2.18** | 1.29** | 0.81 | −2.14 | −0.80 | −4.17 |
| Asian | 0.03 | −0.83** | 0.26 | −0.78** | −0.96 | 0.58 | 1.41 | −0.99 | −1.28 | 1.02 | −4.80 | −4.43 |
| Indian | 0.89** | 0.58** | −0.16 | 0.78** | −1.05 | −0.07 | −3.21 | 0.23 | −4.24 | −2.86 | 0.24 | 0.71 |
| Other | 0.33 | −0.96** | 0.15 | 0.08 | 0.81 | −0.24 | 2.56** | −4.20 | −3.55 | −2.42 | −4.04 | −3.54 |
| β 0 | −7.51** | −2.85** | −6.18** | −4.47** | −7.49** | −7.20** | −9.85** | −8.32** | −6.94** | −11.21** | −6.65** |
|
| Character | ||||||||||||
| Sweet | Nice | Kind | Bad | Good | Cool | Smart | ||||||
| Age | −0.08** | 0.00 | −0.12** | −0.01 | −0.01 | 0.00 | −0.05** | |||||
| Female | 1.08 | 1.42 | 0.16 | −0.32 | −0.72** | −2.14* | −1.67* | |||||
| Black | −0.80 | 1.42 | −1.12 | 0.98* | −0.31 | 0.08 | −4.17 | |||||
| Asian | −4.80 | 2.02* | −1.31** | −4.29 | −1.41* | −3.84 | −4.43 | |||||
| Indian | 0.24 | 0.48 | −6.39 | 1.14* | −1.03 | −4.12 | 0.71 | |||||
| Other | −4.04 | −2.60 | −2.62** | 0.73 | −1.09 | −3.01 | −3.54 | |||||
| β 0 | −6.65** | −10.13** | −3.95** | −7.21** | −5.82** | −7.46** | −5.99** | |||||
| Sentiment | ||||||||||||
| Happy | Sad | Angry | Confused | |||||||||
| 0.02** | −0.01 | −0.02 | −0.06 | |||||||||
| −0.39 | 0.26 | −0.31 | −0.49 | |||||||||
| −0.64 | 1.35** | −1.00 | 0.05 | |||||||||
| 0.38 | 0.51 | 0.65 | −0.66 | |||||||||
| 0.16 | 1.42** | 0.87 | −3.38 | |||||||||
| 0.84 | 0.07 | −0.30 | −2.88 | |||||||||
| −7.17** | −7.37** | −7.27** | −6.96** | |||||||||
| Status | ||||||||||||
| Classy | Traditional | Tribal | Rich | Poor | ||||||||
| Age | 0.05* | −0.01 | −0.03 | 0.03** | 0.00 | |||||||
| Female | 1.07 | 2.61** | 0.82 | −0.41 | 0.58* | |||||||
| Black | 2.56 | 5.71 | 6.22 | 0.40 | −0.54 | |||||||
| Asian | −3.05 | 4.18 | −3.00 | −0.92* | −4.22 | |||||||
| Indian | −2.49 | 5.54 | −2.25 | −1.36** | 2.69** | |||||||
| Other | −1.74 | −2.43 | −1.78 | −0.52 | −3.48 | |||||||
| β 0 | −13.69** | −13.84** | −12.41** | −6.98** | −7.93** | |||||||
| Feature | ||||||||||||
| Dark | Light | Natural | Human | Average | Perfect | Normal | ||||||
| −0.02 | 0.00 | −0.01 | 0.02** | 0.02* | 0.00 | −0.01 | ||||||
| 0.23 | 0.27 | 1.76** | −1.68** | 0.29 | −0.39 | 1.03 | ||||||
| 1.67** | 1.40** | 2.76** | 1.05** | −0.01 | 0.21 | −0.04 | ||||||
| −1.84 | −4.24 | −0.61 | 0.41** | −0.43 | −0.33 | 0.04* | ||||||
| 0.17 | −1.21 | −0.87 | 0.49** | −0.61 | −3.66 | −3.62 | ||||||
| 0.01 | 1.03 | 0.08 | 0.23 | 0.41 | 0.47 | −2.95 | ||||||
| −6.76** | −7.73** | −8.05** | −4.72** | −8.60** | −8.21** | −8.57** | ||||||
| Size | ||||||||||||
| Big | Small | Thin | Fat | Chubby | Slim | Skinny | Strong | |||||
| Age | −0.01* | −0.04* | −0.01 | 0.01 | −0.07* | 0.00 | 0.07** | 0.03 | ||||
| Female | 0.40 | 0.77 | 0.63** | −0.08 | −0.13 | 0.29 | 0.87 | −0.75 | ||||
| Black | 0.11 | 0.90 | 0.05 | 1.01** | −0.02 | 4.27 | 2.08 | 1.55 | ||||
| Asian | −0.62 | 0.21 | −1.07 | 0.62 | −0.07 | 2.83 | −2.99 | 1.55 | ||||
| Indian | −0.09 | 1.14 | −1.44** | −0.01 | −3.40 | 3.88 | 2.38 | −2.98 | ||||
| Other | −0.60 | 0.19 | −0.21 | 0.21 | −2.95 | −2.06 | −1.71 | −2.00 | ||||
| β 0 | −6.26** | −7.97** | −6.11** | −7.29** | −6.91** | −11.69** | −14.81** | −10.19** | ||||
