Abstract
The web of linked data, otherwise known as the semantic web, is a system in which information is structured and interlinked to provide meaningful content to artificial intelligence (AI) algorithms. As the complex interactions between digital personae and these algorithms mediate access to information, it becomes necessary to understand how these classification and knowledge systems are developed. What are the processes by which those systems come to represent the world, and how are the controversies that arise in their creation, overcome? As a global form, the semantic web is an assemblage of many interlinked classification and knowledge systems, which are themselves assemblages. Through the perspectives of global assemblage theory, critical code studies and practice theory, I analyse netnographic data of one such assemblage. Schema.org is but one component of the larger global assemblage of the semantic web, and as such is an emergent articulation of different knowledges, interests and networks of actors. This articulation comes together to tame the profusion of things, seeking stability in representation, but in the process, it faces and produces more instability. Furthermore, this production of instability contributes to the emergence of new assemblages that have similar aims.
T
Since AI technologies contain and afford certain understandings and representations of the world, it behooves us to understand how those technologies are developed and encoded. This much-discussed technology is still not fully understood from a social perspective, as the technologies surrounding AI and the intelligent web are only recently coming into fruition. As global socio-technical assemblages, AI systems and their diverse knowledge bases contain globally distributed material and expressive components that continuously attempt to affirm the assemblage’s identity and represent the world (DeLanda, 2006, 2011). However, attempts at encoding the assemblage produce controversies and ambiguities that threaten to destabilise it. This has the effect of enabling certain understandings while occluding others (Knobel, 2010). Understandings are then reproduced via search-based systems. How, then, do we understand the development and cementing of knowledge and classification systems as they relate to search-based AIs? What are the processes by which those systems come to represent the world, and how are the controversies that arise overcome?
Through the perspectives of global assemblage theory, critical code studies and the theory of informatic practice, I will analyse one such assemblage. Schema.org is but one component of the larger global assemblage of the semantic web, and as such is an emergent articulation of different knowledges, interests and networks. This articulation comes together to tame the profusion of things, seeking stability in representation, but in the process, it faces and produces more instability. Furthermore, this production of instability contributes to the emergence of new assemblages that have similar aims. In order to arrive at this conclusion, I employ netnographic analysis of data that captures the components, work processes, decision-making, encoding and creation of Schema.org. This allows me to understand the ways in which the assemblage comes to classify and encode the world through an analysis of how vocabulary proposals are managed, how controversies relating to those proposals arise and are dealt with and which parties are involved in those decisions. Following assemblage theory, I will examine the various components and the ways that they territorialise and deterritorialise the assemblage. I begin with an introduction to the semantic web and its controversies. Following that introduction, I situate the semantic web within literatures on global assemblage theory, critical code studies and practice theory. Following the literature review, I will discuss the study’s case and methodology, after which I interpret and discuss the results of how assemblage theory aids in understanding the semantic web and its controversies.
Literature and Background
Initially a proposal by the inventor of the World Wide Web, the semantic web is an additional layer to the existing web where content is given machine-readable context and meaning (Berners-Lee, Hendler & Lassila, 2001; Giri, 2011; Halford et al., 2012). Absent semantic web technologies, information on the web is simply ‘raw’ text. Meaning making is left to users, as algorithms are only able to crawl the texts for instances of identified keywords or criteria, that is, there is nothing that identifies Barack Obama as the 44th president of the United States apart from the co-occurrence of those terms within text. A web search for ‘Who is the president of the U.S.?’ would search for each of those terms and present information to the user via some relevancy algorithm. The determination that Barack Obama is the current president and not George Bush is left to the user, despite the algorithm’s likelihood of presenting one set of results over another.
The technology that makes this ‘raw’ information machine-readable—semantic data—relies on providing a definition of terms, properties and a formal statement of relationships, all of which become marked and interlinked by web developers. For this to occur, there needs to be a schema for describing a thing’s properties, some form of semantic structure that determines how those properties are attached to specific instances and finally the collection of statements needs to be represented in a formal set of relationships, called an ontology. These define the rules of representation and establish relationship hierarchies. Linked data contextualise data points by supplying more information on that data, providing a way for machines to more easily understand information by traversing class hierarchy paths and linking of bits of meaningful data across domains and uses (Berners-Lee et al., 2001; Coyle, 2008; Giri, 2011; Halford et al., 2012). For instance, The New York Times, when referring to Barack Obama, links its occurrences to a number of ontologies, such as SKOS and DBpedia, that, themselves, define who Barack Obama is and how he relates to other people, places and things. In DBpedia, for instance, Barack Obama is the successor1 of George W. Bush and is the 44th person listed as president. Bush is himself identified by his own Universal Resource Identifiers (URIs) that link to his respective page and his own set of relationships described in various ontologies. In this contextualised case, the machine knows the terms of the search and how the terms relate to one another. That is, who Barack Obama is, how he relates to the Presidency and where in the line of succession he sits.
Of course, the process of marking all web data is time consuming and fraught with ambiguity. When the data come from different places, one needs a standard way of making sure that name and type are being used in the same ways. For instance, maybe name is not the same in different cases, as name could mean first name, surname or username. In this case, differences exist despite potentially coming from the same webpage. Here one needs to remove this ambiguity by providing additional information on how these two artefacts are related. Exact URIs need to be used to describe each specific instance, as well as identifying exactly which ontologies that are being drawn on to provide semantic context. There are problems here though, as time, ambiguity and difficulty set up a conflict between precision and practicality. Additionally, there are few dedicated standards and no universal ontologies, which lead to mark up issues. The problem of drawing on various ontologies, or ontologies in general, is that they sometimes make differences invisible and make equivalency statements between instances that may not be valid (Halpin et al., 2010; Poirier, 2015; Waller, 2016). This problem is exacerbated when two or more linked ontologies treat an instance differently. Mobilising any understanding or representation of an idea or phenomenon can have the effect of occluding other understandings. This ‘ontic occlusion’ (Knobel, 2010, p. 3) leaves those understandings and representations outside of legitimate narratives, or in the present case, renders them invisible to users’ attempts at acquiring information.
A notable example refers to the owl:sameAs statement. The idea is that the statement, owl:sameAs, links two instances, indicating that the URIs for both refer to the same instance. This has the necessary consequence that a statement about one is true for the other. This statement is helpful in delineating equivalencies across web domains when an instance is not textually referred to in identical ways (Halpin et al., 2010). For instance, Poirier (2015) notes that in DBpedia, content about Caitlyn Jenner is owl:sameAs content about Bruce Jenner despite the very important and relevant identity transformations and political problems that exist in such a statement. The linked nature of the semantic web makes it so that any link to existing data sets with owl:sameAs makes a statement about that instance applicable to both, independent of the truth of the statement. This presents clear issues relating not only to accurate representation but also classification.
A second issue arises when there are contextual differences. A link of an instance in one context may not be appropriate across contexts. Making such an equivalency erases the distinctions that contexts add to identity and definition (Halpin et al., 2010). A third issue is the confusion of identity, representation and properties of things. Properties of things, or that describe things, are not the same as representations or identities of those things, though equivalency statements are often made between the properties and the thing itself. Furthermore, representation is not the same as identity. Universal Resource Identifiers—pictures, email addresses, social security numbers, etc.—can stand in as signifiers of a person, referring to a representation of a person, but are not reducible to the person itself and in no way addresses the complexities of their identity. When access to information is mediated by systems that draw upon semantic web technologies, it becomes vital to understand how these understandings are mobilised, how representations are cemented and how the decisions behind them are arrived at.
That the semantic web is an extension of the existing web makes it impossible to discuss without recourse to discussions of globality. The semantic web intersects with these discussions in two primary ways. The first intersection is with Collier and Ong’s (2005) discussion of global assemblages. In this work, they distinguish between global forms and the actual global. Global forms are universalised phenomena that ‘have a distinctive capacity for decontextualisation and recontexualisation, abstractabilty and movement, across diverse social and cultural situations and spheres of life’ (Collier & Ong, 2005, p. 11). The semantic web is a global form in the sense that it is not referring to any specific or concrete instance but rather is a global term to refer to heterogeneous uses across domains. By their account, the actual global is the specific form of interactions between contexts, social and cultural situations, technologies and users. In the current case, the actual global is the specific assemblage of ontologies, schemas, actors, links and use cases that represent any particular articulation of the semantic web.
The second point of intersection is more conventionally global in the sense that the various assemblages that make up the larger semantic web are themselves distributed across the globe, enacting an open source, peer-produced, semantic environment. The servers, computers, companies and people, all are distributed in global space, as are the work processes, articulations and expressions of the semantic web. It is a coded space where different networks—ontologies, use cases, etc.—and individuals with different global knowledges—programmers, academics, entrepreneurs, librarians, etc.—come together to manage the world’s complexity, creating a system of representation and classification. However, in doing so, they encounter and produce more complexity. At this stage, it is necessary to engage with the literature on assemblage theory in more detail.
Assemblages are mixtures of components—material and immaterial—that enter into relations with one another for the purposes of affecting the world. They consist of open relationships between components—themselves assemblages—that have no essential unity, but rather relations of exteriority, where relations between components are fluid, contingent and irreducible to the properties of the components themselves (Bogard, 2006; DeLanda, 2006, 2011). These relationships only ever appear unified and internally coherent because of the actual fact of their having relations and their ability to affect and be affected. DeLanda (2006, 2011) defines an assemblage along a double axis that traverses roles of materiality and expressivity, as well as processes of stabilisation and destabilisation. Each of the two axes serves to mark the identity of the assemblage in their distinct, but related ways. The material/expressive axis serves to mark the assemblage’s content and manner of expression. The second, territorialising, axis serves to mark out the assemblage’s identity as it pertains to its boundaries. However, all assemblages carry with them deterritorialising processes that serve to destabilise the assemblage, introducing variation into the assemblage and setting the conditions for the breakdown and/or rearranging of relationships within and between assemblages. Assemblages then, always, and at all times, contain the possibility of transformation and the emergence for new assemblages in a contingent manner (DeLanda, 2006, 2011; Wood, 2013).
That deterritorialisation processes are inherent in systems of classification and representation is not a novel idea. Busch argues that the production of standards and classifications only ever ‘produce partial and impermanent orderings’ (Busch, 2011, p. 6). Bowker and Star (2000) made similar arguments with respect to disease and race classifications. Likewise, Foucault (1973) argued that instances will always escape taxonomic ordering. Deleuze and Guattari (1987) similarly argued that territorialisation and deterritorialisation are inseparably linked, as new lines of flight will form from any attempt at striation. Waller (2016) makes a similar argument, noting that the representation of knowledge in the semantic web necessarily limits its possible expression but does not examine the practices and negotiations that occur in that construction. Central to these arguments is that particularities are always more granular than their attempts at capture, and difference will always emerge where attempts at reduction and equivalency occur.
The study of these types of systems presents significant difficulty to researchers, especially when they are encoded into software. Classification and coded systems are often ‘black boxed’ and unavailable for scrutiny, embedded and interwoven with numerous other algorithms and code, ever changing and not easily predicted, or simply taken for granted (Bowker & Star, 2000; Kitchin, 2014; Seaver, 2013). Most studies of coded systems have focused their attention on effects and the work they do in the world (Kitchin & Dodge, 2011; Lupton, 2014; Manovich, 2013). As yet, little work has engaged in sustained empirical investigation of the processes of creation and modification underpinning code systems, less still have focused on semantic technologies. This contributes to a gap in our collective understanding of the way order is encoded into technical artefacts, a gap that this study hopes to begin to fill (Kitchin, 2014). As all software and classification systems are products of socio-technical assemblages and are embedded in broad cultural and historical contexts, they have values and politics implicated in their encoding that are inseparable from the encoding itself (Benkler & Nissenbaum, 2006; Bowker & Star, 2000; Kitchin, 2014; McKelvey, 2010). While important, it is not enough to simply say that the programmers and organisations have politics that they explicitly or implicitly embed in these systems. Since ‘creating an algorithm unfolds in context through processes such as trial and error, play, collaboration, discussion, and negotiation’ (Kitchin, 2014, p. 10), we need to understand those processes, how issues emerge and how they are dealt with.
Analyses of informatic practice allow just that. The concept of informatic practice borrows from Knorr-Cetina’s (2001) theory of objectual practice which argues that objects of inquiry only ever become partial objects, as they are first simulated and represented and then subject to evaluation, correction and improvement. They never completely shift into the background of habitual practice. As assemblages themselves, these objects, and the relations in which they are implicated, are never complete; they are always emerging and transforming based on the intra-actions between the objects, the individual(s) and discourses engaging with them. Investigation and interaction with them produce greater levels of complexity rather than a reduction in it.
French (2014) conceives of information, its relationships and objects as never complete. Advancing the concept of informatic practice, or ‘the assumption that information has a material basis in the spatio-temporal milieu of everyday life’ (French, 2014, p. 230), he details the ways that data are collected, coded and embedded in material practice. In this way, he examines how information is enacted, or brought into being during the course of practice. As such, an understanding of immaterial information requires attention to the mundane performance of routinised work—filling out forms, categorisation, filing, communicating effectively. I adapt this line of thinking to focus attention on the practices and processes of expression and territorialisation of the semantic web. I argue that the expression of openness contained in the global production and use of the semantic web, rather than constructing a universal and fully developed classification or knowledge system, only ever creates contestable and partial representations that perform ontic occlusion. Further, this combined with the paradoxical nature of its territorialising and deterritorialising processes leads to the emergence of new articulations of semantic web assemblages.
I turn now to an explanation of the case study, the description of the data and the analytical approach used for investigation after which I follow with a discussion of the results from my analysis and a conclusion which charts directions for future research.
Case and Methodology
To answer my questions of how digital knowledge systems come to represent the world, and the controversies that informatic practice introduces, I draw on netnographic data from a larger study of the Schema.org project, a global, open source, peer-produced linked data ontology. Schema.org is a semantic web project spearheaded by Google, Microsoft, Yahoo and Yandex and covers approximately 10 per cent of existing web content (Guha, Brickley & Macbeth, 2015). Major use cases include The New York Times, The Guardian, IMDB, Monster.com, LinkedIn, Yelp, Zillow and the Google, Microsoft, Yahoo and Yandex search-based ecosystems among others. At the time of writing, Schema.org maintains a community group of 223 participants representing at least sixty-six organisations that contribute to the interlinking, development and maintenance of the project. Of that group, a small few official representatives of the stakeholders shepherd Schema.org and make final decisions regarding its direction. The most prominent sources of participation are by Google, Good Relations, Microsoft, Yandex and the World Wide Web Consortium (W3C). Nearly all participation is from Western collaborators, the vast majority coming from people located in the United States, England and Germany.
My primary sources of data for this article come from the main work sites for Schema.org. This includes the full history of collaboration on the project from April 2014 to September 2015 as housed by GitHub.com, and the W3C from April 2015 to September 2015. GitHub data include 812 total issues handled across 345 distinct threads relating to the development and management of Schema.org. W3C community mailing list data include 279 total messages with content similar to GitHub. Issue data and W3C mailing list data contain a variety of modification requests and proposals, discussions of their applicability, debates about semantic accuracy, requests for assistance in application of the ontology and responses by community members on those issues. While both platforms are publicly open, contributors are primarily web developers looking to apply Schema.org to their sites or prominent community members contributing to the day-to-day development and operation of the project. There is substantial variation in the length of issue threads and email chains, as well as in how detailed those data are. They range from zero to forty-two responses, averaging six responses per issue, with each response at a paragraph or more often including lengthy examples of web markup. Supplementary data, to understand the range of components that compose Schema.org, was drawn from Schema.org’s own website as well as Linked Open Vocabularies, a project that maps and catalogues semantic web ontologies and their interpenetrations.
GitHub and the W3C mail archives provide ideal sites for this approach because since April 2014 nearly all open discussion, work requests and vocabulary development have been occurring through GitHub’s platform. Those not on GitHub are found on the W3C mailing list. While the data contain all of the public discussion on how to best alter and apply the ontology, it does not contain any backchannel communications either in person or via email. Additionally, since the data are entirely textual and interviews were not performed for this study, there is no possibility for follow-up or clarification questions that are not in the text. The effects of these limitations seem negligible however, since all modifications are public and the data of primary interest are the discussions and debates about the ontology’s coverage.
This study followed the netnographic approach developed by Kozinets (2010) on conducting digital ethnographies of online communities. The practice of netnography allows for a situated analysis of community engagement available in online forums. While I abstained from participation to avoid being disruptive, netnographic research, like traditional ethnographic methods, necessarily includes a range of techniques including archival and documentary analysis, the tracing of event timelines and narratives and in-depth case study as deployed here. Springer (2015) argues that this provides a set of observational data similar to, but more reliable than field notes and artefacts from traditional ethnographic methods. It additionally provided for a large degree of researcher reflexivity, as I was able to easily revisit data in its original pre-coded form (Springer, 2015).
The textual analysis followed the approaches of Berg (2004) and Strauss and Corbin (1990). I began the coding process by establishing codes derived from the language of the actors themselves. This had the benefit of providing a view towards the meanings that actors gave to their actions. As such, the first step in the data analysis process was to read all data in light of the focus on determining how the assemblage was developed and cemented. Here I identified statements that dealt directly with mission statements, rationales for actions, strategies for developing and expanding the project and how controversy was handled. I searched for those narratives as well as previously undefined categorical narratives that relate to my research project more generally. Next, I established grounded categories where I placed this material into context with the narrative-and action-specific conditions of texts, for example, referencing practicality, usability and simplicity when discussing extensions to the ontology or coverage of new domains.
The second stage of coding involved coding texts based on constructs and questions guided by themes relevant to my research questions. These themes related primarily to territorialisation/deterritorialisation, conflict and consensus and classification, which itself was broken down into instances where false equivalencies were made and recognised, where properties, representations and identities were conflated and when instances evaded proper classification within the ontology. This phase also included coding of the actors and components involved in the assemblage. Here I coded actors by reported country, affiliation and role. Additionally, I coded actors’ involvements with other linking ontologies. The final stage of the coding process involved drawing connections between codes from both stages of the coding process. In this phase, I modified and collapsed codes where appropriate and placed them into more direct conversation with the study’s research questions and the contexts they implied. This allowed me to draw connections between themes that may have initially been treated separately and apply coded information directly to the study’s focal interests. For instance, that usability and practicality took precedence over accuracy in representation and classification.
Discussion of Results
In the following sections, I first discuss the expressive components of Schema.org. Here I will show the role that openness and interlinking between ontologies plays in the project. This openness not only allows for the contribution and adoption by diverse sources, enabling its globality, and increasing the scope of its reach as a knowledge base but also sets the stage for a host of semantic issues that serve to deterritorialise the assemblage. This deterritorialisation causes a reduction in the complexity of information as known and presented by AI systems. These twin processes of territorialisation and deterritorialisation will be a topic of the second section.
Expression
Despite being dominated by a small number of organisations and individuals, Schema.org is openly available to anyone able to collaborate or that has cause to use its markup. It is also worth noting that modifications to the ontology have originated and continue to originate from marginal collaborators, and that Schema’s explicit operational model is to make changes and extensions as simple and free as possible. This is much evident in the fall back to practicality and usability as a guiding mantra whenever contentious, or highly difficult vocabulary proposals are introduced. Schema.org’s open expression acts as a catalyst for the modification, clarification and extension of the underlying ontology and its form of representation. So, Schema.org is not reduced to the properties of the major companies funding it, the characteristics of its many collaborators or the properties of the numerous ontology extensions but is instead the actual deployment of their collective capacities to interact, create, define and link.
The primary expressive components are the ontology extensions, generalisations, specialisations, use cases and the Schema.org ontology itself. There are currently forty-eight different ontologies that link to Schema.org and use it as their backbone. In seven of these cases, the linked ontology simply overlaps with Schema.org, marking out ontological equivalencies. This allows developers using Schema.org or the other linked ontologies to indicate that a particular instance is the same as an instance in the comparable ontology. While the specific organisation and depth of the various ‘equivalent’ ontologies may differ, they serve to increase the spread and scope of one another across use cases. The generalisations of Schema.org express its identity by simply mapping their own particular uses on top of Schema.org’s formal model of representation. That is, the generalised ontology directly adapts Schema.org’s structure and class hierarchy for its own design. These expressions focus on the adaptability of Schema.org’s structures of organisation for ontology construction. Specialisations and extensions express the assemblage in two distinct but related ways. Extensions take Schema.org as their foundational ontology and extend it to domains in which Schema.org has either decided not to cover or has not yet done so. Once extended, the basic principles of linked data obviate the need for Schema.org to address that new domain within its ontology. This further expresses the assemblage’s open identity. Relatedly, specialisations are when new ontologies make Schema.org’s general classifications more domain specific. For instance, GS1 is a separately developed ontology that links to Schema.org but exclusively specialises in consumer goods, at a level of granularity that would be difficult for the core Schema.org community to accomplish.
Territorialisation/Deterritorialisation
Recall that the various components of any assemblage engage in processes that either help define the assemblage or they work to destabilise and alter that identity, deterritorialising it and encouraging the formation of new relationships and the emergence of new assemblages. Thus, all components of any given assemblage engage in territorialising and deterritorialising operations, expressing and maintaining the assemblage’s identity or altering it in unknown ways. At the outset, one should realise that there are two main identity outcomes for Schema.org. The first is the affirmation and adoption of the project. This means at a basic level, global adoption, use and work on the project are identity affirming and territorialising. I refer to this spread and adoption as first-order territorialisation. The second outcome concerns the status of the ontology itself. In order to build a linked, semantic layer onto the existing Internet, Schema.org must manage the profusion of things by developing a formal set of definitions and relationships between them. The more formally they represent the world and the more they are able to simplify the play of linguistic difference, the more territorialised this specific representation of knowledge can become. Thus, work must take place to refine, extend and build complexity into the ontology to establish correct relationships and mitigate semantic ambiguity. This second-order territorialisation refers to the work and those processes that stabilise the ontology relative to the limitless complexities of language and difference. In an ideal sense, this leads to a more rich and semantically complex representation of the world by AI systems.
As I noted previously, the expressive role that openness plays in defining Schema.org cannot be overstated. That Schema can be expressed across multiple global standards—JSON-LD, Microdata and RDFa—makes its adaptability to other ontologies and to a wide variety of webmasters much greater in theory. The same can also be said of its commitment to open peer production. The open proposal of new projects and the granular distribution of tasks on a global platform virtually ensure that projects important to users will be implemented or at minimum command earnest discussion. Indeed, the data bear this out. Proposals that are simply stated, simple to code and simple to implement are adopted, discussed and completed. However, there are numerous issues that arise when that simplicity is no longer present. Most notably, deterritorialisation occurs at the first order when proposals are too long or complicated, when they draw attention and resources away from near term, more manageable goals, or when they introduce conflict into the existing ontology.
A contributing factor to this first-order deterritorialisation relates to the territorialisation processes at the second order to which I now turn. As mentioned, territorialisation needs to be thought of in multiple ways. At one level, the Schema.org is trying to territorialise and encode itself across web domains as a system for classifying things. At another level, it is trying to territorialise knowledge as a way for making things meaningful to machines via accurate distinctions and formalised relationships, for example, the periodic practice of refining types to more accurately reflect their specific distinctions as in the cases of making Series, CreativeWorkSeries or Season, CreativeWorkSeason. Territorialisation processes at the second order then concern themselves with the assemblage’s components trying to extend, modify and refine the existing ontology into new content areas and/or specialisations of existing areas.
Second-order territorialisation for Schema.org happens in three main ways. The first is through the addition and specification of an all-together new area. Once a proposal to modify the vocabulary is made, variable amounts of debate and discussion occur until some settlement is reached. The level of debate and discussion itself depends on the sophistication, specificity and level of difficulty that the proposal presents. In this particular form, debate is intense, though not contentious; participants maintain a professional work environment in all cases. The data indicate that proposals to modify Schema.org’s ontology involve more individuals, discussion and conflict than proposals or requests that can be satisfied in other ways. In the cases I observed, these proposals were larger and more detailed than other proposals, necessitating lengthy discussion to ensure feasibility and that no ontological conflicts exist. For instance, three of the longest discussions in the data were debates around the feasibility, size and unintended consequences of adding complicated medical, bibliographic and event series extensions to the ontology. Each proposal created its own set of issues for the community to work through. The medical proposal was so complex that it was postponed so as to not distract from other goals and so that the community could ruminate on the best ways to integrate the proposal in a usable and harmonious way. Similarly, the bibliographic extension created the need for additional terminology that were difficult to define, would create conflicts with existing terms and would be entangled with other types in the ontology. Debates about event series sought to determine if a serialised event like the Olympics, among others, should be considered at the most general level as an event, at a more granular level as a specific event—2012 London Summer Olympics—or at an even more granular level that accounts for the specific sporting events therein. Because each declaration is linked to other objects such as time, location, reoccurrence, each with their own granular complications—start date, end date, specific event date, time and location—the community was unable to use a more generic application of the existing ontology, despite some members’ reluctance to add complications. In all three cases, the proposals were eventually added to the ontology, with the medical and bibliographic proposals as hosted extensions.
The second way that territorialisation occurs is through the further specification of a vague or incomplete portion of the ontology. The actual process in which this occurs and is resolved is not substantially different from the process above, only that this form of proposal tends to produce more dissent. Activity relating to these proposals often involves debates on the need and merit for the refinement as well as exposing false equivalencies and ambiguities. For example, a proposal to include where an artefact was created exposed issues in coding locations as they related to creative works. The current ontology conflates a given work’s location of creation, residence and description and runs into additional difficulty when referring to concrete versus virtual works as well as referring to works in the abstract. Members often disagree with the need to modify or extend the ontology at more granular levels, as many prefer to use ad hoc descriptions as opposed to formal ontological classification. As mentioned, feasibility in these proposals is a major concern, and as a result, steering group members employ a triage approach to ensure that proposals determined to be important and manageable are completed within the deadlines imposed. At times this results in larger, more detailed, proposals being abandoned or postponed, as in the previous case. Interestingly, this attempt at second-order territorialisation produces the most instability in the assemblage and paradoxically contributes to processes of deterritorialisation.
The third means of territorialisation happens when one linked component may sufficiently cover an area so that the central Schema ontology need not make changes providing that there are no contradictions between the two linked ontologies. This method happens when a proposal is either too complex, out of the project’s current scope, or is simply solved elsewhere. Settlement in this way is reached through the suggestion of one of the central collaborators who recognises that a compatible solution exists elsewhere. In some cases though, collaborators are simply seeking advice on how to modify their own linked vocabularies to more closely resemble and more cohesively fit with Schema.org’s. In one exchange, a community member requesting help with linking individuals to their social media accounts found Schema.org’s coverage inadequate and so requested a better solution. As a response, more active members suggested that the first member draw on one of two external ontologies, Semantically Interlinked Online Communities (SIOC) and Friend of a Friend (FOAF), that specialise in semantic markup of various social networks and media. This would connect any markup using Schema.org to the sections using the social media ontologies.
Within the data, there is a stress on use practicality and maximal simplistic coverage, which has an interesting effect on deterritorialisation. The data indicate that first-order territorialisation and second-order territorialisation are often incompatible. The following comments by a steering group member provide an example of the conflict that comes with extension and accuracy:
In the previous medical/health additions we made the mistake of including terms whose name was implicitly contextualized to medical/health scenarios. So we ‘used up’ the word ‘action’ on a property that we now call…muscleAction and so on…every change needs to make sense [a]cross domain, since schema.org is a cross domain vocabulary.
Here a member points out that a previous vocabulary specialisation came into conflict with an extension to Schema.org. Another member notes, ‘As Schema.org grows, generic names for types and properties leads to unfortunate collisions.’ That is, the two orders of territorialisation are in conflict with one another. One is guided by practicality and ease of use, while the other is guided by understanding semantic differences. A crucial question then is which take precedent and why.
The data indicate that first-order territorialisation is most important and is guided by rules of simplicity. Steering group members repeatedly stress the need to cover the majority of use cases in contentious proposals, as one steering member mentions while arguing for a less precise and more general markup rule, ‘I think we need to find a sweet spot that covers 80–90% of the cases. A formally proven perfect solution for the underlying problems is IMO beyond what schema.org can achieve.’ The data indicate that semantic accuracy only prevails when simplicity and practicality agree with it. This alignment most commonly occurs when modifications are small, clear and introduce novel and superficial coverage to the ontology, for instance, when the community added Exhibition to the ontology and then further distinguished between Exhibition as CreativeWork and Exhibition as Event. In this example, there was consensus about ease of use, ease of implementation and semantic accuracy. Thus, there was alignment between the forces that push both orders of territorialisation.
Second-order deterritorialisation occurs when specifications are made to existing domains or when new domain proposals involve a high degree of semantic complexity. As one member remarks about a proposal for integrating legal decisions and terminology:
The problem is that internationally there are SO many different vocabularies—I’m not just talking about language differences, I’m talking about the way legal concepts are referred to and thought of in different jurisdictions. Not to mention at all levels from international…to national to regional (state, province) to municipal and other local levels. So trying to include all those existing vocabularies is not only monumental but probably unworkable. In a situation like this it is actually more helpful to…come up with high level generic terms to which individual vocabularies can be mapped.
In this example, semantic complexity is simply too great to manage, let alone integrate in a way that developers could easily use. The trade-off in this case was a proposal to use generic legal terms, but even those terms ran into semantic difficulties because they were based on the US legal system and were not sensitive to the differences in other local regional and global legal systems. Since the ontology must be simple to understand and markup must be easy to do for first-order territorialisation to occur, generally, territorialisation at the first level means that territorialisation at the second level cannot proceed, and the assemblage is destabilised, with the notable exceptions when the two orders align. That is, adaptation and integration with the assemblage must be within the realm of feasibility for an average web developer. Added complexity–territorialisation at level two–increases the likelihood of error and miscoding. The actual practical effect of a miscode may or may not be severe, but it does work against the affirmation of the assemblage’s identity.
The paradox of de/territorialisation sets the stage for the emergence of new assemblages and/or modifications to the existing set of relations. Let two examples from the data suffice. In the first example, an absence of a sufficiently detailed product and item taxonomy led to a complicated proposal that broke with general rules guiding feasibility. Rather than enacting that proposal, a new proposal to integrate an alternate ontology, with similar coverage, was created. In the second example, inadequacies in the ontology, stemming from complexities in the real estate market, prompted collaborators to include Zillow.com as well content from two other real estate ontologies. In each of these examples, the tension between applicability and accuracy played a major role in the establishment of new connections and the emergence of new groups of relationships. So as Schema.org is destabilised by its inability to specialise or extend its ontology, it provides the components for new semantic assemblages to emerge alongside it. However, the creation of new assemblages through interlinking can raise the very same issues of compatibility discussed previously.
This conflict between the orders of territorialisation is significant. Since search-based AI increasingly hold a privileged position vis-à-vis our access to knowledge, as knowledge consumers we would want it to be as semantically rich as possible. However, the spread and adoption of the knowledge base—first-order territorialisation—often result in a loss of complexity. As shown, this second-order deterritorialisation has the unintended consequences of ontic occlusion and the blunting of complex distinctions. For example, Google’s recent decision to provide medical diagnosis ‘cards’ as a priority feature of its semantic search implicitly advocates for specific and select medical information independent of patient context and history, not to mention occluding alternate sources of advice. Relatedly, the major users for Schema.org are large companies with goals that may not be aligned with broad or open access to information. This, of course, comes at the expense of competing interests. As semantic web technologies become the global standard mediating access to information, use ceases to be a matter of choice for knowledge consumers, while those technologies transform into a sieve for the web of knowledge, presenting only what has not been allowed to pass through its holes. How the decisions on what is or is not considered too granular are made bear directly on these outcomes.
Conclusion
Global assemblages, Schema.org in specific, and the semantic web in general, contain components that represent many different knowledges, intentions and sets of relationships that are deployed to create and encode a certain understanding of the world while occluding others. I argue that in the process of capture and representation, the semantic web creates and faces increasing amounts of complexity and ambiguity. This complexity and ambiguity serve to destabilise the particular articulations of that global form, leading to the establishment of new sets of relationships among components and the emergence of new articulations. In this sense, this study follows in the footsteps of prior work that found capture and escape to be inextricably linked within knowledge and classification systems (Bowker & Star, 2000; Busch, 2011; Foucault, 1973). This study adds to that rich work by expanding the scope of analysis to domains of code based, automated and purportedly value neutral, decision-making. Additionally, through an assemblage theoretical approach, it finds that escape itself contributes to the emergence of new particular knowledge and classification systems.
That such controversy exists within these systems of knowledge and representation has major importance to a world that increasingly relies on AI systems. When exposure to information, employment opportunities, residential choices, opportunities for consumption and the social sorting that results from digital surveillance more generally are mediated by the decisions of AIs, it becomes vitally important to understand the biases that might exist within those systems. This necessarily includes analyses of the processes through which those systems are created. While outside the specific purview of Schema.org, the markup it enables and the results that search engines present could become tailored to specific individuals further entrenching user targeted search and consumer profiling. Semantic results for housing could integrate data profiles about individuals to effect of excluding certain people from certain areas. Thus, future work would do well to examine the creation of other instances of the semantic web using additional methodologies that allow researchers to interrogate the decision-making process at a more intimate level. Additionally, future research should examine the ontologies themselves in more detail, exposing the identities and understandings of the world that are occluded, or rendered invisible. Furthermore, concern should be directed towards the actual uses of this technology, attending to practical effects that AI generated information has on users. All of this requires a sustained, and interdisciplinary, empirical commitment to an opaque and enormously complex facet of the coming age.
