Abstract
I argue that web historiography should be placed higher on the Internet Studies’ research agenda, since a better understanding of the web of the past is an important condition for gaining a more complete understanding of the web of today, regardless of our focus (e.g. political economy, language and culture, social interaction or everyday use). Building on reflections about ’historiography’ and the ’web’, I discuss several major challenges of web historiography vis-à-vis historiography in general, focusing on the characteristics of the archived website and the web sphere, and the consequences of these characteristics for web historians. I conclude by outlining future directions for web historiography.
It is not unusual within a relatively young field of study for scholars to begin discussing the state of the field and to outline compelling problems of the future. In Internet Studies, this interest is articulate in a special issue of The Information Society dedicated to the discussion of whether Internet Studies is a discipline or a field (Baym, 2005: 229).
However, the history of the internet and of the web is not on the agenda of this special issue. One reason may be that the contributors were charged with discussing ‘Internet Studies’ in general terms and as opposed to other well-established fields and disciplines, such as the social sciences, computer science and communication and media studies. However, this absence may also index the simple fact that reflections on historiography did not play a significant role during the first 10 years of Internet Studies, and with good reasons. Firstly, when establishing a new field/discipline, studying the past is less urgent than studying the present. Secondly, the very short past of in particular the web may not even be considered a history. However, Internet Studies is now a well-established field of study, and it is time to place historiography higher on the research agenda.
In this article I outline some of the challenges presented by a sub-set of internet history, namely, web historiography. I limit the discussion to the history of the World Wide Web, rather than to internet history in general, because over the past two decades, the web has largely become the pivotal point of the internet as well as of Internet Studies. 1 Of course, the internet is a precondition for the advent and development of the web, and hence it is part of web history; it is also likely that the internet will foster new forms that will supplement the web, such as the use of apps on mobile devices such as smart phones and tablet computers. 2
In terms of impact and number, histories of the web have not played a significant role in the first years of Internet Studies. However, historical studies of the web have been done (for a summary, see: Brügger, 2010a: 8–13). The general picture that emerges here is, firstly, that web history has not yet been constituted as a sub-field of study in its own right within Internet Studies; this is indicated by a lack of shared theoretical and methodological assumptions and discussions. Secondly, although the number of contributions is very limited compared to Internet Studies in general, within the last couple of years a growing interest in the field becomes apparent, from approximately 20 publications up through 2005 to some 35 between 2005 and 2010, in both periods equally distributed between journal articles, books and dissertations. 3
The argument for encouraging internet scholars to further develop web historiography is simple: a better understanding of the web of the past is an essential condition for gaining a more complete understanding of the web of today, regardless of whether our focus is on political economy, language and culture, social interaction or everyday use. Obviously, the present forms and uses of the web have not been created ex nihilo: they all come with a history. In addition, the past of the web becomes longer as we move forward in time, and since the web will probably continue to change rapidly, it will become still more entangled in a complicated network of past and present lines of influence, which will thus become increasingly important to describe and analyse. Correspondingly, the need for general and systematic reflections on web historiography and its theoretical and methodological challenges becomes ever more urgent.
With these general considerations in mind, I begin with comments on what is understood by ‘web’ and ‘historiography’ and then discuss some of the major challenges presented by web historiography. The main argument here is that many of the challenges specific to web historiography revolve around the characteristics of one of the source types on which many historical web studies will probably build, namely archived web material. After a brief introduction to web archiving, I turn to the characteristics of the archived website and the web sphere, and the consequences of these characteristics for the web historian, specifically with regard to source reliability and to making historical studies of a web sphere.
Web and web strata
Whether studying the web of today or of the past, we can focus on five different web strata: a web element, for example an image on a webpage; a webpage is what we see in a browser window; the website is a number of coherent webpages; the web sphere is the web activity related to a theme, an event or the like; and the web as a whole is anything that transcends the web, such as the general technical infrastructure of the web or the content of the web in its totality (see Brügger, 2009: 122–125, for further details and definitions).
Although the literature on web histories is limited, one can find examples of historical studies of each web strata: a study of the banner ads of the top US web sites between 2000 and 2007 (Li and Zhunag, 2007), or of Danish web ads between 2004 and 2009 (Jessen, 2010); a digital style history studying the development of graphic design on webpages (Engholm, 2002), or of ‘the look of the web’ (Ankerson, 2010); a number of historical studies of one or more websites – for instance, in relation to political communication (Jarvis and Wilkerson, 2005; Schweitzer, 2008), online newspapers (Nerone and Barnhurst, 2001) or broadcaster’s websites (Burns, 2000, 2008); historical studies of web spheres, such as web campaigning in relation to US presidential elections (Foot and Schneider, 2006) or the development of weblogs from 2003 to 2004 (Herring et al., 2007); and, finally, studies of issues related to the web as a whole, such as the development of the use of cookies in web browsers (Elmer, 2002).
In addition, the five strata are interrelated since they constitute each others’ contexts. For instance, they can be ‘nested’ in one another: we can study the use of images (web element) as they appear on front pages (webpage) in online newspapers (website) during general elections (web sphere). An example of a historical study of nested web strata is a study of the use of thumbnail pictures on the Sydney Morning Herald online between 2002 and 2006, in which the analysis of the small images is related to the changing design of webpages and the website (Knox, 2009).
This focus on the five web strata is mainly concerned with the web as a medium and as a text (Brügger, 2009: 118–119). Obviously, one could also broaden the perspective and study how the web and its strata were shaped in an interplay with producers and users, as well as with the contexts in which production, web and use are situated (economic, political, institutional, cultural, other media, etc.). This could, for instance, include science technology studies inspired by actor-network theories focusing on the mutual shaping of producers/users and technology (cf. Latour, 2005; Law and Hassard, 1999). We can further add studies that map and explain the mutual influence between producers, web, users and contexts, as a whole (cf. Brügger, 2010b: 30–41). However, to some extent these kinds of studies may also have to include reflections as to what producers or users are actually producing or using, be that web elements, webpages or one of the other web strata.
(Web)historiography
Internet scholars intent on writing histories of the web – regardless of which stratum and larger topoi (e.g. political economy, language and culture, social interaction or everyday use) are in focus – share several fundamental questions with historiography in general. Firstly, one cluster of general concerns revolves around the purpose of the study: is the aim to establish a chronology of events and actions or is it (also) to identify the processes and forces that drive historical change? And how is the history related to the present in which it is written? Secondly, another set of general issues relate to the philosophy and theory of history: how should we understand the changes in the past? Are they ascribed some kind of ‘direction’ (development, progress, decline) or should they rather be seen as a number of ‘neutral’ transformations? And can the history we are writing be divided into different phases (‘periodization’)? Based on which criteria? A third set of fundamental questions relate to methodological issues concerning the source material: which sources are available, and how will the choice of sources determine and bias our study? What are the criteria for our interpretation of the sources? And how can the source reliability be determined, based for instance on an evaluation of how where, and in what condition the sources were found?
All these concerns are well known to any historian, but in relation to at least one of the points mentioned above – the sources – web historiography faces a number of specific challenges. And since the sources are the basis of the study, these challenges may affect all the other issues.
Web historiography can make use of a variety of source types, but in contrast to studies of the present-day web, the web historian cannot create new sources that ‘emerge’ directly from the object of study (e.g. surveys, focus groups or any other source referring to current web activities). Therefore, the web historian must frequently use source types that have been handed down to him from the past, such as archival material, books, reports and content in traditional mass media, as well as any other kind of electronic and digital source. However, one of these source types stands apart, namely archived web material, since it poses some challenges not familiar from other source types. 4
Although archived web material is by no means the only source type for the web historian, in many cases it will probably always be one of the sources for web history. The archived web material can be used like any other kind of source – to give information about and document the study – but it is also the entity that contributes to keeping the analysis together. When setting out to write the history of one specific website or of a web sphere related to an event, we can make use of archived versions of either the website or the web sphere as important means to delimit our object of study, which can then incorporate other sources.
Generally, historiography is in most cases based on incomplete source material: for whatever reasons, things have been lost or destroyed, or simply not preserved. Thus, individual sources as well as collections of sources may be incomplete, and the historian usually has to make do with what can be found, just as he has to reflect on the fact that something is probably missing. The same holds true for the web historiographer. However, as will be shown in the following, the incompleteness understood in the sense that something is missing is not necessarily an adequate description of the incompleteness of the archived web.
Web archiving
Since the ways in which online web material enters a web archive have an impact on the characteristics of the archived web sources, it is necessary to introduce some of the main elements of web archiving (Brügger, 2005, 2011a; see also Brown, 2006; Van den Heuvel, 2010; Masanès, 2006; Schneider et al., 2009).
We must first distinguish digitization and digital preservation from web archiving. In the former, material that was initially offline and not digital (e.g. paper, tapes, records, etc.) are made digital; in the latter, the collection is built of online and ‘born-digital’ material. Concerning digitization and digital preservation, the main issues include how to perform the transformation to digital media while limiting data loss and securing long-term preservation. A number of these issues are also of relevance to web archiving – along with several additional ones.
Firstly, web material can be collected and archived in several different ways. The most widespread collection method is web harvesting, or downloading files from web servers based on a seed-list of URLs, but other methods are also possible, for instance making screen dumps or screen movies.
Secondly, the archiving process can be carried out as either micro- or macro-archiving. Micro-archiving is done on a small scale by individuals without professional knowledge about web archiving, who collect web material for a specific purpose, such as a research project. Macro-archiving is carried out on a large scale by professional archiving institutions in order to conserve pieces of the cultural heritage (cf. Brügger, 2005: 10–11). 5
Thirdly, making a copy on a 1:1 scale of what was online on the web at a given point in time is very much the exception. The different types of archiving, as well as choices to be made regarding what we want to archive, require different archiving strategies. In the case of micro-archiving, the strategy often depends on the purpose of the research project. In macro-archiving, the subsequent use of the archived material is not known and therefore one (or more) of the following three strategies is normally used (in general, all based on web harvesting): snapshot, selective and event archiving. The snapshot strategy aims at archiving a certain part of the web, usually a big portion such as an entire Top Level Domain such as a country code (.uk, .de, .fr, etc.), which may take several months. The selective strategy intends to archive a limited number of websites that have been selected individually before the archiving process; it is usually used on websites that are frequently updated. Finally, the event strategy seeks to archive web activity in relation to a specified event (e.g. political elections, sport events, catastrophes, etc.), based on prior selection.
Challenges of using archived web sources
Generally, the ways in which archives are created and made accessible for research tend to favour certain kinds of studies based on archived material and neglect others.
I now turn to the effects of the archiving process on the archived material vis-à-vis each of the five web strata mentioned above, and the consequences of this process for the web historian intending to use the material.
The most conspicuous but often ignored impact of the archiving process on the archived web material is that in practice the website is almost always the basic unit in a web archive. This is because most archiving software is based on the (total or partial) archiving of domain names in the form of URLs (the exception being screen dumps, screen filming and individual webpages). The predominance of the website as the core archival unit means that in most cases the website’s domain name is de facto the entry point to the archive, regardless of the web strata studied. If, for instance, we are doing a historical study of the use of images or video in fan communities, we have to know the URLs of all the relevant websites we would like to include (this problem may be partly solved if web archives get free text search, which almost no web archives have today). In addition, since in practice the website is the basic archival unit, its characteristics, further outlined below, also affect the other strata in various degrees.
Despite the important role of the website generally among the archived web strata, the other strata are nevertheless also affected by the archiving process, each in their specific way and with specific consequences for the web history scholar. However, in contrast to the archived website and web sphere, the web element, the webpage and the web as a whole are only affected in one way by the archiving process: something may be missing. I begin by discussing these three strata and then focus on the website and on the web sphere, since, as we will see below, they are both affected by the archiving process in ways that go beyond the problem of missing elements.
Web element, webpage and the web as a whole
Web elements and webpages may be missing in the archived material. For instance, images, sound, video or hyperlinks may not have been archived for a variety of technical reasons, or specific webpages may have been deliberately omitted when a website was archived. The consequences are obvious: if the aim is to study the history of the use of video, of web design/layout or of networks of hyperlinks, these lacks may be sufficiently significant as to make the study impossible.
Concerning the web as a whole, what is missing in a web archive is specifically the kind of information about ‘the web’ that the web itself usually offers us when we study the online web, such as search results and other information about what ‘the web’ looks like here and now. Either such general information about the web at a particular moment in the past has not been archived at all or it has only been archived randomly. As a first example, the aim is to study the history of ranking practices of search engine results. Since we cannot do web searches backwards in time, and search result pages are not archived, this study is only possible if the result pages of search engine queries in relation to a specific word have been continuously archived in the past (cf. Rogers, 2009: 20).
One may also want to use various mappings of the web as a source, for example The Open Directory Project, claiming to be ‘the largest, most comprehensive human-edited directory of the Web’ (http://www.dmoz.org/docs/en/about.html, accessed 28 November 2010), or Internet World Stats, which claims to feature ‘up to date world Internet Usage, Population Statistics and Internet Market Research Data, for over 233 individual countries and world regions’ (http://internetworldstats.com, accessed 28 November 2010). Both of these publicly available resources could be extremely valuable for the web historian years later, but they only provide information about the current web: information about the web of the past is not available. The web historian who wants to map a past web must therefore rely on the existence of older versions of these services in web archives.
The archived website
The fundamental characteristic of the archived website is that it is a reconstructed and unique version of what was once online, rather than a copy of it. There are three reasons for this (Brügger, 2009: 125–128).
Firstly, the scholar or the archiving institution carrying out the archiving must choose between different archiving strategies and different archiving software, as well as between different concrete ways of performing the archiving (where to start, should material on other servers be archived, should specific file formats be in-/excluded, etc.).
Secondly, the website to be archived may be changed on its server during the archiving process, and we do not know if, when and where any changes may take place. Therefore, the archived website in toto may not necessarily correspond to the online website: the asynchrony between archiving and a possible updating of the website may have resulted in either something being missing or the archived website being composed of webpages that were never online at the same time – or both. The archived website may therefore not only be incomplete but also inconsistent compared to what was once online.
Thirdly, the actual archiving process may not be performed as planned. For a number of technical reasons, the archived website may not be complete – for example, because parts of the content on the web server were impossible to archive (for instance, java scripts, streamed sound or video) or the archiving software may have been caught in crawler traps, such as calendars, which tend to make the archiving process endless.
In addition, the archived website is a reconstruction in the sense that it is reconstituted in the archive on the basis of the different archived web elements or webpages. This re-assembling can be carried out by the archiving institution, either as part of or after the archiving process, or by the scholar using the archived website. Thus, in a number of cases an archived website may exist in three different versions: one version when archived, another when re-assembled in the archive after the archiving, and a third when used by a scholar. In this sense archived web material is not only born digital, but it is also ‘re-born digital’. Therefore, versions of the same website, archived on the same date, may well be different in different archives – and even in the same archive – but it can be very hard to determine whether and to what extent they are.
The character of the archived website can be summarized as follows:
The process of archiving creates a unique version rather than a copy, and it is a version of an original which we can never expect to find in the form it actually took on the web; we can neither find an original among the different versions nor reconstruct an original based on the different versions. (Brügger, 2010a: 7)
These characteristics distinguish the archived website from other archived media types; they further have several consequences for the web historian using the material later. We can now examine one of these: the issue of source reliability. 6
Generally, we can approach source reliability in two distinct but interrelated ways. The first focuses on the relation between an event and the different sources that report or interpret what happened or in other ways attest to the event. The historian’s task then involves assessing why and in which situation the sources were created, how close (in space and time) the source’s authors were to the event, and whether they had any particular interest in accounting for the event in a positive/negative way.
A second approach to the question of source reliability is to use a more media-oriented perspective focusing on the reliability of the source as a document in its own right and in relation to an original document of which it may be a copy, rather than on the source’s relation to the event. In an age of mechanically reproduced media, this philological approach has largely been relegated to the study of handwritten documents (e.g. letters, manuscript drafts and manuscript books). However, with the advent of the archived website we may be in need of a website philology, since in a number of cases it can be important and relevant to determine which archived version is closest to what was once online. However, a number of differences exist between, for instance, medieval manuscript books and archived websites. One of the most important is that the web historian aiming to determine what a website looked like at a specific time online – for instance, based on three different versions – cannot hope to trace a provenance backwards in time as can the medieval philologist who is mapping the provenance of copies and originals by answering such questions as, is this manuscript book a copy of this 10-year older book? On the contrary, rather than examining successive versions over time, the web historian examines versions that were created almost simultaneously, and she cannot expect to identify one of the versions as an original and the others as copies – they are all versions (cf. the above quote, Brügger, 2010a: 7).
In summary, when discussing source reliability in relation to the archived website, of first importance is its reliability as a document (version) vis-à-vis the (lost) online original, and not the relation between document and event. However, obviously the reliability of the archived website as a document may also be an issue when using this kind of document to re-establish an event, since differences between versions can challenge efforts to determine what happened at a specific time. Overall, these considerations indicate that archived websites are not necessarily as they were online, that different versions of the same online web phenomenon may exist, and that versions of archived websites must be dealt with in different ways compared to other media, to get as close as possible to what was actually online (cf. Brügger, 2011a: 34–38).
The archived web sphere
Kirsten Foot and Steven Schneider coined the concept of ‘web sphere’ as the basic analytical approach in their historical study of web campaigning in US elections in 2000, 2002 and 2004 (Foot and Schneider, 2006):
We conceptualize a Web sphere as not simply a collection of Web sites, but as a set of dynamically defined digital resources spanning multiple Web sites deemed relevant or related to a central event, concept, or theme. (…) the boundaries of a Web sphere are generally delimited by a shared topical orientation across Web resources and a temporal framework. (Foot and Schneider, 2006: 20)
When evaluating the challenges that archived web material pose for web sphere historiography, it is important to note that Foot and Schneider’s study is based on a collection of web material with two characteristics. Firstly, the scholars themselves were responsible for creating the web collection (Foot and Schneider, 2006: 28); secondly, the collection was created while the event was unfolding on the web (Foot and Schneider, 2006: 41–42). They therefore had a broad overview of the possible material to include in the web sphere, and they were able to continually shape and adjust the creation of the collection to make it fit the aims of the study (cf. Foot and Schneider, 2006: 30–35). However, this way of proceeding will probably be the exception in many cases of web sphere historiography. Instead, historical studies of the web sphere will increasingly have to cope with the material that can be provided from existing archives – more specifically, web material that has been created by others and that the scholar may have to assemble into a web sphere years after the event has taken place.
Consequently, the major challenges for web sphere historiography are to recreate the web sphere backwards in time, and then address the fundamental questions of how and where the web material can be found and how it can be claimed to have formed a web sphere in the past. When addressing these two questions, a web historian can follow two interrelated paths: he can hope to find something that resembles a web sphere, or she can set out to recreate the web sphere herself, in both cases by using web material in one or more archives. These points require further elaboration.
The web historian may be fortunate to find something like a web sphere in an archive, created by a fellow scholar or an archiving institution. For instance, themed collections in the Library of Congress about September 11, 2001, or the Iraq War in 2003 can be considered web spheres, and any archive using the event strategy for the archiving process may hold similar collections. Moreover, archives such as the national Australian internet archive Pandora allow users to browse the collections by subject (Arts, Business & Economy, Defence, etc.); these categories may also be considered web spheres (cf. http://pandora.nla.gov.au). However, since demarcating the web sphere and identifying the web material to be included therein is a dynamic process and, in most cases, a function of a specific study (Foot and Schneider, 2006: 29), we cannot be certain that the themed collection can serve as a web sphere, since we may have wanted other websites included. In addition, when evaluating to what extent a themed collection can be used as a web sphere, it is critical that its creation has been thoroughly documented. For instance, we must know the criteria and the processes for identifying the web material to include, as well as the period of time and the intervals when this was done (cf. Foot and Schneider, 2006: 33–35, 211–213). In many cases such documentation either does not exist, or is insufficient.
If no collection exists that can be considered a web sphere, the scholar must go through the archive himself to find relevant material. However, at least two problems arise here. Firstly, in many cases we have no precise idea of the size and possible content from which to choose the web material to include in the web sphere (e.g. some archives have collected a Top Level Domain name, while others have not). This means: we do not know to what extent the archive is complete compared to what was once online, and therefore we have no idea of how similar the web sphere is to what it would have looked like had we made it in the past. Secondly, as mentioned above, few web archives have free text search; therefore, we have to use different methods to find the URLs to include in the web sphere than we would have used (e.g. search engines or online databases) while the web material was online. We may, therefore, have to base our work on non-web sources (e.g. printed material, electronic media and the like) for finding information about possible websites to include. Ultimately, compared to when the material was online, we may reconstruct the web sphere less systematically and with the risk of greater bias. In addition, since no overview exists of the possible web material to include, it is difficult to know whether and to what extent it is biased.
To summarize: either the web historian is dependent on others’ ways of conceiving of the web sphere, or she may have some difficulty creating the web sphere herself.
If the web historian does not limit his search for web material to include in the web sphere to one archive, the problems outlined above are both repeated and aggravated, since they may be at work in several different ways in different web archives (i.e. differences in archiving strategies and software, terms of searching and accessing the archive, etc.). Accordingly, the questions related to existing themed collections, completeness of the collection and finding the URLs may now have to be answered in different ways for each archive used.
To give an impression of this challenge, imagine doing a historical study of the web sphere in relation to a sporting event such as the Olympic Games, based on the following existing resources: my own private web archive (random and unsystematic micro- and event archiving, archived with Internet Explorer for Macintosh and saved as WAFF files); the special collections in the Australian national web archive Pandora (macro- and event harvesting, open access online); the Danish national web archive, Netarchive (macro- and event archiving; relevant URLs must be known; only researchers have access; and the material must not be removed from the archive); and the Internet Archive (macro- and random snapshot archiving; relevant URLs must be known; open access). Subsequently, using the web material in these different archives would be complicated regarding both access and possible technical integration. Fundamental tasks such as using collaborative coding tools (e.g. those used by Foot and Schneider; cf. Foot and Schneider, 2006: 214–220) or checking for duplicates would be complicated. Simple but important manoeuvres such as following links through the web sphere would also not be trivial, nor would automated analyses of the web sphere through, for instance, network analysis. 7
Moreover, since in many cases the web sphere is composed of archived websites, the above-mentioned challenges related to the individual archived website likewise come into play; the potential for different versions of the same website in the various archives poses serious difficulties to delimiting the web sphere.
In short, while it is possible to undertake historical studies of a web sphere based on archived web material, we must consider several methodological issues when (re)constructing the web sphere of the past in the present.
The future of web historiography
Although the web historian’s task faces several challenges, these should not discourage us from undertaking historical studies of the web. One of the aims of Internet Studies is to fully understand the conditions for contemporary internet forms and uses: historical perspectives are essential to such understanding.
I conclude by outlining three ways of doing future web historiographies. Firstly, as mentioned at the beginning of this article, web histories have slowly grown in number in recent years; this growth will surely continue as the past offers more and more to study.
Secondly, alongside these ‘new’ histories, a number of originally non-historical studies may be repeated in order to discuss changes and developments from the past to the present in a clear historical perspective. A few studies have already done so – for example, a study of online editions of US newspapers from 2005, which replicated a previous study about the same newspapers from 2001 (Barnhurst, 2010).
Thirdly, the need will arise for more comprehensive historical studies of the web that provide comprehensive foundations for more detailed studies. This could, for instance, be studies of the development of the web in national contexts, studies of the commercial web or of the web of civil society.
One way of fostering the emerging field of web historiography would be to initiate cross-national studies of the history of transnational events on the web, for example predictable events such as the Olympic Games. There are several advantages of making historical studies of the web cross-national. Firstly, the number of scholars involved would increase, thus generating a critical mass from the relatively small number of web historiography scholars in each country. Secondly, cross-national studies would provide analyses and empirical data (e.g. along the lines of the three kinds of historical studies of the web mentioned above) that would benefit each research environment. Thirdly, cross-national studies would constitute an important and much-needed academic forum for discussing fundamental theoretical and methodological issues. Some of these methodological problems have been raised here and, in particular, cross-national and cross-archive web sphere studies evoke compelling problems; thus, one of the key issues to address is the differences in national approaches to web archiving. This is why it would be important also to involve the different national archiving institutions, and possibly a transnational organization such as the International Internet Preservation Consortium (IIPC).
In conclusion, cross-national studies of the history of the web, along with general theoretical and methodological discussions, could provide important and necessary perspectives to complete the picture of Internet Studies.
Footnotes
Funding
This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
