Abstract
In the streaming era, the very thing that defines it is what threatens to impede access to important media history and cultural heritage. Streaming’s barriers to entry and its interim content catalogs challenge the actual collection and preservation of it for research and teaching purposes. If researchers and libraries do not work together to document and preserve these, we will keep losing important sources and data. From a collection perspective, we argue that streaming services consist of their catalog, metadata, and graphical user interfaces. First, we map the large-scale legal deposit collection of streaming at a national library as well as a media researcher’s small-scale targeted collection. Second, we compare the resulting collections of web sites and graphical user interfaces in order to discuss methodological challenges. The findings of this comparative analysis indicate the existing deficiencies in both collections and suggest potential improvements in the collection and preservation of streaming services.
Keywords
Introduction
Online distribution, under the moniker ‘streaming’, challenges the collection of media content and subsequent research that rely on well-documented media archives. Streaming as an everyday term covers a paradigm shift in media culture that transforms media use, its infrastructures, and its business models. Streaming is a complex and composite iteration on distributive technology that dissolves and reinvents distribution and media use in continuous exchange between Internet media and traditional broadcast and flow media (Spilker and Colbjørnsen, 2020).
Denmark, along with other countries that have high Internet penetration, has reached an interesting inflection point in this exchange. On a daily basis, Danes stream more TV (47 pct.) than they watch flow-TV (37 pct.) and six out of 10 Danes subscribe to Netflix (Kantar-Gallup, 2022).
In this Danish case study, we wish to point to a particular problem regarding streaming, which is that the actual interfaces of these services are currently poorly collected and not well preserved for future reference. They are essentially lost to us on a daily basis. This issue would be best addressed at a policy level since it is a problem not only for researchers but also for the public as there is little transparency about what actually happens within the dynamic interfaces of streaming services. In the Danish media political context, the latest Media Agreement 2023-2026 between several parties in the parliament addresses some issues concerning streaming, especially how global streaming services may pose an external threat to a small nation’s media system (Ministry of Culture, 2023). In a particular sentence, the agreement also calls for more transparency regarding personalization on the national broadcasters’ streaming services. However, in the current situation, it is very difficult to collect the evolving interfaces of streaming services and document any changes in the practices of the streaming services.
Streaming is a standard for many and no longer a first-mover activity, at least in the Nordic region of Europe (Lüders et al., 2021). We argue that given the current state of affairs, it is difficult to collect streaming services, their contents, metadata, and interfaces. Most of the constituent parts of streaming services that give us ‘the streaming experience’ are essentially non-downloadable and un-collectable. However, a legalmandate to collect and preserve cultural heritage drives the national library to address this issue (The Royal Danish Library, 2024a). The same issue sparks engagement among scholars that wish to collect streaming for research purposes, but who end up making individual archives unknown to their associates. Therefore, we wish to address this issue in collaboration by answering the following research questions: What methodological challenges do we find when we collect and study streaming services using our two different collection methods? What characterizes collections of streaming interfaces and how can we improve future collections?
For this comparison and discussion of methods, we primarily draw on media studies, archive studies, and especially the subfields of television studies and web archive studies. First, we map collection practices and their resulting collections in a large-scale legal deposit collection of streaming at the Royal Danish Library (2005-present) and in a media researcher’s small-scale targeted collection (2019-present). Second, we compare the resulting collections with a focus on the collection of graphical user interfaces of video on demand services (VODs). In order to give specific examples, we have chosen to focus on two particular video streaming services: the VOD of the central Danish public service media (DRTV) and Netflix, which is the dominant commercial player among the available VOD in Denmark. We would be hard-pressed to find a typical Nordic case, since each service, irrespective of media content type, has its own specificity and can reflect various dimensions of the concept of streaming (Lobato, 2018; Spilker and Colbjørnsen, 2020). DRTV is online TV that Danes can access via tax-subsidized subscription and login is only required when accessing the service from ISPs outside Denmark. Netflix in Denmark is a subscription-based VOD with nation-specific pricing and content catalog. They are not too similar nor too different. Still, we expect that our comparison will have valuable points that are also relevant for those who study other kinds of streaming services, for example, for music, books, or gaming, as well as other dimensions of streaming.
Finally, we discuss methodological challenges in a comparison of the collections and collection practices. In doing so, we identify differences but also interdependencies between approaches to collecting and the resulting collections.
Theoretical framework
Studying a streaming service can mean vastly different things, from studies of diversity, prominence, interaction design, or genre use. Depending on which perspectives you plan to use in your study, these will require different things from the data collected. As we mentioned in the introduction, streaming is a complex and composite iteration of media as delivery technology. We address this issue at a time where libraries, archives and museums are also in transition (Hvenegaard Rasmussen et al., 2022). We take a holistic approach that acknowledges streaming as multiple variations on non-permanent access to media content. Therefore, our approach bridges all forms of media content and its metadata. For the sake of brevity, our case focuses on services that provide stream-based access to videos on-demand (VODs). Our definition of metadata incorporates what others may define as paratexts, for example, program descriptions and images (Gray, 2010; Johnson, 2019; Kelly and Sørensen, 2021).
We use the concept of the interface to describe the Web site of the streaming service that presents the content including the graphical user interface that frames the playback of content. There is often a fluid transition between the two and we argue that both play a central role in the use of a streaming service. Hence, we address what Hesmondhalgh and Lotz (2020) see as the service interface that is layered among device interfaces and marketplace interfaces etc.
During the past few years, the field of Web history and archive studies has been opened to researchers with an interest in digital humanities (Brügger and Laursen, 2019; Brügger and Milligan, 2022; Gomes et al., 2021). For our purposes, the terms ‘collecting’ and ‘archiving’ and the resulting collections and archives are somewhat interchangeable. Brügger (2018: 79) defines web archiving as any form of deliberate and purposive collection and preservation of web material. We do purposefully collect web sites and preserve them to pursue immediate research and save them for posterity. Preservation for posterity is a primary concern for the national library. If not a primary concern to researchers, collecting samples of a highly dynamic and changing interfaces and addressing them through research is inherently preservatory. This article does not consider aspects of preservation and accessibility. We refer to the practice of collecting, which results in collections. At any rate, we aim to further the digital collection and archival literacy needed across historically inclined fields of research (Jensen, 2020). In our case, it stems from a need to address the highly fragmented and dynamic publication and distribution scenario that enables a multifaceted ‘streaming experience’ based on ephemeral catalogs that are subject to permanent reconfiguration (Colbjørnsen et al., 2021; Johnson, 2019; Parikka, 2012). The catalogs and their materials are collected to serve as documentation. Regardless of how fragmented and dynamic they were and what system designs might have led to them being collected and preserved, they are reborn digital materials because of the process (Brügger, 2016). Streaming services are online products at the mercy of the highly dynamic and interchanging characteristics of the web and the highly competitive environment for online entertainments services. Our case will underscore that their online web presence cannot be considered a stable original that supports the collections as a proof or backup (Brügger, 2018: 85).
From television studies, Lobato (2018) and Kelly (2022) provide discussions and practical advice for researchers that grapple with television’s changing technological compositions, distribution systems, and market logics. In terms of methods, streaming services such as Netflix are hybrid systems that combine television, cinema, and Internet technologies. This makes them naturally responsive to the emergent digital methods such as scraping and big data analytics, but its international catalog system can also be studied in relation to legacy media studies questions (Lobato, 2018: 252). However, Lobato (2018) also notes that Netflix has been structurally transformed by its internationalization. Due to licensing agreements that vary between geographical region and regulatory pressure, Netflix behaves differently in different countries. In some senses it may now be more appropriate to see Netflix as a collection of national media services tied together under one brand rather than as a uniform global service (Lobato, 2018: 245). Taking one step at the time, we provide a case of collecting one national version of Netflix, namely the Danish, in two different collections. At this stage, we cannot find indications that suggest anyone, including national archives and the Internet Archive, holds a copy of the Netflix catalog for research purposes. Nor its interfaces, metadata, or its many apps across different devices that frame users’ access to the Netflix catalog. At the very least, it seems that nobody has collected the Netflix catalog as arranged and presented to a national audience (in and of itself a daunting task). If Netflix and similar streaming services do not grant access, by law or otherwise, then researchers (must) build their own collection or turn to the national collections.
The practical approach we subscribe to is thoroughly discussed by Laursen et al. (2017), specifically four different ways of collecting digital cultural content: still image screen capture, Web recording, API-collection, and Web collecting. In this article, we mainly focus on the combination of still image screen capture and web collecting as a lean and efficient go-to method. We find that this easily documents what has been collected by the library and what can be collected at present from live versions of the streaming services interfaces.
Maemura et al. (2018) propose a framework for documenting key aspects of a collection that addresses the situated nature of the organizational context, technical specificities, and unique characteristics of web materials that are the focus of a collection. We draw loosely on their framework with regard to key questions and relevant information for the optimal documentation and tracking of collections’ provenance.
Circling back to media studies, we use the concept of five dimensions of streaming as a framework for thinking about and mapping the streaming concept across types of media content and media industries (Spilker and Colbjørnsen, 2020). This helps us identify and document which dimensions of streaming we are studying and which we are not collecting. Furthermore, the five dimensions help us characterize traits that will influence how we collect material that is transient, never fixed, and whose availability is contingent (Colbjørnsen et al., 2021). We could even frame this as the absence of an original to collect.
We aim to exemplify and discuss the methodological challenges for the collection of streaming. However, we acknowledge our bias that will inevitably shape the theoretical and practical analytical inclinations that influence our approach. Therefore, as our own conceptual contribution, we include three different approaches to analyzing a streaming service. We do this to demonstrate the various interests and approaches that libraries and media researchers could have when collecting streaming services.
As the first approach, we can understand and analyze the service as an interface and a webpage by analyzing the front page as a curated webpage of high communicative importance (Johnson, 2019; Maasø and Spilker, 2022). A particularly interesting element is the so-called ‘carousel’ (or hero board) at the top of the page and how its content can reveal a lot about the service’s implied target groups and genre assumptions (Bruun, 2020: 95ff). Such an approach can mix elements of textual analysis and content analysis, respectively, at a micro or macro-level, and it will often be both qualitative and quantitative.
As the second approach, we can understand the service as a catalog of various kinds of content and analyze the composition of the catalog by quantifying and counting content within different categories (Lobato, 2018). This data-heavy approach has some similarities with analyzing flow programming schedules (Williams, 1975: 86ff). It can track the continuous evolution over time and check for significant variations such as the genre variation or the degree of locally produced content.
As the third and final approach, we suggest a comparative analysis of different services that, for example, have different institutional mandates (public service vs commercial), are from different media industries (e.g., video streaming vs podcast) or have different company histories (broadcast native vs online native). This approach potentially offers a more media-systemic and macro-oriented look at the service’s overall structural setups and visual identities combined with an understanding of their inherent media logics (cf. Hesmondhalgh and Lotz, 2020). In our experience, it is especially in these comparisons to other services that it becomes clear for the researcher what is special about a particular streaming service, for example, how it could perfectly represent certain dimension of streaming or and new ones. We argue that our empirically grounded discussion of the collections, our methods, and their closeness to everyday collection practices merits a comparison that is urgently needed.
By focusing on conventional on-demand streaming services from global commercial players and national public service providers we have delimited the analytical scope. For example, we do not include the topic of pornographic videos online or the topic of live streaming (e.g., sports). Both have played important roles historically and they continuously influence the dynamics of the streaming era, (see Hutchins et al., 2019; Paasonen, 2022). However, we hold that the discussion of methodological challenges stemming from the collection of conventional video streaming services is relevant across different types of content regardless of its ethical status in society or its degrees of liveness.
Methods at the Royal Danish Library
Author A has studied how The Royal Danish Library (the library) collects streaming services and is an employee in the department for digital cultural heritage with access to the library’s collections.
Based on author A’s research, we briefly outline the central methods used at the library to collect streaming services. The library collects physical and born digital cultural heritage under the provision of the Danish legal deposit legislation, which applies to published works connected to Denmark (n.d.). 1 This is a large-scale collection of many types of works. At the library, the concept of ‘works’ covers web pages, software, books, images, and recordings of audio and visual content all in multiple formats. with as much context as possible. The aim is twofold. To document and preserve our culture and in the process support as many different research approaches and disciplines as possible within the confines of the legal deposit law.
However, the library receives diminishing amounts of deposited physical books, newspapers, magazines, small press, etc. The library contracts with a national distribution network operator to record the full flow-based schedule from the major nationwide TV-stations and radio stations. Recording flow broadcasts only secures the content if it has had a broadcast stint and it was collected as such. This means that it is different from how it is (possibly) formatted and presented in new or simultaneous appearances in streaming catalogs and interfaces.
The distinction between streaming-only content and streaming-first content is based on the industry’s publication strategic nomenclature. This is an important difference, especially if a library aims to reduce duplicate collections. Large portions of the content available on streaming services have already been collected on physical formats or as broadcast recordings. Hence, particular designations that mark a certain publication as exclusive to one service are especially interesting, for example, Viaplay Original or Netflix Exclusive.
The Danish Broadcasting Corporation (DR) is the main public service media provider in Denmark with statutory incitements to transition to web-based services. In 2020, DR converted two of its flow TV channels to be streaming-only catalogs on its streaming service, DRTV. This led the library to implement the collection of streaming-only content which has confounded its data infrastructures and data models (Aegidius, 2021). The collection via DR’s application programming interface (API) began in 2022. It collects the content and the metadata, which includes program descriptions and images but not the interface of DRTV. The library has not made similar attempts to collect Netflix Exclusives or Originals since there are too few that fall under the purview of the Danish legal deposit law.
Since 2005, the library has collected streaming services’ interfaces as part of its collection of all Danish web domains and Danish content published on international domains. 2 In essence, the Danish web sphere that extends beyond the Danish top-level domain (Brügger, 2018). The backbone of this collection is a seed list provided by the national top-level domain administrator Punktum.dk (n.d.). The process of web harvesting uses four collection strategies in combination: quarterly cross-sectional collections, event-based collections, selective collections, and special collections typically upon request (The Royal Danish Library, 2024). The Danish Web archive (Netarkivet) 3 is the central collection to survey with the aim of understanding how well the library has collected the interfaces and metadata of streaming services.
Content collected of our two cases in the Danish web archive with status code:200.
Author A used the library’s SolrWayback interface to its Web archive that will rank and playback search results (Egense, 2021; Lauridsen, 2021). The quantities of collected web content might look promising with a high number of web pages for the domains over the years and a consistent collection of hits from all the years. However, to illustrate the underwhelming quality of the collected content, we provide selected screenshots in a later section. From the content type collected, we can see that web pages are the majority and video files are nearly nonexistent. The absence of video files is to be expected for two main reasons. Firstly, the web curators state that, in general, embedded videos are not easily collected. Secondly, most services utilize a paywall that the current collection method cannot pass. An international collaboration is underway to create a browser-based collection method that should mitigate these exact issues and more (Myrvoll et al., n.d).
The number of hits includes duplicates and versions of interfaces. Author A filtered the search to include only content with the HTTP response status code 200 meaning successful server requests. The numbers in Table 1 reflect web pages and other content that is collected as fully as possible by excluding redirects and errors (e.g., status code 301 and 404). A description of the search procedure and the screenshots of the collected versions of streaming services’ interfaces in the library’s collection are available in an open access data repository (The Royal Danish Library, 2023).
We have included legacy versions of DRTV in Table 1. Before 2010, the interface on dr. dk/tv provided on-demand access to approximately 400 programs from DR’s two main channels. DR’s first streaming service was named DR NU (DR NOW) to emphasize the newness of its immediate on-demand offering and it was active for 4 years (2010–2014). By then, the service was rebranded as DR TV and later DRTV with the original URL (/tv) re-directing between the latter two during and after the subtle name change. This explains the high number of web pages collected through the years on the original URL, https://www.dr.dk/drtv/.
To sum up, through a concerted effort that relies on a combination of collection methods the library is able to secure streaming services by way of their published content, published metadata, and their interfaces. Below we will analyze how well the current collection fares. The library registers streaming services in Denmark and streaming services that do and could produce or re-distribute Danish content. The register, based on an initial mapping, shows 70+ services. The number includes services based on text, sound or video and combinations hereof (The Royal Danish Library, 2023). The number of streaming services is high because of legacy services that are no longer available. Furthermore, the register overlaps with existing efforts to collect hybrid services that include social media and require yet a different set of collection methods.
At present, the library is not able to reconstitute a working version, also called playback, of any of the collected streaming services to display a legacy streaming experience for future users. However, the library can give research access to the separate parts in the library’s collections. Interestingly, access to the radio and TV section of the digital audio-visual collection functions as a subscription-based streaming service in its own right, which highlights the widespread adoption of streaming technologies, in both commercial and public service sectors. 5
Collection methods used by the media researcher
Author B has been using a targeted approach to collect screenshots of the front pages of DRTV, TV 2 PLAY, Viaplay and Netflix from November 2019 and until now, which in total amounts to about 1.300 screenshots. The purpose of this collection is to have the possibility to conduct longitudinal analyses of how these services’ front pages develop over time and to be able to analyze developments within specific genres like reality shows or news content. The preliminary studies of material from this collection are able to reveal not only the amount of content within a certain genre but also its placement and priority on the service compared to other content genres.
For media researchers, there can be several good reasons to create our own collection of screenshots of particular streaming services. The purpose of constructing a separate collection is to create a more specific and detailed data set about particular services. Specifically, Author B took screenshots of selected areas within the streaming services’ interfaces. Additionally, we might be interested in doing this several times over a given period of time in order to do a more longitudinal analysis of exactly how that platform develops over time. Practically speaking, the downside is that it can be quite tiresome and time-consuming to take screenshots – especially if we want to document several services every time. Another downside is that a screenshot does not preserve the interactivity that characterizes the streaming experience. Additionally, the use of personalization by some services to cater to the user’s personal tastes, can alter the appearance of the interface (Van den Bulck and Moe, 2018). To avoid this, Author B used a separate profile, when this was possible. Optimally, this will allow for screenshots of a somewhat neutral version of the interface. However, we must acknowledge that this does not change the fact that users can and most likely will be presented with very different versions of these interfaces In exposing the operational logics of the Netflix recommender system, Pajkovic (2021) also provides means to acknowledge and utilize this fact in the service of documenting the many ‘versions’ of the same streaming services. However, this makes it even more difficult to say what the ‘correct’ version of the service actually is – and perhaps all these versions are relevant to the researcher. Documenting and reflecting on the intended with the specificities of the actual collected materials will make it easier to consider whether the collected version is correct, clean, or neutral (Maemura et al., 2018). Since online content is dynamic and can change every minute, hour and day, it is challenging if not impossible to document all the many changes as they can quickly disappear again. For these exact reasons, it is important to generate a collection for documentation purposes with timestamps on the samples (Riffe et al., 2019: 73).
In content analysis in general, we can distinguish between various kinds of probability sampling (such as random, systematic or stratified sampling) or various kinds of non-probabilty sampling (such as convenience or purposive sampling) (Eskjær and Helles, 2015; Riffe et al., 2019). The sampling strategy that we choose will affect how we collect the data, for example, how often and how extensively we collect. In terms of how many samples we need, there is no simple answer as this depends on several conditions such as how much data we can manage, what degree of uncertainty we can accept, and the amount of resources available (Bryman, 2012: 197). Following Riffe et al. (2019), Author B used ‘constructed weeks’, which according to their statistical tests is a more efficient sampling strategy than both specific weeks and random selection. For instance, if we want to document 1 year in the life of a streaming service, then we can collect a number of constructed weeks (using random Mondays, random Tuesdays, etc.) (Riffe et al., 2019: 85). Still, what days we end up collecting can make a big difference, at least in our experience, since the front pages of some streaming services are very influenced by daily news and big media events (such as big sports events, elections, and holidays.). Depending on the purpose behind our collection, we may see these special events as an undesirable bias in the sample or as an important moment that we really want to preserve. Again, documenting the planned process as well as actions taken, when unscheduled events or process anomalies occur, can increase levels of methodological transparency and reflexivity (Maemura et al., 2018).
Another important choice for the researcher is where to focus. Here we can choose to focus on the service’s front page in order to document that this (arguably most important) page that the user sees first – or we can choose to dive deep into specific genres or areas of the service (e.g., sports, news, children’s content, etc.). Author B chose to focus on the service’s front page and specifically the carousel at the top of the front page, which many streaming services have. The carousel is usually an interesting space to document since this prominent placement is commonly where the services promote what they have considered to be the most important content.
Comparison of collections
Overview of the two collections compared.
Besides their difference in how we have applied rules for paywalls, the major distinguishing factors are the collection timeframe and the number of services collected. In theory, the library’s collection contains any streaming services’ interface contents to the extent that they could be automatically collected from the Danish top-level domain (.dk) or via links to services on international domains. Author A chose a semi-open approach to find the most prominent streaming and identified and made screenshots of 41 of streaming services collected by the library. 7 This means that other streaming services would exist in the library’s collection, certainly ones providing access to pornography or live sports that were initially excluded from the search. Author B chose a closed approach by collecting four streaming services in order to focus on the development of their interface for a period of three and a half years and onwards.
We compare what is collected of two streaming services. The international and commercial service Netflix and the national public service media DRTV. They represent the majority of streaming services in use in Denmark. With reference to the dimensions of streaming proposed by Spilker and Colbjørnsen (2020), we see that both VOD services provide professional content as opposed to user-generated and on-demand without live streams for a general audience rather than a niche audience via dedicated single-purpose services. To strengthen its general approach, the library is grabbling with semi-automated collection of YouTube. To a limited extent, this effort covers other dimensions of streaming, namely, user-generated content that also caters to niche audiences via a multi-purpose service.
Netflix
Netflix.com has been collected as early as 2005. Then, it was not a streaming service. We opted to include it as historical context to the collection. However, we focus on the Danish version of Netflix that we found to have been collected from 2014 to 2022.
Netflix requires a subscription-based login to access their catalog. From 2014 to 2022, the library collected Netflix’ web pages before login, outside the paywall that separates the interface into presentational Web site and the central video player software that serves the Netflix catalog. This means the library has no collected instances of the Netflix web-based video player’s interface.
From 2018 to 2021, the style sheet elements of the interface, which are a critical part of the building blocks that arrange the content, were not collected during the automated collection process. Figure 1 shows how this makes it difficult to discern how Netflix presented itself to users. The content of the interface, logos, buttons, and images are displayed one after the other as a vertical list and the browser has defaulted to a standard font. Interestingly, we found images from the Web site that depict the Netflix interface on various devices after login, see Figure 1. From November 2019, we found short videos that highlight the interface with an emphasis on new functionalities. These are lucky breaks in what is otherwise a fragmented collection of interfaces, their content, and metadata. The images and videos have a unique identifier in the pwid xml files for each screenshot year.
8
Excerpt of screenshot of playback of collected Netflix Web site, 2018.
Particularly on commercial services like Netflix, Author B’s hand-held collection method has the major advantage of being able to collect content behind paywalls and inside services that require login. Even though Netflix does not have a carousel, we get a clear picture of the top tile/title, which in this case is the series Copenhagen Cowboy (Winding Refn, 2022). However, in this instance, we cannot tell from the interface whether we are presented with exactly this Danish-produced series because of personalization and/or geography since we only have access to the version of Netflix that is available in Denmark. This demonstrates why, in the case of international services like Netflix, we typically need comparative research and larger collaborations between researchers, and perhaps the industry, from various countries in order to understand the full scope of the service’s actions.
Danish broadcasting cooperation's streaming service
Figure 2 shows the DR TV front page of 2015 with image placeholders in the grid but no actual images have been collected. From 2012 to 2018, the presentational images have not been collected from the interfaces of DRs streaming service, DRTV and its predecessors DR TV and DR NU. Presentational images missing from playback of collected page for DR TV, 2015.
The applied method has collected the links that support the call to the API that provides the images to the interface. This could be interesting because we found that as far back as 2014, these presentational images are still available from DR’s server (DR, 2023). The library was not able to retrieve them at the time of collection. Given the images are still online, there is a chance that the library can retroactively harvest resources for its 2015 collection of DRTV and other instances of the web sites that might link to the same images. The links to the images are timestamped from the day of collection. This would help ensure the historical integrity of the collection in cases of re-collection.
When we further investigate one missing asset, a presentational image for a TV-series, we can find its URL registered in the Web archive. In fact, the collection method of automatically harvesting web resources, in use at the library, has registered the URL 43 times in 2015 without collecting the image: ‘Url has never been harvested: https://www.dr.dk/mu-online/api/1.3/Bar/505f492e860d9a340c5a1902’ (2023). There can be various reasons for this to happen. The image could be used on several pages of the DR. dk domain. The automated collection process is typically set to a limited number of hops, meaning the number of assets it requests by ‘clicking’ the available hyperlinks on a web page and seemingly comes across the same URL several times. A recent study of the Danish Web archive found that: ‘approximately 50% of the files in a broad crawl had been archived more than once’ (Brügger et al., 2020). Unlimited hops would vastly increase the data redundancy in the web archive. It is important to mention that this is common for the automated collection process. Once initiated, the web curators at the library rarely know precisely what happens during the automated collection process (Brügger, 2018: 54).
DR’s streaming service does not require login to view published content (for IP addresses placed in Denmark). However, the collection is missing the web player interface for most of the TV programs. This is a significant gap in the collection of DR’s streaming efforts. We did find collected instances of web pages for a specific TV program in which a web player was embedded. The collected page included metadata describing the content, its participants, duration, genre, etc. However, the collection process had not managed to collect the video file and more so the player element displayed a typical error message, that is, ‘no video with supported mime type found’ or ‘Adobe Flash Player required’. This means that we cannot playback the TV program in situ nor see the play/pause button, audio volume controls, or other typical elements of the then DR TV web player. 9
From 2019 to 2022, more images were successfully collected from DRTV. Hence, playback of the front pages of DRTV are better populated with presentational images. However, we still have no playback of video files but we do get a good indication of what everything surrounding the embedded web player looked like in 2022.
Figure 3 shows attempted playback of a TV documentary. Notice the red playback button that shows a frozen loading circle below a placeholder element that is missing its presentational image. From this collection we cannot know how the interface looked and what interactive features it presented to the user. Attempt at playback of TV documentary episode from DRTV.
Figure 4 is a screenshot from Author B’s collection and shows the carousel at the top of DRTV’s front page – or rather one of the 11 titles and images that were in the carousel’s slideshow on that particular day in January 2023. Compared to the library’s collection, this approach has some advantages. First, it is an advantage that researchers can take screenshots, which capture the whole appearance of the service’s front page without the aforementioned missing elements from the library’s collection. Second, we also have the option to take several screenshots to get all of the 11 images and to collect the entire contents of the carousel instead of settling for just documenting one out of those 11 titles. The carousel at the top of DRTV’s front page (January 2023).
However, one disadvantage can be that some services use autoplay and start playing presentational videos (trailers) automatically for the title in the front. If a switch to video-based screen capture is not an option, we must decide whether we will try to get the timing just right in order to collect the initial frame or accept a random sample, as is the case in Figure 5, where the character in the picture is saying the line ‘Mom?’. Netflix’s front page while it is in the middle of automatically playing a trailer (January 2023).
Music and book streaming services seem to have more static interfaces than video streaming services. A quick glance at the library’s collection of Spotify interfaces with Danish content shows terrible results with empty redirects for the domain spotify.com and spotty attempts at collecting the web player on the URL https://open.spotify.com. From 2007 to 2022, the library cannot boast of even a moderate success in automatically collecting the Spotify playback interface.
Concluding discussion
In the following, we will focus on methodological challenges surrounding the collection process. We will divide our discussion into three sections. First, we will discuss the advantages and disadvantages of the two approaches. Second, we will discuss the extent to which the two collections can supply or support each other and in that way mitigate their methodological challenges. Finally, we will discuss steps to achieve greater transparency and better data about and from streaming services.
We have described the national library’s method as a general collection approach and the media researcher’s method as a targeted collection approach. A parallel difference is whether a collection has a macro-level and/or micro-level approach. To exemplify this, we briefly recount how the library has initiated a general automated collection from commercial aggregators of born digital music and books, along with a targeted automated effort to collect streaming-only TV programs from the two major public service TV-stations. However, the library’s current attempted micro-level collection effort provides but a narrow slice of the dimensions of streaming. So far, the method does not capture the graphical user interface only video files, program descriptions, images, and various other metadata.
The researcher’s targeted approach seems to have a wider margin of success by collecting a few streaming services in full or thematically. This should provide data that satisfies that researcher’s immediate needs. However, it runs the risk of being too narrow and hence support very few insights into the different dimensions and roles of streaming services in the contemporary society.
We have already touched upon another important distinction to do with the degree of automation of the process of collecting. At the library, curatorial and technical staff thoroughly test and improve upon their practice of automated collection through many years of quality control. Yet, the library’s collection only happens outside any paywalls, which is a key collection bias and analytical disadvantage. What is not automatically collected, will be missing from the collection. The interfaces are missing. Arguably, the most important aspect of the streaming experience. In other words, our evidence suggests that the automated process has deficiencies in terms of the ‘lack of depth’ during the collection of the interfaces of the streaming services.
In contrast, Author B has documented versions of streaming services interfaces with manual screenshots after having logged in to the services. If a web page does not load successfully, the researcher can just reload the web page to capture all the content in the subsequent screenshot. In the hand-held collection process, the researcher will inevitably tackle such obstacles in the short run on the micro-level. However, the obstacles could accumulate during a longer scheduled collection. In other words, the hand-held collection is subject to the researchers’ stamina. If collection fatigue sets in at some point, the consistency of that collection could suffer. This is a potential hindrance for studies, which aim to provide longitudinal insights on the development of streaming services. Author B remedied this by downshifting from monthly to quarterly collection. It seems that the hand-held method has longitudinal potential while the library’s collection method is established as longitudinal. Since there is no exact time-based definition of a longitudinal study, we could argue that the library’s collection method should provide very lengthy longitudinal advantages. Its allocated work effort of testing and improving its method should increase the replicability of the automated method. A threat worth mentioning stems from the fact that the very lengthy longitudinal collection is subject to the library’s strategic priorities that they must renegotiate every 3–5 years.
Still, the two methods could support each other simply because they cover different depths and timelines of streaming. We will describe two possible interdependencies at the practical level and policy level. The targeted approach can document user profiles better and collect more micro-level details, for example, full samples of images in a carousel and aspects of personalization, which are missing in the automatic collection. The choice to, or attempt at, collecting inside paywalls or logins is a very important one. The library’s general collection of other types of content and related materials from streaming services can serve as context for both methods. In the case of national services, the library should be able to give researchers access to the videos featured in the screenshots and historic news coverage of specific services. However, we should consider the methods as interdependent rather than the one being subordinate to the other. In other words, researchers can help the library patch the obvious holes in past collections. Looking forwards, the researchers can help the library adjust and improve collections. Given its wide and ambitious scope, it should be in everyone’s interest to have an optimal broad very lengthy longitudinal collection of streaming. It is a demanding task for the curators to assess what future needs will be. Researchers and eventually the public will continuously have to help the library assess the merit of their general approach that aims to collect everything automatically. This will require a better and continuous dialogue between the library and these stakeholders about what should be collected from the Internet (Brügger, 2018; Schafer and Winters, 2021). The same goes for the library and researchers concerning the various dimensions of streaming captured in varied datasets (Kelly and Sørensen, 2021: 87). Not only are collections mutually beneficial on a practical level. We reiterate that they are interdependent at a policy level. The library initiated its collection practices based on investigations made by appointed experts and researchers that led to legal deposit legislation for dynamic web sites (Bache and Finnemann, 2003). Without renewed and continued exchange of advice and assessments of what needs to be collected there will not be a wide, long, and lasting contextual collection to draw from and build upon. As such, it is a fundamental interdependence at the level of policy and at the level of the individual collections of specific types of content, as shown here in the case of streaming services. There are multiple advantages of developing relationships between researchers and institutions. Not only do they facilitate greater access to data but they may also potentially increase access to the expertise and tools required to make sense of said data while also enforcing appropriate digital data preservation policies (Kelly, 2022: 16–17).
We wish to add that the streaming services are also important partners and collaborators. However, their preservation needs and research interests might not match those of the libraries and the researchers. Nevertheless, greater awareness of the legal deposit obligations and the benefits of research collaborations could produce better preservation of digital cultural heritage, more research insights, improved public-service offers, and potential commercial gains. The dynamic illusive Netflix catalog seems like a hyperobject, an n-dimensional non-entity on par with the Internet itself (Morton, 2013). Yet, Netflix does have a research division and a Web site that links to their research (Netflix, n.d). A quick survey of the site suggests that while they do present at conferences and publish in ACM proceedings they do not seem open to collaboration with independent researchers.
We can pick up, browse through, and pass on books made of paper and their born digital counterparts. We cannot, at this point in time, ‘press play’ in a collected version of the software and content assemblage we call streaming services. Their constituent parts are spread across various collections. Curators and researchers face a near-insurmountable task of reconstructing them as a piece of cultural heritage and as a research object. Potentially, an automated collection process inside login would collect every page and their contents that is linked to in Figure 5. Barring any conventional download of content in a ripping manner, this would be akin to having a bot real-time ‘watch’ all Netflix shows or work its way through a curated playlist of content that is covered by Danish legal deposit law. Such a scenario presents a probable collection method with a very high degree of complexity. Which elements in the interface should be ‘clicked’ and in which sequence? Will the play button be consistently placed in the interface? Will the ‘more info’-button load a pop-up or take us to a different ‘site’ of the interface? Let alone the apparent impossibility of automatically collecting interactive TV-series!
In summary, this means that collecting streaming services is very problematic and fraught with challenges. Nonetheless, we are hopeful since we can identify overlapping collections that can support each other. The collections can be built upon and further enriched by research requests if researchers and institutions in collaborations with the services acknowledge interdependencies and produce rich documentation and metadata.
We urge researchers and curators to help politicians realize how big a problem the lack of transparency around streaming actually is. As we have discussed, the actual interfaces of these services are not well preserved for future reference, and every single day, important material is regrettably not collected. Also, if a streaming service used its interface in a way that was societally problematic, the current state of our collections might make that difficult for anyone to prove. One solution is that a political intervention could make it mandatory for streaming services to hand over documentation of their service’s appearance (inside paywalls) to the library on a recurring basis. Another major transparency issue is also the lack of reliable ‘ratings’ or numbers that document streaming use, which most services do not share with the public. Altogether, these circumstances point to how streaming services are difficult to research and escape important critical observation.
In the meantime, we recommend that more parties test and share their experiences of using various tools that support archival and research-based collection of online materials. An example of this is Webrecorder’s software Browsertrix Cloud that provides a user interface for non-developers while being compliant with standardized Web archive formats (Myrvoll et al., n.d). As tools become easier to use, it is increasingly important to provide documentation of choices before and during the collection to increase the methodological transparency and reflexivity of a given collection. This will help future exchanges and access to collections of streaming services.
Footnotes
Acknowledgements
We would like to thank the following staff at the Royal Danish Library for their invaluable support and curiosity: Program manager for The Danish Web archive (Netarkivet) Anders Klindt Myrvoll, Web curators and technicians Thomas Martin Elkjær Smedebøl, Stephen Hunt, and Thomas Egense. We are also thankful for the assistance we received from student helper Johan Flensmark. We highly appreciate our reviewers and the editors of the journal for their tremendously valuable critiques and thoughtful suggestions.
Funding
A grant from the Ministry of Culture Denmark funded the research project that provided insights for this article: FPK-2021-0004.
