WARChain: Consensus-based trust in web archives via proof-of-stake blockchain technology 1

Abstract

Web archives store born-digital documents, which are usually collected from the Internet by crawlers and stored in the Web Archive (WARC) format. The trustworthiness and integrity of web archives is still an open challenge, especially in the news portal domain, which face additional challenges of censorship even in democratic societies. The aim of this paper is to present a light-weight, blockchain-based solution for web archive validation, which would ensure that documents retrieved by crawlers are authentic for many years to come. We developed our archive validation solution as an extension and continuation of our work in web crawler development mainly targeting news portals. The system is designed as an overlay over a blockchain with a proof-of-stake (PoS) distributed consensus algorithm. PoS was chosen due to its lower ecological footprint compared to proof-of-work solutions (e.g. Bitcoin) and lower expected investment in computing infrastructure. We based our prototype on the open-source Nxt blockchain and implemented it in Python. The prototype was tested on web archive content crawled from Hungarian news portals at two different timestamps with more than 1 million articles in total. We concluded that the proposed solution is accessible, usable by different stakeholders to validate crawled content, deployable on cheap commodity hardware, tackles the archive integrity challenge and is capable to efficiently manage duplicate documents.

Keywords

Web archive validation blockchain proof-of-stake web crawling censorship

1. Introduction

Web archives are snapshots of web content collected by processes called crawlers. The contents of such web archives are trusted only as much as the institutions or individuals who created them, e.g. we can (mostly) trust the archives stored by national archives in democratic societies are unaltered and stored as they were crawled in the past. These repositories may become targets of hacking attempts aiming to reduce consensual trust and cause confusion in order to increase the value of certain preferred, ‘alternative’ information sources. This is an open societal challenge, as there is broad consensus about the significance of being able to research and access web content which was available five, ten or even more years ago. Unfortunately, there are various reports about attempts to alter web archives or the original content (i.e.’cancel culture’). If a content sought to be removed is still available in the archives and can be used as proof, soon the archives will be the next target. That is an additional reason warranting the development of solutions which distribute trust and allow different stakeholders (e.g. archive holders and researchers) to participate in the validation of at least the archived content.2

²
As the number of content providers is orders of magnitude greater than the number of memorial institutions, it is easier to develop and test a standard workflow and extend it to the scale of the content providers.

Considering the fact that web archival efforts are usually understaffed, underfunded and lack computer scientists and system administrators to develop and maintain their systems, any solution in this domain should be easy to deploy, maintain and run on cheap hardware attainable in limited budgets.

Our goal is therefore to propose a blockchain-based web archive validation solution which tackles trust, is simple and cheap to implement. It should be accessible to different web archive stakeholders, e.g. national archives, research institutions and groups, as well as the general public. Essentially it can be one piece of a puzzle allowing future generations to enjoy and research the Internet as it was in the past.

2. Related works

Born digital (cultural) heritage can be regarded as an unwanted orphan: most undertakings in the domain face lack of financial resources (i.e. formal funding) and entirely rely on the competence and enthusiasm of the human element involved [26]. Web archives are used for long-term storage of (usually static) born-digital documents harvested from the web [11,12,15,16,42]. They allow content to be replayed in the future in close resemblance to their original versions at the time of capture [4,5]. The Internet Archive’s (IA) Wayback Machine [17] is the largest web archive.3

³
Internet Archive, https://archive.org/.

Standardized data access via uniform application programming interfaces to the IA and other archives is usually supported by the Memento framework [3,34]. A memento is a timestamp-archived version of a resource retrieved from the web in the past. Content retrieved from a web archive (i.e. a memento) is often marked with an archival replay banner consisting of data about the memento (i.e. archived resource) and the corresponding original resource [2]. It is imperative to know that a memento was not forged or altered beyond the necessary changes for replay, e.g. rewriting links inside documents. The Signed Exchanges is a step towards achieving those goals, yet its reliance on short-lived digital signatures limits its applicability in long-term web archival projects. When combined with Certificate Transparency, it might become a viable temporally-aware digital signature validation model [20]. Web Packaging is a novel standard for bundling and optionally signing resources for authenticity preservation and non-repudiation [31,41]. The Text Encoding Initiative (TEI) recommendation is not often used during the process of web-crawling and archiving as a base format for the workflow, although it allows versioning and annotation of enriched articles, allows to store metadata and the body of the document structurally in one file, as well as the verification of the authenticity of the source text [29]. An example where TEI XML is an optional output format is available in reference [6].

The above initiatives aim to address real-life threats of censorship and tampering in web content with trusted archives, which is a justified research and development effort, as there were reports about manipulated web content (e.g. politically-correct (PC) speech and ‘cancel culture’ in general) and even archives, e.g. blog posts in the Wayback Machine [8]. Lernet et al [23] go one step ahead and present proof-of-concept attacks for showing deliberately modified resources to users without tampering with the archive content itself.

Additional important challenges in web archives are duplicates, as well as inaccurate or missing metadata and unwanted boilerplate text [13,24,27,33]. Countering the above-listed challenges in the web archival domain is a task which is yet to be fully solved. Currently trust is reputation-based, which means that we either trust an institution or individual maintaining a web archive, or we do not.4

⁴

Trust in the institution can therefore be easily manipulated by pulling emotional strings and reduce its popularity and therefore the trust within.

There are proposals to rely on blockchain technology [21] and apply a distributed ledger [36] on the Ethereum infrastructure to ensure the long-term integrity of public web archives [10] thereby ensuring public trust in the archived resources [37]. Essentially, they are advocating a shift from institutional trust to technological trust, i.e. instead of institutions, they propose to trust technology (i.e. the blockchain), which is guaranteed to be impartial. When considering blockchain-based solutions, the use of proof-of-stake [19] solutions significantly reduces the computing power necessary to reach distributed consensus and thereby becomes a viable alternative to platforms based on proof-of-work. PoS blockchains were not yet part of the mainstream in 2020–2021, e.g. Ethereum Casper was scheduled to be introduced in January 2020, but it was delayed [30].

Based on the above we concluded that data acquisition, storage, retrieval, analysis, indexing/searching and visualization are the key topics and challenges both practitioners and researchers active in the web archival domain try to face. Conservation and preservation are on the radar but defined as concepts of lesser relevance. Security in general with topics such as trust, integrity (i.e. protection against alteration), censorship and other forms of unwanted filtering and modification are seldom addressed issues in digital heritage in general and web archives specifically [26, Fig. 14].

3. Problems to solve

As we stated above, researchers and practitioners of web archiving prioritize practical problems of storing and searching web crawled data to more complex activities of heritage curation as conservation and preservation. There is also an obvious lack of trust in the research communities which in many respect goes back to the complexity to handle web crawled material as well as to the way memory institutions curate their collections. Within this context we specifically focus on stakeholder trust in web archival solutions on the metadata, document and political levels.

3.1. Metadata level

Although certain efforts aim to standardize the process and shape future web archives, they are confined to the metadata level [1]. The textual level of web content curation is not standardized, and this drawback is the main reason that born-digital material, and, more specifically web archives, are underrepresented in current scientific research compared to analogue or digitized sources. The fact that there are certain efforts to standardize the metadata describing crawled documents does not mean that there are adequate amounts of high quality metadata available to researchers. A scientific citation of a source must contain stable authorship and temporal metadata to fulfil scientific standards and these are most often lacking in the case of born-digital sources [39].

The Internet mostly consists of discrete documents (scientific papers, news articles, forum posts, product descriptions and ratings, etc.), which contain (at least implicitly) metadata. These metadata are often treated as secondary if it comes to maintaining or standardising them to facilitate usage in (internal) archives or global search engines.5

⁵
Such big companies have the resources to handle diverse documents at scale therefore they can quickly become the target of censorship and prone to lose trust.

The authenticity and long-term preservation of the sources is left even further behind. Trusted preservation of such metadata (i.e. cryptographic hashes of content) could protect against tampering and censorship, if content creators would use a solution to create and maintain standardised metadata from their data and deposit them voluntarily to preserve their authenticity and increase the public trust in such providers. However, without exact standard metadata we have to go one level deeper to the actual documents.

3.2. Document level

Documents crawled from the web lack three other important features which hinder their use in research as trusted sources. (1) There are identical or, what is worse, nearly-identical versions of the “same” document available in a web archive crawled at different timestamps. In current web applications, an HTML document can be different in a crawl done within seconds as it is generated on the fly by portal engines. (2) The textual integrity of a born-digital object is far more complex than that of an analogue one, as the distinguishing between dirt (boilerplate) and the actual content is often more than a simple technical decision. (3) As web documents are easily alterable, it is very hard to track the actual agent and the motivation of the changes, which, in many cases are done by purely technical actors, sometimes by the authors/editors themselves for various reasons and extent, and sometimes by malicious algorithms or personae.

Luckily, there are good examples of versioned, collaboratively edited, born digital knowledgebases e.g. Wikipedia and its derivatives. They adhere to most of the aforementioned technical requirements. There is also Wordpress an open source CMS system which owns more than half of the market in 2021.6

⁶
https://www.tooltester.com/en/blog/cms-market-share/ last accessed, 2021/10/01.

If the development of such platforms could adopt the required curation features, the Internet would become more friendly to digital heritage. Apart from the missing technical know-how that we want to solve in this paper, there are further purely political issues to be solved as well.

3.3. Politics

Traditional archives have always been embedded in a context of power, channelling access to the past, in more ways than one. On the one hand, by regulating the entry of documents into the archives, and on the other, by organising the archive’s material in a way that determined what was searchable in the sea of documents. Thirdly, by regulating who has access to the documents.

This complex system, which Cornelia Vismann describes so thoroughly [35], has only become more complex as culture, as Manuel Castells puts it, is now being digitally mediated [9]. Over the two-decades-long history of web archiving [38], the stakeholders directly involved in the archiving process have become increasingly heterogeneous. The first and most pressing problem is that a good part of the web communication to be archived today is in the hands of companies whose services have themselves become archives. However, they are even less able than traditional archives and international or national web archives to meet the strict principles of archiving and curation: reliability, stability and transparency. There are a great number of well-documented examples of such data destruction or loss, as the closure of GeoCities by Yahoo! in 2009 or the data loss from MySpace following “a server migration”7

⁷
Chokshi, N. (2019). Myspace, once the king of social networks, lost years of data from its heyday. New York Times. https://www.nytimes.com/2019/03/19/business/myspace-user-data.html last accessed, 2021/10/01.

[25].

The delicate and precarious balance between service providers, content producing institutions and individuals, international and national web archiving initiatives, and the growing number of audiences (academics, politicians, companies, individuals) who use the results of web archiving is also threatened by the uncertainty of the legal environment. One symbolic example is the attack by publishers on the Internet Archive.8

⁸

https://blog.archive.org/2020/07/29/internet-archive-responds-to-publishers-lawsuit/ last accessed, 2021/10/01.

But copyright holders are only one of the groups that deny the social necessity of web archiving, or at least some of its practices. There are counter-interests at both institutional and private levels. One type is the influence of power censorship, which distorts the correct representation of web content in web archives. This includes not only the practices of dictatorships restricting access to the Internet but also cases where web archiving or access to web archives is blocked for political reasons. In November 2013 the Conservative party of the United Kingdom had deleted more than a decade’s worth of speeches from its website. The party has also blocked access to the Internet Archive’s Wayback Machine [40].

But it is not only political power that can be confronted with the practice of web archiving but also the private sphere of the individual. This is illustrated by the case of Tumblr’s decision to delete 700,000 blogs classified by the AI tool as containing prohibited sexual content. Volunteers made a desperate attempt to archive the material,9

⁹

Captain, S. (2018). The frantic, unprecedented race to save 700,000 NSFW Tumblrs for posterity. Fast Company. https://www.fastcompany.com/90279321/the-frantic-unprecedented-race-to-save-700000-nsfw-tumblr-for-posterity last accessed, 2021/10/01.

thereby preserving a mass of private content not exactly destined for the archive.

It is even possible that civil initiatives with an otherwise obviously indispensable memory policy function, such as Documenting the Now, which aims to archive the social media presence of social movements, will confront the global web archiving institution, the Internet Archive, in order to protect the security of individuals [28].

One of the many challenges facing web archiving and maybe one of the largest is responsiveness. The CODIV-19 pandemic has shown that the centuries-old policy architecture of traditional archives is unsustainable in a born-digital culture and that documenting sudden environmental and social changes requires archivists to react with a speed previously unimaginable. The difficulties of web archiving at the data and policy level mean that it is almost impossible to maintain a stable consensus among all stakeholders, which would be a prerequisite for building trust in web archives and for embedding the whole practice in society. We, therefore, argue that, in addition to the trust that can be achieved discursively, greater emphasis should be placed on achieving trust through technical means. Therefore it is very complicated, maybe even impossible to provide solutions for each of the above problems. In the following, we summarize the threats faced by web archives and propose a blockchain-based solution which solves at least some of the challenges identified.

4. Threat model

In threat modelling one attempts to think like an attacker and identify threats to different assets of a system, taking into consideration assumptions about the capabilities of the potential threat sources (i.e. attackers). In most threat modelling approaches it is necessary to create an inventory of relevant assets, list potential threat sources (attackers) and utilize a proper framework for identifying threats [7]. Microsoft’s STRIDE is commonly used to group threats into six broad categories from whose names it obtained its abbreviation, namely: spoofing identity, tampering with data, repudiation threats, information disclosure, denial of service and elevation of privileges [32].

The assets we aim to protect with the WARChain system are born-digital documents. In the context of our specific analysis, we focused on the protection of newspaper articles published by news portals. We performed an attacker-centric threat modelling and identified the following categories of threat sources, which were actually the usual suspects targeting most types of infrastructures, but with different motives in our specific context:

Nation states – either democratic or autocratic states which push propaganda or target content which is not aligned with their national interests, e.g., criticism against a certain government.

Affected external individuals e.g., influential politicians or businessmen intent on hiding or modifying content published about their activities.

Criminal organizations, which are either hired by nation states or individuals to perform attacks, or are intent to modify born-digital articles for their own purposes.

Companies, mainly multi-nationals intent on covering up or shaping facts published by independent journalists.

Insiders employed by the news portals or their subcontractors motivated by financial rewards, revenge or ego.

We analyzed the current ecosystem of online news portals and their usual data management and backup policies and identified the following most important threats grouped into the STRIDE categories:

Spoofing: (1) Hide or alter the true identity of the original author or entity modifying a news article;

Tampering: (2) Data integrity attacks against accessible news articles (e.g., tampering with news portal data), (3) backup or historical data integrity attacks (e.g., modifying a news entry in the Wayback machine) or (4) fake news generation and insertion.

Repudiation: (5) Claim that a news entry was created or modified by somebody else.

Information disclosure: (6) Disclose author information for articles with sensitive content e.g., criticism against an authoritarian government, potentially leading to arrests and prosecution.

Denial of service: (7) Intentional harm caused to data storage devices in a data center or backup facility with the ultimate goal to render certain born-digital documents unavailable for good.

Elevation of privilege: (8) News portals back office access and privilege escalation or (9) the same attack in backup systems used to store historical newspaper article data. These threats can results in (often) untraceable spoofing, tampering, repudiation or information disclosure. Essentially this threat category is an enabler for other threat categories in our specific context.

We will show that our proposed solution successfully tackles a significant subset of the challenges identified in this section. More specifically, it will be a valuable asset in countering spoofing, tampering and repudiation threats.

5. Solution

We propose to introduce a system architecture shown in Fig. 1, which consists crawlers, validators, archived content storage and consumers. Crawlers, storage and consumers are components in existence and out of scope for this paper. Crawlers harvest document from the web [18], stored in archives (depicted by the WARC [15] Store processes) and subsequently used by consumers. The key novel elements of the proposed WARChain (this name was coined as a combination of the words’web archive’ and’blockchain’) system are the validator processes. The task of these proposed nodes is to participate in a distributed consensus algorithm with the goal to validate each new web archive entry deposited by the crawlers, as well as to certify the validity of each entry when requested by peer nodes or archive consumers depicted by the laptops in the bottom right of Fig. 1. This solution is intended to replace similar, informal validation systems based on document hashes and institutional document repositories (e.g. in Zenodo communities10

¹⁰
Zenodo, https://zenodo.org/.

). It is important to note that we propose to introduce a distributed system of multiple such nodes implemented at different stakeholders worldwide, thereby distributing trust.

We consider crawlers, storage nodes and validators as trused elements of the global web archival ecosystem, which might be targeted by the diverse threat sources identified in Section 4. Consumers are not necessarily trusted and might be considered as potential threat sources. We further propose that the validators reach distributed consensus by relying on a proof-of-stake (PoS) blockchain and majority voting. We advocate for the use of PoS as opposed to proof-of-work (PoW) solutions, which consume significant amounts of computing power and electricity, thereby increasing their ecological footprint i.e., they are dirty solutions. The ‘stake’ in such a system would not be measured by the size of financial deposits, but instead by the trust placed in the institutions participating in the system which host the validator nodes. The trusted systems are depicted in Fig. 1 by the icons of academic and government institutions, which can be national archives, universities or other trusted institutions. One such participating institution could host different node types, e.g. a university or other research outfit might host a crawler, a storage node, one or more validators and customers, who in this context would be the researchers accessing the trusted web archives.

Fig. 1.

WARChain system architecture.

In our proposed architecture not all node types participate in the PoS blockchain, more specifically only the WARChain validator nodes participate in it. Validators might be hosted only by trusted institutions, which stake the public trust put in them while participating. Ideally, only national archives and highly respected higher education and research institutions would be allowed to delegate validator nodes to the system, thereby making it a semi-private blockchain used for a specific purpose. This is different from the general-purpose blockchains usually utilized by cryptocurrencies and other common usage scenarios. The authors of the ARCHANGEL solution [10] present a somewhat similar solution, but with the notable difference of relying on a public proof-of-work blockchain, which wastes significant amounts of energy as well as not formally introducing the different node types shown in Fig. 1.

Essentially, we propose an extension of the existing systems in which web archives are crawled and stored in storage nodes utilizing on-premise or cloud-based glacier storage which can be rented even on a limited budget. The novel processes participating in the WARChain are used only for storing the limited information necessary for web archive validation. We propose a system in which trusted crawlers harvest documents from the web, persist them into the storage nodes as well as generate tuples of additional validation information consisting of the following pieces of information:

Crawler process or institution identifier. Storage complexity: up to 128 bits for a globally unique identifier.

Crawl date. The timestamp of document crawl. Storage complexity: up to 64 bits for a Unix timestamps.

URL hash. A hash of the URL of the crawled document. Storage complexity: 256 bits (if SHA256 is used).

Document hash. A hash of the document harvested from the URL specified. Storage complexity: 256 bits.

The above tuple ensures that each raw document stored in the storage archive has a unique and irrefutable link to its corresponding block and/or transaction within the blockchain, i.e. the crawler identifier, crawl date and URL hash uniquely identify a web document crawled by a specific crawler at a specific time.

We propose to use strong one-way hashing functions (e.g. SHA256) which are expected to be available and secure for many years to come. Additionally, the chosen hashing function should be ready for the coming age of quantum computers, i.e. able to resist hacking attempts performed on novel quantum computers of ever increasing strength.

The storage complexity of this solution is limited as for each (id, timestamp, URL hash, document hash) tuple we propose to introduce up to 704 bits (128 + 64 + 256 + 256 bits) of additional data per web document. Considering that the Bitcoin block-chain’s size exceeded 350 GB in November 2021,11

¹¹

https://www.blockchain.com/charts/blocks-size.

an equally sized WARChain would be able to contain a theoretical maximum amount of information about up to 0.535 billion documents, which is about 14% of the actual size of the currently indexed Web (3.8 billion pages).12

¹²

https://www.worldwidewebsize.com.

Obviously, an implemented version would have slightly higher storage consumption, but would still be able to contain validation information about large portions of the indexed Web, while having a similar storage complexity to Bitcoin. Based on the above calculation we theorize that WARChain could theoretically contain validation information for web archives containing the entire indexed web, while running on a general-purpose personal computer with storage devices available today, e.g. on a few disk drives with 1–2 TB capacity.

6. Implementation

In Fig. 1 we visualized crawlers, storage, validators and consumers. Our team implemented and experimented with all of the listed components. We discussed crawlers and storage in [13] and [14]. As a proof-of-concept we implemented the WARChain as an overlay over the open-source Nxt proof-of-stake blockchain and published our prototype on Github.13

¹³
https://github.com/lendak/warchain.git.

In this section we discuss the implementation of the WARChain and its underlying infrastructure in more detail.

6.1. The Nxt blockchain

Instead of the EduPoS blockchain prototype utilized in reference [22], this research was based on the Nxt proof-of-stake blockchain.14

¹⁴
https://nxtdocs.jelurida.com/Nxt_Whitepaper

Nxt was chosen as it is easy to install, it is open source, has an easy-to-use web-based application programming interface and implements a proof-of-stake distributed consensus algorithm which is less energy intensive compared to proof-of-work solutions like Bitcoin.

Nxt’s private blockchain evaluation kit15

¹⁵

https://www.jelurida.com/nxt/evaluation

is an Nxt bundle which is easy-to-deploy and suitable for academic research and evaluation. A notable potential drawback of Nxt in our use case is that it is published under the Jellurida Public License (JPL). JPL is an extended version of GPL with the addition of an airdrop requirement if a developer decides to create an Nxt fork and use it as a cryptocurrency.16

¹⁶

https://www.jelurida.com/resources/jpl

Fortunately, this limitation did not materialize in our case, as we did not use our Nxt-based blockchain for cryptocurrency.

The Nxt evaluation kit comes with ten genesis blocks which allowed us to start experimenting with an empty blockchain, but with multiple usable and well-funded accounts. The existence of funds was relevant as both cryptocurrency transfers and sending messages incur fees in Nxt.

6.2. Validators

Each WARChain validator node was an Nxt peer, i.e. a full member of the Nxt blockchain. Our additional code implemented the following domain-specific functionality:

A searchable dictionary of (URL hash, document hash) pairs.

A counter and reporting capabilities for duplicate entries.

Detection and reporting of potentially tampered-with documents which are reachable via a shared URL, but differ between crawls.

The validator nodes worked with (crawler id, crawl time, URL hash, document hash) tuples. One example tuple serialized to JSON is shown in Listing 1.

Listing 1.

Document hash representation in JSON

The document and URL hashes were created with the SHA256 cryptographic one-way function with an empty salt value in this example. SHA256 is considered a strong one-way function, which will most probably remain secure for an extended period of time.

6.3. Crawler simulator

As we intended to work with large WARC files [15] and switch between different versions of our prototypes, we needed an efficient way to experiment with real-life WARC files. Therefore, we decided to implement a crawler simulator in Python. Stage one of the simulator works with WARC files by relying on the WARCIO streaming library.17

¹⁷
https://github.com/webrecorder/warcio.

It extracts URL and article data and saves them to easily accessible Excel and comma separated formats. We decided to extract the newspaper articles from archives we worked with, i.e. we focused only on article text and removed the HTML markup and media files by relying on existing functionality in the BeautifulSoup library18

¹⁸

https://www.crummy.com/software/BeautifulSoup/.

and our own custom WARC content filtering.

Stage two of the simulator loads the cleaned URL and article data, creates transactions and pushes them into the WARChain. This part of the simulator was wrapped as a console application.

7. Experiments

We started our experiments with the assumption that the implementation of the underlying Nxt proof-of-stake blockchain was correct and that it inherently supported distributed consensus and immutability of blocks. Further, we assumed that the data stored inside the chain is unchangeable even in the presence of one or more malicious participants in the system depicted in Fig. 1. Instead of checking the functionality of the underlying blockchain, our intention was to show that the proposed WARChain system can run effortlessly on commodity computer hardware and that it manages duplicate entries and integrity checking in the web archival context. Along this line of thought, our first experiment focused on the efficiency and ease of use of the system, followed by the verification of the system‘s strong integrity validation and duplicate management capabilities.

Table 1
Overview of the experimental setup

Test hardware Intel i7-9750 CPU, 16 GB RAM, SSD

Web archive #1 [documents] 467,115

Web archive #2 [documents] 592,896

Crawlers 1 or 2

WARChain validator nodes 1 or 12

We conducted our experiments in a setup which consisted of the following elements (see Table 1 for a summary):

One or more crawler simulators implemented in the Python programming language. The simulators extracted articles from WARC files which contained web pages crawled from a newspaper portal [14]. The WARC content was crawled at two different dates and the two archives contained 467115 and 592896 article entries.

The WARChain consisted of a varying number of nodes.

As an efficient way to compare both article text and URL values between the two archives, we hashed them with SHA256 and compared those hash values when looking for exact URL/article matches.

We ran the Nxt private evaluation kit with default, out-of-the-box settings – we configured only the known peers. The experiments were run on a high-end personal computer with an Intel i7-9750 CPU, 16 GB RAM and a pair of non-raid SSD drives.

7.1. Experiment #1: Proof of concept

The specific goal of our first experiment was to show that the proposed solution is not just easy to deploy, but also able to run on commodity personal computers, thereby allowing any stakeholder with limited funds to install it on any computer with an appropriate network connection. The starting point of the experiment was a WARChain system with of a single validator node. We populated the validator node by running one crawler simulator, taking the smaller WARC file (467 thousand entries) as input. For simplicity, our experimental setup did not include any storage nodes. This setup was similar to a situation in which a single institution implemented a single WARChain node which was then populated by document descriptions created by a single crawler process. It is important to note that we started this experiment with a completely empty ledger (i.e. blockchain).

We did not explicitly control the process in which Nxt grouped transactions into blocks. We serialized the tuples corresponding to separate web documents to JSON and experimented with different numbers of such descriptions in a single message submitted via Nxt’s SendMessage API function. We listed a subset of messages sent in this manner in Nxt’s web-based graphical user interface and include it in Fig. 2.

Fig. 2.

JSON-serialized document descriptions in the Nxt user interface.

We sent the messages to a well-known genesis account which comes as part of the default Nxt private blockchain evaluation kit. For simplicity, we did not use message encryption in our experimental setting. This configuration can be changed if necessary as it would be sufficient that trusted, legitimate blockchain users knew the passphrase corresponding to the account and could load its messages for web archive validation purposes.

As one JSON was roughly 200 bytes long and SendMessage allowed prunable19

¹⁹

Prunable messages are moved from the blockchain to long-term storage after a specified time.

messages to be up to 42 kilobytes long, we experimented with different numbers of document descriptions in a single JSON (1, 10, 20, 25, 35, 50, 100). Our experiments showed that the web API returned a JSON serialization error (although it was a valid JSON file!) for any values equal or higher than 50. Therefore we decided to conduct the experiment with 35 document descriptions submitted as a single message.

We measured the memory and time complexity of populating a single WARChain node. Our measurements are shown in Table 2. Memory consumption was measured as the memory used by the single validator node running on the high-end personal computer we used during our tests. Memory use was moderate considering that Nxt is implemented in Java.

Table 2

Memory consumption and average transaction time

Memory consumption	235.3 MB
Average transaction time	0,03 s

The average transaction time was measured via the time necessary to submit 100 tuples created for the documents in the input archive and serialized to JSON. We were somewhat perplexed by the fact that sending a message to the Nxt blockchain took two seconds on average, regardless of message size, which we varied between JSON strings corresponding to 1 through 35 documents as stated above. We attributed this seemingly sluggish performance to the default Nxt configuration we used in this experiment and did not attempt to tweak the performance of the underlying blockchain. As in this experiment we had only a single validator node, our time complexity measurements did not include the overhead caused by the communication infrastructure and the distributed consensus algorithm. Considering that an actual deployment of the WARChain would consist of a smaller number of nodes, the time to validate transactions can and should be significantly lower than the currently measured 5–10 minutes available in Bitcoin20

²⁰

https://www.blockchain.com/charts/median-confirmation-time.

or Ethereum.21

²¹

https://ethgasstation.info/blog/ethereum-transaction-how-long.

Based on our first experiment we concluded that the WARChain is easy to deploy, the validation code can be quite simple (at least in the context of the Nxt PoS blockchain), as well as that time and computing complexity are limited. This experiment also showed that the WARChain is able to act as an immutable platform for storing information about web crawls conducted by different actors, guaranteeing the integrity of documents potentially crawled in the past and no longer available.

We were surprised that Nxt was somewhat slow when running it with its default configuration and we intend to investigate this further as part of our future work. Even if it turned out in the future that Nxt could not perform better, we expect that there will be other similar solutions with higher throughput to which we could easily port the WARChain system.

7.2. Experiment #2: Validation

The specific goal of our second experiment was to test the system’s capability to efficiently detect and report alterations in web documents collected by crawlers at different timestamps and by different WARChain stakeholders. We started by repeating experiment #1 and followed that up by running the crawler simulator on the second, extended dataset collected from the same news portal at a different timestamp. This second web archive consisted of 592,896 entries as mentioned above. Again, and similarly to the previous experiment, we submitted Nxt messages consisting of 35 document descriptions encoded in JSON (see Section 1 for details). We expected the system to be capable to discern the following types of document entries:

New entries which were included only in the second archive.

Modified entries which existed in both archives, but were different.

Removed entries which were inside the first, smaller dataset, but not included in the second, larger WARC file.

We found that the overlap between the two archives was only partial as 173,784 URLs were shared between the two archives (about 37% overlap). We did not find document discrepancies between the two datasets which would point to possible integrity issues caused by either planned corrections, tampering or censoring. More specifically, this meant that we found 173,784 shared URL hashes between the two datasets and the corresponding article hashes were (also) equal. Thereby we concluded that there were no alterations made to documents between the collection times of the two web archives we used in our experiments. Table 3 summarizes these findings.

Table 3
Document validation across multiple web archives

Web archive overlap [documents] 173,784

Web archive overlap [%] 37%

Integrity check failures [documents] 0

We were surprised to see a smaller than expected overlap between the two archives, which we intend to investigate in the future, i.e. we intend to find out why only 37% of the articles were shared between the archives collected at different times from the same news portal. It was either caused by a change in the crawler implementation, or there might be more sinister reasons behind this finding.

We decided against measuring the average time it took to perform a single integrity check, as we reckoned that it is near zero compared to the length of other tasks performed in the system, e.g., the crawling process and reaching distributed consensus in the underlying proof-of-stake blockchain.

7.3. Experiment #3: Duplicate detection

In our third experiment we wanted to assess the merits of the proposed solution in duplicate document detection. We initialized a slightly different experimental environment with a blockchain consisting of two validator nodes and ran two crawler simulators which were inserting validation information at two different WARChain nodes – Table 4 contains the list of processes involved in this experiment. When looking at Fig. 1, this can be understood as if there were two black processes (validators) and two white processes (crawlers). We deployed this system on the mobile workstation specified in Table 1 and a different general-purpose workstation running the same version of the Nxt private blockchain. We configured the well-known peers for both blockchain nodes, thereby creating a single system. It is obvious that this setup could be easily scaled further by using cloud-based, containerized solutions or multiple workstations and/or servers.

Table 4
Duplicate detection environment

WARChain nodes 2

Crawler simulators 2

Storage nodes 0

With this experiment we tested the way the system would behave in a more realistic setting in which multiple web crawlers collect documents from potentially overlapping parts of the web. As the blockchain is not designed to perform update and delete operations, we allowed each crawler to create new transactions for each batch of crawled documents. This way the blockchain contained numerous (see Table 1 for a quantitative measure) transactions essentially describing the same archived content by different crawlers, i.e. there were duplicate transaction entries in the blockchain which differed only in the crawler identifier and/or crawl timestamp of the JSON representation shown in Listing 1. If the system were implemented on a global scale with participation of many validators and crawlers, then the above described duplicate data entry feature of the system would be quite useful in detecting duplicate crawling efforts carried out by different organizations and teams worldwide, as well as allowing cross-validation between the duplicate content crawled.

Another significant aspect of duplicate management are storage nodes and the duplicate storage of overlapping contents crawled by different digital archival projects. In this context we propose the WARChain to be used towards handling that sub-problem of duplicate management by allowing web crawling projects to check in the WARChain who crawled their target content and when, thereby allowing them to make decisions against storing duplicate content available in multiple locations. We intend to investigate this feature as part of our future research.

8. Conclusion and future work

We stated that the long-term authenticity of web content by archives is an open scientific challenge. We approached this challenge by identifying the system’s key assets and common threat sources. We continued by applying the STRIDE methodology to pinpoint the most likely threats grouped into the spoofing, tampering, repudiation, information disclosure, denial of service and escalation of privilege threat categories. We then proposed to mitigate the majority of identified threats with WARChain, a light-weight, blockchain-based solution for web archive validation. The WARChain was implemented as an overlay on the Nxt proof-of-stake blockchain, thereby requiring less computing power compared to proof-of-work solution, like Bitcoin. Due to the high cost of storing data in any blockchain, we separated archive storage from the blockchain-based archive validation. The nodes participating in the WARChain store only tuples consisting of crawler identifier, archive date, URL hash and document hash. We thereby rely on two layers of cryptography, one to hash the URLs and documents, and the other to hash the blocks in the blockchain and thereby add immutability and authenticity to the web archival process.

We implemented the WARChain in Python and as an overlay over the Nxt proof-of-stake blockchain. We validated the solution by testing it on two medium-sized web archives crawled from a news portal and consisting of approximately 500,000 entries each. We extracted the article texts from the archived documents and worked with hashed URL and article values.

The experiments conducted showed that the system can run on cheap commodity hardware, it is capable to validate vast amounts of archived web content and is able to manage duplicate entries resulting from multiple overlapping crawls. If a similar system were deployed at multiple digital archives worldwide, we could distribute technological trust between those participants, who would ideally be national archives and research institutions. That kind of distributed trust and non-repudiation would ensure that the web archives created by crawlers are not tampered with and available for validation even many years in the future.

As future work, we intend to experiment with additional web archives, port the system to work on mainstream proof-of-stake blockchains as they become available (e.g. Ethereum Casper) and to implement a more fine-grained duplicate detection solution based on minhashing, i.e. to split the article texts into sentences or bags of words and store multiple hash values for each text document. As any PoS-based solution needs to somehow penalize misbehavior of nodes, which we still need to explore and experiment with in the future. On the deployment side we intend to package our solution as Docker image(s) and thereby ease its deployment on personal computers, servers or in the computing cloud.

Footnotes

Acknowledgment

This research was supported by the Institutional Excellence Program for Higher Education (FIKP) of the Republic of Hungary.

References

14873 ISO/DTR, Information and documentation – Statistics and Quality Indicators for Web Archiving, https://netpreserve.org/resources/IIPC_project-SO_TR_14873__E__2012-10-02_DRAFT.pdf, last accessed 2021/02/28.

Alam,

Kelly,

M.C.

Weigle and

M.L.

Nelson, Unobtrusive and extensible archival replay banners using custom elements, in: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, IEEE, 2018, pp. 319–320. doi:10.1145/3197026.3203881.

Alam,

Weigle,

Nelson,

Melo,

Bicho and

Gomes, MementoMap framework for flexible and adaptive web archive profiling, in: 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), IEEE, 2019, pp. 172–181. doi:10.1109/JCDL.2019.00033.

Alam,

M.C.

Weigle,

M.L.

Nelson,

Klein and

Van de Sompel, Supporting Web Archiving via Web Packaging, 2019, arXiv preprint arXiv:1906.07104.

Aturban,

Alam,

M.L.

Nelson and

M.C.

Weigle, Archive assisted archival fixity verification framework, in: Proceedings of the 19th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’19), ACM, 2019, pp. 162–171.

Barbaresi, in: Generic Web Content Extraction with Open-Source Software. Conference on Natural Language Processing (“Konferenz zur Verarbeitung Natürlicher Sprache”, KONVENS 2019), Erlangen, Germany, 2019, pp. 267–268.

D.J.

Bodeau,

C.D.

McCollum and

D.B.

Fox, Cyber Threat Modeling: Survey, Assessment, and Representative Framework, Homeland Security Systems Engineering & Development Institute, 2018.

Butler, Addressing Recent Claims of “Manipulated” Blog Posts in the Wayback Machine, http://blog.archive.org/2018/04/24/addressing-recent-claimsof-manipulated-blog-posts-in-the-wayback-machine, last accessed 2020/07/08.

Castells, The Rise of the Network Society, John Wiley & Sons, 2012.

10.

Collomosse,

Bui, Brown et al., ARCHANGEL: Trusted archives of digital public documents, in: Proceedings of the ACM Symposium on Document Engineering 2018, ACM, 2018, pp. 1–4.

11.

Gomes,

Miranda and

Costa, A survey on web archiving initiatives, in: International Conference on Theory and Practice of Digital Libraries, Springer, Berlin, Heidelberg, 2011, pp. 408–420.

12.

Holzmann,

Goel and

Anand, Archivespark: Efficient web archive access, extraction and derivation, in: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, ACM, 2016, pp. 83–92. doi:10.1145/2910896.2910902.

13.

Indig,

Kákonyi and

Novák, Crawling in reverse – lightweight targeted crawling of news portals, in: Proceedings of the 9th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics,

Kubis, ed., Wydawnictwo Nauka i Innowacje, Poznan, Poland, 2019, pp. 81–87.

14.

Indig,

Á.

Knap,

Sárközi-Lindner,

Timári and

Palkó, The ELTE.DH pilot corpus – creating a handcrafted GigawordWeb corpus with metadata, in: Proceedings of the 12th Web as Corpus Workshop, Marseille, France, 2020.

15.

ISO 28500, WARC (Web ARChive) file format, http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml, last accessed, 2020/05/15.

16.

Johnson and

Thomas, Interfaces with the past...present and future? Scale and scope: The implications of size and structure for the digital archive of tomorrow, in: Proc. Digital Heritage Conf, 2013.

17.

Kelly,

Alam,

M.L.

Nelson and

M.C.

Weigle, InterPlanetary wayback: Peer-to-peer permanence of web archives, in: International Conference on Theory and Practice of Digital Libraries, Springer, Cham, 2016, pp. 411–416.

18.

Kelly and

M.C.

Weigle, WARCreate: Create wayback-consumable WARC files from any webpage, in: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Li-Braries, ACM, 2012, pp. 437–438. doi:10.1145/2232817.2232930.

19.

Khullar, Implementing Proof of Stake Part – 2, Medium.com, https://medium.com/coinmonks/implementing-proof-of-stake-part-2-748156d5c85e, January 2019.

20.

Laurie,

Langley and

Kasper, Certificate Transparency, Internet RFC 6962, 2013, https://tools.ietf.org/html/rfc6962.

21.

V.L.

Lemieux, Blockchain Technology for Recordkeeping: Help or Hype? Technical Report, U. British Columbia, 2016.

22.

Lendák,

Indig and

Palkó, WARChain: Blockchain-based validation ofweb archives, in: 10th International Workshop on Socio-Technical Aspects in Security, Guildford, United Kingdom, 2020.

23.

Lerner,

Kohno and

Roesner, Rewriting history: Changing the archived web from the present, in: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Com-Munications Security (CCS 2017), ACM, 2017, pp. 1741–1755. doi:10.1145/3133956.3134042.

24.

Milligan, Lost in the infinite archive: The promise and pitfalls of web archives, International Journal of Humanities and Arts Computing10(1) (2016), 78–94. doi:10.3366/ijhac.2016.0161.

25.

Milligan, Welcome to the web: The online community of GeoCities during the early years of the World Wide Web, in: The Web as History: Using Web Archives to Understand the Past and the Present,

Brügger and

Schroeder, eds, UCL Press, 2017, pp. 137–158.

26.

Münster, Digital heritage as a scholarly field – topics, researchers, and perspectives from a bibliometric point of view, Journal on Computing and Cultural Heritage (JOCCH)12(3) (2019), 1–27. doi:10.1145/3310012.

27.

Pomikalek, Removing boilerplate and duplicate content from web corpora, Ph.D. thesis, Masaryk University, Faculty of Informatics, Brno, Czech Republic, 2011.

28.

Schafer and

Winters, The values of web archives, in: International Journal of Digital Humanities, 2021, pp. 1–16. doi:10.1007/s42803-021-00037-0.

29.

Schreibman,

Siemens and

Unsworth, A Companion to Digital Humanities, John Wiley & Sons, 2008.

30.

Scwartz, Ethereum 2.0: A Complete Guide. Casper and the Beacon Chain, Medium.com, https://medium.com/chainsafe-systems/ethereum-2-0-a-complete-guide-casper-and-the-beacon-chain-be95129fc6c1, last accessed, 2020/07/08.

31.

Shatnawi,

Munson and

Thao, Maintaining integrity and non-repudiation in secure offline documents, in: Proc. Intl. Conf. on Document Engineering, ACM, 2017.

32.

Shostack, Threat Modeling: Designing for Security, Wiley, 2014.

33.

Sigurðsson, Managing duplicates across sequential crawls, in: Proceedings of the 6th International Web Archiving Workshop (IWAW, 2006.

34.

Van de Sompel,

M.L.

Nelson and

Sanderson, HTTP framework for time-based access to resource states–Memento, RFC 7089, 2013.

35.

Vismann

translated by Winthrop-Young, G., Law and Media Technology, Stanford University Press, Stanford, CA, 2008.

36.

W3C, An Overview of the PROV Family of Documents, https://www.w3.org/TR/prov-overview, last accessed 2020/07/08.

37.

Walport, Distributed Ledger: Beyond Blockchain, Technical Report, UK Government, 2015.

38.

Webster, Users, technologies, organisations: Towards a cultural history of world web archiving, in: Web 25, Histories from 25 Years of the World Wide Web, 2017, pp. 179–190.

39.

Winters, Coda: Web Archives for Humanities Research Some Reflections, UCL Press, 2017, p. 238248.

40.

Winters, Breaking in to the mainstream: Demonstrating the value of Internet (and web) histories, in: Internet Histories 1.1-2, 2017, pp. 173–179.

41.

Yasskin, Web Packaging, https://github.com/WICG/webpackage, last accessed 2020/07/08.

42.

L.L.

You,

K.T.

Pollack and

D.D.

Long, Deep store: An archival storage system architecture, in: 21st International Conference on Data Engineering (ICDE’05), IEEE, 2005, pp. 804–815. doi:10.1109/ICDE.2005.47.

Test hardware	Intel i7-9750 CPU, 16 GB RAM, SSD
Web archive #1 [documents]	467,115
Web archive #2 [documents]	592,896
Crawlers	1 or 2
WARChain validator nodes	1 or 12

Web archive overlap [documents]	173,784
Web archive overlap [%]	37%
Integrity check failures [documents]	0

WARChain: Consensus-based trust in web archives via proof-of-stake blockchain technology 1

Abstract

Keywords

1. Introduction

2 As the number of content providers is orders of magnitude greater than the number of memorial institutions, it is easier to develop and test a standard workflow and extend it to the scale of the content providers.

3 Internet Archive, https://archive.org/.

3.1. Metadata level

5 Such big companies have the resources to handle diverse documents at scale therefore they can quickly become the target of censorship and prone to lose trust.

6 https://www.tooltester.com/en/blog/cms-market-share/ last accessed, 2021/10/01.

7 Chokshi, N. (2019). Myspace, once the king of social networks, lost years of data from its heyday. New York Times. https://www.nytimes.com/2019/03/19/business/myspace-user-data.html last accessed, 2021/10/01.

5. Solution

10 Zenodo, https://zenodo.org/.

13 https://github.com/lendak/warchain.git.

14 https://nxtdocs.jelurida.com/Nxt_Whitepaper

17 https://github.com/webrecorder/warcio.

Table 1 Overview of the experimental setup Test hardware Intel i7-9750 CPU, 16 GB RAM, SSD Web archive #1 [documents] 467,115 Web archive #2 [documents] 592,896 Crawlers 1 or 2 WARChain validator nodes 1 or 12

Table 3 Document validation across multiple web archives Web archive overlap [documents] 173,784 Web archive overlap [%] 37% Integrity check failures [documents] 0

Table 4 Duplicate detection environment WARChain nodes 2 Crawler simulators 2 Storage nodes 0

Footnotes

Acknowledgment

References

²
As the number of content providers is orders of magnitude greater than the number of memorial institutions, it is easier to develop and test a standard workflow and extend it to the scale of the content providers.

³
Internet Archive, https://archive.org/.

⁵
Such big companies have the resources to handle diverse documents at scale therefore they can quickly become the target of censorship and prone to lose trust.

⁶
https://www.tooltester.com/en/blog/cms-market-share/ last accessed, 2021/10/01.

⁷
Chokshi, N. (2019). Myspace, once the king of social networks, lost years of data from its heyday. New York Times. https://www.nytimes.com/2019/03/19/business/myspace-user-data.html last accessed, 2021/10/01.

¹⁰
Zenodo, https://zenodo.org/.

¹³
https://github.com/lendak/warchain.git.

¹⁴
https://nxtdocs.jelurida.com/Nxt_Whitepaper

¹⁷
https://github.com/webrecorder/warcio.

Table 1
Overview of the experimental setup

Test hardware Intel i7-9750 CPU, 16 GB RAM, SSD

Web archive #1 [documents] 467,115

Web archive #2 [documents] 592,896

Crawlers 1 or 2

WARChain validator nodes 1 or 12

Table 3
Document validation across multiple web archives

Web archive overlap [documents] 173,784

Web archive overlap [%] 37%

Integrity check failures [documents] 0

Table 4
Duplicate detection environment

WARChain nodes 2

Crawler simulators 2

Storage nodes 0