Abstract
The world wide web has two main forms of architecture, the first is that which is explicitly encoded into web pages, and the second is that which is implied by the web content, particularly pertaining to look and feel. The latter is exemplified by the concept of a website, a concept that is only loosely defined, although users intuitively understand it. The Website Boundary Detection (WBD) problem is concerned with the task of identifying the complete collection of web pages/resources that are contained within a single website. Whatever the case, the concept of a website is used with respect to a number of application domains including; website archiving, spam detection, and www analysis. In the context of such applications it is beneficial if a website can be automatically identified. This is usually done by identifying a website of interest in terms of its boundary, the so called WBD problem. In this paper seven WBD techniques are proposed and compared, four statistical techniques where the web data to be used is obtained apriori, and three dynamic techniques where the data to be used is obtained as the process progresses. All seven techniques are presented in detail and evaluated.
Keywords
Introduction
A disconnect exists between the underlying structure of the Web and the structure perceived by humans. Humans can intuitively identify an implied organisation of the Web that is not obvious from the underlying explicit structure. More specifically, humans are able to infer website boundaries in a manner that is not necessarily explicitly encoded in any underlying structure. Conceptually, this human perceived architecture exists in a layer above the encoded structure of the web. The process whereby humans derive this architectures is founded on relationships and/or similarities between features encoded in the web, such as the topic or subject of web pages, layout, navigation menus, imagery and so on. Although humans are able to interpret the WWW in this way, it is a difficult task for a machine to achieve. The ability for a machine to identify this underlying architecture remains an open problem [10].
The standard graph model of the web assumes that each web page is a single node within a graph, with hyperlinks (edges) connecting it to other nodes. This model has been used for various web analysis tasks [11,19,33]. The advantage of this standard model is that it is explicitly encoded in the Web and it can be easily extracted using simple web crawls, however it conveys very little information about the higher level web architectures that humans can perceive.
The work presented in this paper is motivated by a number of applications whereby knowledge of the higher level web architecture, comprised of collections of websites each delimited by a web site boundary of some kind, is required. More specifically the work is motivated by a need for automated Website Boundary Detection (WBD), knowledge that cannot be directly obtained from simple web graph analysis. Typical applications where website boundary knowledge is required include automated: (i) Website archiving and digital preservation [1,50], (ii) Web directory generation [31], (iii) Website map generation [19,34,35] and (iv) Web spam detection [8,9,13,32,38,52–54]. Further more, the study of the WWW at the website level, rather than the web-page level, provides benefits with respect to a variety of web analysis activities [10]: (i) Content authorship analysis [18], (ii) Inter (between) and Intra (within) website connectivity analysis [33], and (iii) Statistical analysis of dynamic website growth and website generation on a website level [11]. Finally, arguably the most prominent motivation for WBD is for the purposes of digital preservation; the identification of complete web (site) content and its archiving for future generations [1,12,17,50].
A naive approach to WBD would be to try to identify a website boundary using the urls of web pages. This would not work for a website were content resides on multiple domains or uses a directory structure (for example so as to logically segment content based on media type). Supervised learning could also be thought of as an option for WBD. But as websites vary in size, it can be extremely difficult to devise training data on which to build a model for classification. Further challenges are discussed in Section 2.
The main challenge of WBD is the resource cost required to access, store and process web content. In terms of accessing web content there is an overhead associated with the retrieval of content; as content is stored on servers around the world there is a time and bandwidth cost associated with accessing pages/resource on the web. In the context of WBD, so as to use computational resources effectively, we wish only to consider relevant pages. The process of WBD can be considered in either a static or dynamic context. The distinction is that in the static context all the www pages to be considered are first downloaded. In the dynamic context only portions of the Web are considered in turn, the analysis is thus conducted in a step by step manner. The advantage of the first is that decisions on what groupings to classify web pages into, either contained within a website (target pages) or outside of the website (noise pages), can be made based on all data apriori. The advantage of the second is that the amount of irrelevant (noise) pages that are requested, stored and processed can potentially be reduced, as the gathering of pages is considered at the same time as the classifying of pages which in turn can be used to inform the decision of whether to access subsequent pages or not.
In this paper both static and dynamic WBD algorithms are considered. Recall from the foregoing that the distinction between the two is that in the first case the search space, a collection of web pages, is obtained in advance, while in the second case the search space is explored dynamically. Two static WBD approaches are considered: (i) content based and (ii) structure based. The focus of the first, as the name suggests, is the content of web pages. The idea is to describe web page content in terms of appropriate attributes and to use these as features to represent web pages. Web pages are then clustered using these features in order to identify a websites boundary. Two static content based techniques are considered: (i) the Composite Feature (CF) technique and (ii) the Discriminant Feature (DF) technique. The focus for the static structure based WBD approach is the hyperlink structure of web pages, as such the approach is founded on graph partitioning methods. Two graph partitioning techniques are considered in the context of the static structure based WBD approach: (i) Hierarchical Graph Partitioning (HGP) and (ii) Mincut Graph Partitioning (MGP). The proposed dynamic approach is founded on the use of a Random Walk (RW) web graph traversal coupled with an incremental k-means clustering mechanism. Three variations are considered: (i) Random Walk (RW), (ii) Self Avoiding Random Walk (SARW) and (iii) Metropolis Hastings Random Walk (MHRW).
The remainder of the paper is organised as follows. In Section 2 a literature review of some existing work on the resolution of the WBD problem is presented. Section 3 then presents a formal description of the WBD problem. Section 4 presents the first of the static approaches, namely the content based approach (two alternative techniques). The second static approach, the structure based approach, is presented in Section 5 (two alternative techniques). Section 6 then presents the proposed dynamic approach to the WBD problem based on Random Walks (three alternative techniques). A complete evaluation is then presented in Section 7 and some conclusions in Section 8.
Literature review
The Website Boundary Detection (WBD) problem is recognised as an open and difficult problem to solve [3–7,10,28–30]. As already noted, the WBD problem is concerned with the task of identifying the complete collection of web resources that are contained within a single website. WBD broadly falls within the domain of “web mining” [14,15,22,43,51]. Web mining is directed at the discovery of hidden patterns or knowledge in web data using techniques that utilise web structure (hyperlinks) and web content (attributes of a page). This section presents some selected related work which discusses a number of alternative techniques, to that presented in this paper, that have been used to address the WBD problem, namely: (i) link block graphs, (ii) logical websites, (iii) compound documents and (iv) directory based websites.
Starting with the link block graph technique, a link block graph is a set of vertices and edges that represent certain elements of a web page referred to as a block (a coherent sets of information). Examples of such blocks are a navigation menu across the top of a page, a set of related links at the side of a page or a section of text and images. Given a set of web pages a link block graph is typically created from these pages as follows. First the individual pages are segmented into “blocks”. These individual blocks are created by segmenting the HTML Document Object Model (DOM). Second a graph structure of vertices and edges is created from the blocks. Each block is a vertex. An edge between blocks is created based on a notion of similarity. The results is a link block graph. In [45], and in related work presented in [30], the link block graph concept was used to model web data, more specifically link blocks were used to represent structural menus (s-menus) of web pages; an s-menu is a link block graph, as described above, but describing only the internal navigation elements of web pages. A relationship between web pages exists when a link from a s-menu can be used to navigate to another page that contains the same s-menu. Web pages that contain common s-menus can be used to define the boundaries of a website.
In the work by Senellart [49] the aim was to find all web pages that are determined to be contained in a “logical website”. A logical website in this context is a collection of pages that is said to represent a logical structure and content, such that they are more connected in terms of hyperlinks than other pages. An idea also adopted with respect to work presented later in this paper. The work by Senellart used the concept of flow networks in order to discover web pages that made up a logical website. An assumption is made that if a flow is pushed around the network “bottlenecks” will occur in the less connected noise pages, but flow more freely around the highly connected target pages of the website. To detect the boundaries of a website flow is pushed from a set of seed pages until a bottle neck is created around the boundaries of the website. The set of nodes that have flow pushed to them between the seed and the “bottle neck” are then identified as part of the website. The significance of the work by Senellart is that the concept of flow networks is also adopted with respect to the work presented later in this paper.
Research by Eiron [21] proposed the notion of compound documents; a set of web pages that can be aggregated into a single coherent information entity. A simple example of a compound document is an online multi-media news article. In the work by Dmitriev [18] a method to discover compound documents is proposed. The method used supervised learning techniques to train a model using manually labelled compound documents. This method was then applied to unlabelled web pages in order to determine which of these pages should be aggregated to represent a compound document. This is a similar problem to the WBD problem, but applied in a supervised learning environment.
In [7] the concept of a “directory based (web) site” is described; the partition of a web server over which some individual (website author/owner) has full control. The method proposed to address the WBD problem in this case used various filters based on characters and directories in the URLs of web pages so as to categorise which child pages fall under the control of which individual and thus identifying a website’s boundaries in terms of authorship. The disadvantage of this technique is that it assumes that all pages contained within a website are located within the same directory hierarchy as one another. In this paper it is argued that this is often not the case, as individual web pages, images and other web resources can be located using different urls and servers but can still make up a single website.
Formal description
The general WBD problem can be formally described as follows. A web page is defined as a WWW resource, written in HTML format, and denoted by w. A collection of n web pages is given by W such that
With respect to the work presented in this paper the hyperlink structure of a set of web pages is represented as a (web) graph G such that
The HTML content of a page
WBD problem
Using the above formalism, the WBD problem can be defined as follows. Given a collection of web pages W, comprising n individual pages such that
A basic WBD solution used later in this paper is founded on a clustering process. This subsection is thus concluded with a brief description of this process. Broadly, clustering is used to groups similar web pages in W together based on their content, more specifically on the values in an associated set of feature vectors F. A cluster configuration
Static vs. dynamic
As noted in the introduction to this paper the WBD problem is considered in both the static and dynamic contexts. In the static context, the entire vector space is made available prior to the commencement of the WBD process. Thus the collection of web pages
In the dynamic context the entire vector space is not known in advance. The web pages in the vector space are populated in a dynamic manner as part of a web traversal (crawling) process. This essentially means that the vector space is updated at each “step” of the traversal process. The seed page
Content based static WBD
This section presents the first of the static WBD approaches, namely the content based static WBD approach. Recall that in the static context the search space is identified apriori. The content based static WBD approach uses a range of features, extracted from the content and meta-data of web pages, to define individual web pages in the set W. A clustering algorithm is then applied, as described above, and the individual web pages in W allocated to a set of clusters K; as noted above the cluster holding
Feature representations
There are many features that can be extracted from web pages that can serve to describe content and thus be utilised with respect to the static WBD approach. In previous research textual based features are commonly used to represent web pages. Thw work presented in this paper proposes some additional features referred to as link based features. The particular link based features of interest are: image links, mailto links, page anchor links, resource links, and script links. Further details on all the features considered with respect to the work presented in this paper are listed below. Further details on all the features considered with respect to the work presented in this paper are listed below. In each case we consider our motivation for selecting the feature and how the feature may be represented in terms of a binary valued feature space representation. The features considered are categorised into two groups: (i) link (URL) based features and (ii) textual based features.
The link based features were extracted and represented as follows. The HTML content for each page was first downloaded and validated into a well formed Hyper Text Mark-up Language (HTML) Document Object Model (DOM) object. This was done because HTML documents are notoriously malformed and subsequently difficult to parse, for example documents often have missing tags. The steps required to validate the HTML into a well formed document are therefore a necessary precursor to the successfully extraction of link based features. Each link (URL) in each web page was then parsed and extracted from the HTML by splitting it into “words” using the standard delimiters found in URL’s. For example the URL
The textual based features were extracted from each web page using a html text parser/extractor to extract text as it would be rendered by a web browser. This was deemed to be the same text that a user would use to judge a pages topic/subject. Note that for this purpose non-textual characters were removed, along with words contained in a standard “stop list”. Thus the identified textual feature values comprised only what were deemed to be the most significant words. Each word represented a binary valued feature in our feature space.
The link based features that were considered in this work were as follows:
The motivation for using hyper links as one of our features is the observation that web pages that are related tend to share many of the same hyper links. The shared links may be other pages in the same website (e.g. the website home page) or significant external pages (most links point to a related set of pages with some common topic). The use of image links was prompted by the observation that web pages that link to the same images were likely to be related, for example a common set of logos or navigation images could imply a relationship. If a set of web pages includes identical email links ( Page anchors are used to navigate to certain places on the same page, these can be useful with respect to the resolution of the WBD problem as they often have meaningful names. It was conjectured that if the same or related names are used on a set of web pages it could imply related content. The motivation for using resource links was that the styling of a page is often controlled by a common Cascading Style Sheet (CSS) which could therefore imply that a collection of pages that use the same style sheet are related. It was observed that some scripting functions that are used in web-pages can be written using some form of common script file; if pages have common script links then they could also be related. URLs are also likely to be an important factor in establishing whether subsets of web-pages are related or not. A URL is used to request the actual page from the WWW, it contains a substantial amount of information like server and directory information. It is conjectured that the title text used within a collection of web pages belonging to a common web site is a good indicator of “relatedness”. Another key indicator of web page “relatedness” is content; the text contained in individual WWW pages.
Static content based WBD
This sub-section presents the two static content based WBD techniques: (i) the Composite Feature (CF) technique and (ii) the Discriminant Feature (DF) technique. Both comprises three inter-related processes: (i) feature selection, (ii) clustering and (iii) web site extraction, as illustrate by the process template given in Table 1. The first technique, CF, is the simplest of the two variations. The CF technique operates by simply using a user prescribed individual feature, or combination of features, to represent the web pages in W. The features (see previous section, Section 4.1) are extracted from web pages and represented as a vector space model. The clustering approach described above is then applied and the cluster containing the seed page (
Process template for the static approach to WBD
Process template for the static approach to WBD
The second technique, DF, is founded on a discrimination method whereby individual features, and combinations of features, are generated and evaluated using a small amount of prelabelled training data (both target and noise web pages). The idea is to learn a “best” set of features in terms of a given WBD problem. A scoring mechanism is used for the evaluation. This is done using the score metric presented later in this paper in Section 7.1. The discrimination process is an iterative one, commencing with single features and then on each iteration adding additional features. In this manner a “combination hierarchy” of features is produced. Given a set of features the same clustering approach as described above can be used to identify ω and consequently B. The process is illustrated in Table 2 with an example set of features
Example hierarchy generated using the feature discrimination technique
This section presents the second static WBD approach, the structure based static WBD approach, whereby the web graph hyperlink “structure” is used to identify the desired website boundary B. The approach makes the assumption that a website comprises a set of “strongly” connected components compared to the rest of the web graph under consideration. Two variations of the static structure based WBD approach are considered: (i) Hierarchical Graph Partitioning (HGP) and (ii) Mincut Graph Partitioning (MGP). For the first the Newman hierarchical partitioning algorithm [39,40] was used. For the second the Ford Fulkerson algorithm was used which in turn is founded on the mincut-max flow theorem described in [24,25] directed at the concept of flow networks. The two variations are described respectively in Sections 5.1 and 5.2 below; both aim to partition the web graph such that strongly connected pages are segmented from the rest, thus producing a WBD solution.
Hierarchical graph partitioning structure based static WBD
This sub-section describes the hierarchical graph partitioning variation of the Structure Based Static WBD approach founded, as noted above, on the Newman hierarchical partitioning algorithm [39,40]; an agglomerative hierarchical mechanism for identifying clusters in graph data according to the similarity of graph vertices. The algorithm produces p graph clusters, where p is determined by the algorithm itself (no need for user input). The hierarchical graph clustering approach is centred around a modularity value Q; a measure of how “good” a given cluster configuration is. The Q value is calculated, according to the inter-connectivity of the vertices in each individual cluster, by comparing the connectivity of records within each cluster with respect to the connectivity if the records were connected randomly [41].
More specifically the Q value is calculated as follows. Given: (i) a directed graph
The Newman clustering algorithm is an iterative hierarchical clustering method applied to graph networks (such as social networks); the aim being to generate a set of clusters K, where each individual cluster contains one or more vertices. The input to the algorithm is a configuration K generated by placing each vertex in V in its own cluster. Thus on start up
Using the Newman algorithm a set of clusters K is produced. However, as noted above, the number of clusters is not predetermined, it is derived by the algorithm. The seed page
Mincut graph partitioning structure based static WBD
This sub-section describes the Mincut graph partitioning approach based on the Max-Flow Min-Cut theorem applied to flow networks (recall that flow network theory was also used in [49] for WBD as disccused in Section 2). As noted previously, in graph theory a flow network is a graph of vertices and edges where edges can have a flow associated with them, which is an indicator of material “flowing” between vertices. Flow networks can be used to model a variety of problems [2,26]; in the context of the work described in this paper they can be used to identify “cuts” (divisions) within a network in order to produce graph partitions [47]. A “minimum cut” is a cut that splits a network into two partitions such that a “minimum” number of edges are cut, thus forming two clusters comprising dense (or at least denser) connections [47]. The minimum cut of a network can be found effectively by applying the Max-Flow Min-Cut theorem [16,24].
The concept of a flow network model is formally described as follows (the description and examples are based on [16]). A flow network is a directed graph
Using the above definition of a flow network, a flow f inside a network can be formally defined as follows. The flow f, in a flow network G, is considered to be a value which can be measured between vertices u and v. A flow
A residual network
A
Given a flow network A flow f is a maximum flow in G The residual network The flow in the network
(1) states that a flow f needs to be a maximum flow in G. This can be proven using (2). If f is said to be a max flow, while there is an augmenting path in
The Ford Fulkerson Algorithm [24] is used to compute the max-flow in a given flow network
To produce a WBD solution from a graph partitioned into two clusters using the min cut of the graph a similar method was used to that used with respect to the static WBD approaches described above. Namely that the cluster containing the seed node
The WBD approaches presented in this section focused on exploiting only the hyper link structure of W when producing WBD solutions, whilst the WBD approach presented in the previous section (Section 4) exploited only the content of the web pages in W with respect to the WBD problem. The approaches presented in the following section (Section 6) are directed at both the content and hyper link structure of web pages when producing WBD solutions.
Random walk dynamic WBD
This section presents the third and final WBD approach considered in this paper, the dynamic approach. Recall that in the dynamic context the web data is not available prior to the start of the analysis, the web data is “fetched” as the approach proceeds. The three approaches in this section are based on the Random Walk concept, a method of traversing a graph structure in the form of a succession of random steps from node to node. The idea is to include an element of serendipity (chance discovery) in the process. A random walk is used in this work as a probabilistic method of visiting nodes (web pages) of a web graph in a non-deterministic order. It is argued in this work that such a probabilistic method provides advantages with respect to WBD. Thus in the context of WBD we again conceive of a collection of web pages, in which we wish to find a web boundary centred on some seed page
The remainder of this section is organised as follows. Section 6.1 provides an overview of the dynamic WBD approach, whilst Section 6.2 provides details on the graph traversal methods used. The remaining three sub-sections (Sections 6.3, 6.4 and 6.5) present the three variations of the dynamic approach respectively.
Overview of the dynamic WBD approach
The generic dynamic WBD process is described by the template given in Table 3. At the start (line 1) the set of target pages
As the traversal progresses, for each identified node w (web page), we determine whether the current node belongs to
Template for the dynamic approach to WBD
Template for the dynamic approach to WBD
This section details the three random walk graph traversal methods considered: (i) standard Random Walk (RW), (ii) Self Avoiding Random Walk (SARW) and (iii) Metropolis Hastings Random Walk (MHRW). These are all probabilistic approaches. The distinction between a deterministic and a probabilistic approach is that, given a web graph G, the first will always traverse the graph in the same manner while the second will (at least potentially) traverse the graph in a different manner each time the algorithm is applied to G. Deterministic approaches use a heuristic process to determine which node to visit next thus the ordering of web pages accessed using a deterministic method remains fixed for every traversal of the same graph. Thus using a probabilistic approach there is an element of choice and unpredictability involved (serendipity). Each of the proposed probabilistic graph traversal techniques is considered in detail in the following three sub-sections in the context of the template presented previously in Table 3. For the evaluation presented later in this paper the operation of the proposed probabilistic traversal techniques are compared with two deterministic techniques: (i) breadth-first Search (BFS) and (ii) depth-first Search (DFS).
Pseudo code for Random Walk (RW)
Pseudo code for Random Walk (RW)
The Random Walk (RW) method of graph traversal is described by the pseudo code presented in Table 4. As before the input to the algorithm is a start seed page
Self Avoiding Random Walk (SARW)
The Self Avoiding Random Walk (SARW) “crawls” the graph in a random manner without returning to previously visited nodes. It does this by maintaining a list of nodes R visited so far. The pseudo code is shown in Table 5. The input to the algorithm is a starting seed page
Metropolis Hastings Random Walk (MHRW)
The random traversal of the RW and SARW methods is greatly effected by the underlying graph structure. If a node has a high degree, then it is more likely to be chosen randomly, as it has an increased number of neighbours. The Metropolis Hastings Random Walk (MHRW) aims to reduce this behaviour. The approach taken is to use a calculation which effectively produces an inverse probability of choosing a neighbour which has a high degree. The pseudo code for the MHRW is given in Table 6. As in the case of the previous two random walk algorithms the input is a seed page
Pseudo code for Self Avoiding Random Walk (SARW)
Pseudo code for Self Avoiding Random Walk (SARW)
Pseudo code for Metropolis Hastings Random Walk (MHRW)
From the foregoing seven distinct techniques to address the WBD problem have been proposed. Four static techniques, where the input data is collected in advance, and three dynamic techniques where this is not the case. The four static techniques were characterised as being either content based or structure based (two variations of each). The two variations of the content based static WBD approach were: (i) Composite Feature (CF) and (ii) Discriminative Feature (DF). The two variations of the structure based static WBD approach were: (i) Hierarchical Graph Partitioning (HGP) and (ii) Mincut Graph Partitioning (MGP). The three random walk dynamic WBD variations were: (i) standard Random Walk (RW), (ii) Self Avoiding Random Walk (SARW) and (iii) Metropolis Hastings Random Walk (MHRW). This section provides an evaluation of these seven WBD approaches.
The standard mechanism for evaluating solutions to the WBD problem is to use hand annotated collections of data extracted from the WWW [27,30,44]. This method was thus also adopted with respect to the work presented in this paper. The four data sets used for the evaluation were departmental web pages hosted by the University of Liverpool (www.liv.ac.uk). The web pages collected were manually labelled by an assessor with respect to the WBD solution. Each data set contained between 450 to 500 web pages. The four target web sites were: (i) Chemistry (LivChem), (ii) History (LivHistory), (iii) Mathematics (LivMaths) and (iv) Archaeology, Classics and Egyptology (LivAce). This collection of data sets will be referred to as the Real Web Graph (RWG) collection.
The remainder of this section is organised as follows. Section 7.1 presents the metrics used for the evaluation. The next three sub-sections, Sections 7.2, 7.3 and 7.4, present the results of the evaluation of the three WBD approaches considered in isolation. Section 7.5 then presents a comparative evaluation of the best performing WBD solutions from the foregoing evaluation. The final sub-section presents a summary of the conducted evaluation.
Evaluation metrics
To measure the quality of the proposed WBD approaches performance score was used; the average of the measured accuracy, precision, recall and F-measure (all standard evaluation metrics taken from the domain of data mining). In the case of the dynamic approaches coverage and number of steps was used. Coverage is a measurement of the fraction of total target pages
Content based static WBD evaluation
This subsection presents the evaluation of the content based static approach. The evaluation was broadly directed at evaluating the most appropriate features, and combinations of features, that could best be used to represent web pages, with respect to WBD. More specifically the objectives were:
To analyse the performance of the CF technique in the context of the best features to be used and the WBD solutions generated. To analyse the performance of the DF technique in the context of the WBD solutions generated. To identify the most appropriate clustering algorithm to be used in the context of content based static WBD.
Each is discussed in further detail in the following three sub-sections.

Performance of the top nine single features, ordered from best performing 1, to worst performing 9.

The performance of feature pairings, ordered from best performing 1, to worst performing 36.
This subsection reports on the evaluation directed at determining the most appropriate set of features to be used in conjunction with the CF technique. Recall, that this technique works by selecting a feature or combination of features for WBD as presented previously in Section 4.2. A sequence of experiments was undertaken using single features, pairs of features and triples of features. The results are presented in Figs 1 and 2. Figure 1 lists the nine top performing single features ordered according to their performance score. From the Figure it can be seen that the top performing features are resource, script and image links. These features are intuitively indicative of the authors intent to express relationships in terms of a website. Resource and script links are used to express an underlying similar “look and feel” of pages in the same collection. It is very common for shared scripts, or blocks of code providing functionality, to be stored once, in a common location, and referred to many times by different resources. A similar practice is used in the styling of pages, which may link to common CSS or shared design elements. From Fig. 1 the features that do not seem to provide a good WBD solution are: page anchors, text and “mailto” links.
Figure 2 shows the best results obtained when considering pairs of features. In the figure the x-axis lists the 36 possible pairings ordered according to their performance score. It is interesting to note that the best performing pairs were found to be made up of combinations of a strong and weaker single features. The results demonstrate that when using pairings better results are obtained than when using single features. With respect to the pairings, script, image and resource links appear in the top performing combinations. In contrast to what might have been predicted, combining the top performing single features does not necessarily yield top performing combinations. The features that performed the best with respect to the single feature experiments, do however, seem to perform much better when combined with features that appear lower in the performance list generated with respect to the single feature analysis. The most effective lower performing single features, when used in combination with higher performing single features, appear to be: url, mailtolinks and title features.

The best performing feature combinations and performance scores for each RWG data. The highest scoring feature combination in each case has been highlighted.
The results from the triple feature combinations, not shown here, were also evaluated. It was observed that the results obtained were consistent with the results obtained using pairs of features; under performing features appear to increase the performance of the high performing features when used in combination. In conclusion the top performing features are resource, script and image links.
This sub-section presents the results obtained from the evaluation conducted to determine WBD solutions using the DF technique. Recall that the CF technique uses a predefined set of features and, as demonstrated in the foregoing sub-section, there is no best set of features guaranteed to produce a best WBD solution. The DF technique addresses this issue by using a small amount of training data whereby a best set of features can be learnt. The DF technique iteratively combines features, starting with a single feature, until a best set of features is arrived at which can then be used to derive a WBD solution in the same manner as in the case of the CF technique. On each iteration, the combination that provides the biggest increase in performance score with respect to the WBD solution is chosen to be the “feature set so far”.
The training data that was used for the DF technique was a labelling of pages within the website boundary solution for each of the input data sets. Recall that as the DF technique progresses this labelled data was used to evaluate the best combination of features, such that a best final feature combination was derived which could then be used to generate a WBD solution. On each iteration the resulting WBD solutions are compared to the known training data solution and the best performing feature combination selected for use in the next iteration. This iterative process builds combinations of features. The results obtained are given in Fig. 3 which shows the best feature combination, and corresponding performance score, from each iteration as the DF algorithm progresses (with respect to the RWG evaluation data sets). From the table it can be seen that, as expected, the highest performing feature combinations contain features that were identified as high performing using the CF approach. Namely resource, script and image links.
Figure 4 shows the same results as presented in Fig. 3 but in the form of line graphs. In Fig. 3 performance score is listed on the y-axis and size of feature combination, from 2 to all 9, along the x-axis. From the figure it can be seen that by using all features in combination produced the worst performing WBD solution with respect to all four test data sets (probably because the representation got to convoluted).

The best WBD performance score for each feature combination size, the highest scoring combination for each data set is highlighted by a dot.
The results presented in Fig. 3 and Fig. 4 corroborate the results presented in Section 7.2.1, namely that the best feature combinations are those that comprise both at least one high and at least one mid performing single feature. It can also be concluded that there always exists a combination of features that gives a highest WBD performance score, and that the size of this best combination is somewhere between three and seven.
In this sub-section the results from experiments conducted to identify the most appropriate clustering algorithm to be used are presented. Four clustering algorithms were considered as follows:
Kmeans (Centroid based) [37]
Bisecting Kmeans (Hierarchical based) [55]
DBScan (Density based) [23]
KNN (Centroid based) [20]
Figure 5 shows the ranked average WBD performance score, with respect to each clustering algorithm, with respect to the four data sets considered and using single, double and triple feature combinations. Both the Kmeans clustering algorithm and the bisecting Kmeans algorithm require the number of clusters to be pre-specified. In contrast, KNN and DBScan do not use such a value, instead they rely on a user provided similarity threshold value to determine a cluster configuration. In the evaluation presented in this section the parameters for each algorithm were systematically varied and the best performing output reported. The best performing parameter in each case was as follows: (i) for Kmeans the best number of clusters was

Clustering algorithms ranked according to average WBD performance score across the RWG data using single, double and triple feature combinations.
The main problems with applying clustering algorithms to the different kinds of feature space used in this work is the so called “curse of dimensionality”; whereby, as the number of dimensions increases it becomes increasingly computational intensive to evaluate distance functions with respect to the feature space. The fact that the feature space becomes very sparse as the number of dimensions increases can be considered to be a contributing factor to the poor performance when all features are used.
The evaluation of the two variations of the structure based static WBD approach is presented in this sub-section: (i) Hierarchical Graph Partitioning (HGP) and (ii) Mincut Graph Partitioning (MGP). Recall that the first was based on Newman hierarchical graph partitioning (see Section 7.3.1), while the second was based on the minimum cut algorithm which in turn was based on min-cut max flow theorem (see Section 7.3.2). The objective of the evaluation presented in this section is to measure the performance of the two structure based approaches (HGP and MGP) with respect to the WBD solutions produced. The evaluation results for each technique are presented and discussed in further detail in the following two sub-sections. The results, in both cases, indicated that the structure based static WBD approach has potential in terms of producing high quality WBD solutions. The results obtained also demonstrated that the connectivity between vertices of a single website boundary are in fact encoded in the underlying web graph structure.
HGP evaluation
The evaluation of the HGP technique was conducted by measuring the performance of the Newman HGP algorithm using a sequence of web graph snapshots varying in size from 50 to 450 vertices increasing in steps of 50. The pages contained in each snapshot were collected using a breadth first crawl from a seed (home) page
The results are presented, in terms of performance score as before but also precision and recall, in Fig. 6. From the figure it can be seen that as the snapshot size increased recall also increases while precision decreased. The increase in recall indicated a corresponding increase in the number of target web pages grouped with the seed page. This in turn demonstrated that, as the snap shot size increased, the Newman algorithm was increasingly partitioning the graph in such a way that target pages were not being allocated to too many differing clusters, but were in fact being mostly grouped together (the desired result). The decrease in precision indicated that the number of noise pages grouped with the seed page increased as the snapshot size increased. In other words as the snapshot size increased the HGP technique was increasingly partitioning the graph in such a way that pages belonging to the website of interest (defined by

The results obtained using the HGP technique on the RWG data with varying snapshot sizes.
The results presented in Fig. 6 also indicate that the most appropriate WBD solution associated with each data set is not produced using a single particular snapshot size (number of vertices/pages). What is consistent is that the best WBD solution is arrived at using smaller sized snapshot than larger sized snapshots. This intuitively indicates that reducing the amount of noise improves the WBD solution using the proposed HGP technique. What this suggests in terms of the deployment of the technique is that an improved crawling strategy, that serves to reduce the number of noise pages included in the snapshot, such as that employed by the random walk techniques also considered in this paper, will be beneficial in the context of deriving WBD solutions. In the figure it can also be observed that there is some volatility in the results. The precision values of LivChem and LivMaths drops for certain data set sizes, as does the score in some cases. This is due to the nature of the graph partitioning algorithm working on the hyperlink structure of varying sized snapshots. These snapshot sizes could contain a vastly different hyperlink structure if they contain or omit a high degree page (a page that has many links to other pages). This then causes the HGP technique to segment the graph into a very different set of partitions. Hence the volatility in the results, as shown in Fig. 6.
The evaluation of the proposed MGP technique was directed at analysing the Mincut algorithm’s ability to segment a graph using the concept of flow networks so as to generate a WBD solution. Recall that the Mincut algorithm uses a source s and a sink t vertex within the graph to conceptualise flow, and then makes a cut of the network at the “bottle necks”. For the evaluation of the MGP technique the source s and sink t were iterated over all the possible vertices in the RWG data sets. Thus for a graph containing n vertices this involved

WBD performance scores using the MGP technique with iterating source s and sink t vertices. (performance score on y-axis combination identifier on x-axis).
The results are presented in Fig. 7. The figure shows the performance scores for each of the WBD solutions along the y-axis and the identifier for each possible vertex combinations of s and t along the x-axis. The x-axis is ordered according to the source/sink combination, in BFS order from the seed node, which is displayed at point 0 on the x-axis. From the figure it can be observed that for each of the data sets there exists at least one WBD solution that has a performance score of
This section presents a comparative evaluation of the three proposed dynamic probabilistic random walk techniques for WBD: (i) RW, (ii) SARW and (iii) MHRW. The reported evaluation was conducted by comparing the proposed techniques with two deterministic techniques, namely Breadth First Search (BFS) and Depth First Search (DFS). Recall that the proposed dynamic approaches are based on random walk traversals of the web graph, combined with incremental clustering of web pages to produce a WBD solution. The objectives of the evaluation were:
To compare the operation of the considered techniques in the context of the most appropriate features with which to represent web pages. For this purpose five single features were considered: body text, title text, script links, resource links and image links. These features were chosen because they were the top performing features identified with respect to the analysis presented in Section 7.2 above. Note that, for simplicity, only single features (in contrast to combinations of doubles or triples) were used with respect to the evaluation results presented here.
To compare the performance of the five techniques in terms of runtime and coverage.
To compare the operation of two different clustering algorithms, the Incremental Kmeans (IKM) algorithm [42,48] and the Incremental Clustering Algorithm (ICA) [36], in the context of the techniques under consideration.
With respect to the first objective the results are presented in Fig. 8. The results demonstrate that the title feature exhibits the best performance in terms of the proposed dynamic WBD solutions. The performance of the title feature in the dynamic context contrasts to that of the static context (see Section 7.2.1). The reason why the title feature performed well in the context of the dynamic WBD approaches, whereas it did not perform so well in the context of the static CF approaches, was found to be due to the nature of the incremental clustering and web crawling used. The dynamic approaches were found to be significantly influenced by the order in which web pages were received which in turn was a consequence of the adopted web crawl strategy used.
The advantages offered by usage of the title feature were: (i) it is written by the author/publisher to convey some summary of the page that is quickly digestible by users, (ii) it is focussed on a particular subject, (iii) it is written in a concise manner designed to “round-up” the subject of the www page, (iv) it contains much less noise than (say) the body text of a www page, (v) it is short (few words are typically used) and (vi) it is focussed on a particular subject. As a result usage of www page titles as a feature with which to represent www pages offers good similarity measurement between related pages. The image, script and resource links performed the worst in the dynamic context because they did not provide for effective comparison to allow target and noise pages to be distinguish. It is suggested that the nature of the dynamically changing feature space is the reason for the difference in performance between the static and dynamic approaches founded on the same features.

Average WBD Performance score for dynamic techniques using Body text, Title text, Script links, Resource links and Image links as the feature representation.
In the context of the evaluation with respect to the second objective for the dynamic approach to WBD only the title feature was considered (because the previous experiments had indicated this provided the best performance). The runtime results are presented in Table 7 where the rows represent the five dynamic techniques considered. The first three columns give the
Figure 9 gives plots of the average coverage, target coverage and noise coverage as the dynamic techniques proceed (number of steps along x-axis). The figure shows how the coverage values change over time. From the figure it can also be seen that the graph coverage using RW reaches a high level earlier on in the process in comparison with SARW and MHRW. MHRW is the slowest at covering the total graph although the MHRW technique covers noise pages at a lower rate whilst maintained a high coverage of target pages. Note that the randomised traversal using RW, SARW and MHRW are all subject to the underlying structure of the graph, the hyperlink structure can be complex, and unpredictable. The MHRW approach deals well with the complex structure and produces the best WBD score.
WBD performance for the dynamic techniques considered, using the RWG data sets, ordered according to performance score
From the foregoing reported evaluation it was concluded that the MHRW approach was the best performing in terms of performance score. The approach produced an acceptable WBD solution, while at the same time reducing the amount of noise pages covered. This effectively reduces the resources required to produce a WBD solution.

The average total coverage (top), target coverage (middle) and noise coverage (bottom) for the five dynamic WBD techniques using the RWG data sets and the title feature.
The third and final objective of the evaluation directed at the dynamic approaches was to compare the usage of the IKM and ICA clustering algorithms in the context of dynamic WBD. The average WBD performance score of the BFS and Random Walk (RW) approaches, using ICA and IKM, are shown in Fig. 10. The BFS and RW techniques was chosen for the comparison between the probabilistic and deterministic techniques because they produce the best performance with respect to earlier experiments as reported in Table 7. Figure 10 show the “history” of the graph traversal using both clustering algorithms as the pages of the graph were visited step by step. The history of the walks show how the performance of the WBD solutions changes as the walk proceeds. The BFS technique of traversing the graph produced an ordering of pages such that both the ICA and IKM algorithms effectively converge in under

The average WBD performance score using the BFS and RW techniques and either IKM or ICA clustering.

Comparison of the highest performing approaches for the production of WBD solutions: (Top) WBD performance score, (Middle) Target coverage and (Bottom) Noise coverage.
This section presents an overall comparison of the three WBD approaches presented in this paper. Recall that the above reported evaluations were undertaken in terms of WBD performance score, coverage and run time. The WBD performance score was used to measure how representative the website boundary solution was with respect to each approach. Graph coverage was used with respect to the dynamic approach measure the amount of data gathered (noise and target pages) to produce a WBD solution, which in turn provided an indicator of the potential resource cost associated with requesting, downloading and pre-processing data from the web. Run time was used to measure how long each technique took to produce a WBD solution. The amount of time taken in terms of number of steps was also used to comparatively evaluate the techniques. In addition the foregoing evaluation considered the effect of using various parameters and variations. For the overall comparison reported in this sub-section the best performing variations of each of the proposed approaches was used as follows:
Static content based : DF Static structure based: MGP Dynamic: MHRW using IKM
As before the evaluation was again conducted using the RWG data collection.
The results are presented in Fig. 11. In the figure three bar graphs are presented: performance score, target coverage and noise coverage with respect to the RWG data sets. From the figure it can be seen that the static approaches performed much better than the dynamic approaches in the context of both performance score and coverage. The highest performing WBD solutions were produced using the DF and MGP techniques. This outcome would be expected as the static approaches have access to all the data apriori (assuming a large enough sample); the static approaches can thus use all the data to make relative decisions on what pages are included with respect to the WBD problem. In the dynamic context it is not the case that all data is known in advance, only partial (step-by-step) data is used to produce a WBD solution, but as the process proceeds we can expect the approach to get better. Therefore the decisions made using the dynamic approach are made relative to what is known about the partial data. It can be argued that the dynamic approach produces an adequate WBD solution while at the same time maximising the number of target pages gathered (Fig. 11 Middle) and reducing the number of unwanted noise pages (Fig. 11 Bottom). If a standard cost is to be associated with each web page, which is consistent with the requesting, downloading and pre-processing, then it can be seen from the comparative evaluation that the dynamic approach provides a much lower cost WBD solution, while still maintaining an adequate WBD performance. This is shown by the higher WBD score, and lower noise coverage (which is associated with a higher cost).
Conclusion
In this paper a number of approaches/techniques for addressing the WBD problem have been proposed and evaluated. The significance of WBD is that it has application with respect to automated website archiving and digital preservation to give two examples (see Section 1). The work presented in this paper considered both static and dynamic approaches for resolving the WBD problem. The distinction is that in the static context all the www pages to be considered are first downloaded. This has the consequent advantage that decisions on what groupings to classify web pages into, either contained within a website (target pages) or outside of the website (noise pages), can be made based on all data apriori. In the dynamic context only portions of the Web are considered in turn, the analysis is thus conducted in a step by step manner. This has the consequent advantage that the number of irrelevant (noise) pages that are requested, stored and processed can potentially be reduced. The evaluation of the proposed approaches presented in this work concentrated on an important challenge with respect to WBD which is the resource cost required to request, store and process web content (see Section 1). This was conducted by evaluating the noise and target page coverage. The experimental analysis and evaluation of the static approaches was used to inform the dynamic approach to WBD. The static feature combination and graph partitioning analysis informed the dynamic approaches. The dynamic approach exploited the structural properties of the web graph while also using the content based attributes of web pages with respect to WBD.
Four static approaches, CF, DF, HGP and MGP were considered (and evaluated); content based (CF and DF) and structure based (HGP and MGP). The evaluation of the content based approach concentrated on the feature representation of web pages. The evaluation concluded that the most appropriate features to be used to represent a web page in the context of providing a good WBD solution were: image, script and resource links. The most appropriate clustering algorithm was shown to be Kmeans. The structure based approach concentrated on the structural properties of the hyperlinks in the web graph for the purpose of WBD. The evaluation concluded that hyperlinks could be successfully exploited with respect to WBD. The evaluation of the structure based static approach indicated that the MGP technique, which used the max-flow min-cut theorem, was the most effective structure based WBD approach.
The three dynamic probabilistic random walk techniques were considered: (i) RW, (ii) SARW and (iii) MHRW. In the evaluation the operation of these techniques was compared to two deterministic techniques, namely Breadth First Search (BFS) and Depth First Search (DFS). From the evaluation it was concluded that the MHRW approach was the best performing in terms of performance score. The approach produced an acceptable WBD solution, while at the same time reducing the amount of noise pages covered. This effectively reduces the resources required to produce a WBD solution.
Overall this work argues that the dynamic approach using Random Walk graph traversal and incremental Kmeans clustering provides for both an effective WBD performance and a minimisation of the number of irrelevant noise pages considered, which is increasingly important if we wish to scale the proposed dynamic approach to much larger websites. In particular the Metropolis Hastings Random Walk (MHRW) graph traversal method, coupled with the title feature representation and the incremental Kmeans algorithm (IKM), was the best performing WBD method.,
