UCrawler: A learning-based web crawler using a URL knowledge base

Abstract

Focused crawlers, as fundamental components of vertical search engines, focus on crawling the web pages related to a specific topic. Existing focused crawlers commonly suffer from the problems of low efficiency of crawling pages and subject migration. In this paper, we propose a learning-based focused crawler using a URL knowledge base. To improve the accuracy of similarity, the similarity of the topic is measured with the parent page content, anchor information, and URL content. The URL content is also learned and updated iteratively and continuously. Within the crawler, we implement a crawling mechanism based on a combination of content analysis and simple link analysis crawler strategy, which decreases computational complexity and avoids the locality problem of crawling. Experimental results show that our proposed algorithm achieves a better precision than traditional methods including the shark-search and best-first search algorithms, and avoids the local optimum problem of crawling.

Keywords

Focused crawler crawling strategy URL knowledge base URL learning

1. Introduction

With the rapid growth of data volume on the Internet, numerous people retrieve information using search engines such as Google and Baidu. Although those search engines can provide comprehensive search results, general search engines also have low recall when the required results belong to a specific topic, resulting in query quality degradation, which is often expressed as many irrelevant results and omission of some relevant results. To address these problems, many researchers propose to build vertical search engines that focus on the web pages related to a specific topic. Designing an effective focused crawler is a significant and challenging issue in implementing vertical search engines. The goal of the focused crawler is to fetch as many relevant web pages as possible and discard irrelevant web pages. To implement an effective and efficient focused crawler, several problems should be solved [1], including defining the topic being focused on, judging whether a web page is related to the topic, determining the order of scheduling web crawl, etc.

Several crawling strategies for focused crawlers have been proposed in the literature. These crawling strategies can be divided into three categories: Text-Based Crawling Strategies (TBCS for short) [2, 3, 4] that utilize the text content guiding the crawl trajectory; Link-Based rawling Strategies (LBCS for short) [5, 6, 7] that achieve a comprehensive description of web link information and reflect the entire web; and classification-based crawling strategies (CBCS) [8, 9, 10] that obtain topic web pages by classifying the web pages. TBCS crawlers achieves good performance in finding the pages relevant to a specific subject, but it is apt to fall into a local optimum. LBCS achieves better guidance than generic crawlers, but it is prone to “topic drift.” CBCS has difficulty in reflecting the overall structure of the web, and the prevalence of high computational complexity. Some focused crawlers are built based on a combination of CBCS and LBCS, but computation cost is high.

To address these problems, we designed a focused crawler named UCrawler, which is constructed based on URL information learning combined with content and link analyses. In UCrawler, potential information in the text and URL is mined to avoid locality problem, using the methods of web content analysis and link analysis. More specifically, the first step is to calculate the similarity between the target URL and the parent page content, and then obtain the information owned by the URL string using a machine learning method, to improve the accuracy of URL relevance computation. We also propose a novel crawling strategy that combines text-based content analysis and simple link analysis to avoid crawlers operating only in a local optimum and reduce the high computational complexity.

The rest of this paper is organized as follows. We discuss the related work on focused crawling in Section 2. Section 3 introduces the design of UCrawler. Section 4 describes the relevance calculation model. Details of the URL knowledge construction, URL learning, and URL knowledge update of UCrawler are presented in Section 5. Section 6 describes the crawling strategy. We evaluate our proposed focused crawler and compare it with two classic focused crawlers in Section 7. We conclude this paper and introduce our future work in Section 8.

2. Related works

A crawler is a program that downloads a portion of the web via a breadth-first search. Starting from seed URLs, a crawler will systematically download all links branching outward, and then the links branching from those web pages. The focused crawler of a special-purpose search engine aims to selectively seek out pages that are relevant to a pre-defined set of topics, rather than to exploit all regions of the web. A focused crawler is developed to collect relevant web pages of interesting topics from the Internet. Further, crawlers retrieve pages at rapid speed to keep the search engine indices up-to date.

Figure 1.

Focused crawler architecture.

Figure 1 depicts the architecture of a focused crawler. The user only wants relevant results, so a focused crawler is required to maximize relevancy. This is accomplished by the use of a priority queue and a relevance classifier. This classifier is attuned to the interests of the user and is applied to each web page, only those with a high relevancy score are crawled. There are various vector space models used to determine relevancy as well as self-learning statistical models.

TBCS methods use web content, anchor text, context, and other information to select a direction to traverse the Internet. The procedure in TBCS can be described as follows. The crawler starts with a set of seed page URLs, which guide the search for pages of interest. They follow the criteria for assigning higher download priorities to links based on their likelihood of leading to pages on the topic of queries. Pages pointed to by links with higher priorities are downloaded first. The crawler proceeds to recursively fetch the links contained in the downloaded pages. Typically, download priorities are measured based on the similarity between the topic and the anchor text of a page link, or between the topic and text of the page containing the link. Text similarity is computed by similarity measurements such as the Boolean model or the Vector Space Model (VSM). The emphasis algorithms of the strategy include the fish-search algorithm, the shark-search algorithm, and the best-first search algorithm.

Fish-search algorithm dynamically maintains a list of non-crawled URLs in the order of priority of being gathered and downloads the web pages according to the URL queue. In this procedure, URLs in the relevant web pages are assigned with higher priorities than the others in the irrelevant web pages. The procedure for the fish-search crawler model is straightforward. Dynamic crawling and crawling as a guided strategy have some effects. Meanwhile, several shortcomings are identified, such as using string matching to judge whether a web page is related to a subject; this approach has limited judgement accuracy.

Shark-search algorithm improves on the fish-search algorithm by considering the effects of the characters in the hyperlinks when the priority of URLs is being calculated in this algorithm. At the same time, the VSM is utilized to calculate the similarity of web pages. The shark-search algorithm has the advantage of a good theoretical basis, and its relevance computation is simple. The algorithm shows good performance at a distance closer to the relevant page. However, because the text information lacks global semantics, the overall situation of the web is difficult to reflect. The crawler always downloads pages with high relevance scores, and discards other pages with low relevance scores. Qiu et al [23] proposed an improved version of shark-search algorithm, which evaluates the value of URLs by combining anchor text, the adjacent text of anchor text and the value of authority web. This method increases the speed of the topic crawler even with the increasing of number of crawled pages.

The best-first search estimates page-to-topic relevance as document similarity. Computing the similarity measured by traditional information retrieval models (e.g., using VSM as best-first crawlers do) is based on lexical term matching, that is, two documents are similar if they share common terms. However, the lack of common terms in two documents does not necessarily mean that the documents are unrelated. Despite its simplicity and rich theoretical foundation, good performance is difficult to achieve in environments where diverse topics exist. TBCS failed to take into account the information owned by hyperlinks, which contribute to the performance improvements generated by the focused crawler crawling strategy.

LBCS methods achieve a comprehensive description of the web link information and can reflect the entire web. Traditional examples of these algorithms include PageRank [5] and HITS [6] algorithms. Zong et al. [7] suggested improvements for the HITS algorithm. PageRank crawler can reflect link relationships among pages, and does not consider the relationship between the web pages and specific topics. Consequently, PageRank crawlers are apt to lose their directions to a specific topic, and thereby miss more topic-specific pages. An improved version of HITS [7] introduces a center website (Hub) and an authority website (Authority) through the Hub web pages and Authority web pages to reflect the overall situation of the web link. All these algorithms achieve high effectiveness for general crawlers. They are suitable for finding high-authority web pages rather than topic-specific pages. In addition, the calculation complexity is relatively higher and the update latency is larger, which seriously affects crawler crawling. Cai [24] combined fish-search and PageRank algorithm to calculate the relevance between the web content and the topic. Liu and Li [25] introduced HITS algorithm, and integrated the page content and web links to improve the topic relevance and authority.

Chakrabarti et al. [8] proposed CBCS for comprehensive crawler systems. Their research goals concentrate on the relations between web links and semi-supervised learning. In these systems, the focused crawler uses standard topic classification, and on the view of a user-specified starting point, labels web pages that are of interest to the user and classifies web pages according to user interests. Wang et al. [9] proposed a focused crawler based on the Bayesian classifier, using the naive Bayes classifier to calculate topic relevance. This classification-based focused crawler can deeply determine user interests. However, it cannot reflect the overall structure of the web, because its computational complexity is higher and the threshold is hard to determine.

Yuki et al. [11], Chen and Zhang [12], and Luo et al. [13] proposed a combination of content and link strategies. References [11, 12] combine a web content analysis and PageRank algorithm, and [13] is a combination of the shark-search and HITS algorithms. Although these algorithms improve the precision rate, the implementation is complex and the computation cost is high.

Zheng et al. [14] proposed a learnable semantic-focused crawling framework based on ontology. An artificial neural network (ANN) was constructed using a domain-specific ontology and applied to the classification of web pages. Batsakis et al. [15] used multiple sets of experimental verification: semantic crawlers based on the ontology did not improve performance over text-based crawling, with better results obtained when semantic relations were restricted to synonymous terms. However, expecting that their performance can be improved is plausible using application-speci?c ontologies, instead of general purpose ontologies such as the WordNet [16]. Therefore, this group of crawlers is unsuitable for guiding most focused crawlers.

Hsieh et al [21] introduced an extensible crawler, a utility service that crawls the Web on behalf of its many client applications. The extensible crawler lets clients specify filters that are evaluated over each crawled Web page. The act of crawling the Web is decoupled from the application-specific logic of determining if a page is of interest, shielding Web-crawler applications from the burden of crawling the Web themselves.

Ahlers [22] developed a focused crawler with an adaptive geospatial focused on efficiently retrieves and identifies location-relevant documents on the Web. Based on geospatial features of both Web pages and the link graph, the crawler prioritizes promising links and pages, leading to high-efficiency access of the desired geospatial topic.

Jiang et al [26] described the design and implementation of a university focused crawler that runs on BP network classifier for prediction of the links leading to relevant pages. These work in [21, 22, 26] are orthogonal to our research and can be easily combined with UCrawler.

To Address the disadvantages of various existing focused crawlers, we present a focused crawler based on the URL knowledge base combined with content and link analyses. To guide the URL to calculate URL relevance, URL knowledge is updated iteratively by URL learning. Extensive experiments demonstrate that our proposed crawler can achieve good performance.

3. The architecture of UCrawler

In a focused crawler system, we mainly focus on the design of three modules: a feature vector extraction module, a relevance calculation module, and a web scheduling crawl module. The feature vector extraction module is used to extract the characteristics of the topic and expressed them as a weighted vector. The relevance calculation module calculates the relevance between the web page and the topic. The web scheduling crawl module prioritizes pages according to the scheduling strategy. High-priority pages are crawled preferentially, thus ensuring the high quality of the page that has the highest relevance to the topic and that can be crawled in a short time, as well as the ability of the crawler to better extend to the entire web.

3.1 Design goals

UCrawler architecture has been guided by several principles and system goals:

High efficiency. The primary goal of a focused crawler is to reduce the number of irrelevant Web pages, while preserving pages that users interest. To minimize the cost of page crawling, the crawler is required to pre-evaluate the relevancy of a page before the page is download. For guiding the crawling process, a topic filter with topic-specific knowledge base, is used to make sure all crawled pages are limited to a given topic. High coverage. Relevant pages for a given topic are always distributed in a large number of websites. Focused crawler should starts from a small set of seed sites, and finding the relevant sites as many as possible. Achieving a high recall is one of key challenges for focused crawlers. High Adaptivity. Topic locality is a traditional problem for focused crawler. New terms or topics rise all over the time. Recognition of new terms put forward a significant challenge to the crawler. UCrawler is constructed based on a URL knowledge base, which is continuously updated. High extensibility. A focused crawler should support various web page analyzer and, relevance calculation model and topic filters. One of our specific goals is to design these components of UCrawler as pluggable parts, and make UCrawler extended to more generable models and topics.

3.2 Ucrawler design

The architecture of UCrawler is illustrated in Fig. 2. The UCrawler workflow is executed as follows: (a) The web crawler starts downloading from a given URL seeds that are center web pages and authority web pages of the specific topic. (b) The web page analyzer analyzes the web content and links of the download to eliminate noise information and extract web page content, anchor information, web link information, and some topic-related information. (c) The relevance calculation is conducted based on content, anchor information, and URL information. (d) From the result of the relevance calculation, the topic filter analyzes whether the content of parsed pages is related to the topic or not. (e) URL learning learns information that is relevant to the topic, extracting the topic information. (f) The web crawl scheduler formulates a crawling strategy to prioritize each URL, and then adds them to the priority queue. The web crawler downloads the seeds in the URL after the priority queue has been established.

The process pseudocode of the crawler system to build the URL knowledge base is shown in Table 1.

Table 1
URL knowledge base learning

Input: URL seeds
Output: topic correlation thesaurus
#keep updating the URL knowledge base
while(True):
if URL_queue.size() $>$ 0:
# the first URL in URL_queue is fetched
current_URL $=$ URL_queue.get()
# store the web page
store(current_URL)
# extract the URL
for next_URL in extract_URLs(current_URL):
if next_URL not in seen:
seen.put(next_URL)
URL_queue.put(next_URL)
else:
break

Figure 2.

Design of UCrawler.

4. Relevance calculation

The topic crawler selectively crawls topic-related web pages, it filters the less relevant pages by calculating the relevance of each page to a particular topic. In this paper, the vector space model (VSM) is used to calculate the correlation between web pages and topics. Therefore, the establishment of topic vectors becomes the primary problem in the calculation of topic relevance.

4.1 Establishment of the topic vector

The topic relevance of web pages is largely dependent on the topic vector, which directly determines the crawling effect of a focused crawler. The topic vectors are extracted from hub and authority web pages, formed as a set of representative keywords with corresponding weights.

In the UCrawler, the calculation of web page topic relevance relies on the two vectors: the keywords vector for web content and the keywords vector for anchor, which are denoted as Vc and Vn, respectively. Based on Vc and Vn, the similarity of a crawled page can be measured. The list of initial keywords can be selected by some experts at the beginning, and then the keywords can be evolved with more pages are crawled and analyzed. Meantime, the weights for individual keywords will be iterated and converged, forming the topic vector.

4.2 Relevance calculation model

The UCrawler system uses the contents of the parent page, URL anchor information, and the URL string information to determine the relevance between a web page and a topic. Web pages and topics are represented as keyword vectors, and relevance between them is calculated using VSM. The model formula is defined as Eq. (1):

$\displaystyle S(P_{j},N)=\frac{\sum^{t}_{i=1}w_{i,j}\times w_{i,r}}{\sqrt{\sum% ^{t}_{i=1}w^{2}_{i,j}\times\sum^{t}_{i=1}w^{2}_{i,r}}}$ (1)

Where $P_{j}$ is the topic-specific weight vector and $N$ is the web page under investigation. The notions $w_{i,j}$ and $w_{i,r}$ are the weights of term “ $i$ ” in the topic weight vector and in the web page, respectively. The range of $S(P_{j},N)$ is between 0 and 1. The term weights for a document are computed using TF-IDF, where TF is the frequency of a word in given link context and IDF decreases the weight of a word if a page appears numerous times on the page. If a word does not exist in context, its weight is zero.

4.3 Topic relevance calculation

The relevance between a web page and a topic is defined as follows:

$\displaystyle S(\textit{URL},N)=\alpha\cdot S(\textit{content},V_{C})+\beta% \cdot S(\textit{anchor},V_{C})+\omega\cdot S(\textit{URL},V_{N})$ (2)

where $S(\textit{content},V_{C})$ , $S(\textit{anchor},V_{C})$ , $S(\textit{url},V_{N})$ represent, respectively, the relevance between the parent page and the topic, the anchor information and the topic, and the URL content and the topic. Relevance calculation combines the VSM of Eq. (1) where the parameters $\alpha$ , $\beta$ and $\omega$ satisfies:

$\displaystyle\alpha+\beta+\omega=1$ (3)

To find optimal values for alpha, beta and omega, two groups of training data are required. One group contains the relevant urls and pages, the other groups contains the irrelevant urls and pages. At the intial stage, these three parameters are set to 1/3 uniformly. Then the similiarty scores are calculated and used for judging a given URL relevant or not. The parameters are iterately updated for minimizing the error times of url relevance judgement.

5. Constructing the URL knowledge base

In [17], a focused crawler based on URL rule learning is proposed. URL rules related to the topic, represented by regular expressions, are obtained by classifying and clustering. In order to mine more URL rules, a URL knowledge base is built, and updated iteratively by URL learning methods [18, 19, 20].

Figure 3.

URL information learning structure.

As illustrated in Fig. 3, the basic process of learning URL information is described as follows: First, the URL of a given number of pages is specified to obtain a training data set. Second, a topic knowledge base is acquired by learning the obtained URL information using a selected learning algorithm. Third, the topic knowledge base guides the URL to calculate URL relevance and feedback relevance to guide the learning of URL information. Finally, after learning the URL information, the topic knowledge base is updated for further learning and improving system performance.

5.1 URL information learning process

In order to build accurate topic keyword vector, weights of keywords are dynamically updated during learning process, and new keywords are also added automatically. The URL learning process is as follows.

5.1.1 Training data

We analyze a large number of different web pages, which contain not only the authority and center web pages that are relevant to the topic, but also some unrelated pages. Training data on these web pages were collected and categorized into two groups of URL types. Based on the training data, two threshould values, $S_{\text{max}}$ and $S_{\text{min}}$ will be derived. $S_{\text{max}}$ is the lower bound of topic-related page relevances, and $S_{\text{min}}$ is the upper bound of non-related page relances. Based on $S_{\text{max}}$ and $S_{\text{min}}$ , we can get two data samples. Sample 1 is related to the topic with topic relevance greater than $S_{\text{max}}$ , and Sample 2 is not related to the topic, with topic relevance lower than $S_{\text{min}}$ .

5.1.2 Data processing

The URLs are divided into words. The meaningless stop words such as “a,” “an,” and “is,” are removed. The word that can feed back some information is subsequently extracted.

For example, after splitting and processing the URL “http://sports.sina.com.cn/nba/,” we obtain the words “http,” “sports,” “sina,” “com,” “cn,” and “nba.” We then remove the words “http,” “com,” and “cn,” which do not contain useful information. Finally, we acquire the words “sports,” “sina,” and “nba,” which can feed back some information.

After the above process, we derive three thesauruses, each consists of keywords and corresponding weights. A topic relevant thesaurus is obtained from positive training data, i.e. Sample 1. A topic-irrelevant thesaurus is built from negative training data Sample 2, with all weights initialized to zero. We also build a learning thesaurus, which is used to record the keywords and the weights of the keywords related to the topic that are obtained by machine learning. These three thesauruses form the URL knowledge base, which guides the URL relevance calculation.

5.1.3 URL feedback learning

A set of phrases is obtained after splitting the URL as described in the process above. The weights for these phrases are defined on basis of $S(\textit{url},V_{N})$ , which represents the relevance between the URL content and the topic. The URL learning process and the URL knowledge base maintenance process are as follows:

(1)
If $S(\textit{url},V_{N})\leqslant S_{\text{min}}$ , these phrases are added to the topic-irrelevant thesaurus, and the weights of the phrases updated as follows:

$\displaystyle\textit{New}(w_{i})=0.5\times\textit{old}(w_{i})+0.5\times S(% \textit{url},V_{N})$ (4)
(2)
If $S(\textit{url},V_{N})\geqslant S_{\text{max}}$ , the phrase is processed as follows:

–
If the phrase contained in topic-irrelevant thesaurus, then it is discarded.
–
If the phrase contained only in topic relevant thesaurus, the weight of the phrase is updated in the topic-relevant thesaurus, as shown below:

$\displaystyle\textit{New}(w_{i})=0.5\times\textit{old}(w_{i})+0.5\times S(% \textit{url},V_{N})$ (5)
–
If the phrase contained in none of the three thesauruses contain the phrase, including the phrase in the learning thesaurus, the number of occurrences of this phrase is set to 1. The phrase is then added to the topic-relevant thesaurus, and the weight of this phrase is initialized as

$\displaystyle w_{j}=\frac{1}{n^{2}}S(\textit{url},V_{N})$ (6)
–
If the phrase contained in both the learning thesaurus and the topic-relevant thesaurus, the number of occurrences $m$ of this phrase is increased by 1. Given a threshold $n$ , the weight is updated as follows:

If $m>n$ , the weight of this phrase is updated as

$\displaystyle\textit{New}(w_{i})=0.5\times\textit{old}(w_{i})+0.5\times S(% \textit{url},V_{N})$ (7)

If $m\leqslant n$ , the weight of this phrase is updated as

$\displaystyle\textit{New}(w_{i})=0.5\times\textit{old}(w_{i})+0.5\times\frac{(% m-1)^{2}}{n}S(\textit{url},V_{N})$ (8)

The $\textit{old}(w_{i})$ in Eqs (4)–(7) is the original weight of the phrase in the topic-relevant thesaurus. $S(\textit{url},V_{N})$ is relevance between URL and the topic. $\textit{New}(w_{i})$ is the updated weight of phrase “i.” $n$ in Eqs (6) and (8) is a customized threshold parameter. $\frac{1}{n^{2}}S(\textit{url},V_{N})$ and $\frac{(m-1)^{2}}{n}S(\textit{url},V_{N})$ in Eqs (6) and (8) are formulated to reduce misjudgment. Only when the number of occurrences of the phrase in the learning thesaurus greater than $n$ , the phrase is considered as related to the topic, and the weight is updated according to Eq. (7).

5.2 Updating the url knowledge base

As a large number of URLs are generated and disappeared, the URL knowledge base needs to be updated. The update policy is described as follows:

(1)
The phrase in the topic-relevant thesaurus with weight lower than a threshold is removed, because these phrases may be topic-irrelevant.
(2)
The phrases in the topic-irrelevant thesaurus are sorted ascending by weight, then first h% phrases are removed, because they may be topic relevant phrases.

With the URL information learning process, new topic-related phrases are found and existing phrases’ weights are updated continuously. Therefore URL relevance calculation become more accurate.
6. Scheduling web page crawl

The main gain of web page crawl scheduling policy for a focused crawler is to discover and download more topic-related pages, and to crawl high-quality pages with a high priority.

The web page content evaluation scheduling crawling model mainly uses text information in the page text as an analytical basis and calculates the relevance between the page text or the anchor text and the intended topic to determine whether a link is worthy to be crawled. The URL analysis model mainly considers the unique semi-structured characteristics of the web pages and analyzes the relevance, to which hyperlinks point among the web pages to calculate the relevance between the page and the intended topic.

UCrawler adopts a combination of content analysis and simple link analysis crawling strategy. In the web-crawling process, UCrawler, according to the topic relevance of the web content and the URL link analysis, evaluates the priority of the URL. Using Eq. (2) in Section 4.3, URL relevance is calculated as a measure of the web crawl, and then a simple analysis of the number of topic page outputs and the number of new topic web page links found on the web page is performed. The detailed process of the UCrawler web crawl strategy is as follows:

(1)
First, we calculate the relevance between the URL and the topic and obtain the relevance value $s_{0}$ . If the value is less than a given threshold, the URL is omitted.
(2)
A queue named list is created, and all the distinct hosts visited by the crawler are recorded.
(3)
When the number of topic-related pages pointed by this page is greater than a given threshold, a bonus score $k_{0}$ is indicated for the current page.
(4)
When the host of current page isn’t contained in list, a bonus score $k_{1}$ is indicated for the web page, and the page host is added to the list.
(5)
When the topic web page has a different host than its parent web page, a bonus score $k_{2}$ is indicated for the web page.
(6)
rom the above steps (1)–(5), we obtain the final score of a URL as:

$\displaystyle s=s_{0}+k_{0}+k_{1}+k_{2}$ (9)
(7)
According to the URL’s final score, a priority queue is generated, and determine the web page crawling order.

The crawling strategy proposed above is a combination of the advantages of content-based methods and link-based methods. Step (3) emphasizes the role of the center web pages, which points to many high quality topic-relative web pages. Steps (4) and (5) give reward to cross host web links, so that more topic-related web pages can be discovered. Otherwise, the focused crawler may fall into local optimum, i.e. crawling only within the same host.
7. Experiments and analysis

We implemented UCrawler based on the methods proposed in this paper, and evaluate the algorithm performance, the scope of the crawler to crawl, the maintenance of the knowledge base, and the URL learning effect. Precision metric is defined as follows to evaluate crawler performance.

$\displaystyle\text{Precision}=\frac{R}{M}\times 100\%$ (10)

where $R$ is the number of topics relevant to the web page, and $M$ is number of all web pages crawled.

This study tests the algorithm performance of the focused crawler UCrawler using the topics movie, mobile phone, and finance, and compares the algorithm performance with those of the shark-search algorithm and the best-first search algorithm. Each algorithm crawls 10,000 web pages.

Figures 4–6 compare UCrawler with the shark-search and best-first search algorithms on movie, mobile phones, and finance and economics fields.

Figure 4.

Comparison of the precision ratio in the field of movies.

Figure 5.

Comparison of the precision ratio in the field of mobile phones.

As shown in Figs 4–6, the precision ratio of the proposed algorithm is higher than those of the shark-search and best-first search algorithms. The precision of the proposed algorithm decreases moderately even with the increasing of number of crawled pages. Note that the precision will decrease with the number of crawled pages increasing. This is caused by more irrelevant pages are fetched and retrieved. However, the precision of UCrawler decreases moderately even with the increasing of number of crawled pages.

The number of crawled pages is counted to evaluate whether UCrawler exists locality problem. The number of hosts visited by the crawler is shown in Table 2.

Table 2

Comparison of the number of hosts visited

Topic/host number	Shark-Search algorithm	UCrawler algorithm
Movie	8	32
Mobile phone	6	23
Finance	12	35

Figure 6.

Comparison of the precision ratio in the fields of finance.

In Table 2, the number of hosts related to the topic the UCrawler algorithm has visited is higher than that visited by the shark-search algorithm. This result shows that UCrawler achieves a higher ability of tracing the entire web and finding more topic-specific pages than Shark-search.

Base on the number of crawled web pages, we found that shark-search algorithm visits web pages from a limited number of hosts, while UCrawler visits web page from a comprehensive and dispersive hosts. Therefore, we can conclude that the UCrawler has a higher ability of avoiding crawling only in the local optimum compared with the other two.

URL knowledge base maintenance and URL learning are also analyzed in this experiment. The URL knowledge base is constantly updated and improved by URL learning. Table 3 presents the URL knowledge base update process in the field of films. The dash (–) means that the word does not exist in the URL knowledge base.

Table 3

Update of the URL knowledge base in the field of movies

Topic/sweight	Initial value	Update 1	Update 2	Update $k$	Update $n$
film	0.6	0.545434	0.501620	……	0.501620
chinafilm	0.5	0.515143	0.478945	……	0.487425
movie	0.6	0.6	0.492264	……	0.484608
entertainment	0.4	0.431407	0.481871	……	0.478854
movie	0.6	0.6	0.578301	……	0.578301
TV	0.5	0.546677	0.535322	……	0.49566
goldhorse	0.5	0.467734	0.487652	……	0.475666
video	0.3	0.3	0.317966	……	0.317966
jinma	0.5	0.532123	0.521124	……	0.487543
hollywood	05	0.5	0.483223	……	0.484634
mtime	0.5	0.5	0.481163	……	0.483355
horse	–	–	0.287390	……	0.317390
oscar	–	–	–	……	0.496897
redjsen	–	–	0.438713	……	0.487132
golden	–	–	0.237858	……	0.378581
hyfy	–	0.002569	–	……	–
UVLQS	–	0.001233	–	……	–
spxw	–	0.002519	–	……	–
lph	–	–	0.002557	……	–
dcbjg	–	–	0.001338	……	–
……	……	……	……	……	……

In Table 3, the process of URL learning of the topic-relevant thesaurus exhibits irrelevant phrases, such as lph $=$ 0.002557, dcbjg $=$ 0.001338, and so on. Their weights are updated by UCrawler URL feedback learning module to very small values. These phrases are finally removed from topic-relevant thesaurus because of their small weight. Hence, the phrases have almost no effect on the relevance calculation.

During the URL learning, the weight of the phrase in the topic-relevant thesaurus will change iteratively, and we learn some phrases that are related to the topic, such as oscar $=$ 0.496897, horse $=$ 0.317390, golden $=$ 0.378581, redjsen $=$ 0.487132, and so on.

Experimental results shows that, the URL knowledge base is constantly improved by URL learning. Therefore the relevances between web page and topic can be calculated more accurately.

8. Conclusion

One of the key problems of vertical search engines is to construct a focused crawler, which can crawl topic specific web pages accurately and completely. We presented UCrawler, an effective focused web crawler that calculates the topic similarity using parent page content, anchor information, URL content. Our crawling strategy is based on combination of content analysis and simple link analysis. Experimental results show that our approach has better performance than the shark-search and best-first search focused crawlers, and avoids the local optimum problem of crawling. Although the initial results are encouraging, extensive work needs to be done to improve crawling efficiency. A major open issue for future work is to do a more extensive test with a different topic, allowing the crawler to automatically adapt to a different topic.

References

Zhou

L.-Z.

and Lin

, Survey on the research of focused crawling technique, Computer Applications 25(9) (2005), 1965–1968.

Meusel

Mika

and Blanco

, Focused crawling for structured data, ACM International Conference on Information and Knowledge Management, 2014, pp. 1039–1048.

Teregowda

Fernández Ramírez

J.P.

et al., The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists, in Proc. 3rd Annual ACM Web Science Conference, Evanston, IL, USA, June 2012, pp. 340–343.

Zhao

Zhou

Nie

et al., SmartCrawler: A Two-stage crawler for efficiently harvesting deep-web interfaces, IEEE Trans Services Computing 9(4) (2016), 608–620.

Page

Brin

and Motwani

, The PageRank Citation Ranking: Bring Order to the Web, Stanford, CA: Stanford University, 1998.

Kleinberg

J.M.

, Authoritative sources in a hyperlinked environment, Journal of the ACM (1999), 604–632.

Zong

Shen

and Liao

, Improvement of HITS for topic-specific web crawler, Lecture Notes in Computer Science 3644 (2005), 524–532.

Chakrabarti

van den Berg

and Dom

, Focused Crawling: A New Approach to Topic-Specific Web ResourceDiscovery, in Proc. 8th International WWW Conf., Toronto, 1999, pp. 1623–1640.

Wang

Chen

Zou

et al., A Focused Crawler Based on Naïve Bayes Classifier, International Symposium on Intelligent Information Technology and Security Informatics, 2010, pp. 517–521.

10.

Taylan

Poyraz

Akyokus

et al., Intelligent focused crawler: Learning which links to crawl, Innovations in Intelligent Systems and Applications (INISTA), 2011, pp. 504–508.

11.

Uemura

Itokawa

Kitasuka

et al., An Effectively Focused Crawling System, Studies in Computational Intelligence 376 (2012), 61–76.

12.

Chen

and Zhang

, HAWK: A Focused Crawler with Content and Link Analysis, in Proc. the 2008 IEEE ICEBE, 2008, pp. 677–680.

13.

Luo

L.-B.

Chen

and Wu

Q.-X.

, Research on topical crawler of shark-search algorithm and hits algorithm, Computer Technology and Development 20(11) (2010), 76–79.

14.

Zheng

H.-T.

Kang

B.-Y.

and Kim

H.-G.

, An ontology-based approach to learnable focused crawling, Information Sciences 178(23) (2008), 4512–4522.

15.

Batsakis

Petrakis

E.G.M.

and Milios

, “Improving the performance of focused web crawlers, Data & Knowledge Engineering 68(10) (2009), 1001–1013.

16.

WordNet, Available: http://wordnet.princeton.edu/wordnet/.

17.

Zheng

Zhou

et al., URL Rule Based Focused Crawler, IEEE International Conference on e-Business Engineering (2008), 147–154.

18.

Priyanka

, Efficient crawler for gathering and exploring relevant sites from hidden web, International Journal of Emerging Technology and Computer Science 2(2) (2017).

19.

Zheng

Liu

Tian

et al., The web of knowledge mining: Theory, Methods and Applications, Science Press, 2010, pp. 111–116, 136–161.

20.

Dong

and Hussain

F.K.

, Self-adaptive semantic focused crawler for mining services information discovery, IEEE Transactions on Industrial Informatics 10(2) (2014), 1616–1626.

21.

Hsieh

J.M.

Gribble

S.D.

and Levy

H.M.

, The Architecture and Implementation of an Extensible Web Crawler, NSDI (2010), pp. 329–344.

22.

Ahlers

and Boll

, Adaptive geospatially focused crawling, CIKM (2009), pp. 445–454.

23.

Qiu

Lou

and Chang

, A focused crawling algorithm for improving Shark-Search, Micro-computer Applications 33(2) (2017), 19–21.

24.

Cai

, Design and implementation of topic-oriented multi-threaded web crawler, Lanzhou: Northwest University for Nationalities, 2017.

25.

Liu

and Li

, The theme reptile algorithm based on fusion link structure, Journal of Huaqiao University: Natural Science 38(2) (2017), 195–200.

26.

Jiang

Han

Lin

et al., Design and implementation of university focused crawler based on BP network classifier, 2009 Second International Workshop on K, 2009.

UCrawler: A learning-based web crawler using a URL knowledge base

Abstract

Keywords

1. Introduction

2. Related works

3.1 Design goals

3.2 Ucrawler design

Table 1 URL knowledge base learning

4.1 Establishment of the topic vector

4.2 Relevance calculation model

5.1.1 Training data

5.1.2 Data processing

5.1.3 URL feedback learning

References

Table 1
URL knowledge base learning