Improved algorithm of Context Graph based on feature selection

Abstract

In order to solve the problem of low efficiency of traditional theme crawlers in searching theme pages, the crawling algorithm based on Context Graph was discussed. After analyzing the working principle and process of the algorithm, we introduced a new algorithm idea named feature selection algorithm. This new algorithm improved the original TF-IDF formula accordingly and solved the algorithm problems.

Keywords

Context Graph feature selection theme crawler DF-IDF

1. Introduction

From the perspective of the whole study history of crawler algorithm, there are much more theme algorithms. From the traditional perspective, however, there are two main types of corresponding algorithms: text-based hyperlink structure and network-based hyperlink structure. To achieve this theme crawling algorithm, first of all, the Context Graph should be built for the crawler to find the crawling path independently, which is actually the link between pages and the hierarchical structure diagram composed by it.

The crawling algorithm based on content analysis can effectively predict the degree of page correlation, which has a good theoretical basis, and the calculation method is relatively simple, but this method ignores the link structure information between pages. Represented by PageRank crawling algorithm based on link structure, by analysing the relation between the page link relations to determine the importance of the pages, and access to link in order to sort, but they only consider the relationship between the links between pages, ignore the page itself related situations and relatively large amount of calculation, the theme of the affected crawling speed. At the same time, these two types of crawling algorithms predict the importance of links mainly based on the information obtained during crawling, which is often local and difficult. Therefore, these two types of crawling algorithms generally have the disadvantage of “local optimal”, while the improved topic-crawling algorithm based on Context Graph can better solve this problem.

Figure 1.

Universal crawler structure.

Context Graph carries out classification training on the pages set of each layer through the experience automatically improved computer algorithm (machine learning method), and it can estimate the probability of relevant characteristic words and prior probability. Through the crawling process of the theme web crawler, the classification training of each layer is firstly achieved, and then the probability that the page belongs to each layers can be calculated by the text classifier when a new page is acquired. After the probability sorted, the layer with the highest probability will be determined to be part of the link path.

2. Construction of the Context Graph model

M. Diligenti first proposed the topic crawling idea of Context Graph, which was mainly based on the assumption that when searching pages with similar topics, the crawlers must also take similar routes. When designing this algorithm again, in order to make this crawling strategy more reasonable than other strategies, we should not only consider the theme feature of crawling, but also pay attention to the tunnel feature. If you want to pass a crawler to crawl on a particular subject, search crawl after, will be related to the subject of a large number of papers in the web page, may also have some of these pages link to point to other pages may have an association or download address, this layer, web page, will form a hierarchy between users to search the hierarchical structure of the target page crawled path.

To design the topic crawling algorithm based on Context Graph, the first step is to use the reverse search function provided by the existing search engine to expand the scope search of the sample pages related to the theme provided by the user, and see whether the search of these sample pages can be realized through the limited link path in the page to be searched. After the result of the page query is processed, a Context Graph model corresponding to the query process can be obtained.

Under the background of graph, crawler algorithm can not only judge the path of new visited page that is not the target page, but also determine how far the page is from the target page, that is, the level in the path, so the training of classifier in advance will become the priority among priorities. Providing training sample set for training text classifier is an important goal of constructing Context Graph.

Figure 2.

The Context Graph of the two layers.

3. Classifier training and its application in crawling stage

The classifier training should be realized through a specific set of pages. As for specific page content it expresses the theme and the keywords that represent the theme of the page. When taking classifier training of one level page, the webpage will have a hierarchical division according to the established model and merge them together. That is the so-called Context Graph of the merged page.

Here, we use TF-IDF for analysis. TF-IDF (term frequency – inverse document frequency) is a commonly used weighting technology for information retrieval and text mining. The main idea of TF-IDF is: if a certain word appears in one article with a high frequency of TF, and rarely in other articles, it is considered that this word or phrase has a good classification ability and is suitable for classification.

The first step is to select the features of each layer by the method we have mentioned before, and use TF-IDF formula to calculate the weight of characteristic words in the original algorithm. For instance, to calculate the weight of a word m, first we should assume the document which it belongs to as $D$ . The weight of the word also manifests its importance in document $D$ . Then, here is the TF-IDF formula to calculate the weight:

$\displaystyle V(m)=\frac{f^{D}_{(m)}}{f^{D}}\times\text{Log}\left(\frac{v}{f(m% )}+0.01\right)$ (1)

In this formula, some variable names are defined to indicate corresponding value. Variable name $N$ represents the total number of the document collection. The weight of word m represented by variable name $v(m)$ , and $f^{D}(m)$ shows how many times word m appears in document $D$ . The number of document which contains the word m in the document collection is represented by the variable name $f(m)$ and the variable name $f^{D}_{\textit{max}}$ is said to represent the number of times of feature words which appear most frequently in the document $D$ . It can be concluded from this formula that the length of the document used will affect the result of weight calculation. Therefore, it is necessary to carry out normalization. The weight calculated can be unified into a value between 0 and 1, then take the following formula:

$\displaystyle v(m)=\frac{f^{D}(m)/f^{D}_{\max}\times\log(N/f(m)+0.01)}{\sqrt{% \sum_{m\in D}}[f^{D}(m)/f^{D}_{\max}\times\log(N/f(m)+0.01)]^{2}}$ (2)

After having finished the whole process, you can get a set of feature words which is represented by K here. This set of feature words corresponds to the Context Graph of each layer in the page of key sets, and it uses the vector space form of the original algorithm to reveal it. Then we will get the training samples which are made up of the corresponding standard vector. At last, we will use the corresponding layer of the classifier training so that the ultimate aim will be getting the prior probability. If the constructed Context Graph has $N$ layers, the prior probability categories to be trained will be $N+2:P()(i=0,1,2...N),P(C_{\textit{other}})$ (if a page does not belong to any category from 0 to $N$ , it will be assigned to category $C_{\textit{other}}$ ). Then its prior probability can be estimated. Through statistical training on the number of corresponding category documents in the document set, the prior probability of which has more layer of documents will be greater. In addition, the prior probability $P()(t=1,2,3\ldots|v_{j}|)$ of each word in every layer also need to be estimated to show the probability that it may appear in the category. The calculation formula is:

$\displaystyle{p(m}_{t}{|c}_{j})=\frac{1+\sum_{D_{i}\in E_{j}}{f^{D_{i}(m_{t})p% (c_{j})}}}{|vj|+\sum_{Di\in Ej}{\sum_{s=1}^{|vj|}{f^{d_{t}}(m_{s})p(c_{j})}}}$ (3)

$f^{Di}(m_{t})$ refers to the number of times that the word mt appears in the document $D_{i}$ , and $|v_{j}|$ refers to the total number of words in the feature word set.

To construct the n-layer Context Graph, $N+2$ prior probability categories are required, and the text classifier accordingly requires to construct $N+1$ category. This approach will avoid the problem of collecting geometry for training documents.

4. Feature selection and algorithm application

The theme of traditional crawling algorithm exists obvious deficiency, will reduce the search speed of network, and based on the Context of the Graph topics crawling algorithm can solve part of the problem, the corresponding to judge new access to the page as the target, whether the precursor node, and predicted its in crawl path depending on the level of the distance of the target page from, for those who have potential value of the crawling path reserved as much as possible. So compared with the traditional crawling algorithm, the theme crawling algorithm based on Context Graph has some noticeable advantages. However, it also has its own shortcomings. The weight of each feature word calculated by this algorithm should be referred to the page set obtained by the crawler. But the value obtained by this set is not accurate because the page set that crawls in each layer with this kind of algorithm is not typical enough and the TF-IDF formula also has certain deficiencies in itself. In order to make up for these shortcomings, the thesis proposes an improved algorithm, which also perfects the TF-IDF formula, especially solves the problem that the weight calculation of characteristic words is not accurate enough.

4.1 Document feature application

The higher the dimension of the feature word set is, the more complex the feature set extraction and the longer process of crawling will be so that the quality of crawling will also be affected. How to reduce dimension becomes the problem to be solved, which is also the core of feature selection of document set. The process can be represented by the course of feature selection of the graph text set in Fig. 2.

Figure 3.

Processing of document set feature selection.

The feature words in a feature set are not representative, so it is impossible to calculate the exact weight of the feature words. Therefore another feature selection algorithm is needed for evaluation, and the specific process of evaluation will completed by constructing the evaluation function. After the evaluation, the representative feature word sets and their corresponding weights will obtained. Therefore, the core of the evaluation process is the evaluation function. Here are two commonly used evaluation functions.

4.1.1 Information gain evaluation

First, we will set some names: $\textit{IG}(m_{k})$ represents the information gain from mk which shows the word attribute, all documents collection will represented by $D$ and the specific document will represented by $d$ . And we will use $p(d)$ to represent the probability that document belongs to category $D,P(d,\overline{m_{k}})$ will be used to represent the conditional probability that conditional document belongs to the category without word $m_{k}$ , and the conditional probability of feature words appear in the document $d$ will be represented by $p(d|m_{k})$ . Then we can calculate $\textit{IG}(m_{k})$ by the following formula:

$\displaystyle\textit{IG}(m_{k})=H(D)-H(D|M_{k})$ (4)

4.1.2 Mutual information function MI

We assume that the probability that the word mk belongs to category $d$ is represented by $P(d,m_{k})$ , and $P(d)$ represents the occurrence probability of the word $m_{k}$ , $P(m_{k})$ represents the occurrence probability of the word $m_{k.}$ Then the mutual information function $\textit{MI}(d,m_{k})$ of attribute $m_{k}$ can be calculated by formula:

$\displaystyle\textit{MI}(d,m_{k})=\log\left(\frac{p(d,m_{k})}{p(d)p(m_{k})}\right)$ (5)

In practice, it always takes the average of them:

$\displaystyle\textit{MI}_{\textit{avg}}(m_{k})=\sum_{d\in D}{p(d)\textit{MI}(d% ,m_{k})}$ (6)

4.2 Algorithm improvement application

If $p(w)$ represents the probability of event w, the amount of information in the information theory can be defined by the following formula:

$\displaystyle I(w)=\log\left(\frac{1}{p(w)}\right)$ (7)

It shows that the greater the probability of an event is, the less information can be provided when it occurs, and vice versa. Similarly, the less frequently a word appears, the higher the weight value and higer probability of being selected as a feature word it has.

Based on these theories, if $p(d)$ , the occurrence probability of documents belonging to category $d$ can be determined, the value of $p(dm_{k})$ will not be much different in general. Here, we make an improvement on Eq. (5), as shown in Eq. (8):

$\displaystyle\textit{MI}(d,m_{k})=\log\left(\frac{p(d,m_{k})}{p(d)p(m_{k})}% \right)=\log p(dm_{k})-\log p(d)-\log p(m_{k})$ (8)

This formula shows that the value of log $p(m_{k})$ must be small enough if we want to get the larger value of $\textit{MI}(d,m_{k})$ for the value of $p(d)$ is determined. Therefore, the word which has the less frequency of appearance will have the higher importance. However, because of it, the proportion of these words in the word segmentation set will be lower, and these words will be extracted as feature words to represent the specific content and characteristics of a document.

After the text classifier classifies the training samples, the obtained classification set is not detailed and accurate. Therefore, in the practical application, the document set representing the category should be further refined to obtain more refined subcategories. For example, the computer category can be subdivided into mainframe, microcomputer and other subcategories. A large category document set may contain many subcategories, and the proportion of each subcategory is not the same. On account of this situation, the sample set itself will restrict the application. It may happens that in a collection of computer-class documents, 70% of the documents may belong to the subcategories of the minicomputer class, 20% of the documents may belong to mainframes’, and only 10% of the documents may belong to giant-scale computers’. Maybe different subcategories will have the same feature words. These feature words can indicate that these subcategories belong to the computer category. However, if the giant-scale computer category has a few feature words and document in the whole document collection of computer category, this part of the word will not get attention and be wrongly abandoned according to the above theory.

In order to avoid the frequent occurrence of this situation, this paper proposes another judgment method, that is, some words are important enough to be extracted as long as they occur frequently in a certain subclass of documents set, even if they appear in a certain class of documents set very rarely.

After further research, this paper discusses a new feature selection method to solve the deficiencies. Feature selection algorithm bases on word frequency difference, and it will improve the existing TF-IDF formula.

We use $m_{k}$ to represent the $K t h$ characteristic word in the document set, and the frequency of the word $m_{k}$ in document $d_{i}$ is represented by $f_{k}^{d_{i}}$ . The frequency of the word $m_{k}$ in the document $d_{j}$ is represented by $f_{k}^{d_{j}}$ , and $N$ represents the total number of categories in the document set. The formula gives a coefficient $\lambda$ which is used to adjust the proportion. If $M(m_{k}$ , $d_{i})$ is used to represent the category weight of the word $m_{k}$ on category $d_{i}$ , the calculation formula of the category weight will given in Eq. (9):

Where $j\neq i$

$\displaystyle M(m_{k},d_{i})=\exp\left[{\frac{1}{f_{k}^{d}\times(N-1)}\sum_{j=% 1}^{N}{(f_{k}^{d_{i}}-f_{k}^{d_{j}})}}\right]\times\left[{1+\lambda\frac{\sqrt% {D(d_{i},f_{k}^{m_{j}}}}{E(d_{i},f_{k}^{m_{j}})}}\right]$ (9)

Then we can come up with the following improved TF-IDF formula:

$\displaystyle V(m_{k})=\frac{tf(m_{k},d)\times\log(N_{i}/f(m_{k}+0.01))\times M% (m_{k},d_{i})}{\sqrt{\sum_{m_{k}\in d}{\left[{tf(m_{k},d)\times\log\left(\frac% {N_{i}}{f(m_{k})}+0.01\right)\times M(m_{k},d_{i})}\right]}}^{2}}$ (10)

It is an adjustment of the original formula, which is based on the original formula and multiplied by the weight of the word category. The reason is that documents of different categories should distinguish the different degrees of the importance of the same word. If one word is important for the document of the specified category, while it must be unimportant for other categories. This is what the improved algorithm can reflect.

On the basis of the original crawling algorithm, we make the following changes by using the new feature selection algorithm idea and the improved word weight calculation formula as following:

(1)

Sample collection is carried out in the current corpus (such as sogou). According to the sample classification, the weight $M(m_{k}$ , $d_{i})$ corresponding to the word segmentation in the corpus will be calculated firstly. After the weights of all the words are calculated, the average value of $\overline{M}(m_{k},d_{i})$ will be gained.

(2)

According to the category di of the subject, the previous category weight $M(m_{k}$ , $d_{i})$ of the words and the improved TF-IDF formula will be used to calculate. For some word segmentation weight $M(m_{k}$ , $d_{i})$ without the category weight calculated in advance, the value $\overline{M}(m_{k},d_{i})$ (the average value of all word weights in the category) can be taken. Then extract the words with the largest calculation result to form the feature word document.

(3)

Category d by subject; using the category weight of the word calculated before $M(m_{k}$ , $d_{i})$ And improved TF-IDF formula calculation. For some classes weights are not calculated in advance $M(m_{k},d_{i})$ Can take the value of $\overline{M}(m_{k},d_{i})$ (the average value of all word weights in this category). Then extract the word with the largest calculated result to form the feature word set of the document set.

5. Experiment and result analysis

In the crawler system, precision ratio, recall ratio and F1 value can all be calculated. These three values are generally used to evaluate the performance of a crawler system. To determine whether a newly acquired page is related to a topic, it is necessary to calculate the precision ratio, which is relatively simple to achieve compared with the other two. In the further process of crawling, we need to calculate the recall ratio to count the total number of pages related to crawling topic, and the workload of this process is very complicated. In the calculation, the application of precision ratio and recall ratio should be combined. In this paper, F1 value (F-measure) is introduced to realize this process.

$\displaystyle H(j)=\frac{2}{\frac{1}{r(j)}+\frac{1}{p(j)}}=\frac{2r(j)p(j)}{r(% j)+p(j)}$ (11)

First, the crawling results are sorted, and the first $j$ pages are taken as samples to calculate the recall ratio. The recall rate is represented by $r(j)$ , and $p(j)$ is the precision ratio calculated in this process.

At present, the corpus of sogou is relatively comprehensive. Due to its comprehensive classification, it mainly contains information saved from Sohu website. News corpus and relevant classification information are manually sorted and classified, and many experimental samples are used. This paper also USES the corpus of sogou as the data source to extract documents and special words.

In order to compare the performance of the algorithm, this paper makes an experimental measurement using the crawling algorithm based on the Context Graph and the theme crawling algorithm under the improved background. The sample is the theme of “car”. In order to obtain standard crawling data, we take several conditional values for the number of pages obtained. According to the two algorithms at 1000, 1500, $\ldots$ , 4000, the corresponding recall ratio, precision ratio and F1 values are calculated. And then the two crawling algorithms are compared based on these data. The results are shown in Table 1.

Table 1

Comparison of recall ratio, precision ratio and F1 value

Number of pages searched	Algorithm	Precision ratio	Recall ratio	F1 value
1000	Topic crawling algorithm based on Context Graph	0.689	0.715	0.7018
	Improved topic crawling algorithm based on Context Graph	0.775	0.813	0.7935
1500	Topic crawling algorithm based on Context Graph	0.710	0.718	0.7140
	Improved topic crawling algorithm based on Context Graph	0.813	0.815	0.8140
2000	Topic crawling algorithm based on Context Graph	0.713	0.725	0.7190
	Improved topic crawling algorithm based on Context Graph	0.830	0.819	0.8224
2500	Topic crawling algorithm based on Context Graph	0.715	0.736	0.7253
	Improved topic crawling algorithm based on Context Graph	0.800	0.809	0.8045
3000	Topic crawling algorithm based on Context Graph	0.715	0.723	0.7190
	Improved topic crawling algorithm based on Context Graph	0.778	0.805	0.7913
3500	Topic crawling algorithm based on Context Graph	0.714	0.716	0.7155
	Improved topic crawling algorithm based on Context Graph	0.770	0.812	0.7951
4000	Topic crawling algorithm based on Context Graph	0.716	0.727	0.7215
	Improved topic crawling algorithm based on Context Graph	0.789	0.822	0.8061

After a new feature selection algorithm is introduced into the improved algorithm, the TF-IDF formula has been improved so that the performance of the subject crawler system is greatly improved. The figure below compares the performance of the two algorithms more intuitively:

Figure 4.

Comparison of FI values of the two algorithms.

After introducing a new feature selection algorithm into the improved algorithm and improving the TF-IDF formula, the performance of the subject crawler system is improved to a large extent. By comparing the corresponding values obtained by the two algorithms, it can be seen that in the context of the calculation of recall rate and improvement, the crawling algorithm is much better than the subject crawling algorithm based on the context. At the same time, the optimized subject crawling algorithm is much better than the subject crawling algorithm in the precise calculation. This result indicates that the improved Context Graph crawler algorithm has more advantages in terms of precision than the crawler based on Context Graph.

6. Conclusions

With the rapid development of the network, the theme crawler can meet the needs of users who pursue “specialization, sophistication, profundity”. Traditional crawler cannot fulfill the characteristics both of web crawler search strategy subject and the tunnel. The topic crawler search strategy which is based on the Context Graph can solve the problem of the tunnel well. However, it does not take into consideration of link anchor text, the title and page text. So this thesis, through introducing the idea of judging level in the algorithm, can only solve the problem of frequency crawling speed. After analyzing the working principle and process of the algorithm, a new algorithm idea named feature selection algorithm is introduced. In this new algorithm, the original TF-IDF formula is improved accordingly, and the algorithm problem can be solved as well .

Some deficiencies in this paper need to be improved. In practical application, some noisy pages may appear, that is, the links in these pages do not match the search theme, so how to eliminate it is also a matter for further study.

Footnotes

Acknowledgments

This work was supported in part by Science and Technology Project of Jilin Provincial Department of Education (JJKH20180949KJ, JJKH20191203KJ), Jilin Provincial Science and Technology Department No. 20190201195JC, 20180101047JC.

References

Zhou

, Research on event driven and protocol driven subject crawler application in topical domain, Hunan University of Science and Technology, 2012.

Cheng

, Design and implementation of metaserch engine based on suffix tree clustering algorithm, Jilin University, 2017.

and Liu

, Overview of the subject web crawler research, Computer Engineering and Science (2) (2015), 45–51.

Zhang

and Liu

, An optimized path focusing crawler crawling strategy, Minicomputer System 8(8) (2016), 1721–1723.

Min

and Huang

, The design and implementation of the customized theme focused crawler, Computer Engineering and Design 36(1) (2015), 17–19.

Liu

and Li

, Fusion link structure of the subject crawler aalgorithm, Journal of Huaqiao University (Natural Science Edition) 2(38) (2017), 195–197.

, Research on key technology of vertical search engine and distributed implementation, Southeast University, 2017.

, Binary network community partition based on PageRank algorithm, Shenyang University of Aeronautics and Astronautics, 2016.

Novak

, A survey of focused web crawling algorithms, Proceedings of SIKDD at Multiconference IS. Slovenia: ACM Press, 2004, pp. 55–58.

10.

Chen

and Desai

C.B.

, An enhanced web robot for the CINDI system, Proceedings of the C3S2E Conference. Canadia: ACM Press, 2008, pp. 133–135.

11.

Barbosa

and Freire

, An adaptive crawler for locating hidder web entry point, Proceeding of the 18th International Conference on World Wide Web. Madrid, Spain, 2009, pp. 681–697.

12.

Patel

, An adaptive updating topic specific web system using T-graph, Journal of Computer Science 6(4) (2010), 450–456.

13.

Bussche

and Weiand

, Not so creepy crawler: Easy crawler generation with standard XML queries, Proceeding of the 19th international conference on World Wide Web, Raleigh, North Carolina, USA, 2010, pp. 1305–1308.

14.

J.J.

Wei

and Zhou

, The optimized background value of the GM(1,1) model which based on non-homogenous index series, Journal of Systems Science and Information (9) (2010), 149–156.

15.

Tan

Gei

Ren

et al., Entity linking for queries by searching wikipedia sentences, EMNLP (2017), 68–77.

16.

Shijia

and Yang

, Entity search based on the representation learning model with different embedding strategies, IEEE Access 5 (2017), 15174–15183.