Abstract
In this paper we define the document phrase maximality index (DPM-index), a new measure to discriminate overlapping keyphrase candidates in a text document. As an application we developed a supervised learning system that uses 18 statistical features, among them the DPM-index and five other new features. We experimentally compared our results with those of 21 keyphrase extraction methods on SemEval-2010/Task-5 scientific articles corpus. When all the systems extract 10 keyphrases per document, our method enhances by 13% the F-score of the best system. In particular, the DPM-index feature increases the F-score of our keyphrase extraction system by a rate of 9%. This makes the DPM-index contribution comparable to that of the well-known TFIDF measure on such a system.
1. Introduction
Given a text document, the keyphrase extraction problem consists of finding terms (phrases or n-grams) that best describe the document. Documents are of different types: scientific papers (most of bibliographical references deal with this document type), news articles [1], meeting transcripts [2], web pages [3], emails [3], social networks [4], etc. Applications of automatic keyphrase extraction include digital libraries management [5], web content-based advertising [6], content-based tag recommendation [7], document and web page retrieval [8], summarization [1], document clustering [9], query expansion [10] and automated sentiment analysis.
In any text extraction problem we are faced with the choice between improving the specificity of an n-gram by extending its size n, which reduces its frequency, or improving its statistical significance by decreasing its size, which reduces the specificity to the considered document. This choice between overlapping n-grams of different sizes needs a salience criterion in order to extract significant terms, which are also specific to the document we are considering. For example, the candidate 3-gram multiple sequence alignment contains two 2-grams, multiple sequence and sequence alignment. These three n-grams could have close TFIDF values – how does one choose the real keyphrases among them? Intuitively, multiple sequence is a wrong candidate, but sequence alignment or multiple sequence alignment or both could be in the main topics of the text document. The comparison of the frequencies of these n-grams in the document can help a learning system to decide which of them are keyphrases. This is achieved in this paper by the DPM-index metric, which gives a way to control both the size and the frequency of an n-gram by preserving both its significance and its specificity.
Recent works in keyphrase extraction area focused mainly on using features based on external knowledge information or document structural information. Our general goal in this paper is to return to the fundamental knowledge extraction question addressed by the pioneer works in the area, which is: how to extract keyphrases from any type of text document without a priori knowledge on keyphrases. For this purpose we developed a supervised learning system that uses 18 statistical features, among them DPM-index and five other new measures. After selecting keyphrase candidates using linguistic information (Part-of-Speech, POS, tags), the system offers the possibility to evaluate the efficiency of the contribution of each statistical feature.
We experimentally compared our results with those of 21 keyphrase extraction methods on SemEval-2010/Task-5 scientific articles corpus. When all the systems extract 10 keyphrases per document, our method increases by a rate of 13% the F-score of the best system. In particular, the DPM-index feature increases the accuracy of our keyphrase extraction system by a rate of 9%. We also show on this data that combining the 18 features using any supervised learning paradigm (boosting, bagging and regression) yields the best performance.
This paper is organized as follows. In Section 2 we give a synthetic state-of-the-art of keyphrase extraction methods based on the features they use to solve the problem. The preprocessing step of our extraction system is summarized in Section 3. The DPM-index measure and other features we use are described in Section 4. In Section 5 we present the utilized classifier. We analyse and compare our results in Section 6 and end with a conclusion.
2. Related works
Designing a keyphrase extraction system consists of selecting the properties that could distinguish keyphrases from other terms. These properties, called features, can be classified according to their origin: term-level features, document-level features, corpus-level features and external knowledge-based features. Keyphrase extraction methods are then traditionally classified according to the features they use into two categories: document-dependent and corpus-dependent. These two categories become four if we take into account that they use external knowledge sources or not.
Document-dependent approaches [11–17] are based on term- and document-level features. These methods are unsupervised and generally use the co-occurrence frequencies of the terms in the considered document. In addition to term and document features, corpus dependent approaches (the majority of cited references) generally use the number of corpus documents that contain a term as an indicator of its specificity. Terms that appear frequently in the document but rarely in the remainder of the corpus are more likely to be keyphrases. External knowledge based approaches use external knowledge sources such as terminological databases: GRISP, MeSH, etc. [5, 18–21], linguistic ressources (WordNet, etc. [22, 23]), article databanks (Wikipedia, etc. [7, 8, 14, 24–28]), search engine query logs [6, 8] and search engine results (rank scores and/or text-snippets from Google, Bing, MSN etc. [24, 29–32]). Terminological databases were used early by text categorization methods [33] that are essentially based on such information to assign a priori known keyphrases to a document.
Features can also be classified according to their nature: statistical, linguistic and structural features. The most used statistical features are term length, that is, the number of words it contains [34, 3], term frequency (TF) in the document [34, 3], inverse document frequency (IDF), which depends on the number of corpus documents that contain the term, TFIDF [35, 36], which combines TF and IDF, the first position in the document [3, 34–36] and the co-occurrence frequency of the term with other document terms [11, 12]. Part-of-speech tags and noun phrases chunks [37–39] are linguistic features that try to capture the linguistic properties of keyphrases. Structural features that are provided by HTML or XML documents (like occurrence in title, document header, hypertext link, etc.) also help keyphrase identification [6, 20, 40–43].
The selected features are combined in order to give a synthetic score for each document term. This score is used to rank all the candidate terms; the k-top ranked terms are output as the keyphrases we look for. The way this score is computed depends on the method. Some use an unsupervised approach when others use supervised learning on the corpus. Some unsupervised approaches are simply based on a score formula that combines the feature values like the baseline TFIDF or other expressions [44–46]. Other approaches use more sophisticated unsupervised algorithms [1, 2, 11–17, 23, 26, 47–51], generally a clustering method or a graph-based ranking approach [52] inspired by the well-known PageRank algorithm [53]. Supervised learning is achieved with machine learning approaches like bagged C4.5 [3], naive Bayes [35, 36], neural networks [54], SVM [40, 55] maximum entropy [6] and so on.
3. Candidate term identification
Identifying the candidate terms of each document needs to preprocess the texts. The way this classical step is done can significantly affect the accuracy of the keyphrase extraction method. Thus it is important to give some details on how we have implemented it. The preprocessing stages are as follows:
Tokenization and POS-tagging. The first step of this stage includes sentence boundary detection, tokenization and POS tagging. We use the Stanford POS tagger [56] for these purposes. For each sentence, we generate all possible n-grams (subsequences of up to n tokens) with
Stemming. Term variation can affect its frequency, which is an important indicator of whether it is a keyphrase or not. The solution consists of replacing each word by its stem. A stem is generally what remains when we remove a suffix from a word. There are different stemmers [57]; we use the iterated Lovins stemmer [58], which is more aggressive than other stemmers. Aggressive stemming gives better results for keyphrase extraction method [3].
Document indexing. All positions of a stemmed term are conserved. As the occurrences of a stemmed term can appear in different forms (different n-grams), we decided that the n-gram of a term is the most frequent one in the document. The same problem arises for POS tags. We have considered two cases: if the majority of POS tags correspond to noun phrases (as defined next) we choose the most frequent noun phrase POS tag, otherwise a POS tag which is not a noun phrase is chosen. At the end of this stage the document is transformed to a set of terms, where each term t in a document d is defined as a tuple: (n-gram, stem, tag, d, {positions}).
Linguistic filtering. According to linguistic knowledge, the noun phrases are most likely to be keyphrases. Once all the terms are discovered, we only keep those which are considered as noun phrases. Our observation of the real keyphrases in the used train data suggest a very large POS tag definition of noun phrases which satisfy the regular expression: (NN | NNS | NNP | NNPS | JJ | VBC | VBN | NN IN | NNS IN)* (NN | NNS | NNP | NNPS | VBG)
Corpus indexing. To compute corpus dependent features we need to index all the documents of the train data. After corpus indexing, each document term becomes a tuple: (n-gram, stem, tag, d, {positions}, {documents}) where {documents} represents the set of document paths of the training data containing this term, that is, a term which has the same stem.
4. Feature selection
Recent works in keyphrase extraction area focused mainly on using features based on external knowledge information or document structural information. Our general goal in this paper is to return to the fundamental knowledge extraction question addressed by the pioneer works of Turney et al. [34] and Frank et al. [35, 36], which is to extract keyphrases from any type of text document using statistical features without a priori knowledge on keyphrases, the opposite of text categorization approaches [33].
Some features used in the literature have different definitions that have different impacts on keyphrase extraction accuracy. In order to define precisely each used feature we propose to use the notations given in Table 1.
Notations used in this paper.
4.1. DPM-index
State-of-the-art keyphrase extraction features do not deal with the fact that a candidate term may contain or be contained in another candidate. This relationship between n-grams can be expressed by a feature that characterizes real keyphrases among different overlapping subterms of the same n-gram. For example, the candidate 3-gram multiple sequence alignment contains two 2-grams, multiple sequence and sequence alignment. These three n-grams could have close TFIDF values; how does one choose the real keyphrases among them? Intuitively, multiple sequence is a wrong candidate, but sequence alignment or multiple sequence alignment or both could be in the main topics of the paper.
The comparison of the frequencies of these n-grams in the considered document can help a learning system to decide which of them are keyphrases. The term multiple sequence by itself is not a keyphrase because it is not used outside multiple sequence alignment. This means that the frequency of multiple sequence is equal to the frequency of multiple sequence alignment. So we can suppose that, if a term t is contained in a superterm s and f(t, d) = f(s, d), i.e. f(s, d)/f(t, d) = 1, then t cannot be a keyphrase.
Now deciding whether t = sequence alignment is a keyphrase or not will depend on its frequency compared with the frequency of s = multiple sequence alignment. Let us, for example, consider a document d where multiple sequence alignment occurs 10 times. If sequence alignment occurs 100 times (f(s, d)/f(t, d) = 0.1) it is more likely to be a keyphrase than if it occurs 11 times (f(s, d)/f(t, d) = 0.91). The more sequence alignment is used outside multiple sequence alignment, the more it is likely to be in the main topics of the considered document. It is unlikely that, when the frequency of sequence alignment is close to the frequency of multiple sequence alignment, it means that the document deals specifically with the question of multiple sequence alignment. Suppose now that we have three documents d1, d2 and d3 where t = sequence alignment is contained in the superterms:
s 1 = multiple sequence alignment;
s 2 = pairwise sequence alignment;
x = w sequence alignment, where w denotes any other word different from multiple and pairwise.
The frequencies of t and its superterms are given in the Table 2. One can observe that, the more the maximum of the ratio f(s, d)/f(t, d) is close to 1, the more t is unlikely to be a keyphrase by itself.
The DPM-index of the term t = sequence alignment in three documents.
We define the document phrase maximality index of a term t as:
As we defined it, DPM-index(t, d) tends to 1 when t is not included in a superterm that has a similar frequency in the document d. In contrast, it tends to 0 when t is almost always a part of a bigger term in d, which indicates that it is unlikely to be a pertinent keyphrase for this document. By means of the DPM-index measure, we can immediately see in the example of Table 2 that the document d1 does not deal with the general topic of t = sequence alignment, which has a DPM-index of 0.09, but is specific to s1 = multiple sequence alignment. The opposite arises in the document d3 where its DPM-index is 0.91. The document d2 represents an intermediate situation where the DPM-index of 0.45 indicates that sequence alignment is more likely to be in the main topics of d2 than in those of d1.
These examples about three potential overlapping noun phrase candidates multiple sequence alignment, multiple sequence and sequence alignment can arise in any extraction system where we have to choose between extending the size n of an n-gram and reducing its frequency, or at the opposite, to reduce n and augment its frequency. The DPM-index gives a way to control both the size and the frequency of an n-gram by preserving its significance and its specificity to the document d.
The DPM-index of a term in a document resembles at first sight the C-value of a term in a corpus [59] which is:
The C-value is known to be an essential criterion for terminology extraction. It could be adapted to keyphrase extraction in a document by replacing the document frequency in the corpus by the frequency in the document d. We denote this adaptation by docC-value:
The DPM-index differs completely from docC-value because it uses the maximum of the superterm frequencies, rather than their mean, as is the case in the docC-value. The mean is not efficient for keyphrase extraction as it does not point out that a term t is almost always contained in a particular superterm. For the example of Table 2, in both d1 and d2, the means of the frequencies of the superterms of t are the same and equal 5.5, when their maximum is 10 in d1 and 6 in d2. This leads to a DPM-index of 0.09 for d1 and 0.45 for d2, while their docC-values are identical. Thus deciding whether t is a keyphrase or not does not depend on the mean of the superterm frequencies but on their maximum. This was confirmed by our experiments; both C-value and docC-value do not improve keyphrase extraction accuracy.
4.2. Other proposed features
4.2.1. TFIDF ratio
When a keyphrase is an n-gram that contains more than one compound, it is frequent that one of them is a specific word to the document. In You et al. [46] this compound is called core word and is defined simply as a word with a frequency greater than a certain threshold. In our case we do not use a fixed threshold and, rather than using TF to characterize this compound, we use TFIDF. We define the feature TFIDFRatio(t, d, D) as the ratio between the TFIDF of a term t and the maximum value of the TFIDF of a compound of t (see the formula in Table 3). This indicator tends to be small when the term has a compound with high TFIDF.
Features used in our system.
4.2.2. Position mean and 2-means
Most of keyphrase extraction systems consider only the first position of a candidate term as a feature. The position of the first occurrence of a term is known to be a very useful feature for keyphrase prediction; however, we want to benefit from the distribution of all its positions in the document. We conjecture that keyphrase positions in the document are clustered differently than other term positions. Thus we propose using as features the mean of these positions and their 2-means. We recall that the k-means
4.2.3. DPM-TFIDF
In our experiments we noticed that combining the DPM-index with TFIDF (see the formula in Table 3) improves the accuracy of this later when we use it as an unsupervised score for automatic keyphrase extraction.
4.3. Other used features
After trying approximately 30 features used in the literature, we retained 12 features that work well for keyphrase extraction in our experiments. Among these features four are rarely used in the literature:
(1)
(2)
(3)
(4)
We added the six proposed features to them obtaining the 18 features we utilize in our system. The Table 3 reviews all the features used in the system; our six features are numbered from 13 to 18.
5. Classification
Our keyphrase extraction system utilizes a classifier trained using a supervised machine learning algorithm. Owing to the difficulty of the keyphrase extraction problem, instead of using directly the classifier outputs, one rather utilizes the probability of being classified as a keyphrase [36]. These probabilities are used as scores to generate a ranked list of keyphrases. According to a fixed parameter k, the system outputs the k-top score terms as predicted keyphrases. We use logistic regression as the learning algorithm. We also tested other learning algorithms, for instance bagged C4.5 decision trees, random forests and LogitBoost, but in every case logistic regression gave good results. We used the Weka implementation of these methods [65].
The accuracy of our system greatly depends on the way the 18 features are transformed into input attributes for the logistic regression algorithm. After we tested different data mining methods, we found that the best results are obtained after discretization and binarization of the input features. In order to explain this we need to recall some details on logistic regression.
A classifier is able to estimate the probability
Logistic regression tries to estimate
where Xi = (xi1, xi2,…, xim) is the vector of m attribute values of the training instance i. This means that, for the logistic regression model, we have:
The values of the parameters
In order to deal with the nonlinearity of feature effects on keyphrase probability we propose to discretize [66] each feature and binarize the intervals resulting from discretization. Precisely the logit function becomes:
where
6. Experiments
6.1. Corpus
In order to evaluate our system, we used the SemEval-2010 data for task 5: Automatic Keyphrase Extraction from Scientific Articles [67]. These data consist of 244 scientific conference and workshop papers from ACM Digital Library. Papers were selected from four research areas: Distributed Systems, Information Search and Retrieval, Learning and Social and Behavioral Sciences.
For each paper at most 15 keyphrases were manually assigned by both paper authors and readers. From this corpus 144 papers were provided for training and the evaluation was done on 100 articles. The main advantage of using SemEval-2010/Task-5 corpus is that we can compare our results with those obtained by 19 teams that participated in the challenge. Furthermore, two recent papers [51, 46] also used these data.
We followed the procedure given in the challenge for the evaluation of our system. In this task the methods were compared using three exact match evaluation metrics. An exact match evaluation metric measures how well the automatically generated keyphrases match exactly the manually assigned ones. More flexible metrics could be used [68]; however, exact match is stricter and enables us to compare our results with those of the 21 teams. Specifically, the three metrics used are: the precision, which represents the proportion of the extracted keyphrases that match the manually assigned ones; the recall, which is the proportion of the keyphrases manually assigned that are extracted by the keyphrase extraction system; and the F1-score, which is defined as 2 × precision × recall/(precision + recall).
6.2. Results
We used SemEval-2010/Task-5 data for two experiments. First we followed the procedure of the challenge in order to compare our results with those of others systems. In this procedure 144 documents are used for training and 100 documents for the evaluation. Second, we grouped the 244 documents and used them for a 5-fold cross-validation experiment in order to confirm the results.
6.2.1. Challenge procedure
The performances of each system are given over the numbers of keyphrase candidates: top 5, 10 and 15. Table 4 shows the performances of each system ranked by the F1-Score over the top 15 keyphrases. Our system ranks first over the three keyphrase candidates and for the three metrics used. For 10 keyphrases, our system increases by a rate of 13% (29.4/26.0% − 1) the F1-score of HUMB [20]. Notice that, opposite to our system, HUMB uses structural features and different external knowledge features in order to improve its performance. These knowledge bases (GROBID/TEI, GRISP and HAL) are specific to scientific papers. Then the most important feature of these results is that, by using only statistical features on linguistically filtered terms, our system outperforms the others without loss of generality.
Our system compared with 21 systems according to precision (P), recall (R) and F1-score (F).
In order to measure the contribution of each feature to the overall system performance, we represent in Table 5 the results obtained by our method when removing each feature from the learning attributes. According to these experiments, we can see that the DPM-index is the most important of our features; removing it significantly decreases the performance of our system. When DPM-index is used, the F1-score of the keyphrase extraction system increases by 9% (for the 10 top keyphrases), which makes its contribution comparable to that of all corpus-level features including the popular TFIDF. Note also that using our six proposed features increases the F1-score by a rate of 13.5% for the 10 top keyphrases.
Our system performances after removing a feature or a family of features compared with HUMB and WINGNUS.
DPM-TFIDF is replaced by TFIDF.
DPM-index is removed and DPM-TFIDF is replaced by TFIDF.
IDF, logTFIDF, KLD, DPM-TFIDF and TFIDFRatio are removed.
We also tested our system using different machine learning algorithms. The goal was to find the best classifier. As one can observe in Table 6, combining the 18 features using any learning paradigm (boosting, bagging and regression) yields the best performance, confirming that our results do not depend on a specific learning approach.
Our system performances with different learning methods compared with HUMB and WINGNUS
6.2.2. Five-fold cross-validation
We used 5-fold cross-validation in order to confirm the stability and the robustness of the features. The 244 documents of the entire corpus are partitioned randomly into five sets of 48 documents (four documents remain). Testing is performed on each set and the remaining is used for training. The results are averaged over all runs. Table 7 presents the 5-fold cross-validation results obtained by our method when removing each feature from the learning attributes. These results confirm that DPM-index improves keyphrase extraction (by 9.6%). Our three other features impact less the F1-score when they are used individually, but together with DPM-index they significantly improve the performance. Table 8 gives the results we obtained with different machine learning algorithms. In the 5-fold cross-validation experiment LogitBoost and logistic regression give similar results on average.
Five-fold cross-validation performances (mean and standard deviation) after removing a feature or a family of features
5-fold cross-validation performances (mean and standard deviation) with different learning methods
7. Conclusion
This paper presents a new measure the DPM-index that discriminates the overlapping n-grams in a document. We developed a new supervised learning keyphrase extraction system that combines 18 statistical features, among them the DPM-index. We showed experimentally that this latter makes a contribution comparable to corpus-level features for the extraction of 10 keyphrases. The results of this system demonstrate an improvement over other systems without using any external knowledge or document structural features, which makes it more flexible and adaptable to other types of corpora.
Footnotes
Acknowledgements
Professor Aïcha Mokhtari (USTHB) and Professor Thierry Lecroq (Université de Rouen) are gratefully acknowledged for their help and support.
Funding
M.H. was a recipient of a PhD fellowship from the PROFAS French–Algerian cooperation programme.
