Abstract
Keyphrases are important phrases that represent the theme of a document. With the help of keyphrases people can quickly find useful information from massive data. Traditional statistic-based methods for keyphrase extraction only make use of the statistical features of the words and ignore the semantic relationship between words. Recently, the emerging methods based on deep neural network extract keyphrases by capturing the semantic contextual information without considering the statistical features. In this paper, we propose a new keyphrase extraction method based on the neural network architecture composing of deep and wide learning parts. In the deep learning part, BERT (Bidirectional Encoder Representation from Transformers) and Bi-LSTM (Bidirectional Long Short-Term Memory) models are used to capture the contextual semantic information from the word sequence while in the wide learning part several important statistical features are considered to jointly train the keyphrase extraction model. The experimental results on two public datasets show that the performance of our proposed model is better than eight commonly baseline keyphrase extraction methods.
Introduction
In the research field of natural language processing, keyphrases can improve the efficiency of extracting useful information from massive data. The keyphrase extraction technology aims to find the phrases that best capture the main topics of the document. Especially for web articles that do not provide keyphrases, this technology can enable readers to quickly understand the core meaning of the article. It has important application scenarios in tasks such as text classification, dialogue system, automatic summarization, and information retrieval [37, 42]. At present, with the rapid development of deep learning methods and great applications in NLP domains, how to integrate deep learning models into keyphrase extraction methods has become an active field of keyphrase extraction research.
Existing traditional keyphrase extraction algorithms mainly fall into two categories: supervised and unsupervised methods [21]. The supervised algorithm requires a lot of training documents with manually labeled keyphrases to train the keyphrase extraction model. Several important features including part-of-speech, TFIDF, and first occurrence position are widely used in the supervised method to train the classifier. There are tow disadvantages of the supervised method. Firstly, the process of training consumes a lot of manpower and time. Secondly, the subjectivity and accuracy of manual labeling of the corpus directly affect the extraction effect. Unsupervised methods assign a score to each candidate phrase by considering various features. The most well-known unsupervised method is the graph-based approach that builds the graph according to the co-occurrence degrees between words within a window and ranks candidate words by the page rank algorithm [23]. Two major limitations of this graph-based unsupervised method are: (1) when building a graph model for a document, the graph focuses more on expressing the degree of co-occurrence between two words, but the contextual information it captures is very limited, and it cannot describe the different meanings of the same word in different sentences; and (2) the extraction process ignores the sequential nature of the text, which is not conducive to showing the relationship between words.
In recent years, deep learning models are applied in the task of keyphrase extraction to obtain the contextual semantic information of words in the document, and achieve good results when extracting keyphrases from short texts. This type of method is usually combined with traditional methods, using a deep neural network architecture to learn the embedding information of words, and combining the embedding results with a graph model to complete the extraction task [21]. Although this method adds semantic feature information to the traditional graph model, the order information of the text is still insufficient. In addition, some scholars regard keyphrase extraction as a sequence labeling task [1], which can effectively obtain the context information of words, but loses important statistical features, such as part-of-speech, word frequency, et al.
Figure 1 shows an example of keyphrases in an abstract,with red bold words representing the keyphrases noted by the author. It can be seen from Fig. 1 that keyphrases are usually composed of nouns and adjectives, and they often appear in multiple places. Moreover, from the perspective of position, important words are more likely to appear at the beginning and end of paragraphs. Therefore, when the algorithm calculates the statistical information of the word, the part-of-speech, word frequency, and position features should all be considered in the algorithm. In addition, in the representation of document semantic information, the deep learning model can effectively represent words in vectors at the semantic level to obtain the contextual semantic relationship of individual words. In order to effectively improve the accuracy of keyphrase extraction, this paper focuses on how to take into account both the statistical information and semantic contextual information of words in the extraction algorithm based on the application of deep learning technology.
Red bold phrases represent the author-input keyphrases for this abstract from the kp20k dataset. 
Therefore, this paper proposes a new keyphrase extraction algorithm called WD-Rank. In order to learn more feature information of words that are helpful to the extraction result, we propose a new feature representation algorithm, and the overall framework of our algorithm includes deep and wide feature learning and extraction components. Specifically, in the deep component, in order to fully consider the interrelationships between words in the document, we construct a deep neural network combining BERT(Bidirectional Encoder Representation from Transformers) and Bi-LSTM (Bidirectional Long Short-Term Memory) structure to obtain the semantic vector representation of each word. In the wide component, the vector representation of the statistical features of the word is obtained by calculating the intuitive information of part of speech, word frequency, and position of the word. Finally, this algorithm fuses the two sets of features produced by the deep and wide parts. Therefore, the algorithm can integrate various features to train the extraction model, thereby improving the performance of keyphrase extraction. In our work, we also treat keyphrase extraction as a sequence labeling task. Our contributions are as follows:
We explore a deep learning model that combines the BERT and Bi-LSTM frameworks. In this combined model, we can better obtain the contextual semantic information of a single document. In addition, our algorithm also fuses the statistical characteristics of the document to construct a deep learning-based keyphrase extraction model. We design a new feature representation algorithm to achieve both statistical and semantic features of a document in only one extraction model. Through extensive experiments, we prove that the loss of any important feature representation of the word has a degrading effect on the extraction result. Our comparative experiments on two widely used datasets show that our algorithm outperforms the current mainstream keyphrase extraction algorithm.
The rest of the paper is organized as follows. Section 2 summarizes the related work of keyphrase extraction from two perspectives of traditional methods and deep learning methods. Section 3 elaborates our proposed method, Section 4 describes the details of the experiment and the analysis of the results, and Section 5 is the conclusion and outlook for the next step.
In this part, we mainly focus on the related work of mainstream keyphrase extraction methods.
Traditional keyphrase extraction method
In general, traditional keyphrase extraction methods include two categories: supervised and unsupervised [27]. The supervised method regards the keyphrase extraction as a binary classification task (keyphrases or non-keyphrases). This method fuses the feature information of phrases and uses pre-labeled samples to train an optimal classification model, which is used to judge whether a candidate phrase is a keyphrase. The supervised method needs to design a classifier, such as the naive Bayes classifier [36], the decision tree classifier [32], the logistic regression classifier [12]. Supervised methods usually outperform unsupervised methods.
The unsupervised method reduces the need for manual pre-annotation of data, so it is simpler to implement. The extraction process generally first selects candidate phrases from the document according to the part of speech rules, and then scores the candidate phrases according to different extraction rules and algorithms. Finally, the candidate phrases with higher scores are selected as the keyphrases. The difference between various unsupervised methods lies in the different algorithms for scoring candidate keyphrases. Commonly used methods include statistics-based methods, graph-based methods, and topic-based methods.
The earlier research was based on the statistical method, which focuses on the processing of the statistical information of the words in the document, that is, the reasonable quantification of different feature values. The KP-miner algorithm restricts the selection rules of candidate words when calculating the word frequency and the inverse document frequency for the candidate phrase, and adds location feature information to improve the extraction efficiency [9]. In addition to calculating the statistical information of phrases in a single document, some studies have been proposed to use co-occurrence statistics on external resources to improve extraction efficiency [20]. Besides, some scholars are no longer limited to just counting the feature information of keyphrases, but also add document-level feature information to the extraction algorithm [10].
With the emergence of TextRank [23], scholars began to study the keyphrase extraction algorithm based on graph model [34, 11, 38]. TextRank builds a graph for a document where the vertices in the graph represent words and the edges represent the co-occurrence relationship between words. Then the importance of each node is evaluated according to Google’s PageRank algorithm. Subsequent research on the improvement of the graph model-based algorithm is mainly reflected in three aspects: (1) changes to the composition rules, for example, the node represents no longer a single word but a phrase, or word sequence filtered according to certain rules; (2) changes in the weight assignment of edges that represent the degree of association between two vertices. For example, the semantic similarity between words is used as the weight value; (3) changes to the calculation rules of the PageRank algorithm, such as changing the initial weight assignment of each node. Because of the informality and noise of Twitter articles, NE-Rank [3] is based on the fact that the weight of the node and the weight of the edge should be considered when calculating the importance of the node in the graph. PositionRank [11] changes the initial value of the node in the graph which incorporates both the position of the words and their frequency in a document into a biased pagerank. Biswas et al. [5] believe that the importance of a word is determined by several different influencing factors, such as word frequency, distance from the central node, word position, importance of neighbor nodes, etc. The experimental results show that each kind of node features has an impact on the extraction effect.
Besides, many scholars try to integrate topic information into the graph to improve the quality of keyphrases extracted [6, 18]. Topical PageRank [19] incorporates the LDA topic model into the algorithm. The graph model assigns different weight values to edges under different topics. The Pagerank algorithm is used to calculate the score of the word under each topic, and finally, the topic distribution information is combined to calculate the final score of each word. Sterckx et al. [31] proposed that the calculation of word scores depends on the cosine similarity of the word-topic probability vector of a single document itself and the document-topic probability vector. TopicRank [7] divides candidate words into different topic clusters. When constructing the word graph model, the vertices are topic clusters representing different topics.
Keyphrase extraction method based on deep learning
Deep learning methods have been successfully applied in the domains of computer vision, natural language processing, and speech processing. With the emergence of embedding technology [25, 24, 29], different parts of text, such as paragraph, sentence, and word/phrase are mapped into semantically related vectors in a low-dimensional space. [17, 26]. Therefore, scholars began to study how to use deep learning models to complete keyphrase extraction task, such as BERT (Bidirectional Encoder Representation from Transformers) [8], LSTM (Long short-term memory) [14], et al. For example, in response to the length restrictions of tweets, the author adopts a two-hidden-layer RNN (recurrent neural network) structure to capture more contextual semantic information, thereby solving the limitation of extracting keyphrases from single short tweets [39].
EmbedRank simultaneously calculates the semantic similarity between the document and the candidate phrase in the same vector space, and extracts the candidate phrases with higher similarities as keyphrases [4]. On the basis of this research, The GKE algorithm improves keyphrase extraction by introducing the embedding technology into the graph model. First, the semantic similarity of the word and the document is used as the initial weight of each word in the graph, and then the random walk algorithm is run to calculate the score of each word [42]. The GLEAKE method calculates single and multi-word embeddings at the same time and integrates global and local semantic information into the graph model [2]. There are other researches on the combination of graph models and embedding technology [35, 21, 40].
In the past two years, many scholars have focused on how to use the deep learning model to solve the problem of keyphrase extraction treating it as a sequence labeling task [28, 30]. The Bi-LSTM-CRF model [1] is proposed for keyphrase extraction in which the Bi-LSTM component captures the contextual semantic information, and the CRF component obtains the sentence-level tag information. In order to solve the expensive manual annotation corpora for the supervised method, Zhu et al. introduce self-training method into the neural model to leverage the unlabeled articles [41].
In this work, we propose a keyphrase extraction algorithm, which studies a new feature representation method based on the deep neural network architecture composing of wide and deep frameworks. Through the two components of deep learning and wide learning, the advantages of deep learning algorithms and traditional statistical methods can be integrated. To our knowledge, this is the first work that attempts to better integrate traditional feature representation with the deep semantic learning framework based on BERT and Bi-LSTM for keyphrase extraction. We also conduct a large number of comparative experiments and ablation experiments on the public corpus to verify the validity and reliability of the proposed method.
Proposed model
In this section, we describe our novel keyphrase extraction algorithm. The proposed method, WD-Rank, combines the semantic information and feature information of the word in the process of keyphrase extraction. The overall architecture of our proposed methodology is shown in Fig. 2. For each word, we compute the statistical feature representation vector from the wide component and the contextual semantic representation vector from the deep component. The detail of our methodology is discussed in the following subsections.
Overview of the proposed model. 
The wide component is used to collect the statistical information of words in a document, as shown in Fig. 2 (left). In our algorithm, three kinds of important statistical features are considered including word frequency, part-of-speech, and position.
For the part-of-speech features of words, previous keyphrase extraction algorithms usually delete the high-frequency stop words in the preprocessing stage, and consider that the candidate phrases are composed of only adjectives and nouns [34]. Therefore, our algorithm also uses part-of-speech tagging for words, and maps the part-of-speech of each word
Since keyphrases often appear at special positions such as the beginning and end of document paragraphs, location feature information is also very important. This algorithm will mark the absolute position of each word in a document in order from the beginning. For example, the length of document
Finally, because some adverbs or stopwords in the document appear more frequently than other words, when counting the word frequency information of a word, we cannot simply calculate the frequency of the word in the text. This algorithm calculates the TF-IDF (Term Frequency-Inverse Document Frequency) value of a word. Its core idea is that the importance of a word is proportional to the number of occurrences in a single document, but inversely proportional to the number of occurrences in the corpus. It should be noted that, unlike the previous feature vector representation of the part-of-speech and position, the dimension of TF-IDF is only one. There is no operation to convert the discrete numeric mapping into a vector. Instead, this real-valued TF-IDF vector is directly calculated using document and corpus information. The TF-IDF value
where
Based on the three vector representations (part-of-speech, position, and word frequency) calculated above, we get the statistical feature representation
Where “:” denotes the concatenation between those vectors.
In this part, we design a feed-forward neural network, as shown in Fig. 2 (right), which is used to learn the semantic feature information of the word and document. Our method utilizes the BERT (Bidirectional Encoder Representation from Transformers) and Bi-LSTM (Bidirectional Long Short-Term Memory) models to map the original data into the feature vector space to obtain a vector representation with semantic features. It is generally believed that BERT has a stronger ability to learn latent semantic information than LSTM. Since BERT extracts contextual semantic feature information that is biased towards fine-grained aspects in the document, the semantic knowledge it obtains is more detailed. However, LSTM has a stronger ability to learn global knowledge and is easy to extract semantic feature information that is coarse-grained. In Fig. 2 (right), the discrete words are first embedded in the BERT layer, which uses the BERT model, an encoder representation based on the bidirectional Transformer feature extractor, to obtain the word vector embedding representation, and then output the results to the next layer for training.
Structure of BERT. 
The structure of the BERT model is shown in Fig. 3. When the BERT model is trained, the lowest layer in Fig. 3 represents the input vector
Throughout the training process of BERT, the weight of each word is dynamically adjusted by fully considering the relationship between each word and other words in the sentence. Therefore, the final trained vector representation not only contains the semantic information of the word itself, but also contains the semantic connection with other words in the context. It is not difficult to see that BERT assigns different feature vector representations for the same word in different positions. The
Bidirectional LSTM network is an improvement of LSTM. LSTM can effectively solve the problem of gradient vanish in RNN [13]. The LSTM model mainly includes input gates
where
When processing the text sequence, the semantic information of the current word is not only related to the previous word sequence, but also related to the subsequent text information. Therefore, we use the bidirectional LSTM, which can consider the semantic relationship of the past sequence and future sequence of the word at the same time during training, and finally capture the context information of the entire document. The word output vector
where the vector
In Fig. 2, The output of the Bi-LSTM layer and the vector representation of the statistical features are passed to a linear layer together. Specifically, the input vector of the linear layer is obtained by concatenating the vector
Algorithm 1 summarizes the learning procedure of our proposed model.
Experimental results
In order to assess the performance of our proposed method, we conduct a series of experiments. In this section, we first describe the statistics of two commonly used public datasets used for evaluation, and then through the ablation experiment to analyze the effectiveness of our model, finally we describe the comparison experiment with eight baseline methods.
Datasets and evaluation measures
We use two popular public keyphrase extraction datasets for our evaluation experiment. The first dataset is Inspec [15], which is composed of 2000 English abstract documents of scientific articles. Inspec divides the entire corpus into a training set, a validation set, and a test set, which contain 1000, 500, and 500 abstracts respectively. The second dataset was constructed by Meng et al. [22] named kp20k. The kp20k dataset is composed of 567,830 scientific articles collected from various online digital libraries, such as ACM Digital Library, ScienceDirect, et al., and divided into three sets: a training set containing 527,830 documents for model training, a validation set containing 20,000 documents for parameter tuning, a test set containing 20,000 documents for model evaluation. The detailed statistics of these two datasets are shown in Table 1.
The statistics of our datasets. Columns are: number of documents contained in the database; average length of documents; total number of manually labeled keyphrases
The statistics of our datasets. Columns are: number of documents contained in the database; average length of documents; total number of manually labeled keyphrases
Each document in the corpora includes the title, the abstract, and correct keyphrases manually labeled. When the algorithm extracts keyphrases, the title and abstract are used as input data, whereas the manual annotated keyphrases are used when evaluating the results of the algorithm. In our experiment, the comprehensive evaluation index
where
We use Python 3.7 and Pytorch 1.1.0 for implementation. We execute the programs on a 64 bit PC with 32G RAM, Intel Core i7-9700K CPU @3.60GHz 8-core Processor and GTX1080 GPU running Ubuntu 16.04 LTS.
In the preprocessing stage, we use the NLTK toolKit to tag the part-of-speech of each word. The first layer of the neural network we built is to use BERT to convert discrete words in the document into vectors representing semantic information. Google provides a variety of pre-trained BERT models for different languages and different model sizes. In our experiment, we use the “bert-base-uncased” pre-training model and train each word to obtain a 768-dimensional word embedding vector. The dimension of the output vector in the hidden layer is set to 256. The setting of the feature dimension of the wide component is elaborated in the next subsection 4.3. The optimizer used for training our neural network model is Adam [16] with an initial learning rate set to 3e-5, and gradient clipping set to 1.0. In addition, the loss function uses CrossEntropy, and the dropout rate is set to 0.25 to avoid over-fitting.
Ablation experiments and feature parameter settings
In order to validate the effectiveness of our proposed method, we conduct a wide range of parameter adjustment comparison experiments and ablation experiments to evaluate the role played by each component in WD-Rank.
First, we performed ablation experiments on the deep component by removing only one neural network model at a time from the deep part. Specifically, we compared three different models including BERT, Bi-LSTM, and the hybrid model composed of BERT and Bi-LSTM. The experimental results are shown in Table 2.
The results of the ablation experiment of the Deep component in our model on the kp20k and inspec datasets
The results of the ablation experiment of the Deep component in our model on the kp20k and inspec datasets
From the experimental results in Table 2, it can be seen that our model obtained the highest
Interestingly, when we removed the BERT model and only retained the Bi-LSTM model, the
Second, in order to verify the effectiveness of each statistical feature of the word, we add different statistical feature to the wide component one by one to observe its impact on the extraction results. The experimental results on the Inspec dataset are shown in Table 3. The results in Table 3 show that when a new statistical feature is added to the Wide component, the efficiency of keyphrase extraction has been improved, and the best
The results of adjusting different statistical features on the Inspec dataset
On the other hand, we observe the impact on the model results by adjusting the dimensions of different statistical feature vectors in the wide component. In the above experiment of testing the statistical features of the wide component, we first added the part-of-speech feature information. The blue curve in Fig. 4 represents the change of model performance under different part-of-speech vector dimensions. The results show that the best
The 
After that, we set the part-of-speech dimension to 5 and vary the position dimension from 5 to 40. The red curve in Fig. 5 represents the change in model performance under different position vector dimensions when the wide component adds part of speech and position features. The best
We compare our proposed method to eight state-of-the-art keyphrase extraction methods including: EmbedRank [4], Bi-LSTM [1], Bi-LSTM-CRF [1], PositionRank [11], TF-IDF, TextRank [23], TopicRank [7], and SingleRank [34]. For the three graph-based methods of PositionRank, TextRank, and SingleRank, the window size is uniformly set to 10, and the damping factor is set to 0.85. For the topic-based method TopicRank, the centroid strategy is used when selecting keyphrases in the topic cluster, and the trade-off parameter
Comparison of our proposed approaches with existing state-of-the-art models for
value on Inspec, and kp20k datasets
Comparison of our proposed approaches with existing state-of-the-art models for
The 
As shown in Table 4, the performance of our method is significantly better than other baseline methods on two public datasets. Compared with the three methods based on deep learning, on the inspect dataset, the
Comparative experiment results show that the efficiency of the two methods based on the deep learning model is generally higher than that of other types of extraction methods. The most plausible explanation is that, the neural network architecture has a powerful representation learning ability for extracting many useful feature information, which also shows that the deep learning model can greatly improve the quality of keyphrases extracted. The results also show that the idea of fusing multiple feature information in the extraction algorithm can achieve better results than relying on single feature information such as position information (PositionRank) or topic information (TopicRank).
In addition, in the experiment, we found that there are noisy data in the Inspec dataset, and some manually labeled keyphrases do not appear in the text, which leads to inaccurate calculations of accuracy. After removing such documents, there are still 502 papers remaining, but through actual experiments, it is found that the accuracy rate obtained by experiments on this basis is only about 1% compared with the previous ones. Therefore, the influence of noise data on the experimental results is not great.
To see the excellent extraction quality of our model, we show an anecdotal evidence of different models using two documents, which is part of the Inspec test set. In Fig. 6, the first column shows the titles, abstracts, and manually annotated keyphrases of two different documents, in which we have marked the manually annotated keyphrases in bold. The second column shows the keyphrases extracted using the BERT and Bi-LSTM hybrid model mentioned in section 4.3 with only the deep component. In particular, we use the blue bold font to mark the correct extraction keyphrases. The third column is the keyphrases extracted from the document using our proposed model (including both deep and wide components), and the correct extracted keyphrases are marked in red bold font. As can be seen from Fig. 6, for the first document, the predicted keyphrases extracted by the BERT and Bi-LSTM hybrid model cover two out of six author assigned keyphrases, whereas our model correctly labeled all the keyphrases. For the second document, the BERT and Bi-LSTM hybrid model hits three of the six author assigned keyphrases, whereas our model hits four correct keyphrases, and the other missed extracted phrases are two substrings belonging to a remaining author assigned keyphrase.
Examples of the extraction model only using deep component vs. our model for capturing keyphrases of documents, which are named “
Through the comparison of the above two examples, it is not difficult to see that the performance of our model is much better than the extraction model that only uses deep component. This also directly reflects the importance of the wide component in our model. For the first document in Fig. 6, phrases that meet the part-of-speech rules in special positions (including the title, first sentence, and last sentence) are successfully found, such as “knowledge workers”, “kimura system”, “pervasive computing” and “data sources”. We also observe that a missed noun “populations” and an adjective word “electronic” predicted by our model for the second document are substrings of the author assigned keyphrase “electronic population data resources”. Besides, the predicted word “populations” appears more frequently in the document, but neither of these two words is predicted by the model that only uses the deep component.
In this paper, we regard keyphrase extraction as a problem of sequence labeling and propose a new keyphrase extraction algorithm based on deep learning, called WD-Rank, which increases the accuracy of word labeling by fusing the statistical and semantic features of words. Our model architecture includes deep and wide feature representation learning and extraction components. Specifically, we design the wide component to extract the statistical feature information of the document, and adopt the network architecture that combines the BERT and Bi-LSTM model to form the deep component to extract the contextual semantic information of the word. Experimental results show that our proposed WD-Rank achieves better extraction results than other mainstream keyphrase extraction methods.
In future work, we have the following two considerations: The first is to use other neural network models to learn more feature information of words to improve the efficiency of keyphrase extraction. In addition, most of the current keyphrase extraction algorithms are effective in short texts, so how to improve the extraction effect in long texts is the second interesting direction.
Footnotes
Acknowledgments
This research was funded by the National Natural Science Foundation of China (No.61503116), the Special Project of Provincial Scientific Research Platform of Hefei Normal University (No.2020PT15), the Natural Science Foundation of the Anhui Higher Education Institutions of China (No.KJ2021A0902), the Major Science and Technology Projects of Anhui Province (No.201903a05020047), the Key Research and Development Program of Anhui Province (No.201904d07020012), the University Synergy Innovation Program of Anhui Province (No.GXXT-2021-090).
