Abstract
Automatic keyphrase extraction from texts is useful for many computational systems in the fields of natural language processing and text mining. Although a number of solutions to this problem have been described, semantic analysis is one of the least exploited linguistic features in the most widely-known proposals, causing the results obtained to have low accuracy and performance rates. This paper presents an unsupervised method for keyphrase extraction, based on the use of lexico-syntactic patterns for extracting information from texts, and a fuzzy topic modeling. An OWA operator combining several semantic measures was applied to the topic modeling process. This new approach was evaluated with Inspec and 500N-KPCrowd datasets. Several approaches within our proposal were evaluated against each other. A statistical analysis was performed to substantiate the best approach of the proposal. This best approach was also compared with other reported systems, giving promising results.
Keywords
Introduction
The exponential growth of textual and unstructured data in digital format have led to a significant challenge in textual information processing, that of distilling the most important information from the amount of information available. The development of computational solutions based on the application of natural language processing (NLP) and text-mining techniques has emerged as the most promising option for dealing with this challenge.
In this context, a high-level description of a document can be obtained through relevant words or phrases, from their strong relationship with the main topic (s) addressed in the documents, so that automatic keyphrase extraction is an essential task for many text-mining solutions [19, 10]. The keyphrase provides a concise understanding of a text, enabling one to grasp the central idea and the main topics discussed in a text document, and it facilitates the construction of text-representation models, such as graph-based models. Several automatic keyphrase extraction models have been created over the last few years, some following a supervised approach [18, 11] and others using unsupervised techniques [3, 16, 20, 25, 26, 27, 28].
In this study we focus on unsupervised keyphrase extraction, where human-annotated training data for applying some machine-learning algorithm is not required in this process. The solutions reported still have low rates of accuracy and performance [10, 19], and semantics is one of the least-exploited linguistic features in the most widely-reported proposals, especially in unsupervised approaches.
According to [19], it is essential to focus on semantically and syntactically correct phrases and make sure that the keyphrases are semantically relevant to the document topic and context. Topic modeling for keyphrase extraction from texts is reported in [26, 25, 3], however the semantic analysis in those proposals has not been considered, or at least, not in all its possible dimensions, constituting a weakness. The semantic analysis of textual content, at the level of word meaning or relationships between them, is usually influenced by subjectivity, vagueness and imprecision, due to the inherent ambiguity of natural language, which constitutes a challenge for the computational solutions required by intensive semantic processing. Fuzzy logic offers a number of techniques for dealing with these problems, such as fuzzy set techniques, fuzzy clustering algorithms, and aggregation operators, among others. Despite these advantages, few keyphrase extraction proposals using a fuzzy logic approach to semantic analysis have been identified [26].
This paper proposes an unsupervised method for automatic keyphrase extraction from a single document. The method was conceived through the combination of the use of lexico-syntactic patterns with graph-based topic modeling, which is carried out from the fuzzy logic perspective. In this sense, syntactic and semantic measures are combined, applying the aggregation operator OWA (Ordered Weighted Averaging) [31], to increase the semantic processing level of the candidate phrase in topic identification. The description of the method gives an example, using a document of the Inspec datasets. In this example, the state of the text is shown at each stage of the proposal. The method was evaluated with the Inspec [11] and 500N-KPCrowd [18] datasets, and the performance was measured using the precision, recall, and the performance was measured using the precision, recall, and F-measure metrics. Several experiments were carried out with the purpose of providing a deeper grounding for the contribution of the proposal. Different topic modeling approaches were evaluated in the proposal, taking into account single and multi-criteria approaches. A comparison of multi-criteria approaches was performed with different numbers of keyphrases, in order to analyze the behavior of the proposal for different outputs. The best approach of the proposal was substantiated through a statistical analysis, the well-known Friedman Test and Post-Hoc Test. The approaches with the best results were compared with those obtained by other state-of-the-art unsupervised proposals, improving the results with respect to those systems included in the comparison.
Specifically, the contributions of this paper are the following: (1) we propose a new way of processing the semantic information in topic-modeling based keyphrase-extraction solutions, applying a fuzzy aggregation operator (OWA), and (2) we show, on two datasets, that the fuzzy topic modeling proposed can improve accuracy in the unsupervised automatic keyphrase-extraction process.
The rest of the paper is organized as follows: Section 2 summarizes the analysis of related work; Section 3 sets out the theoretical background of the main concepts; Section 4 describes the proposed method; Section 5 presents the datasets, metric description and experimental results and the corresponding analysis. Conclusions and future lines of work are given in Section 6.
Related works
Solutions for automatic keyphrase extraction in text documents are usually designed in four phases: pre-processing, identification and selection of candidate phrases, keyphrase determination, and evaluation [19]. Unsupervised keyphrase extraction approaches typically follow a standard three-stage process [10]. The first stage involves choosing the candidate lexical units with respect to some heuristic, such as the exclusion of stop words or the choice of words that are nouns or adjectives. The second stage is ranking these lexical units by measuring their importance through co-occurrence statistics or syntactic rules. The final stage concerns keyphrase formation, where the top-ranked lexical units are used either as keywords or as components of keyphrases. The unsupervised approach has the advantage of using only the information contained in the input text to determine the keyphrases [3, 16, 20, 25, 26, 27, 28].
The common baseline approach for unsupervised keyphrase extraction is tf-idf [12]. It ranks phrases in a particular document according to their frequency in this document (tf), multiplied by the inverse of their frequency in all documents of a collection (idf). Recently, Florescu and Caragea [8] proposed an approach for combining tf-idf with any other word-scoring approach. In their approach, a phrase’s score is computed by multiplying its frequency within the document (tf) with the mean of the scores of the words in the phrase.
Studies often extract keyphrases by collecting adjacent important adjectives and nouns. In [13], a statistical study of four public corpora shows that about 15% of keyphrases contain other kinds of words. This proposal introduces words other than adjectives and nouns to keyphrases, which improves the performance of extraction. It describes a novel approach to extracting keyphrases by collectingnoun phrases (NP’s) as candidate keyphrases using syntactic information, i.e. chunks and constituent syntactic parsing trees. Hence, the well-formedness of keyphrases is ensured by noun phrases from chunks and parsing trees. In addition, words other than adjectives and nouns are also considered to be part of the keyphrase pattern if they appear in candidates noun phrase.
The n-grams resulting in removing all stop words from text constitute the list of candidate keywords in SMAF Extractor [1]. For each keyword in that list, the algorithm fetches its synonyms from an external source. Then the synonyms in the new list are compared to the keywords of the original list if there is a match, the original list of keywords updated as the weights recalculated to include new frequencies. To decide which keywords to extract, SMAF Extractor relied on combined statistical metrics: the traditional term frequency measure, the keyword heading weight, and the keyword first occurrence position weight.
YAKE [6] also uses a statistical analysis over the candidate terms. The statistics of each term are computed considering structure, term frequencies, and co-occurrence. From these terms, the most relevant keywords are selected. The relevance of each candidate term is computed by aggregation of a number of statistical features: casing; term position; term frequency normalization; term relatedness to context and term occurrences in different sentences.
In TextRank [20] the candidate terms and their relationships are represented in an unweighted and undirected graph, whose vertices represent the terms and the arcs represent the co-occurrence relationships between them. An algorithm similar to PageRank [4] is applied to the constructed graph to determine the relevance of each vertex. Next, a third of the vertices of the whole graph are chosen as the most relevant vertexes. Finally, the relevant terms are marked in the text and the sequences of adjacent words are selected as keyphrases. A similar solution is considered in the Salience Rank algorithm [28], to obtain a ranking of the words in the document is combined with other salience measures in the context of an LDA (Latent Dirichlet Allocation) based topic modeling approach. [26] describes the co-occurrence graph of the words of the input text, which is customized for each topic by using the semantic information obtained from the topic model (built from Wikipedia articles) to form the topic graphs. Next, the communities and central nodes of these topical graphs are identified. This is done using the fuzzy modularity criterion for measuring the goodness of overlapped community structures. [27] (RAKE) also aplied a co-occurrence graph built with all individual words founded in the candidate keyphrases. The word score is calculated through the word degree as well as the word frequency. For multiple-word expressions, they calculated the weights by summing the members’ weights up.
WikiRank [34] uses a topic annotator to identify meaningful sequences of words (concepts) in the text and link them to a related Wikipedia page. On the other hand, the use of noun phrases through patterns is used in WikiRank to identify candidates’ keyphrases. A semantic graph is built linking the concepts with the candidate keyphrases that contain him. The candidate keyphrase with the most links to concepts is selected as keyphrases. [33] proposes another graph-based approach. The proposed graph is built using as vertex, words and sentences identified. Edges correspond to the three kinds of relationships: sentence-to-sentence; word-to-word and sentence-to word.In this approach a document can be grouped with many topics. [29] contains a multi-centrality index (MCI) approach, which aims to find the optimal combination of word rankings according to the analysis of nine centrality measures (Betweenness, Clustering Coefficient, Closeness, Degree, Eccentricity, Eigenvector, K-Core, PageRank, Structural Holes) for identifying keywords in co-occurrence word-graph representations of documents.
TopicRank [3] proposes a strategy based on the identification and analysis of topics to extract the relevant phrases. In this method, the longest sequences of nouns and adjectives in the text are extracted as candidate phrases, and the syntactically similar noun phrases are clustered into a theme or topic, using a hierarchical agglomerative clustering (HAC) algorithm [22]. Next, a graph is constructed where each vertex represents a topic and the arcs are labeled with a weight that represents the strength of the contextual relationship in the text between the candidate phrase contained in a topic with respect to those that were grouped with another topic to which it relates. Finally, only one keyphrase from each topic is selected, which is a weakness because a topic can be represented by more than one keyphrase in the same text. This proposal is improved in [25], which conceives a more flexible procedure for keyphrase selection from topics and incorporating the definition of a “distance-between-phrases” function into the candidate-phrases clustering process, although semantic processing remains limited, as in the case of [3]. Liu et al. [16] also consider the clustering of candidate phrases to represent the document’s themes, and a co-occurrence-based relatedness measure is applied for computing the semantic relatedness of candidate terms in this process.
According to the related studies analyzed, graph-based terms representation and topic modeling appear as promising alternatives for unsupervised keyphrase extraction from text. The unsupervised methods offer more significant strengths than supervised ones; nevertheless, they have as a weakness that the graph-based approach does not guarantee that the extracted keyphrases represent all the main topics of the document and it fail to reach a reasonable coverage level of the text document [19]. The good keyphrases of a document should be semantically relevant to the document theme or topic and cover the whole document well [16]. In this sense, the analyzed work shows a low use of semantic analysis in the clustering and topic modeling process carried out, or in any other task included. This semantic processing has focused on computing the co-occurrence relatedness [16, 20, 26, 27] or distance-based contextual relationship [25]. However, there are other levels and measures of semantic analysis, such as semantic similarity and semantic relatedness measures, which have not been explored. Our work is aimed at assessing the benefits of these other semantic measures in topic modeling from the fuzzy logic perspective to improve the outcomes of the unsupervised keyphrase extraction process.
Background
Although a number of automatic keyphrase extraction solutions have been developed over the last few years, semantic issues are one of the least exploited linguistic features; as can be seen in the analysis of related works. The process of semantic analysis can be conceived, fundamentally, from two perspectives: (1) considering only the textual content, for example, exploiting the contextual or co-occurrence relationship between terms, or (2) taking advantage of the external knowledge base, such as: WordNet [21].
WordNet [21] is a lexical database widely used to capture the underlying semantics of texts, whosebasic structure is the synset (acronym of synonyms set). The synset defines the meaning of a set of words that share a sense, and are interconnected by several types of lexical and semantic relations, being distributed in the form of a semantic network. Through this semantic network, it is possible to determine the semantic similarity or relatedness between two words by analyzing the synsets path formed by the different relations that connect them, directly or otherwise. Word-Net::Similarity [24] is a freely available software package developed for this purpose, which makes it possible to measure the semantic relatedness of a pair of concepts (or word senses). In our proposal, two types of semantic relatedness measures from WordNet::Similarity were evaluated in the topic modeling process, specifically, LCH (Leacock & Chodorow) and JCN (Jiang & Conrath), with the last being the most promising, according to [5].
In the automatic keyphrase extraction process, topic modeling refers to a clustering process ofvcandidate phrases which may be strongly linked from different perspectives, and are parts of the core topics. According to Liu et al. [16], the good keyphrases of a document should be semantically relevant to the document’s theme or topic and cover the whole document well. Therefore, the semantic processing in topic modeling reaches a higher level of importance. The semantic analysis of textual content, at the level of word meaning or relationships between them, is usually prone to problems of subjectivity, vagueness and imprecision, due to the inherent ambiguity of natural language. Fuzzy logic offers several techniques for dealing with these problems, such as fuzzy sets, fuzzy clustering algorithms, aggregation operators, and others. On the other hand, subjectivity and imprecision suggests that the semantic analysis in the identification of topics from candidate phrases should be complemented with other linguistic features (e.g. syntactic aspects) and context analysis; transforming topic detection into a multi-criteria decision problem.
Many aggregation operators have been developed to aggregate information [30], including the Ordered Weighted Averaging (OWA) operator, which has been widely used as a solution to multi-criteria decision problems. Aggregation refers to the process of combining values (numerical or non-numerical) into a single value, so that the final result of aggregation takes into account, in a given fashion, all the individual aggregated values [9]. Therefore, the OWA operator can be very useful in combining semantics with other linguistic aspects, through weightings assigned to each measure to be aggregated. This operator allows clusters of phrases to be found that are strongly related from different semantic dimensions, and at the same time, to achieve a wide coverage of the whole document in the topic modeling process.
There are different methods for determining the weights to be used in an OWA operator and the use of linguistic quantifiers is one of them [35], e.g. RIM (Regular Increasing Monotone) quantifiers. Yager proposed a method to calculate the weights of an OWA by means of RIM quantifiers [32], which is defined in Eq. (2). In our proposal, four RIM quantifiers were evaluated (see Table 1), as the first approach to measure the performance of the OWA operator in the keyphrase extraction problem.
Linguistic quantifiers RIM
Specifically, in our proposal, we apply the RIM quantifier “Most” (Feng & Dillon) reported in [7] (see Eq. (5)), as the first approach to measure the performance of the OWA operator in the keyphrase extraction problem.
The proposed method was conceived through the combination of the use of lexico-syntactic patterns with a topic modeling carried out from a fuzzy perspective. This method has four phases, as shown in Fig. 1: (1) text pre-processing, (2) fuzzy topics identification, (3) relevance evaluation of topics, and (4) keyphrases selection. Lexico-syntactic patterns were defined for extracting candidate phrases from the text, and a fuzzy clustering of candidate phrases is proposed for identifying the main topics in the texts, to improve the semantic analysis with respect to other proposals [3, 25, 26]. It also incorporates a more flexible mechanism of keyphrase selection from the relevant topics identified, which allows the extraction of more than one keyphrase and solving the weakness identified in TopicRank [3].
Process of keyphrases extraction.
For a better understanding of the proposal, an example is developed step by step throughout the description of the method. The example text, shown as follow, is a document selected from the Inspec dataset. The gold keyphrase are highlighted in bold.
Example Text:Inverse problems for a mathematical model of ion exchange in a compressible ion exchanger.A
At this stage, different NLP tasks are carried out to extract from the text the syntactic information which is required for the candidate phrase extraction process. Initially, plain text from the input file is extracted, segmented into paragraphs and sentences, and the set of tokens (e.g., words, numbers, and others) is obtained from each sentence. Subsequently, a deep syntactic analysis is carried out using the Freeling parser.
The extraction of candidate phrases is based on the identification of conceptual phrases and a set of defined lexico-syntactic patterns defined for this purpose, such as: [D | P | Z]
Table 2 shows a list of candidate phrases and the corresponding patterns resulting from the pre-processing phase. The candidate phrases highlighted in bold are those that match the reference and those that are underlined correspond to the two that contain a reference’s keyphrase (ion exchange). In this case the patterns used do not allow a keyphrase of a reference to be identified exactly but can be included in the other two candidate phrases, and only one cannot be identified.
Candidates phrases and patterns
Candidates phrases and patterns
The topic identification process is carried out using a hierarchical agglomerative clustering algorithm [22] over the extracted candidate phrases, which is addressed as a fuzzy logic problem for reinforcing the semantic analyses in the phrases clustering. The hierarchical agglomerative clustering algorithm was selected following the approach of TopicRank [3]. This algorithm assumes a similarity function to determine the similarity between two instances (or candidate phrases), which is perfectly suited to our problem. The need for a clustering algorithm that does not restrict the number of clusters is another point in its favor, given that a changing number of topics may be present in different texts, especially when it comes to achieving a general solution, i.e. for documents of different lengths.
Although the use of clustering algorithms for topic modeling has also been reported in [3, 25, 26], semantic analysis is not considered in those proposals, or at least not in all its possible dimensions. This is a weakness considering the assumption that a topic could be modeled through the cluttering of concepts that frequently appear together as well as concepts with similar meanings or that are semantically related. To address this weakness, in our new unsupervised approach the phrase clustering process is carried out considering the score resulting from combining the syntactic similarity and distance-between-phrases measures reported in [25] with a further two semantic similarity measures applying a fuzzy aggregation operator. Moreover, average distance (in words) between each pair of words of each pair of candidate phrases is calculated by Eq. (6), where
The two semantic similarity measures were conceived according to the sentence-to-sentence similarity metric reported in [15] and using two word-to-word semantic relatedness metrics from WordNet::Similarity package, specifically the Jiang and Conrath and Leacock and Chodorow metrics [24]. Additionally, the words distance metric reported in [25] was redefined (see Eq. (7)).
Where
The hierarchical agglomerative clustering process is carried out by creating a square symmetric matrix of size n (total of candidate phrases identified), where each topic identifies a row, and a column and the intersection between each pair of topics contains the SRS (weight value) between a pair of candidate phrases that represent the corresponding topics. The relatedness matrix created from the example text is shown in the Table 3.
Matrix creation
Topics clustering: First iteration
Initially, each candidate phrase is considered as a topic. In each iteration, the pair of topics with the highest weight value is merged. The average of the weight values is used as a clustering strategy of a pair of topics because it represents a balance between complete linking and single linking. Through the use of average linking, the weighting of the relation between the new formed topic
being
Table 5 represents the second iteration, on the matrix of Table 4. In the case of the text taken as an example, the second iteration gives the latter, and thus the topics are formed. In this iteration, the topics merged into a new topic were [proposed methods] and [numerical solution methods], with the best weight 0.4958.
Topics clustering: Second iteration
[b] Fuzzy Identification of Topics
th
The process of clustering is shown in Algorithm 1. The algorithm input is a candidates keyphrase list resulting of the pre-processing phase and the output is a list of all identified topics. Firstly, the matrix is filled out (lines 2 to 6), where the distance (line 4) between each pair of phrases (
The stage concludes by generating a graphic representation of the text, in which the identified topics are represented as vertices and these are linked by arcs labeled with the weight of the relation between them. Each weight represents the strength of the existing semantic relationship between the pair of topics. Topics A and B have a strong semantic relationship if the candidate phrases that include these topics frequently appear close together in the text. The weight
Figure 2 shows a sample of the graph built considering the identified topics from the example text and illustrates the output of this stage. The vertices correspond with the topics identified and the edge weights correspond to those calculated according to Eq. (9).
Graph of topics.
The example shows the clusters resulting of the clustering approach. Indeed, the clustering succeed to group “model of ion exchange” and “process of ion exchange”, which share a high semantic content.
At this stage, the relevance of each topic represented in the constructed topic graph is evaluated using the TextRank [20] model.
The relevance score computed for each topic
The process of edge and vertex weighing of the complete graph is described in Algorithm 2. The input is a topic list (unweighted topics)
[H] Topic Evaluation
Table 6 shows an example of the ranking of topics identified from the text shown above. From these topics, the keyphrases will be selected.
Ranking of topics
The best weighted topics are those with the strongest semantic relationship to other topics, therefore, the best keyphrases should be identified from these topics. In this sense, it can be seen below, in Section 5, that the keyphrases that coincide with the reference keyphrases are identified from the best-valued topics.
The selection of keyphrases from the most relevant topics identified in the previous phases is carried out according to the following criteria: (1) candidate phrase that first appears in the text; (2) most frequently used candidate phrase; and (3) candidate phrase that has a closer relationship with the others in each topic (centroid role). A mechanism that allows the three criteria to be combined has been implemented in our proposal, offering the possibility of extracting more than one keyphrase from each topic and greater flexibility in its execution, respect to the reported in [3] (only one of the criteria is considered which affects the coverage in the keyphrase extraction process). If more than one candidate phrase (associated with a topic) with the same higher frequency is identified, and the frequency value is higher than one, then all of them are selected. Otherwise, only the first candidate phrase that appears in the text will be chosen.
The process of keyphrase selection is described in Algorithm 3. The input is a ranked topic list
[H] Key-phrases Selection
From the description of the method and the example shown through it, the resulting keyphrase is identified below. In the list of keyphrases identified from the example text those that match the reference are highlighted in bold, and those that include a reference’s keyphrase are underlined:
For this example, we can see a high level of matching (4.5 out of 7) of keyphrases identified by our proposal with the reference’s keyphrases, taking into account the one that included the reference’s keyphrase. In this sense, the accuracies reached, reflected by the metrics Precision and Recall (boarded later in Subsection 6.1), are 75% and 64.3% respectively.
Experimental and results
Datasets description
To evaluate the effectiveness of our proposal, we used two standard and publicly available datasets characterized by different types and sizes of documents. The Inspec dataset [11] consists of 2000 abstracts of scientific journal papers in computer science collected between the years 1998 and 2002, and divided into sets of 1000, 500, and 500, as training, validation and test datasets respectively. Each document has two lists of keywords assigned by humans (controlled keywords), which are assigned by the authors, although restricted by the Inspec thesaurus, and uncontrolled, which are freely assigned by the expert readers. The controlled keywords are mostly abstractive, and therefore may not appear in the document, whereas the uncontrolled ones are mostly extractive.
The 500N-KPCrowd dataset [17] consists of 500 English broadcast news stories in 10 different categories (e.g. Politics, Sports) with 50 docs per category. The ground truth or gold standard is defined using Amazon’s Mechanical Turk service to recruit and manage taggers. Multiple annotators were required to look at the same news story and assign a set of keywords from the text itself. The final ground truth consists of keywords selected by at least 90% of the taggers. A statistical characterization of these datasets is shown in Table 7.
Datasets characterization
Datasets characterization
The performance of the method was measured using the precision (P), recall (R), and F-measure (F) metrics.
In this section, we will describe the experimental setup that was considered for both datasets and used to evaluate the effectiveness of the OWA-based topic modeling in keyphrase extraction. In our experiment, we consider the uncontrolled keywords from Inspec as gold-standard keyphrases to guarantee that the keywords appear in the text. In the case of the 500N-KPCrowd, the most and second most selected keyphrases in the ranked keyphrases of each document were considered as gold standard. For each document and each algorithm, we compute the macro-averaged precision, recall and F-measure for measuring the algorithm’s performance. The following experimental tasks were performed:
Evaluating four variants of the OWA operator, using four different linguistic quantifiers, and comparing them with two other variants based on the use of single semantic relatedness metrics. Selecting the qualifier that provides the best results and verifying the benefits of the aggregation metric in semantic processing. To evaluate the impact of each variant of an OWA-based solution according to the Top N keyphrases extracted, whose results would provide more detail to the evaluation of the different quantifiers under assessment. Compare the results obtained by the best OWA solution identified in the previous tasks with the results obtained by other proposals.
Friedman’s Test was performed to validate the results obtained. From each dataset, 250 texts were randomly selected to constitute the sample group, up to 50% of the processed texts. In each test, the level of statistical significance was 95%, which means that the null hypothesis will be rejected when the
Table 8 shows the results of the first experimental task, where the evaluated solutions are grouped in single criteria (topic modeling based on the use of single semantic relatedness metrics) and multi-criteria (topic modeling through the OWA-based aggregation metrics); the best results are highlighted in bold. As shown in Table 8, the precision obtained by the use of LCH or JCN measures in Inspec, as well as the recall in 500N-KPCrowd were good, although not enough. Through this experiment, the use of single semantic relatedness metrics shown that, although the precision increases, the recall significantly diminishes when the size of the documents increases. This behavior is not shown in the OWA-based solution of topic modeling, which reached higher values in most of the evaluation metrics in both datasets, specifically the OWA
In [25], the contribution of the use of single metrics of syntactic similarity or distance between phrases for keyphrase extraction were proved in short and long texts. Them, based on those reported results, and for obtaining a more general proposal that offers good and better results for different types of documents these metrics were aggregated with the semantic measures in the topic modeling process; key process in our approach. According to the results shown in Table 8, using the OWA
Results with Inspec and 500N-KPCrowd according to the topic modelling approaches
Results with Inspec and 500N-KPCrowd according to the topic modelling approaches
To verify the effectiveness of these solutions, they were evaluated considering the Top N (N
Impact of the OWA-based solution proposed according to Top N keyphrase extracted
These results were validated through statistical tests, applying Friedman’s Test to obtain the solution ranking, complemented with Post-Hoc Tests to find significant differences between the solutions. The previous results show the following to be true:
H H H H
First, the statistical tests were performed to analyze the linguistic quantifiers in the OWA-based approaches, and the results obtained from the sample group of each dataset are shown in Table 10. The Friedman’s Test shown that the best solutions are OWA
Results of the Friedman Test and Post-Hoc procedures applied to each OWA-based proposal
Having identified the best OWA solution for each dataset, a similar statistics evaluation was performed, but comparing these solutions with the single semantic measures, whose results are shown in Table 11. In this case, the Friedman’s Test shows that the best solutions are OWA
Results of the Friedman Test and Post-Hoc procedures applied to the best OWA-based solution and the single criteria approach
Proving that the best approach of the proposal is the OWA-based solution, a comparison of results with other algorithms reported in the state of the art is carried out. This comparative is shown in Table 12, where the proposal gives promising results.
Comparative results with other algorithms reported
Our proposal, in Inspec, shows a slight improvement of F-measure in the state of the art (just a 0.2%), according to Vega et al. [29]. In case of recall and precision, both, are the third best. The best recall obtained with Inspec, achieved by Liu et al. [16] (66 %), is due to its clustering based on term semantic relatedness which guarantees the extracted keyphrases have good coverage in the document. Nevertheless, our proposal, generally has the best results.
The recall achieved with 500N-KPCrowd was the least satisfactory result of our proposal, although the results of precision and F-measure were significantly better than those obtained by the other proposals. Although the recall of Yign et al. [33] is the highest in the case of 500N-KPCrowd, its precision is approximately 9% and F-measure 2% lower than our method.
The low recall value obtained can be explained by the presence of a high number of annotated named entities as keyphrases in 500N-KPCrowd. The identification of named entities as candidate phrases from the text was not considered within the defined patterns in the pre-processing phase of the proposed approach, because this type of sentence is not often identified as a keyphrase. On the other hand, the OWA operator applied in the proposed fuzzy modeling of topics includes the aggregation of several semantic measures, which may fail in the case of named entities. This situation suggests a specific analysis for this type of phrase in subsequent applications of our proposal. Nevertheless, through the experiments, the improvement in effectiveness achieved by our method and the fuzzy-based semantic processing proposed in the automatic keyphrase extraction from two types of texts, such as: paper abstracts and news stories, has been proven.
This paper presents a new unsupervised method for automatic keyphrase extraction from text, which combines the use of lexico-syntactic patterns to identify the candidate phrases with a fuzzy modeling of topics. The use of linguistic patterns allowed more possibilities for identifying the candidate phrases and improved the coverage of the text. Several syntactic and semantic measures for modeling the most relevant linguistics features of the candidate phrases were aggregated applying an OWA aggregation operator. The aggregation of these measures through the OWA operator increased the semantic processing of the candidate phrase in the topic identification, which is a little-considered aspect in most of the existing proposals. The proposed method was evaluated on two datasets with different types of texts, and the results obtained were compared with those from other unsupervised schemes. From the different approaches analyzed in the proposal, it was possible to demonstrate that by using the aggregation of several semantic measures (multi-criteria), the best results are achieved with the use of this measure independently (single criteria). This was shown not only by the significant differences between the results, but in addition the use of statistical analysis also confirmed the benefits of using a multi-criteria approach. In this sense, it was also concluded that increasing the amount of identified keyphrases can improve the results.
A slight improvement of F-measure was achieved in both datasets compared to another proposal reported in the state of the art. The most significant result was obtained in 500N-KPCrowd, where a remarkable improvement was found in Precision compared to the other proposals. Although in general the best values were not achieved with Precision and Recall, it was possible to obtain a better balance of these metrics, which contributed to the improvement of the F-measure values. The results obtained with the proposed method promising, demonstrating the contribution of applied fuzzy topic modeling for improving the keyphrase extraction process, in paper abstracts and in more general texts, such as news stories. The improvement of the recall results on general domain texts is a challenge to be solved in the future, considering specific analysis for the entities named. Additionally, other linguistic quantifiers applied to the OWA operator will be assessed and their performance in the keyphrase extraction process will be measured.
Footnotes
Acknowledgments
This work has been partially supported by FEDER and the State Research Agency (AEI) of the Spanish Ministry of Economy and Competition under grant MERINET: TIN2016-76843-C4-2-R (AEI/FEDER, UE).
