Abstract
The constant increase in the production of scientific literature is making it very difficult for experts to keep up to date with the state-of-the-art knowledge in their fields. The use of Natural Language Processing (NLP) is becoming a necessary aid to tackle this challenge. In the NLP field, the task of measuring semantic similarity between two sentences plays a vital role. It is a cornerstone for tasks like Q&A, Information Retrieval, Automatic Summarization, etc., and it is a crucial element in the ultimate goal of computers being able to decode what is conveyed in human language expression.
Measuring Semantic Similarity (SS) in short texts has specific challenges. Because there are fewer words to be compared, the meaning contribution of each word is more relevant, and it is important to take into account the syntax’s contribution to the composed meaning. In addition, the highly specific and specialized vocabulary — Microbial Transcriptional-Regulation—implies the lack of massive training resources. Our approach has been to use an ensemble of similarity metrics including string, distributional, and knowledge-based metric and to combine the results of such analyses. We have trained and tested these methods in a similarity corpus developed in-house.
The task has proved very challenging, and the ensemble strategy has proved to be a good approach. Even though there is still much room for improvement in the precision of our methods concerning the human evaluation, we have managed to improve them reaching a strong correlation (ρ = 0.700).
Introduction
Semantic Textual Similarity (STS) has many practical applications, frequently as a medium for more complex tasks, such as information recovery, text generation, automatic summaries, plagiarism and paraphrasis detection, among others.
STS can also be used to search similar statements within a collection of texts (a corpus). For example, it can be used to recover supporting information about a target statement, to locate the same proposition within different contexts, and to detect unique statements within the corpus.
Automatic processes like these are becoming very important in an era where the exponential increase in information production has made it increasingly difficult to maintain an integral and updated perspective.
One such cases is the biomedical field, where the amount of information created by the scientific community is becoming unmanageable. For instance, the reference literature repository in the field, PubMed, has more than 26 million articles, and it grows at a rate of two new publications per minute.
Even in very narrow topics such as microbial transcriptional regulation, the scientific community struggles to cope with the vast amount of information continuously produced.
A representative example is the RegulonDB project, an international reference for the subject and the primary database regarding transcriptional regulation in Escherichia coli strain K-12. The knowledge contained in RegulonDB is the product of vast amounts of time and effort of several highly qualified curators who for years have been reading the related research articles. This literature curation is a continuous, slow, and costly task where the experts have to grasp the knowledge contained in the disaggregated literature and structure it into a database.
RegulonDB is an ongoing project wherein the spirit of exploring new means to increase the efficiency and quality of the curation has been explored via NLP tools to assist curators. Semantic similarity (SS) applied to the Transcriptional-Regulation literature is one of the research lines.
We first tried two STS baseline methods — Vector Space Models and an average of a word-to-word similarity using WordNet — with poor results (ρ <0.4 Pearson’s correlation).
It was clear that we needed more sophisticated strategies, but most state-of-the-art methods rely on a vast amount of training data. It is worth noting that STS is very sensitive to the aimed task and to the targeted domain. Certain text features, domain relations, and specific conceptual interpretations delineate the kind of textual similarity; that is, whether what is being evaluated is closer to the syntactic similarity, to the conceptual-ontological similarity, or to the analogy.
Therefore, despite the availability of several paraphrase and graded similarity corpora, Microsoft Research Paraphrase (MSRP) ([6]) and the SEMEVAL corpus ([1]), among others; these are not representative of the SS within such a specific topic as ours. The challenges of STS in Transcriptional-Regulation go beyond the very specialized vocabulary. To illustrate, experts weight different certain conceptual resemblances according to the class of the concepts and to the accompanying sentences’ context, e.g., if two sentences mention objects of the same class, like Transcription Factors (TFs) or Promoters, they are evaluated as more similar than two sentences mentioning objects of the Gene class. Hence, we decided to use an ad hoc corpus of pairs of sentences from the Microbial Transcriptional-Regulation literature, and for each of pair of sentences semantic similarity was manually scored by experts [18].
Because our corpus is not big enough for data-intensive machine learning methods (ML), we were limited to heuristics and unsupervised measures. After trying independently some common similarity measures, with poor results, we chose a combination strategy where multiple measures were merged. These strategies are more resilient to the data variability, they can use different types of measures (structured knowledge, textual, and statistical), and they have the flexibility to give specific weights to syntax, lexicon, and textual resemblances in order to adjust to what the domain experts consider semantically similar. Besides, they do not require too much training data, a small set, as our corpus is enough to train the combination method.
The presented approach resulted in acceptable correlation scores (ρ = 0.700). Although there is much margin for improving the scores, they are good enough for some of our purposes, to be applied in assistance tools. Moreover, this study gave us a valuable insight on STS within the Transcription-Regulation domain and some ideas of how to improve it.
One classification of NLP’s similarity measures is according to their information source. It can be lexical or semantic, and the measures can be grouped as follows: textual based, those which use the textual resemblance to compute the degree of similarity among written language fragments; structured resource based, which use the human knowledge encoded in dictionaries, taxonomies, and ontologies to identify and compare the composing elements of the compared texts; and unstructured resource based, for those whose source of knowledge is, commonly, textual corpora (e.g., literature) [3, 13].
Among the structured resources, ontologies are the richer in knowledge and, thus, frequently preferred. They are systems of abstract descriptions, and commonly hierarchical, of specific domains. They have the advantage of an underlying structure which can also be considered to measure the similarity among their elements. Most similarity methods based on these kinds of resources can be classified into hierarchical based, feature based, information content based, and the ones that are a combination of the others. Some examples of these methods are summarized in reference [7, 17].
Measures which use unstructured resources are frequently based on the distributional hypothesis, i.e., words occurring in similar contexts tend to have similar meaning [8, 12]. Works like the studies described in references [2, 34] use this hypothesis and represent words as multidimensional vectors with dimensions’ magnitudes related to the cooccurrence frequencies. These vectors represent the words within a geometrical space, where it is easier to measure similarity (e.g., Euclidean distance, cosine). Recently, there has been much attention on word embeddings, representations based on the same principle but where the dimensionality is reduced by a number of means. Most representative word embedding examples involve Word2Vec ([20]) and GLOVE ([24]). Although these representations were initially focused on lexical units (words), lately there have been approaches where phrases or whole sentences are encoded in embeddings ([16, 30]).
Finally, there are strategies that combine multiple information sources, not necessarily of the same type (e.g., textual, grammatical, and semantic) ([14, 28]). Even at the semantic level, it is common to utilize more than one type of knowledge resource. The advantage is that they complement each other, the knowledge that is implicit in the literature with that which is not necessarily present in the ontologies or thesaurus (knowledge explicitly encoded by people). For example, from an ontology it is possible to compare the taxonomical resemblance between shark and dolphin, but the fear socially associated with sharks and not with dolphins would probably be missing. As a consequence, these ensemble strategies are usually more robust and can be better adapted to nuances of the expected similarity.
Materials and Methods
We used an ad hoc corpus specific to the Transcriptional-Regulation literature ([18]) which better represents the nuances of SS in the domain. The corpus consists of 171 pairs of sentences extracted from 5,600 articles related to the topic, each rated by 3 annotators taken by chance from a group of 7 (non-fully-crossed design). The rating scale was ordinal and ranged from 0 to 4; 4 means that both sentences express the same meaning, and 0 means that their meanings do not overlap at all. Although not balanced — it presents a bias towards no similarity with 40% of 0 rated pairs — it has >50% pairs evaluated with a score within the range of 1-3. Besides, it has a very good interagreement ([33]) coefficient (Gwet’s AC2) of 0.8696.
The implemented pipeline consisted of preprocessing the sentence pairs, applying similarity measures, and finally, combining the individual scores.
In the preprocessing step, we used a list of objects from RegulonDB with multiword names, sorted the objects (from longest to shortest), and looked for occurrences within texts by using the left-right longest match. In this step we identified multiword concepts such as isoosmotic condition, transcriptional dual regulator, DNA-binding protein, KdpE-phosphorylated, acid-responsive, and cold-shock, among others (almost 400). Next, lexical units were lemmatized and their part-of-speech (POS) tagged, and sentence constituency trees were parsed 1 .
The STS step consisted in applying three types of measures: string metrics, which compare how similar are the words and their order between two sentences; distributional metrics, which are based on the premise that context is representative of word meaning, so words are represented as multidimensional vectors into geometric spaces where proximity is used as the SS; and ontological metrics, which rely on the knowledge’s structure to determine how close is the meaning of two concepts.
Finally, in the combination step, regression and ensemble algorithms were used to merge the individual results from the previous step. We experimented with the following regression algorithms: Linear Regression, Multilayer-Perceptron, and Random Forest. On top of that, we applied the ensemble algorithms Bagging and Voting to investigate if they provided any improvements.
The rest of this section is focused on explaining the similarity measures used.
Textual similarity measures
Levenshtein
The Levenshtein distance metric measures the similarity between two strings by computing the number of insertions, deletions, and substitutions needed to transform a source sentence (s) into a target sentence (t). The resulting score represents the distance; thus, the greater the distances, the more different are the strings. The distance between S1 and S2 is given by LDS1,S2 (|S1|, |S2|), where:
In order to use it as a similarity metric, it can be normalized and complemented. If strings are identical, the score is 1, and as similarity decreases, the score shifts towards 0.
The Jaccard score measures the similarity between two finite sets. It does this by dividing the cardinality of the set’s intersection by the cardinality of their union. Then, the Jaccard similarity between two sentences, represented by S1 = w1, w2, . . , w n and S2 = w1, w2, . . . , w m , is given by:
The n-gram similarity is a generalization of the longest common subsequences (LCS) based on the number of elements of size N which are shared between two strings [15].
If a string is represented by a list of elements, each of which is a group of n adjacent elements (n-gram), then the ratio between shared n-grams and the largest of the unique n-grams represents the similarity between both strings.
Dice and Overlap coefficients are ways to compute an n-gram similarity measure.
Distributional metrics use large text corpora as the knowledge source, and they are based on the assumption that words that cooccur frequently are semantically related, i.e., the distributional hypothesis ([8, 12]). In these models, words, sentences, or even documents are represented as multidimensional vectors within geometric spaces where it is possible to measure semantic proximity.
One of the main advantages of these models is that they can be “learned” from unlabeled data, e.g., a textual corpus; furthermore, these representations encode knowledge from how the words are used and not from explicit definitions (e.g., in dictionaries).
Global Vectors for Word Representation (GLOVE) ([24]) is one of the most representative exemplars of these models. GLOVE embeddings are trained on global word-word cooccurrence counts over which a global log-bilinear regression model is applied. The aim is to produce spaces whose dimensions are related to dimensions of meaning.
GLOVE embeddings represent lexical units (words). Because we are dealing with sentences, before comparing them it is necessary to apply a composition over the word embeddings of each sentence to generate a single embedding.
Averaged word embedding
One of the simplest composition methods is to average the sentence’s word embeddings ([22]). Once we have an embedding for each sentence, we use the cosine between them as the similarity measure.
We used this technique to compute two similarity measures: Using out-of-the-box GLOVE vectors, 300-dimension pretrained embeddings over Common Crawl (42B tokens, 1.9M vocab, uncased). Using ad hoc embeddings (Ad Hoc embeddings) trained over 6k Transcriptional-Regulation publications (31M tokens, 108k vocab, uncased).
The ad hoc embeddings were generated from over 6k Transcriptional-Regulation publications. First, all articles (in PDF format) were processed using an in-house-built tool that allowed the extraction of complete sentences from the article’s main content, i.e., without taking into account captions, tables, footnotes, etc. This results in a plain text file with one extracted sentence per line. Next, the text files were tokenized and merged into a single text file where one space separated all of the article’s words and a new line character separated different articles. With this preprocessed corpus (the single text file) and using the publicly available GLOVE scripts 2 , we constructed the vocabulary, which consisted in unigram counts, with words below a minimum frequency count of 5 filtered out. Next, the word-word cooccurrence statistics were constructed using a window size of 15 words over the corpus. Finally, for each word of the vocabulary word embeddings of 300 dimensions were trained over the cooccurrence statistics.
Sentence embedding from common vector
This composition operation consists of the following steps: From comparing two sentences (A and B), generate a common dictionary D. For each sentence Create a vector V of the size of the dictionary (D), i.e., |V| = |D|. Compare each word of D (D
i
) with each word of the sentence (using the cosine between their corresponding word embeddings) Select the highest value of these comparisons and use it as the magnitude of the i dimension of V.
Finally, we measured the SS using the cosine between the composed sentence embeddings.
We applied this technique to compute one similarity measure using the Ad Hoc embeddings trained over the Transcriptional-Regulation literature.
Ontology-based measures
As we have seen, string metrics focus on comparing words and characters contained in sentences, but these methods do not take into account the meaning conveyed by those words. On the other hand, when distributional methods locate close words that express a similar sense, they are dealing with the words’ meanings but only in an implicit manner and just based on their context. Another way to map plain symbols (i.e., words, multiterms) into a semantic space is to use Knowledge-Organized Structures (KOS) (e.g., ontologies, taxonomies, thesaurus, etc.) and then measure how similar are the mapped concepts within the semantic space. Metrics of this type allow leveraging the advantages of stated expert knowledge.
KOS can be very general in order to be applied across domains. This is favorable to the semantic comparison of general language, but it falls short when dealing with domain-specific terms. In those cases, a domain-dependent KOS is needed. Due to the fact that we are analyzing literature in the very specific field of gene regulation transcription, the use of domain-specific ontologies is not just desired but mandatory. To tackle this dichotomy, we decided to use both general and specific KOS.
As a general-purpose KOS, we used Wordnet ([21]), a lexical database that contains conceptually equivalent words (nouns, verbs, adjectives, and adverbs) grouped together in synsets, a kind of thesaurus. However, the difference is that synsets are interlinked through their lexical and semantic relations, forming a network. Although it is not an ontology, because of the intersynsets relations, it can be seen as a lightweight ontology structured through the is-a relation (i.e., hypernym/hyponym).
As a domain-dependent KOS, we used an ad hoc ontology based on RegulonDB’s objects and that was developed to specifically model the Microbial Transcriptional-Regulation domain.
We used Lin98 metric (5) to measure conceptual similarity within the ontologies.
Having two concepts, S1 and S2, then: depth(s) = depth of concept s in the ontology. LCS (S1, S2) = Lowest Common Subsumer of S1 and S2, i.e., the common parent that is lower in the tree.
Hence, Lin98 is defined as follows:
The measures used in the experiments are reported in the following order: Ontology-based measure, using the common vector technique, Wordnet as the knowledge source, and Lin98 metric Normalized Levenshtein Jaccard N-Gram Averaged word embedding, using pretrained GLOVE vectors Averaged word embedding, using ad hoc vectors trained over Transcriptional-Regulation literature Sentence embedding from common vector using ad hoc vectors trained over Transcriptional-Regulation literature Ontology-based measure, using domain-dependent ontology and Lin98 metric
These measures were trained and evaluated using the Transcriptional-Regulation similarity corpus. For each sentence pair, we used the average of their scores as the gold standard (GS). Neither these measures nor the GS had a normal distribution (Table 1).
First, we performed a correlation test between each measure and the GS (see Table 2). The measure with the highest Pearson’s correlation was EmbbedingSimCommonVect using the GLOVE embeddings trained over RegulonDB’s literature (GLOVE-AdHoc), with ρ = 0.604, and the lowest was Levenshtein with ρ = 0.235.
We computed Pearson’s (ρ) and Spearman’s (r s ) correlations to get more insights about the measures behavior within the corpus. Pearson’s correlation indicates a linear relationship, whereas Spearman’s correlation is higher when there is a monotonic relationship but not necessarily linear. While Spearman’s scores are a little higher for all measures, the differences are small for almost all measures except for PhraseEmbbedingByGlove when using GLOVE-AdHoc embeddings, for which Spearman’s score is 12% higher. This means that a linear regression would be unfavorable for this measure, which in fact is one of the measures that more closely correlates to the GS.
Next, we experimented with different algorithms to combine similarity measures: Linear Regression, Random Forest (with 100% substitution, 100 iterations, a batch size of 100, and no maximum depth), and Multilayer Perceptron (with learning rate of 0.3, momentum of 0.2, and 6 hidden layers). Each experiment was done using 10-fold cross-validation on the available data. Table 3 shows the correlation comparisons among these algorithms. As can be seen, Random Forest gives the best result (0.683 Pearson’s correlation) with a statistically significant gain over Perceptron (second best).
We also tried two ensemble algorithms: Bootstrap Aggregation (Bagging) and Voting. In Bagging, multiple learning models are trained with sample subsets pulled out (with replacement) from training data, and the models’ predictions are averaged. We used Bagging applied to a decision tree (Classification and Decision Tree [CART]) with a bag size of 100% and 10 classifiers. On the other hand, Voting combines the predictions of multiple models. The submodels used in the Voting ensemble were an arithmetic mean, a Linear Regression, a Random Forest, and a Perceptron; predictions were averaged. The ensembles’ performances are shown in Table 4. Both Bagging and Voting got better results than Random Forest, but differences were not statistically significant. Hence, to continue with experiments we selected Voting, because it got the best result, and Random Forest, because it is much simpler and its performance is almost as good.
The last experiment was an ablation test to inspect the impact of each similarity measure when combined into the final score. The test was applied to the two previously selected algorithms, Random Forest and Voting (see Table 5).
Measure 5 had the highest correlation merit for Random Forest, with a loss of 3% when removed, whereas for Voting it was measure 7, with a 4.1% loss. In the case of negative impact — correlation improvement when the measure is removed — the measure based on WordNet (measure 1) was the most harmful for both Voting and Random Forest. Without this measure, the Voting strategy had a gain of 0.7% to reach a Pearson’s correlation of 0.700.
In this paper, we examined the automatic evaluation of semantic textual similarity within the specialized domain of Microbial Transcriptional-Regulation literature. STS is always a demanding endeavor, because the conveyed meaning is very dependent on the word order, the concept composition, the use of synonyms, and the shades of similarity between two expressions. Even for humans, to evaluate the semantic overlapping between two statements is not easy. This study showed that besides the challenges intrinsic to STS, there are other difficulties when it is applied to specialized domains.
We have addressed not only the development of a model which gave good correlation scores but also the insights afforded by the STS particularities within the domain. The next pair of sentences is an example of equivalent sentences (SS score of 4):
The dcuB gene is strongly activated anaerobically by FNR, repressed in the presence of nitrate by NarL, and subject to cyclic AMP receptor protein (CRP)-mediated catabolite repression.
The results show that the dcuB gene is expressed exclusively under anaerobic conditions in a manner that is FNR dependent and that it is repressed by NarL in the presence of nitrate and is subject to CRP-mediated catabolite repression.
Both sentences have several challenges: subordinated clauses, long coreferences, use of synonyms, and even the use of conceptual generalizations (e.g., activated - expressed). Interestingly, nonexperts gave them an average SS score of 3, whereas experts classified them as paraphrases (SS score of 4). Beyond the difficulties that nonexperts had in understanding the sentences, this small example shows the impact of specialized knowledge in the semantic interpretation in such special purpose domains. As a side observation, the syntactic complexity in these sentences was more evident, and harder, for nonexperts.
The next example is a pair of sentences rated with an SS of 2 (share specific objects and some other similarities).
Aerobic regulation of the sucABCD genes of Escherichia coli, which encode K-ketoglutarate dehydrogenase and succinyl coenzyme A synthetase: roles of ArcA, Fnr, and the upstream sdhCDAB promoter.
Transcription of the fdnGHI and narGHJI operons is induced during anaerobic-growth in the presence of nitrate.
An interesting singularity from the expert evaluation of these sentences is that they considered aerobic and anaerobic conditions as encompassed in a contrasting condition feature (oxygen availability). Therefore, even when in dictionaries both terms are antonyms, in this domain the contrasting conditions share a certain degree of similarity significant for experts.
The SS’s particularities observed in these examples are due to the fact that experts assign higher weights to certain commonalities based on their own intuition and experience in the topic. Our inference is that it is due to this that similarity measures, when applied independently, give poor results even when in other domains they have been shown to be robust.
To our knowledge, this is the first study to investigate STS in the domain of Microbial Transcriptional-Regulation literature. Our approach was to combine textual and semantic commonalities in order to better model the semantic similarity implicitly expressed by experts through the ad hoc built corpus. A relevant factor for measuring STS is the specialized lexicon; to deal with this we included two semantic measures based on different knowledge sources. One measure used an in-house Transcriptional-Regulation ontology to identify and compare conceptual similarities, and the other used word embeddings built from the Transcriptional-Regulation literature to represent concepts in a geometrical space and compare them.
Our experimental results concurred with the difficulty observed in the corpus, i.e., the domain’s specific phenomena. The relevance of conceptual similarity was confirmed by our results, where textual measures like Levenshtein and n-grams had poor performance whereas semantic measures resulted in higher correlations with the GS. In particular, word embeddings extracted from the Transcriptional-Regulation literature (measure 7) and the measure based on the Transcriptional-Regulation ontology (measure 8) were closest to the experts’ evaluations. However, it should be noted that the impact of measure 8 became small when it was combined with the others (see Table 5). Despite the low performance of individual measures, with the combination strategy we obtained a high correlation ([23]) of ρ = 0.700.
Although the presented approach results are good enough to be used in tools or more complex tasks, there is still room for improving the correlation scores. Even though we are implicitly considering the syntax of sentences within some of the measures, we think that good performance gains could be obtained if we included measures explicitly based on syntax similarity. We are already working in this direction, having in mind works like those described in references [9, 17].
The presented approach is applicable for developing tools that facilitate the access to knowledge conveyed in the Microbial Transcriptional-Regulation literature. In particular, we are already integrating it in assistance tools for the literature curation process.
Conclusions
To evaluate the semantic similarity grade of a pair of sentences is a difficult task but it has proven to be even harder in the context of the Microbial Transcriptional-Regulation literature. The specialized lexicon and the lack of large training corpora restricted the type of approaches that could be used. In particular, data-intensive machine-learning methods are inaccessible. Besides, in this type of literature it is not uncommon to find long and complex sentences with several subordinate clauses.
To cope with these challenges, we decided to use a strategy based on combining different types of similarity measures. Three textual measures were combined with 5 semantic measures of two different types (2 structured resource based and 3 unstructured resource based). With this approach we obtained good results, where the best ρ score was 0.700. Although we are still far from state-of-the-art results, it is worth noting that usually these top scores are obtained in general domains for which several and large training corpora are available.
We developed a model for which performance in the Microbial Transcriptional-Regulation domain is good enough to be used as a resource for more complex tasks or as a component in curation assistance tools (as we are already doing). Furthermore, we have contributed interesting insights into the semantic similarity task in a very specialized and narrowed domain, and also into how different measures correlate with field experts’ evaluations. We think this is a stepping stone towards a better understanding of semantic similarity in these kinds of domains and to the progress of NLP in the bioinformatics field; a key element for better leveraging the rich knowledge conveyed in the literature.
List of abbreviations
Competing interests
The authors declare that they have no competing interests.
Funding
We acknowledge funding from the UNAM and from the National Institutes of Health (grant number 5R01GM110597-03).
