Abstract
The aim of this study is to identify the power of text-based metrics (Cosine and Lucene similarity) and linked-based (Co-citation, bibliographic coupling, Amsler, PageRank, and HITS) and their combination in estimating the similarity of articles with each other. The experiments were conducted on a test collection of 26,262 articles in the PubMed Central Open Access Subset (PMC OAS) of CITREC that, in addition to having linked-based metrics, their full text was available for calculating text-based metrics. Thirty articles were selected as primary articles, and articles related to each of them were retrieved based on the mesh similarity metric. Then, the similarity of the retrieved documents based on text-based and linked-based metrics was also extracted. In the next stage, text-based, linked-based, and hybrid metrics were entered into the generalized regression model to estimate the similarity of the articles to determine their power; finally, the performance of the models was compared based on the mean squared error and correlation. The results showed that the model included Cosine and Lucene similarity metrics in text-based metrics. In linked-based metrics, HITS (Hub), HITS (authority), PageRank, and co-citation had the highest power, respectively; but the bibliographic coupling and Amsler could not enter the model. In general, a comparison of text-based, linked-based, and hybrid metrics performance indicated that the linked-based model estimates similarity between articles better than the text-based model, and the combination of text-based and linked-based metrics makes little change in improving the power of the articles. Despite the importance and application of text-based and linked-based metrics to measure the similarity of articles, a study that examines the power of these metrics alone and in comparison with each other in estimating the similarity of articles was not observed.
Introduction
Information retrieval systems use different similarity metrics to measure the similarity of documents with queries, document with a document, identify related documents, and rank them. These metrics are divided into two main groups: text-based and linked-based. In text-based metrics, primarily, the words and phrases in each text are used to calculate the similarity of documents with each other. For example, to measure the similarity using metrics such as term frequency-inverse document frequency (TF-IDF), cosine similarity, and BM25 as well as Lucene score, often the similarity of two texts is measured by the number of common terms in the document and the queries or two documents along each other (Oghbaie and Mohammadi Zanjireh, 2018; Yoon et al., 2011). So that the similarity usually increases with the increase in the number of common words between the two documents. Text-based metrics face challenges due to issues related to natural language processing, such as ambiguity of pronoun references, polysemy, synonymy, etc. (Kim et al., 2017; Vallez and Pedraza-Jimenez, 2007; Zarrinkalam and Kahani, 2012). It reduces the effectiveness of text information retrieval, document clustering, and retrieval systems.
In contrast, linked-based metrics use links between documents to retrieve and rank related documents. The citation links among the documents and their graph structure are considered to calculate the similarity (Reyhani Hamedani and Kim, 2020; Reyhani Hamedani et al., 2013). Some believe that citations are often used to measure the implicit similarity between articles and that document retrieval using citation links can find documents that have not been identified by textual analysis (Eto, 2019); on the other hand, some point out that these metrics only consider citation relationships between scientific articles and ignore the content of the articles (Reyhani Hamedani et al., 2013).
Co-citation, bibliographic coupling, and Amsler are the classical linked-based metrics. Co-citation is a link made by new authors to previous articles; thus, co-citation is the frequency with which two documents are cited in subsequent documents (Small, 1973). In the bibliographic coupling, common references are examined; in other words, the existence of a reference in two articles is the unit of measurement of the common limit of those two articles, and the more articles are common in terms of their references, the closer they are to each other in terms of content (Kessler, 1963). To calculate the similarity of documents with the Amsler, also called linkage similarity, both bibliographic coupling and co-citation are considered simultaneously (Bichteler and Eaton, 1980).
In addition to the three aforementioned linked-based metrics, there are also two new metrics named Hyperlink-Induced Topic Search (HITS) and PageRank, which were created for the web environment, especially the ranking of web pages. However, the researchers examined their ability to identify essential and prestigious scientific documents and rank them, and validated their performance to measure the similarity of the documents (Chen et al., 2007; Lin, 2008; Liu and Lin, 2007; Liu et al., 2012; Yin et al., 2009, 2011; Zhuge and Zhang, 2010). Both PageRank and HITS are recursive and repetitive metrics based on document linking in the web environment; at the same time, PageRank tends to reflect old articles, and HITS tends to reflect new articles (Devi et al., 2014; Farahat et al., 2006; Jiang et al., 2016; Ng et al., 2001).
Both text-based and linked-based metrics are widely used in various databases to find the most relevant documents. For example, ScienceDirect, in addition to showing the documents identified by the query, suggests the documents cited and citing the retrieved documents and uses co-citation to offer similar documents (Eto, 2013, 2019). Web of Science and Scopus databases use bibliographic coupling (Ahlgren and Jarneving, 2008; Burnham, 2006; Char and Ajiferuke, 2013; Nicolaisen and Frandsen, 2012), ACM digital library uses co-citation (Eto, 2013) and CiteSeer uses Common Citation-Inverse Document Frequency (CC-IDF) as well as co-citation to recommend similar and related documents to the user (Eto, 2019; Gipp and Beel, 2009; Lu et al., 2006; Wanjantuk and Keane, 2004). Google search engine uses PageRank, and the ASK search engine uses HITS to retrieve and rank documents related to user queries (Goswami et al., 2017; Thelwall, 2003).
Despite the use of complex metrics and algorithms to find the most relevant documents, the most crucial challenge in information retrieval is the issue of relevance and user satisfaction; On the other hand, in practice, people prefer the items at the top of the retrieved list because of their expectations of the quality of the algorithm of the information retrieval system or their laziness. They somehow trust the ranking of the information retrieval system (Agrahri et al., 2008), following the principle of the least effort (Zipf, 1949 ), it is evident that they expect to satisfy their information needs through the first few results in the retrieved list.
Since the metrics used in each information retrieval system are different, even assuming similar documents are in their set, it is still possible that the retrieved documents will be different from one system to another (McCain, 1989; Pao and Worthen, 1989). Therefore, it is essential that the metrics used in information retrieval systems be so effective that not only can it retrieve all the documents related to the user query; rather, it can sort the documents in such a way that the user’s information needs are met with minimal effort and time. The first few results are the most relevant, the best, and the most suitable answer to the given query and also provide users satisfaction (Bar-Ilan et al., 2007). Because the user, regardless of the used measure, seeks to find the most relevant documents in response to his information needs. We want to introduce the best of them to be used in algorithms and databases, and more weight is assigned to them.
According to this, the current study intends to examine the power of each text-based and linked-based metric to estimate the similarity of articles with each other, then test the potential of combining the two groups of metrics to improve the power of similarity of articles with each other. According to the above, this study seeks to answer the following questions:
1- To what extent can text-based metrics estimate the similarity of articles?
2- To what extent can linked-based metrics estimate the similarity of articles?
3- To what extent can the combination of text-based and linked-based metrics estimate the similarity of articles?
Literature review
Comparison of the performance of linked-based metrics with each other
The performance of linked-based metrics has always been considered by researchers from the past to the present. In the initial research, the performance of metrics such as the number of citations, bibliographic coupling, co-citation, and Amsler in different areas of information retrieval as well as in comparison with each other, was tested. The findings of Bichteler and Eaton (1980) showed that the Amsler (combination of bibliographic coupling and co-citation) was superior to the bibliographic coupling alone. In another study, Shibata et al. (2009) compared the three metrics of co-citation, bibliographic coupling, and “direct citation” in terms of the performance of these metrics for detecting emerging research fronts and core papers. Findings showed that direct citation had the best performance, and co-citation had the worst performance compared to others.
With the advent of the World Wide Web—which was based on hyperlinks—the question was whether the previous metrics, especially citation-based metrics closely related to hyperlinks, could also be used in the web environment. And if so, in what context does each metric have better performance? Given that citation and citation-based metrics do not consider the authority that documents provide to each other, the potential of graph-based metrics such as PageRank and HITS algorithms in document similarity were considered, and the capability of these new linked-based metrics was tested in comparison with previous metrics.
Ma et al. (2008) tried to propose a new method to measure the importance of Google PageRank-based scientific articles by comparing the PageRank and the number of citations. They reached the results of ranking articles with PageRank with the number of citations and observed a correlation between them and PageRank. Herskovic and Bernstam (2005) repeated the comparison of these two metrics, and the results showed that the performance of PageRank was better than the number of citations. In another study, Lin (2008) compared two new linked-based metrics, PageRank and HITS. He examined the ability to use graph analysis algorithms such as PageRank and HITS in the PubMed-related article network to retrieve biomedical text retrieval. This study showed that PageRank is more effective than HITS in analyzing the link structure of related document networks. Yin et al. (2011) examined three link analysis algorithms, namely degree distribution, PageRank, and HITS, in terms of efficiency in retrieving biomedical texts. Analysis of the extracted data showed that although all three algorithms improve biomedical text retrieval; however, the degree distribution algorithm performs better than the others, and PageRank and HITS rank next, respectively; they also proposed a probabilistic combination that combines citation information with a content-based probabilistic weighting model, which they believe will improve the biomedical text retrieval. Yoon et al. (2016) introduced similarity metrics SimRank, Reverse SimRank (rvs-SimRank), penetrating rank (P-Rank), and connectors rank (C-Rank), which are the recursive version of co-citation, bibliographic coupling, and Amsler, respectively. The Connectors rank(C-Rank) calculated the similarity score based on the number of connectors in the undirected graph. These metrics performed better than their non-recursive versions; among them, the connectors rank (C-Rank) had the best performance. He and Chen (2018) proposed approaches for representing the changing citation contexts of cited publications in different periods as sequences of vectors by training temporal embedding models. They could utilize material representations to quantify how much the roles of publications changed and interpret how they changed. Their study in the biomedical domain showed that their metric on the changes of publication roles is stable at the group level. Still, it could account for the variation of individual publications. Liu and Hsu (2019) proposed a novel measure that improved BC and named BCCCC (Bibliographic Coupling with Category-based Co-citation). The performance of BCCCC was evaluated by experimentation and case study. The results showed that BCCCC performs significantly better than state-of-the-art variants of BC in identifying highly related articles. Yun (2022) proposed novel methods to estimate intralayer similarity on a node-split network using personalized PageRank (PPR) and neural embedding (EMB). He demonstrated that PPR strongly correlates with the coupling measures, and the proposed method can yield distinct similarities between items even if they are distant.
Comparison of the performance of text-based metrics with each other
In studies in text-based metrics, researchers have used different text-based metrics to measure the similarity of articles. In several of these studies, the ability of text-based metrics in comparison with each other has been tested. Sternitzke and Bergmann (2009) examined the inclusion index, the Jaccard index, and the cosine index to calculate the similarity between documents. The results showed that to find similar content among a volume of different documents, the inclusion index, mainly when the degree of similarity is measured based on citation data, provides more accurate results. Boyack et al. (2011) accurately compared nine text-based similarity approaches. They used more than 2 million MEDLINE records. These nine approaches consisted of five analytical techniques with two data sources. The five analytical techniques were: cosine similarity, Latent Semantic Analysis (LSA), topic modeling, two language models, BM25, and PubMed-related articles. The two data sources were: medical subject headings (MeSH) and the other, words derived from the title and abstract. The results showed that the PubMed-related articles and BM25 approach using the title and abstract were the most accurate, and strategies that used only MeSH subjects performed poorer in accuracy and were not comparable to those based on the title and abstract. Thada and Jaglan (2013) compared the three similarity coefficients of Jaccard, Dice, and Cosine to find the essential documents in a dataset. The results showed that although the use of three metrics can provide promising results in finding the most important similar documents, there is still a long way to go to achieve maximum efficiency; also, the results of Jain et al. (2020) research for comparing Pearson correlation coefficient, cosine correlation, constrained Pearson correlation coefficient, Sigmoid function-based Pearson Correlation Coefficient, Jaccard similarity, and Minkowski distance metrics showed that Minkowski metric has better performance relative to other similarity metrics in content-based recommender systems.
Comparison of text-based and linked-based metrics with each other and proposing new approaches to calculate document similarity
Since the 1990s, with the development of textual information processing and retrieval systems, the comparison of linked-based metrics with text-based metrics has been considered by researchers. One of the main findings of these researches was that the documents retrieved using text-based metrics differ from documents retrieved through linked-based metrics. The studies of McCain (1989) and Pao and Worthen (1989) were among the first studies to confirm the above findings and were later confirmed by Ahlgren and Jarneving (2008).
Ahlgren and Jarneving (2008) compared the citation-based approach of bibliographic coupling with a text-based approach based on common abstract stems in the context of science mapping. They used 43 articles published in the Journal of Information Retrieval between 2004 and 2006 as experimental articles. The information retrieval specialist classified the articles. The cosine measure was used for normalization, and the full link was used for clustering. The results of this study showed that the two ranking methods and their compatibility with the classification performed by the subject specialist were low.
In studies that compared the performance of these two groups of metrics with each other, the results obtained were different. Few studies have suggested that text-based metrics have better performance than linked-based metrics in retrieving similar documents (Ahlgren and Colliander, 2009); in some studies, the results showed that link-based metrics could have better performance in retrieval by overcoming problems such as synonyms or ambiguous expressions that exist in textual similarity and lead to misleading results (Ahlgren et al., 2020; Bernstam et al., 2006; Janssens et al., 2020; Yin et al., 2009). The results of Bernstam et al. (2006) showed that citation-based algorithms are more effective than non-citation algorithms for determining articles. Using successful web environment algorithms such as PageRank to retrieve scientific information in biomedicine is possible. Yin et al. (2009) examined the combination of text-based information retrieval results with linked-based document importance scores to improve performance in the TREC biomedical dataset. They looked at three linkage-based ranking algorithms (PageRank, HITS, and InDegree) and the BM25 probability model in terms of efficiency in retrieving biomedical texts. The results showed improved biomedical information retrieval in all three link analysis algorithms and better performance of the InDegree algorithm than other algorithms in retrieving biomedical texts. Ahlgren et al. (2020) compared relatedness measures for community detection in many PubMed publications. The results showed that extended direct citation had the best performance. Janssens et al. (2020) developed a citation-based search method called CoCites which was designed to be more efficient than traditional keyword-based methods. This method starts with identifying one or more important sources (query articles) and involves two searches: the co-citation search that ranks articles based on the co-citation frequency with the query articles and another citation search that ranks them based on total citations that cite or are cited by the query articles; 250 review articles were reviewed. The results showed that CoCites is an efficient and accurate method for finding relevant articles.
Finally, the results of studies that have sought to improve the similarity of documents by combining the two groups of metrics mentioned above have been somewhat contradictory: in some (Boyack and Klavans, 2010; Lu et al., 2006; Menczer, 2004) combinations of linked-based metrics with text-based, system performance for retrieving related documents is improved. Menczer (2004) examined web page content, link, and similarity metrics. The results showed that although the correlation between the different metrics is small; however, it is significant due to the large volume of data used. He also concludes that combining content with link similarity metrics can be crucial to retrieve and ranking web pages. Boyack and Klavans (2010) examined the accuracy of four document-document similarity approaches (co-citation, bibliographic coupling, direct citation, and a bibliographic coupling-based citation-text hybrid approach). The results showed that among the three citation approaches, the performance of the bibliographic coupling approach was better than co-citation in identifying similarities between documents, and the direct citation was at the lowest level. In addition, the combined approach of citation approach was more accurate. Boyack and Klavans (2020) compared the large-scale science models based on textual, direct citation, and hybrid relatedness. They compared PubMed models created using seven relatedness measures: two based on direct citation, one on text, and four using text and citation hybrid measures. They found that the hybrid relatedness measures outperform those based solely on text or direct citation.
In the other group, the results show that when combining linked-based metrics with text-based metrics, the system performance for retrieving relevant documents does not improve (e.g. Ahlgren and Colliander, 2009). Ahlgren and Colliander (2009) examined document-document similarity metrics, including text-based metrics (Term Frequency-Inverse Document Frequency (TF-IDF) and Singular value decomposition, bibliographic coupling citation-based metric, and text-based and bibliographic coupling combined metrics in the context of science mapping. The basis of the similarity of the documents was the cosine similarity scale, which was applied in two aspects of direct (first order) and indirect (second order) similarity. The results showed that indirect similarity performed better than direct similarity in all methods. In direct similarity, the method “Term Frequency-Inverse Document Frequency (TF-IDF) - Singular value decomposition - bibliographic coupling,” which is a linear hybrid approach, performed better than others. In indirect similarity, the text-based method of “term frequency-inverse document frequency” was better than the others. The citation-based bibliographic coupling approach was the worst in all cases. Regarding the similarity of documents, citation-based approaches were worse than text-based approaches. Still, when citation-based and text-based approaches were combined, the performance decreased relative to text-based ones alone.
Use of representation of texts as complex networks
Amancio et al. (2012) believed that a fair assessment of individual researchers and the journals themselves requires that the criteria for selecting references in a given manuscript should be unbiased concerning the authors or journals cited. Therefore, their paper used formalisms of complex networks for two datasets of papers from the arXiv and the Web of Science repositories to show that neither of these criteria is fulfilled in practice. Therefore, they estimated a similarity index between pieces of text. To simulate a systematic search in the citation network, they employed a traditional random walk search (i.e. diffusion) and a random walk whose probabilities of transition are proportional to the number of the ingoing edges of the neighbors. Based on these results, they proposed a fairer approach for evaluating and complementing citations of a given author, effectively leading to virtual scientometry. De Arruda et al. (2018) grasped the mesoscopic characteristics of semantic content in written texts and formed a network model that can analyze documents in a multi-scale way. They showed that the mesoscopic structure of a document, modeled as a network, reveals many semantic characteristics of texts. Such an approach paves the way for a myriad of semantic-based applications. Correa and Amancio (2019) investigated Word Sense Induction (WSI). They devised a method that leverages recent findings in word embeddings research to generate context embeddings. They modeled the set of ambiguous words as a complex network to induce senses. In the developed network, two instances (nodes) are connected if the respective context embeddings are similar. Upon using well-established community detection methods to cluster the obtained context embeddings, they found that the proposed method yields excellent performance for the WSI task. Quispe et al. (2021) investigated whether using word embeddings to create virtual links in co-occurrence networks may enhance the quality of classification systems. Their results showed that the discriminability in the stylometry task is improved when using Glove, Word2Vec, and FastText. Moreover, they found that optimized results are obtained when stop words are not disregarded, and a simple global thresholding strategy is used to establish virtual links. Likewise, theoretical language studies could benefit from the adopted enriched representation of word co-occurrence networks.
Methodology
The test collection we used for this study was the articles in the PubMed Central Open Access Subset (PMC OAS) of CITREC. We use CITREC because it is an open evaluation framework for citation-based and text-based similarity measures. It uses Medical Subject Headings (MeSH) thesaurus as a gold standard and expert relevance feedback. A significant advantage when deriving a gold standard using MeSH descriptors is that most documents in the CITREC test collection have been manually tagged with MeSH descriptors. Due to time and cost constraints, most other test collections can collect human relevance feedback only for a small fraction of the included documents. The CITREC framework provides open-source Java code for computing 35 citation-based and text-based similarity measures and pre-computed similarity scores for those measures to facilitate performance comparisons. This set contains 255,339 articles; however, to compare the indicators with each other, it was necessary to select pieces with all three indicators studied in this study (co-citation, bibliographic coupling, and Amsler). By applying this condition, the obtained data set reached 26,357 articles; on the other hand, given the fact that it was necessary to obtain the similarity of the articles, in addition to citation indicators, based on the textual indices of Lucene and Cosine, from this set, we selected articles whose full text was available in the experimental collection; eventually, about 100 more of the retrieved articles were removed, leaving 26,262 articles in the final set.
To obtain the full text of the selected articles to measure the text-based similarity, the PubMed Central Identifier (PMC Identifier) (PMCID) of all articles was retrieved by coding and using a designed downloader program, the file related to the full text of each article was extracted in nXML format. The extracted articles were placed in a folder consisting of 27 subfolders (each subfolder containing 1000 files) and prepared for further analysis; next, a MySQL database was created using the Wampserver development environment and PHP My Admin; then, using the online demo of the CITREC test collection, an output was prepared, and by entering it into the MySQL database containing the research data set, the main structure of its tables was created; finally, by studying all the required code from the CITREC source code package, an attempt was made to enter the necessary codes by applying the required changes. The results were entered in the created MySQL database. The NetBeans was used to implement the open access codes of the CITREC test Collection, and co-citation, bibliographic coupling, Amsler, and mesh metrics were calculated and implemented in this way.
The PubMed Central Open Access Subset (PMC OAS) of CITREC uses the medical subject headings (MeSH) assigned to each document to measure the similarity of the documents with the queries. CITREC considers the MeSH as the expert relevance judgment. These judgments can be used to create a gold standard for subject relevance. This gold standard enables researchers to measure the degree of reflection of the subject relevance of citation-based and text-based similarity metrics. Therefore, this metric was considered the benchmark of the present study. This technique has been used in several other studies to measure the similarity of documents (Batet et al. (2010), Eto (2012), Lin and Wilbur (2007), Zhu et al. (2009) quoted in Gipp et al., 2015). In this test collection, the similarity metric of MeSH (expert relevance) is calculated by measuring the similarity between each query-document coupling, using the Jaccard metric and dividing the intersection of mesh terms by their union.
The coding had to be done, and the mesh intersection code had to be executed to determine the similarity of the documents. For this purpose, a new table was created in the MySQL database, and each document was compared to all the documents in the collection. Then, the similarity of the documents was obtained based on the intersection of their mesh descriptors with each other, which was stored in descending order and stored in the MySQL database under the name of Sim-Mesh-intersections. Finally, the article files, all in nXML format, were converted to Lucene text search engine index files and saved.
In the next step, for documents that were similar to each other based on mesh intersection, text similarity metrics such as cosine similarity and Lucene are used as a text-based similarity metric in the PubMed Central Open Access Subset (PMC/OAS) of the CITREC, were calculated. Text similarity metrics were calculated for the articles previously stored in the Sim-Mesh-intersections table. It should be noted that before executing the relevant code, the IndexCopier code was executed to create an index of articles; then, the Cosine and Lucene similarity codes were executed to complete the tables in the MySQL database.
By writing a query in SQL language, the set citation network was wholly extracted and stored in a Comma-separated values (CSV) file; due to the large size of the file, it was not possible to use the available software to calculate PageRank and HITS. So, a program was written in Python to open and process this large file and run metrics. Thus, the numbers related to PageRank and HITS (authority and hub) were calculated and stored separately. For the HITS, the authority and hub values were calculated relative to each other in a two-way recursive calculation.
Thirty articles were randomly selected as primary documents, and each of them was considered a query. Searching each article in the MySQL tables created in the previous step, documents similar to each of the 30 basic documents were identified and retrieved based on the mesh similarity scale; then, to determine the similarity of the retrieved articles for each question, linked-based metrics (co-citation, bibliographic coupling, Amsler, PageRank, HITS) and text-based metrics (Lucene and cosine similarity) were extracted from the relevant tables. Social science statistical software (SPSS) version 23 was used to perform statistical tests. Considering the abnormality of the dependent variable, it was not possible to use linear regression for estimation; therefore, generalized linear models such as gamma distribution and inverse Gaussian should be used to model this data. In forecasting and estimation problems, some performance criteria are used to measure the model’s performance. This study used the mean squared error (MSE) and correlation to compare the models’ performance.
Research findings
Before presenting the findings, it is necessary to compare the inverse gamma and Gaussian models in terms of goodness of fit and select one. Goodness indices of the inverse gamma and Gaussian model for the three metric groups examined (text-based, linked-based, and a combination of text-based and linked-based) show that the inverse Gaussian distribution has a larger number in terms of the numerical value of the Log-Likelihood function than the gamma distribution; also, based on the Akaike’s Information Criterion (AIC) and the Bayesian Information Criterion (BIC), the inverted Gaussian model shows a smaller number; therefore, the inverted Gaussian model was found to be more desirable for describing data from all three groups, text-based, linked-based, and hybrid (Table 1).
Goodness of model indices by metric groups.
The role of text-based metrics in estimating the similarity of articles
To measure the role of two text-based metrics, Cosine and Lucene similarity, in estimating the similarity of articles, these two metrics were entered into the regression model as independent variables and mesh similarity as a dependent variable (Table 2). According to the results obtained from Table 2, it can be said that in fitting and modeling the dependent variable with text-based independent variables, the fitted model is significant in the generalized model with inverse Gaussian distribution (χ2(2) = 164.525, p < 0.01).
Significance test of the proposed model for combining text-based metrics.
According to the results of Table 3, in fitting and modeling the dependent variable with text-based independent variables, in the generalized model with inverse Gaussian distribution, all text-based independent variables affected the dependent variable.
Estimation of parameters and effectiveness of text-based variables.
p < 0.01.
Table 3 shows the results of the Wald Chi-Square test to evaluate the effectiveness of the independent variables. The higher value of this statistic leads to the greater effect of the variable; thus, in text-based metrics, cosine similarity (10.303) was more effective on the dependent variable than Lucene similarity (7.239); in other words, the power of the cosine similarity metric was greater than Lucene in estimating the similarity of articles.
In the next step, the output of the generalized model was extracted by inverse Gaussian distribution for combining independent text-based variables (Cosine and Lucene similarity). The results showed that the output of the model was correlated to the values of (rs = 0.069 and p < 0.001), and the mean squared error was (MSE = 0.00447). This indicates that the proposed model with an error of 0.00447 estimates the similarity of articles. Based on the findings of the Spearman correlation test, it can be concluded that with 99% confidence and an error value of less than 0.01, there is a significant relationship between the similarity of articles and the output of the text-based metrics model (Table 4).
Correlation test results between the similarity of articles and text-based model.
N = 15,000.
*p < 0.001.
The role of linked-based metrics in estimating the similarity of articles
To assess the role of the bibliographic coupling, co-citation, Amsler, PageRank, HITS (authority), and HITS (hub) in estimating the similarity of articles with each other, these six metrics as independent variables and the similarity of articles as a dependent variables entered the regression model. According to Table 5, in fitting and modeling the dependent variable with independent linked-based variables in the generalized model with inverse Gaussian distribution, the fitted model is significant (χ2(2) = 801.656, p < 0.001).
Significance test of the proposed model for combining linked-based metrics.
In fitting and modeling the dependent variable with independent linked-based variables, in the generalized model with inverse Gaussian distribution, four variables of independent variables were influential on the dependent variable (Table 6). According to the Wald Chi-Square statistic, HITS (hub), HITS (authority), PageRank, and co-citation had the most significant effect on the dependent variable. The two metrics of bibliographic coupling and Amsler did not affect the dependent variable or enter the model.
Estimation of parameters and effectiveness of linked-based variables.
The output of the generalized model with inverse Gaussian distribution for combining independent linked-based variables with the similarity of articles had a correlation of (rs = 0.204 and p < .001), and the mean squared error was (MSE = 0.00433). This indicates that the proposed model with an error of 0.00433 could estimate the similarity of articles based on the mesh similarity metric. Based on the findings of the Spearman correlation test, it can be found that with 99% confidence and an error level of less than 0.01, there is a significant relationship between the similarity of articles and the output of the linked-based metrics model (Table 7).
Correlation test results between the similarity of articles and linked-based model.
N = 15,000.
*p < 0.001.
Combination of text-based and linked-based metrics to estimate the similarity of articles with each other
Two text-based (Cosine and Lucene similarity) and six linked-based (co-citation, bibliographic coupling, Amsler, PageRank, HITS (authority), and HITS (hub)) metrics as independent variables, and the similarity of articles as the dependent variable entered the regression model. Based on the results obtained from Table 8, in fitting and modeling the dependent variable with independent text-based and linked-based variables, in the generalized model with inverse Gaussian distribution, the fitted model is significant (χ2(2) = 820.137, p < 0.001)
Significance test of the proposed model for combining text-based and linked-based metrics.
In fitting and modeling the dependent variable with independent linked-based and text-based variables, in the generalized model with inverse Gaussian distribution, six out of eight independent variables affect the dependent variable (Table 9).
Estimation of parameters and effectiveness of text-based and linked-based variables.
According to the statistics of Wald chi-square, HITS (hub), HITS (authority), PageRank, and Lucene, cosine similarity, respectively, had the most significant effect on the dependent variable (p < 0.001). Finally, the co-citation metric entered the model after these metrics (p < 0.01). Also, the two metrics of bibliographic coupling and Amsler did not affect the dependent variable and did not enter the model.
The output of the generalized model with inverse Gaussian distribution for combining independent text-based and linked-based independent variables with the similarity of articles has a correlation of (rs = 0.205 and p < 0.001) and the mean squared error (MSE = 0.00433); in other words, the proposed model can estimate the similarity of articles based on mesh similarity metric with an error of 0.00433. Based on the findings of the Spearman correlation test, it can be found that with 99% confidence and an error level of less than 0.01, there is a direct and positive correlation between the similarity of articles and the output of the hybrid model (Table 10).
Correlation test results between the similarity of articles and hybrid model output.
N = 15,000.
*p < 0.001.
Discussion and conclusion
This study tried to evaluate the power of text-based metrics (Cosine and Lucene similarity), linked-based (co-citation, bibliographic coupling, Amsler, PageRank, and HITS), and a combination to estimate the similarity of the articles. The results indicate that linked-based metrics have more power in estimating the similarity of articles than the combination of text-based metrics. Also, the combination of text-based and linked-based metrics relative to hybrid metrics alone makes little change in model performance improvement to estimate the similarity of articles.
Linked-based metrics, HITS (hub), had the greatest effect on the dependent variable, and HITS (authority), PageRank, and co-citation were ranked next, respectively, indicating that newer metrics such as HITS and PageRank estimate the similarity of articles better than traditional metrics. These metrics take into account visibility and authority simultaneously using the link structure of article networks. This finding confirms the findings of previous research on the power of web environment algorithms, including PageRank and HITS, in retrieving and ranking related documents (Chen et al., 2007; Lin, 2008; Liu and Lin, 2007; Liu et al., 2012; Yin et al., 2009, 2011; Zhuge and Zhang, 2010). The HITS algorithm, in addition to the graph structure, also uses the content analysis method and is query dependent; therefore, compared to the PageRank algorithm, which considers only the graph structure and is query-independent, it has been more successful in estimating the similarity of the articles. Poor performance of the co-citation metric compared to HITS and PageRank has also been found in previous research and was to be expected (Ahlgren et al., 2020; Shibata et al., 2009). This can be attributed to the limitation of the co-citation metric. Of course, it should be noted that because the newly published article has yet to receive a proper citation, it cannot be co-cited with other articles. The CITREC test collection is not dynamic and cannot increase the number of co-cited articles.
In this study, among the linked-based metrics, two linked-based metrics, bibliographic coupling, and Amsler, could not enter the model compared to other linked-based metrics. The reasons for this can include better performance and higher power of HITS and PageRank (wald Chi-square coefficient of HITS = 94.737 vs wald Chi-square coefficient of bibliographic coupling = 1.739 and wald Chi-square coefficient of Amsler = 0.850 in Table 6) that metrics with low power cannot enter the model. Also, in the case of Amsler, since this metric is calculated based on bibliographic coupling and co-citation, the poor performance of these two metrics has weakened the performance of the Amsler metric in the present study. In general, the better performance of linked-based metrics than text-based metrics confirms that the link structure of textual similarity networks can improve information retrieval systems’ effectiveness. The weak power of text-based metrics to estimate the similarity of articles can be due to some features of natural language, like linguistic diversity and linguistic ambiguity, which have also been mentioned in previous research (Kim et al., 2017; Vallez and Pedraza-Jimenez, 2007; Zarrinkalam and Kahani, 2012).
The research findings concerning the power of the hybrid model of text-based and linked-based metrics showed that by combining these two groups of metrics, a slight increase in power occurs in the similarity of articles. However, the difference in power between the combination of text-based and linked-based metrics compared to linked-based metrics alone was not significant; but the same relative increase must also be considered; the results of various studies show that documents retrieved using text-based metrics are different from documents retrieved using linked-based metrics (Ahlgren and Jarneving, 2008; McCain, 1989; Pao and Worthen, 1989); accordingly, the combination of these two metrics, even with a slight improvement, can affect the performance of the model. Also, given the focus of text-based metrics on the content of articles and their neglect of citation relationships between them; on the other hand, considering the citation relations between articles and ignoring their content in linked-based metrics, the combination of these two metrics groups can better show the similarity between articles by overcoming the weaknesses and challenges of the other group. The recommendation to use a combination of two groups of text-based and linked-based metrics has been emphasized in several previous studies (Ahlgren et al., 2020; Boyack and Klavans, 2010, 2020; Jiang et al., 2016; Reyhani Hamedani et al., 2013, 2016; Yin et al., 2009). The texts have emphasized that users- especially subject experts and researchers- for various reasons, need efficient retrieval systems that, as far as possible, retrieve only the most relevant and important documents in the shortest possible time; for example, research has reported that users ignore only the first few results and ignore other retrieved results (Bar-Ilan et al., 2009; Bar-Ilan et al., 2006; Lewandowski, 2008, 2017). One of the most critical factors that directly affect this realization is the quality of similarity metrics. In other words, the better the similarity metrics are used, the better the retrieval results can be expected. Because according to the present study, the simultaneous use of linked-based and text-based metrics led to an improvement in the similarity of articles with each other; the use of a combination of metrics to more accurately measure the similarity of documents with each other is suggested as a solution.
On the other hand, given the fact that similarity metrics in many contexts, including search engines, databases, recommendation systems, and even digital libraries, have many applications, the potential of combining these metrics can be beneficial to access similar articles, suggest similar articles, and rank retrieval results; also, considering the better performance of linked-based metrics compared to text-based metrics, in designing database and citation algorithms, more weight can be allocated to such metrics.
This study showed that by using the citation network of articles, it is possible to calculate PageRank and HITS for scientific articles, and they can be used to determine the prestige and importance of articles and also measure the similarity of articles based on them. It is suggested that citation databases, in addition to the number of articles’ citations, calculate PageRank and HITS and report the result.
Finally, it should be noted that considering the research gaps in the field of comparison and the combination of similarity metrics in this study, the power of some of these metrics in estimating the similarity of articles was obtained; in the current study, Cosine and Lucene similarity measures were used to measure the text-based similarity of articles with each other. It is suggested to use other various encoding techniques such as embedding method (e.g. TF-IDF, Word2vec, BERT) to extract features and Euclidean distance, word mover’s distance (WMD) for Vector Similarity to make a definitive judgment regarding the potential of text-based metrics. Since the literature mentions the usability of node embedding to measure similarity and use in information retrieval, it is suggested to study the performance of other networked-based metrics, such as centrality in comparison with text-based and link-based metrics as well as their combination, to predict the similarity of the articles. The dataset of the present study was the PubMed Central Open Access Subset (PMC OAS) of CITREC. Another research can be done on PubMedBERT, and its results can be compared with the current study. In this study, the role of text-based and linked-based metrics in estimating the similarity of articles was obtained using a regression model. In the following research; we intend to get an optimal combination of text-based and linked-based metrics using an artificial neural network. By using the optimal combination of metrics and their application, suggestions can be made to increase the effectiveness of information retrieval systems.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
