Semantic-driven bibliometric techniques for co-citation analysis

Abstract

Co-citation analysis can be exploited as a bibliometric technique used for mining information on the relationships between scientific papers. Proposed methods rely, however, on co-citation counting techniques that slightly take the semantic aspect into consideration. The present study proposes a semantic driven bibliometric techniques for co-citation analysis through measuring the semantic similarity (SS) between the titles of co-cited papers. Several computational measures rely on knowledge resources to quantify the semantic similarity, such as the WordNet “is a” taxonomy. Our proposal analyzes the SS between the titles of co-cited papers using word-based SS measures. Two major analytical experiments are performed: the first includes the benchmarks designed for testing word-based SS measures through the correlation coefficients for expressing the measures efficiency; the second exploits the dataset DBLP ${}^{1}$ citation network. As a result, the semantic similarity measures shows good performance in relation with the human judgements compared to automatic provided estimated similarities. Therefore, the lexical similarity can be consequently used for the automatic assessment of similarity between co-cited papers. The analysis of highly repeated co-citations demonstrates that the different SS measures display almost similar behaviours, with slight differences due to the distribution of the provided SS values. Furthermore, we note a low percentage of similar referred papers into the co-citations.

Keywords

Co-citation analysis bibliometrics WordNet titels analysis semantic measures

1. Introduction

This paper is an extended version of the previous work highlighting the co-citation concept in scientometric field [49, 58]. A citation usually reflects that an author is influenced by the work of another but does not provide an explicit indication on the degree or direction of that influence [17]. By contrast, co-citation quantifies the relationship between co-cited documents with the assumption that more frequently co-cited documents express greater co-citation strength [31]. Many works such as [6, 16, 32, 38, 42, 59, 60] proved paper co-citation analysis as useful to define the structure and evolution of research about a scientific domain. That is why scientists have been interested so far to develop methods for the analysis of co-citation behaviors using a variety of techniques [17, 48] ranging from Natural Language Processing to Network Science as shown in Fig. 1.

Figure 1.

Keyword cooccurrence network for the research papers indexed in Web of Science Core Collection between 2017 and 2019 and dealing with co-citation (Software: VOSViewer 1.6.13 [41]).

In another context, some works such those of [14, 18] showed that the title plays an important role as the first point of contact between writer and potential reader. Therefore, they can be exploited for studying the semantic perspective for the relation between co-cited papers.

That is why the present paper focuses on the semantic level assigned to the highly repeated co-citations through the analysis of the DBLP citation network. It computes the SS between the co-cited papers based on the nouns existing in their titles. In fact, the conventional approaches do not, however, take the semantic aspect of co-citations into account to identify the intellectual structure of a target domain based on the title of the cited paper. The choice of paper title is essentially driven by the fact that authors usually pay substantial care to include appropriate words that best reflect the important content of their paper in the title [26].

The computation of the semantic similarity between words is an important application in a wide range of research fields, including the knowledge management, information retrieval, artificial intelligence, natural language processing, and biomedical domains. A measure of SS takes two concepts as input and returns a numeric score that quantifies how much they are alike. The semantic similarity is different from another concept which is the semantic relation which does not focus only on the similarity, but it includes any semantic relation that can be interpreted for the semantic computation models. Among the semantic relatedness approaches, the distributional models are the famous. In fact, they are based on statistical computational models such as tfxidf [2, 10] and BM25 [28] which are used to express the semantic relatedness and not the semantic similarity. Moreover, word embeddings is an evolution of the distributional semantic approach proposed by [50]. They represent a latent representation of the word based on its contexts learned from corpora. The family of word embedding model scan be divided into two categories including text-based models and recent hybrid embedding models combining text-based models with the use of ontologies [51]. These latent representations can be used for providing semantic similarities scores using the cosine similarity measure.

Here, we focus on the ontological-based SS measures because they realize a rich sense representation due to the semantic relations expressed explicitly. They are, also, based on two dimensions which the first is the sense representation exploiting the semantic resources such as the ontologies providing machine readable representation of concepts. As for the second dimension, it is the computing model used for quantifying the shared features between two concepts using as examples the topological parameters of the ontologies (WordNet for general domain and, MesH and SNOMED CT for the biomedical field).

Several measures judged as structural approaches exploited the taxonomic parameters extracted from the “is a” taxonomy. Among these approaches we can cite:

•

Taxonomic measures estimate the similarity based on the topological parameters [11, 13, 15, 19, 21, 46]: number of taxonomic links, the depth, the hyponyms and the lowest common ancestor (LCA).

•

Information content-based approach quantifies the similarity between concepts as a function of the Information Content (IC) that both concepts have in common in a given ontology. The basic idea is that general and abstract entities found in a discourse present less IC than more concrete and specialized ones [12, 20, 25, 29].

•

Hybrid measures combine between measures in order to merge their advantages [47].

The rest of the paper is organized as follows. Section 2 presents an overview about the related work. Section 3 provides the proposed methodology for computing semantic similarity between paper titles to estimate the semantic similarity between the referenced papers of a co-citation. Section 4 presents some semantic similarity measures based on the topological parameters extracted from the WordNet “is a” taxonomy. Section 5 describes the datasets used for performance evaluation such the semantic similarity datasets and the DBLP citations network exploited for the determination of co-cited papers and in the study of their semantic perspective. Section 6 reports on the evaluation of benchmarks and the study of semantic similarity between highly repeated co-citations. The final section is devoted to presenting our conclusions and recommendations for future research.

2. Related work

2.1 The role, value and usage of research paper titles

Several works focused on the importance of the titles in the scientific papers as they are the most freely and easily available data about scientific papers [22].

The title is considered as the more attractive component composing a scientific paper. [14] studied the readers behaviors in front of the titles of research papers related to several disciplines such science and literature and how they motivate them to deep reading of the papers.

The analysis study made by [18] shows that research papers having shorter titles are best cited than others. In fact, the short title is more attractive through the best selection of the words playing the marketing role. This study was based on the data collected from Scopus and bibliometric platforms such DBLP.

Therefore, the importance of the titles as short, freely available and precise description of the paper makes them a useful component that can be exploited for the quick semantic comparison between the contents of the papers.

That is why several scientists began since the 1980s to assess the efficiency of the use of titles for the similarity verification of research papers using co-word analysis technique [37]. These scientists concluded that co-word analysis of titles fails to precisely prove paper similarity due to a lack of recognition of synonyms and abbreviations (e.g.: NO ${}_{\text{x}}$ and nitrogen oxides, PAN or peroxyacetyl nitrate) as this technique is only based on direct comparison of the titles of papers [45].

In this research paper, we propose semantic similarity measures as a more effective method to assess the similarity of papers and particularly co-cited ones as these SS measures depend on high-scale linguistic resources and are consequently able to detect all synonyms and abbreviations within paper titles.

2.2 Coupling co-citation analysis and semantic measures

Paper co-citations have been studied by several authors using clustering techniques, co-citation proximity analysis and quantitative methods [6, 8, 32, 38] as they are useful to extract the relationships between covered topics in scientific literature, to find the relations between researches, and to study the structure and the evolution of the research efforts about a specific domain [30, 33, 35, 36, 44].

However, although the importance of co-citation analysis, it gives limited information about the context of why papers are co-cited as it is mainly based on statistical approaches [17]. That is why several scientists decided to work on using semantic methods to give another dimension to co-citation analysis [40].

Braam et al. [3, 4] proposed to define the thematic contexts of paper co-citations by retrieving the main keywords for each paper co-citation cluster. They proved the accuracy of their method to easily verify the results of co-citation analysis, mainly concerning aspects related to the cognitive content of publications. They also confirmed that co-word analysis can be used to enrich co-citation analysis results, particularly when the number of co-citations is limited due to the lack of focused internal references within the analyzed set of research papers.

Figure 2.

The semantic similarity computation process between $t_{P_{1}}$ and $t_{P_{2}}$ .

Chen [5] used Latent Semantic Indexing and Pathfinder Network Scaling to recognize the common field of research interest for each author co-citation cluster. In fact, he analyzed the titles, abstracts and keywords of the set of papers of each two co-cited authors to generate a document-document lexical similarity matrix. All built matrices are successfully used later to define the main research topics of each author co-citation cluster.

Elkiss et al. [7] proposed to justify paper co-citation through the analysis of the lexical similarity of the abstracts and full texts of co-cited papers. They proved that lexical similarity between co-cited papers is significantly better when there is a higher proximity of co-citing sentences within the full texts of citing papers ( $\rho<$ 0.001). They also proved the lexical similarity of the papers co-cited multiple times is slightly but significantly better than the one of the papers co-cited once although there is no strong correlation between the number of received co-citations and the lexical similarity of co-cited papers.

The works [17, 34] interested in finding the reasons behind co-citations through the analysis of the sentences introducing them within the full text of citing papers. Effectively, Small [34] proposed to find the relationship between co-cited papers through the retrieval and analysis of the structures that link the sentences citing the co-cited papers within the full texts in analyzed papers. He postulates that the relationship between co-cited papers can be symmetric (similarity [Similar to] and opposition [In Contrast]) or asymmetric (explanation [explains], characterization [Property of] and connective transition [And Also]). As for Jeong et al. [17], they computed the lexical cosine similarity of the sentences defining the co-citations received by each two authors within the full text of each analyzed paper after stemming and removing stop words to let author co-citation network take into consideration the semantic contexts of author co-citations and succeeded to prove that their adjusted co-citation network gives a better overview of authors’ topics of interest than traditional co-citation networks.

In this work, we will apply semantic similarity measures on the titles of co-cited papers to determinate if co-cited papers deal about similar topics. Detailed explanation about the methods we used can be found in Sections 3 and 4.

3. Computing semantic similarity between paper titles

In order to evaluate the semantic similarity between the referred papers in a co-citation ( $p_{1}$ , $p_{2}$ ), we choose to study the semantic similarity of their titles. This choice is motivated by the fact that a title is a short phrase that includes a set of strategically selected words that best reflect the content covered in a given paper. The proposed process is detailed in Fig. 2. In fact, it includes two main modules, namely the pre-treatment module, which serves to filter and extract the nouns $N(t_{P_{i}})$ which are considered the main elements (topical terms) composing the titles ( $t_{P_{1}}$ , $t_{P_{2}}$ ), and the semantic similarity computation module, which estimates semantic similarity using the semantic resource WordNet and computational models. We choose to focus on the nouns because recent studies such [52] shows that the nouns have the great contribution in order to estimate the semantic similarity between sentences. After that, a thresholding function is applied to transform the provided soft value into 0 or 1 according to similar or not similar states.The binary decision (0 or 1) is related to the task based on the semantic similarity as a part of its treatment. In fact, if the problem is to decide if the titles of co-cited papers are similar or no, so, we exploit the tuning threshold. Else, the fractional provided value can be used directly.

3.1 Pre-treatment step

The stop words in each paper title are removed according to a list composed of 673 stop words. The POS tagger tool of the Stanford CoreNLP2[23] is then used to select the nouns represented by the tags (NN: Noun, singular; NNS: Noun, plural; NNP: Proper noun, singular; NNPS: Proper noun, plural).3 After that, each noun is lemmatized to obtain its singular form. Finally, the sets of nouns of each title $N\left({t_{P_{1}}}\right)=\textit{Nouns}\left({t_{P_{1}}}\right)$ and $N\left({t_{P_{2}}}\right)=\textit{Nouns}\left({t_{P_{2}}}\right)$ will be inputted to the second module as described in Fig. 2. Appendix A gives an example that illustrates the application of this process on a co-citation.

3.2 Computing semantic similarity

The estimation of the semantic similarity degree between the titles $t_{P_{1}}$ and $t_{P_{2}}$ is based on the nouns set assigned to each title and extracted using the first module. This module is based on two main components: the WordNet as a semantic resource representing the word semantics and the computation models exploiting the semantic information.

Figure 3.

WordNet “is a” taxonomy fragment [12].

3.2.1 WordNet

WordNet4 is a semantic resource focusing on English language [9]. As it is illustrated in Fig. 3, it exploits the synset concept for representing the words senses pertaining to different part of speech. Synset represents a specific sense of a word and it includes similar words sharing the same meaning. These sysnets are related through various semantic relations. Then, a polysemous word $N_{i}$ is represented by a set of synsets referring to its different meanings $\textit{Syn}(N_{i})$ .

3.2.2 Computational model

The Semantic Similarity (SS) between two titles is based on the calculation of the SS between their constituent nouns using the SS measures. The computing process is expressed as follows:

$\displaystyle\textit{SemSim}\left({t_{P_{1}},t_{P_{2}}}\right)=\frac{\mathop{% \sum}\nolimits_{i,j}\mathop{\max}\limits_{\begin{subarray}{c}N_{i}\in\textit{% Nouns}\left({t_{P_{1}}}\right);\\ N_{j}\in\textit{Nouns}\left({t_{P_{2}}}\right)\end{subarray}}\textit{SemSim}% \left({N_{i},N_{j}}\right)}{\max\left({\left|{\textit{Nouns}\left({t_{P_{1}}}% \right)}\right|,\left|{\textit{Nouns}\left({t_{P_{2}}}\right)}\right|}\right)}$ (1)

where $\textit{SemSim}(N_{i},N_{j})$ in Eq. (1) is used to compute the semantic similarity between the two words $N_{i}$ and $N_{j}$ . It refers to any word-based semantic similarity measure whose efficiencies are explained in the next section.

In WordNet, each noun $N$ is expressed by a set of synsets (considered as concepts) $\textit{Syn}(N)$ that represents the possible meanings of the word $N$ . So, the SS between two words $N_{i}$ and $N_{j}$ is the maximum similarity found between two specific senses pertaining to $\textit{Syn}\left({N_{1}}\right)\times\textit{Syn}\left({N_{2}}\right)$ as expressed in the following equation:

$\displaystyle\textit{SemSim}\left({N_{i},N_{j}}\right)=\mathop{\max}\limits_{% \begin{subarray}{c}\left({{c}_{1},{c}_{2}}\right)\in\textit{Syn}\left({{N}_{i}% }\right)\\ \times\textit{Syn}\left({{N}_{j}}\right)\end{subarray}}\textit{SemSim}\left({{% c}_{1},{c}_{2}}\right)$ (2)

where $\textit{SemSim}(c_{1},c_{2})$ is computed using a semantic similarity measure.

In literature, the quantification of semantic similarity between words is based mainly on the taxonomies “is a” like the one present in WordNet because they represent the share of common characteristics.

4. Taxonomic-based semantic similarity measures between words

This section presents some WordNet-based SS measures that exploit topological parameters of concepts, including the depth, hyponyms, hypernyms and Lowest Common Subsumer (LCS). These measures are based on the hierarchical structure of the taxonomies independently from the language of the semantic resource as will be explained in next paragraphs.

4.1 Path and depth-based measures

Wu and Palmer [46] proposed a new measure (WP) defined as follows:

$\displaystyle\textit{SemSim}_{WP}\left({c_{1},c_{2}}\right)=\frac{2\times H}{N% _{1}+N_{2}+2\times H}$ (3)

where $N_{1}$ and $N_{2}$ refer to the number of “is a” links from $c_{1}$ and $c_{2}$ to the lowest common subsumer $c$ , respectively, and $H$ to the number of “is a” links from $c$ to the root of the taxonomy. The depth is commonly used to express the specificity of a concept inside a taxonomy, such as the WordNet “is a” noun taxonomy. In the experiments related to the $W P$ measure, we use the $\textit{depth}_{WH}$ [43].

Li et al. [19] have proposed a similarity measure (Li) to overcome the limitations associated with the edge counting methods.

$\displaystyle\textit{SemSim}_{Li}\left({c_{1},c_{2}}\right)=e^{-\alpha L}% \times\frac{e^{\beta H}-e^{-\beta H}}{e^{\beta H}+e^{-\beta H}}$ (4)

Hao et al. [15] proposed a measure (Hao) using the semantic distance between two concepts (the shortest path length: $\left|{\textit{path}\left({c_{1},c_{2}}\right)}\right|)$ and the depth of LCS in the taxonomy. They proposed the following formula:

$\displaystyle\textit{SemSim}_{\textit{Hao}}\left({c_{1},c_{2}}\right)=\left({1% -\frac{\left|{\textit{path}\left({c_{1},c_{2}}\right)}\right|}{\begin{array}[]% {c}\left|{\textit{path}\left({c_{1},c_{2}}\right)}\right|+\\ \textit{Depth}\left({\textit{LCS}\left({c_{1},c_{2}}\right)}\right)+\beta\\ \end{array}}}\right)\times\left({\frac{\textit{Depth}\left({\textit{LCS}\left(% {c_{1},c_{2}}\right)}\right)}{\begin{array}[]{c}\left|{\textit{path}\left({c_{% 1},c_{2}}\right)}\right|+\\ \textit{Depth}\left({\textit{LCS}\left({c_{1},c_{2}}\right)}\right)/2+\alpha% \end{array}}}\right)$ (5)

The interval of $\alpha$ is [0, 1], and the increasing step is 0.1. The interval of $\beta$ is [1, 10], and the increasing step is 1.

Liu et al. [21] presented a different measure to estimate the SS between concepts in WordNet using edge-counting techniques. The fundamental idea of this measure is based on the assumption that the human judgment process for semantic similarity can be simulated by the ratio of common features to the total features between words using two formulae (Liu1) and (Liu2):

$\displaystyle\textit{SemSim}_{\textit{Liu}-1}\left({c_{1},c_{2}}\right)=\frac{% \alpha\times d}{\alpha\times d+\beta\times l}$ (6) $\displaystyle\textit{SemSim}_{\textit{Liu}-2}\left({c_{1},c_{2}}\right)=\frac{% e^{\alpha d}-1}{e^{\alpha d}+e^{\beta l}-2}$ (7)

Where $l$ is the shortest path length between $c_{1}$ and $c_{2}$ ; $d$ is the depth of the subsumer between $c_{1}$ and $c_{2}$ , and $\alpha$ and $\beta$ are smoothing factors $(0<\alpha,\beta\leqslant 1)$ .

Gao et al. [11] presented an approach for measuring the SS (Gao) based on edge-counting and information content theory. In the second strategy, instead of weighting each edge along the path from $c_{1}$ to c ${}_{2}$ , the number of the edges separating $c_{1}$ and c ${}_{2}$ is counted first.

4.2 Depth and hyponyms based measures

Hadj Taieb et al. [13] proposed an ontology-based method (Hadj1) exploiting the depth and the hyponyms for estimating the semantic similarity between two words as follows:

$\displaystyle\textit{SemSim}_{\textit{Hadj1}}\left({N_{1},N_{2}}\right)=\left% \{{{\begin{array}[]{ll}\mathop{\max}\limits_{\begin{subarray}{c}\left({{c}_{1}% ,{c}_{2}}\right)\in\textit{Syn}\left({{N}_{1}}\right)\\ \times\textit{Syn}\left({{N}_{2}}\right)\end{subarray}}\textit{Sim}\left({{c}_% {1},{c}_{2}}\right)&\text{if }w_{1}\neq w_{2}\\ 1&\text{else}\\ \end{array}}}\right\}$ (8)

with:

$\displaystyle\textit{Sim}\left({c_{1},c_{2}}\right)=\left|{\textit{TermDepth}% \left({c_{1},c_{2}}\right)-{\Lambda}\left({{N}_{1},N_{2}}\right)}\right|\times% \textit{TermHypo}\left({c_{1},c_{2}}\right)$ (9)

Where ${\Lambda}\left({{N}_{1},N_{2}}\right)$ is a term used as an adjustment factor for resolving the problem of the fine granularity of WordNet and $\textit{TermHypo}(c_{1},c_{2})$ is a term for quantifying the subgraph of descendants.

4.3 IC-based measures

The IC-based measures consist of a pair including the computing IC method and the similarity measure. Hadj Taieb et al. [12] proposed an IC computing method (Hadj2) based on the quantification of the subgraph formed by the ancestors of a target concept $c$ in the “is a” taxonomy. The IC of a given concept Con is computed as follows:

$\displaystyle IC\left(\textit{Con}\right)=\left({\mathop{\sum}\limits_{c\in% \textit{Hyper}\left(\textit{Con}\right)}\textit{Score}\left(c\right)}\right)% \times\textit{AverageDepth}\left(\textit{Con}\right)$ (10)

The similarity measure proposed by Lin exploits the IC of the lowest common subsumer (LCS) because it represents the commonality between the two concepts $c_{1}$ and $c_{2}$ :

$\displaystyle\textit{SemSim}_{\textit{Hadj2}}\left({c_{1},c_{2}}\right)=\frac{% 2\times{IC}\left({\textit{LCS}\left({{c}_{1},{c}_{2}}\right)}\right)}{{IC}% \left({c_{1}}\right)+{IC}\left({c_{2}}\right)}$ (11)

Sánchez et al. [29] proposed another strategy for computing Information Content (IC) (Sanchez) by using the hyponyms through the leaves of the hyponym subgraph of a concept and integrated a novel parameter, ancestors(c). The IC formula is expressed as follows:

$\displaystyle{IC}\left({c}\right)=-\text{log}\left({\frac{\frac{\left|{\text{% leaves}\left({c}\right)}\right|}{\left|{\text{ancestors}\left({c}\right)}% \right|}+1}{\text{max}\_\text{leaves}+1}}\right)$ (12)

This IC computation method is then used with the similarity measure of Meng and Gu [24] which is based on Lin’s measure. It is expressed by the following equation:

$\displaystyle\textit{SemSim}_{\textit{Sanchez}}\left({{c}_{1},{c}_{2}}\right)=% {e}^{\textit{Sim}_{\textit{Lin}}\left({{c}_{1},{c}_{2}}\right)}-1$ (13)

4.4 Hybrid measures

Zhou et al. [47] proposed a measure (Zhou) that takes information content measures and path based measures as parameters by using a tuning factor $k$ :

$\displaystyle\textit{SemSim}_{\textit{zhou}}\left({c_{1},c_{2}}\right)=1-k% \left({\frac{\text{log}\left({\textit{len}\left({c_{1},c_{2}}\right)+1}\right)% }{\text{log}\left({2\times\left({\textit{deep}\_\textit{max}-1}\right)}\right)% }}\right)-\left({1-k}\right)\times((IC\left({c_{1}}\right)+{IC}\left({c_{2}}% \right)-2\times{IC}(\textit{LCS}\left({c_{1},c_{2}}\right)))/2)$ (14)

In the next section, we detail the dataset DBLP citation network exploited in the determination of co-cited papers for studying their semantic perspective based on their titles.

5. Evaluation datasets

The performance evaluation of the method proposed in the present work for measuring the Semantic Similarity (SS) between co-cited papers is based on two analytical experiments. The first is based on word-based benchmarks and aims to study the performance of the word-based semantic similarity measures outlined in the previous section; the second analyzes the semantic similarity between the highly co-cited papers using the dataset DBLP citation Network.5

5.1 Semantic similarity benchmarks

In this study, we have experimentally evaluated machine generated values of semantic similarity between words and compared them against human ratings [57]. Rubenstein and Goodenough [53] (RG65) obtained “synonymy judgments” on 65 pairs of words. The participants were asked to rate them on the scale of 0.0 to 4.0 according to their similarity of meaning. Miller and Charles [27] (MC30) extracted 30 pairs from the original 65 and then obtained similarity judgments from 38 participants. Likewise, Agirre et al. [1] created a semantic similarity dataset (AG203) that contained 203 pairs of terms from Fin353.

Halawi et al. [54] created a new dataset (MTurk771) wherein the similarity value of each word pair was taken as the mean score given by the workers in Amazon Mechanical Turk (AMT). Bruni et al. [56] created dataset (MEN3000)6 benchmark that consisted of 3000 word pairs, randomly selected from words that occurred at least 700 times in the freely available ukWaC and Wackypedia7 corpora. Hill et al. [55] presented SimLex-999 which contains a range of concrete and abstract adjective, noun and verb pairs. In this paper, we exploit only the noun subset (SimLex666).8

5.2 Evaluation metrics

The comparison between values provided by a measure and human judgments is particularly based on correlation coefficients.

Figure 4.

The formatted entry of each paper in the dataset DBLP citation network.

Figure 5.

Paper number distribution according to year of publication.

Figure 6.

Co-citation frequency distribution in logarithmic scale (a) and frequency distribution of highly repeated ( $\geqslant$ 50) co-citations (b).

Figure 7.

Curves representing Pearson correlation values ( $r$ ) when varying the parameters $\alpha$ and $\beta$ of the measure [19] applied on datasets (RG65, MC30, AG203, MTurk771, SimLex666 and MEN3000).

5.2.1 Pearson coefficient

The Pearson product-moment correlation coefficient $r$ can be employed as an evaluation metric. It indicates how well the results of a measure resemble human judgments. Pearson’s $r$ is calculated as follows:

$\displaystyle r=\frac{n\left({\mathop{\sum}\nolimits x_{i}y_{i}}\right)-\left(% {\mathop{\sum}\nolimits x_{i}}\right)\left({\mathop{\sum}\nolimits y_{i}}% \right)}{\begin{array}[]{c}\sqrt{n\left({\mathop{\sum}\nolimits x_{i}^{2}}% \right)\left({\mathop{\sum}\nolimits x_{i}}\right)^{2}}\\ \sqrt{n\left({\mathop{\sum}\nolimits y_{i}^{2}}\right)\left({\mathop{\sum}% \nolimits y_{i}}\right)^{2}}\\ \end{array}}$ (15)

where $x_{i}$ refers to the $i^{\text{th}}$ element in the list of human judgments, $y_{i}$ refers to the corresponding $i^{\text{th}}$ element in the list of SS computed values, and $n$ to the number of word pairs.

5.2.2 Spearman coefficient

This is used to correlate word pair rankings. The quality of such ranking is quantified by the Spearman rank order correlation coefficient $(\rho)$ . The parameter $d_{i}$ is the difference between the ranks of $x_{i}$ and $y_{i}$ .

$\displaystyle\rho=1-\frac{6\mathop{\sum}\nolimits d_{i}^{2}}{n\left({n^{2}-1}% \right)}$ (16)

Table 1

Results of semantic similarity measures applied on a set of word-based benchmarks using the Pearson correlation ( $r$ )

	RG65	MC30	AG203	SimLex666	MTurk771	MEN3000
Path-Depth
WP	0.836	0.816	0.654	0.504	0.505	0.413
Li	0.858	0.809	0.641	0.604	0.516	0.397
Liu1	0.845	0.775	0.640	0.580	0.520	0.381
Liu2	0.840	0.772	0.635	0.553	0.504	0.405
Hao	0.842	0.823	0.673	0.584	0.491	0.424
Gao	0.866	0.828	0.680	0.591	0.522	0.431
Depth-Hyponyms
Hadj1	0.871	0.849	0.717	0.627	0.550	0.465
Information content
Hadj2	0.730	0.655	0.555	0.445	0.367	0.585
Sanchez	0.835	0.805	0.646	0.583	0.500	0.379
Hybrid
Zhou	0.877	0.857	0.623	0.626	0.541	0.385

5.3 DBLP citation network

The DBLP [39] citation network consists of all papers dated before September 29, 2013. Each entry represents a paper and includes the title, authors, year, publication venue, index id, pid in AmentMiner database, and ids of references of the paper and the abstract. The dataset contains 2,084,055 papers and 2,244,018 citation relationships. Figure 4 describes the formatting way of each paper. The dataset contains 2,084,055 papers and 2,244,018 citation relationships.

Figure 5 shows the number of papers according to the publication year. It is clear that majority of the papers pertaining to the dataset dated back to the period between 2000 and 2010.

This dataset is exploited to study the semantic similarity between the highly pairwise co-cited papers based on their titles.

The inset Fig. 6a shows the co-citation frequency distribution in a logarithmic scale and the inset Fig. 6b is a zoom in on the co-citations repeated more than 50 times. Figure 6 also shows that most of the co-citations occur only once (1904758 $\rightarrow$ 82.63%). The top pairs of co-cited papers can be found in Appendix B.

6. Experiments’ results

The results from the analytical experiments are detailed in three parts: the first concerns the determination of the optimal values for the parameterized SS measures, the second part treats the results according to the word-based benchmarks, and the third part analyzes the semantic similarity between the highly repeated co-citations in the DBLP citation network dataset.

6.1 Determination of optimal tuning values

The semantic similarity measures exploit some of the tuning parameters described by Li et al. [19] in (Eq. (4)). So, the first step is to determine the optimal tuning values through the cited benchmarks. We just compute the Pearson correlation values ( $r$ ) obtained according to the variation of the parameters $(\alpha,\beta)\in$ [0, 1] ${}^{2}$ and adopt the values of $\alpha$ and $\beta$ corresponding to the best $r$ value (Fig. 7). These settings will be used when applying the measure to determine the similarity between co-cited papers.

6.2 Results of word-based benchmarks

Tables 1 and 2 show the correlation values obtained using the Pearson ( $r$ ) (Table 1) and Spearman ( $\rho$ ) (Table 2) correlation coefficients to determine the performance of SS measures in a set of benchmarks designed for the study of semantic similarity.

Table 2
Results of semantic similarity measures applied on a set of word-based benchmarks using the Spearman correlation ( $\rho$ )

	RG65	MC30	AG203	SimLex666	MTurk771	MEN3000
Path-Depth
WP	0.782	0.758	0.623	0.568	0.491	0.331
Li	0.787	0.740	0.615	0.590	0.496	0.335
Liu1	0.782	0.749	0.629	0.568	0.491	0.331
Liu2	0.789	0.741	0.625	0.560	0.480	0.337
Hao	0.783	0.759	0.627	0.570	0.488	0.332
Gao	0.784	0.746	0.619	0.577	0.494	0.337
Depth-Hyponyms
Hadj1	0.761	0.737	0.639	0.613	0.528	0.351
Information content
Hadj2	0.728	0.687	0.562	0.462	0.402	0.576
Sanchez	0.816	0.786	0.638	0.594	0.497	0.425
Hybrid
Zhou	0.823	0.792	0.632	0.612	0.510	0.368

Figure 8.

Curves representing the distribution of similar co-citations using different semantic similarity measures with a variation of the threshold $\theta$ .

Tables 1 and 2 show that the results with the datasets RG65 and MC30 are very close to the human judgments, for example the correlation value $r=$ 0.877 with Zhou measure for RG65. Moreover, the measure Hadj1 presents good correlations of $r=$ 0.717, $r=$ 0.627 and $r=$ 0.55 according to the large datasets AG203, SimLex666 and MTurk771, respectively. Those measures are, therefore, exploited to determine the semantic similarity between paper titles and study the semantic similarity of the referenced papers of a co-citation. Therefore, the good performance demonstrated at the word level can be transferred to the titles level. In fact, these measures are exploited to determine the semantic similarity between the paper titles and study the semantic similarity of the referenced papers pertaining to a co-citation.

6.3 Studying the semantic simialrity of co-citations

We extract from the DBLP dataset 2305156 distinct co-citations. From this dataset, we select a subset ( ${\Upsilon}$ ) including the co-citations having a frequency of more than 3. The resulted subset contains 83143 different co-citations. The semantic similarity between the titles of the referenced papers in each co-citation pertaining to ${\Upsilon}$ is then computed. The computing process is described in Fig. 2 and modeled in Eq. (1). The different SS measures cited in the present work are used to visualize the percentage of similar papers related to a target co-citation through the threshold $\theta$ (see Fig. 8).

Despite the difference at the used computational models, Fig. 8 shows that several semantic similarity measures have similar behaviors, such as (Liu1 and Liu2) and (Hao and WP). Moreover, we note that the Zhou measure provides the most high similarity values when compared to other measures. We also note the marked congestion visualized by the curves for the $\theta$ -values $\in$ [0.5, 1]. This could presumably be attributed to the fact that highly similar co-citations are limited [7] (for example, for $\theta=$ 0.7 and the measure Zhou, we have the percentage 2.43%). Furthermore, Fig. 8 shows a set of measures (Li, Liu1, Liu2 and Hadj1) whose curves exist in the center.

Each point pertaining to a curve refers to the percentage of co-citations having a similarity degree higher than or equal to the threshold $\theta$ . For example, when $\theta=$ 0.1 for the curve assigned to the measure WP, the percentage of $\textit{SemSim}\left({t_{P_{1}},t_{P_{2}}}\right)\geqslant 0.1$ is 57.29%. The analysis of the semantic similarity between the paper pairs referenced by the co-citations through their titles shows that the majority are not similar (Fig. 8, $\theta\geqslant 0.6$ ), confirming that most of the papers are not co-cited because they are similar but they are mainly co-cited so that they can be complementarily used to explain an idea in the target paper as precisely shown in [34].

7. Conclusion and future work

In this paper, we propose a new method for analyzing the semantic similarity between the referenced papers in a co-citation. This method can be considered as a continuation of [7, 17]. Our approach is based on the semantic similarity measures applied on paper titles. Each title is submitted to a pre-treatment analytical step, including principally the POS tagger tool of Stanford, to extract the nouns and transform them into their lemma. The analysis proceeds by a semantic similarity computation procedure that principally integrates the word-based semantic similarity measures, which are based on the topological parameters extracted from the WordNet “is a” taxonomy. The exploited parameters are the depth, the hyponyms, the hypernyms and the LCS. Among the used SS measures, we exploit the intrinsic IC-based SS measures combining between IC-computing methods and IC-based measure. The evaluation includes two parts: the exploitation of recent and old word-based benchmarks and the use of the dataset DBLP citations network.

The results from the experiments performed on benchmarks show that the SS measures can simulate the human thinking process due to the good correlations using Pearson (r) and Spearman ( $\rho$ ) coefficients. These results demonstrate that the computed SS estimations of lexical similarity are very close to human judgments.

Therefore, these SS measures are exploited to calculate the SS between the titles of referenced papers related to highly repeated co-citations. In fact, a couple of paper titles is considered similar if the soft value provided by a SS measure is higher than a threshold $\theta$ . The analysis of the semantic similarity of the co-citations, according to the dataset of DBLP, shows that most of the highly repeated co-citations are not or slightly similar, which proves the complementarity between the co-cited papers exploited to explain an idea in the target paper.

Considering the promising results generated by the method proposed in the present study, further research, some of which is currently underway in our laboratory, is needed to apply it in a process of co-citation clustering and investigate the topical relatedness between co-cited papers and to investigate if SS measures of the titles of co-cited papers significantly correlate with co-citation proximity, with co-citation frequency, and with human judgment of the similarity of co-cited papers.

Footnotes

Stanford CoreNLP provides a set of natural language tools for treating the text and gives the base forms of words, their parts of speech. http://nlp.stanford.edu/software/corenlp.shtml.

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_tree bank_pos.html.

http://wordnet.princeton.edu/.

http://aminer.org/billboard/DBLP_Citation.

http://clic.cimec.unitn.it/ẽlia.bruni/MEN.html.

http://wacky.sslmit.unibo.it/doku.php?id=corpora.

http://www.cl.cam.ac.uk/∼fh295/simlex.html.

http://aminer.org/billboard/DBLP_Citation.

Appendix A

Appendix B: Top pairs of co-cited papers

Rank	Co-cited papers	Frequency
1	J. Ross Quinlan. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.	354
	J.R. Quinlan. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (March 1986), 81–106.
2	J. Ross Quinlan. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.	202
	Leo Breiman. 1996. Bagging predictors. Mach. Learn. 24, 2 (August 1996), 123–140.
3	V. Jacobson. 1988. Congestion avoidance and control. In Symposium proceedings on Communications architectures and protocols (SIGCOMM ’88), Vinton Cerf (Ed.). ACM, New York, NY, USA, 314–329.	151
	Sally Floyd and Van Jacobson. 1993. Random early detection gateways for congestion avoidance. IEEE/ACM Trans. Netw. 1, 4 (August 1993), 397–413.
4	C.A.R. Hoare. 1978. Communicating sequential processes. Commun. ACM. 21, 8 (August 1978), 666–677.	139
	R. Milner. 1982. A Calculus of Communicating Systems. Springer-Verlag New York, Inc., Secaucus, NJ, USA.
5	Ricardo A. Baeza-Yates and Berthier Ribeiro-Neto. 1999. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.	137
	Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the seventh international conference on World Wide Web 7 (WWW7), Philip H. Enslow, Jr. and Allen Ellis (Eds.). Elsevier Science Publishers B. V., Amsterdam, The Netherlands, The Netherlands, 107–117.

References

Agirre

Alfonseca

Hall

Kravalova

Paşca

and Soroa

, A study on similarity and relatedness using distributional and wordnet-based approaches, in: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2009, pp. 19–27. Association for Computational Linguistics.

Ben Aouicha

Hadj Taieb

M.A.

and Ben Hamadou

, LWCR: Multi-layered wikipedia representation for computing word relatedness, Neurocomputing 216 (2016), 816–843.

Braam

R.R.

Moed

H.F.

and Van Raan

A.F.

, Mapping of science by combined co-citation and word analysis I. Structural aspects, Journal of the American Society for Information Science 42(4) (1991), 233.

Braam

R.R.

Moed

H.F.

and Van Raan

A.F.

, Mapping of science by combined co-citation and word analysis II. Dynamical aspects, Journal of the American Society for Information Science 42(4) (1991), 252.

Chen

, Visualising semantic spaces and author co-citation networks in digital libraries, Information Processing & Management 35(3) (1999), 401–420.

Chen

Song

I.Y.

and Zhu

, Trends in conceptual modeling: Citation analysis of the ER conference papers (1979–2005), in: Proceedings of the 11th International Conference on the International Society for Scientometrics and Informatrics, 2007, pp. 189–200.

Elkiss

Shen

Fader

Erkan

States

and Radev

, Blind men and elephants: What do citation summaries tell us about a research article? Journal of the Association for Information Science and Technology 59(1) (2008), 51–62.

Eto

, Evaluations of context-based co-citation searching, Scientometrics 94(2) (2013), 651–673.

Fellbaum

, WordNet: An Electronic Lexical Database (Language, Speech, and Communication), illustrated edition. MIT Press. 1998.

10.

Gabrilovich

and Markovitch

, Computing semantic relatedness using wikipedia-based explicit semantic analysis, IJcAI 7 (2007), 1606–1611.

11.

Gao

J.B.

Zhang

B.W.

and Chen

X.H.

, A WordNet-based semantic similarity measurement combining edge-counting and information content theory, Engineering Applications of Artificial Intelligence 39 (2015), 80–88.

12.

Hadj Taieb

M.A.

Ben Aouicha

and Ben Hamadou

, A new semantic relatedness measurement using WordNet features, Knowledge and Information Systems 41(2) (2014), 467–497.

13.

Hadj Taieb

M.A.

Ben Aouicha

and Ben Hamadou

A.B.

, Ontology-based approach for measuring semantic similarity, Engineering Applications of Artificial Intelligence 36 (2014), 238–261.

14.

Haggan

, Research paper titles in literature, linguistics and science: Dimensions of attraction, Journal of Pragmatics 36(2) (2004), 293–317.

15.

Hao

Zuo

Peng

and He

, An approach for calculating semantic similarity between words using WordNet, in: Digital Manufacturing and Automation (ICDMA), 2011 Second International Conference on, 2011, pp. 177–180. IEEE.

16.

Hou

Yang

and Chen

, Emerging trends and new developments in information science: A document co-citation analysis (2009–2016), Scientometrics 115(2) (2018), 869–892.

17.

Jeong

Y.K.

Song

and Ding

, Content-based author co-citation analysis, Journal of Informetrics 8(1) (2014), 197–211.

18.

Letchford

Moat

H.S.

and Preis

, The advantage of short paper titles, Royal Society Open Science 2(8) (2015), 150266.

19.

Bandar

Z.A.

and McLean

, An approach for measuring semantic similarity between words using multiple information sources, IEEE Transactions on Knowledge and Data Engineering 15(4) (2003), 871–882.

20.

Lin

, An information-theoretic definition of similarity, Icml, 1998, 296–304.

21.

Liu

X.Y.

Zhou

Y.M.

and Zheng

R.S.

, Measuring semantic similarity in WordNet, in: Machine Learning and Cybernetics, 2007 International Conference on. 6, 2007, pp. 3431–3435. IEEE.

22.

Magerman

Van Looy

and Song

, Exploring the feasibility and accuracy of latent semantic analysis based text mining techniques to detect similarity between patent documents and scientific publications, Scientometrics 82(2) (2010), 289–306.

23.

Manning

Surdeanu

Bauer

Finkel

Bethard

and McClosky

, The Stanford CoreNLP natural language processing toolkit, in: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014, pp. 55–60.

24.

Meng

and Gu

, A new model for measuring word sense similarity in WordNet, in: Proceedings of the 4th International Conference on Advanced Communication and Networking, SERSC, 2012, pp. 18–23. Jeju, Korea.

25.

Meng

and Zhou

, A new model of information content based on concept’s topology for measuring semantic similarity in WordNet, International Journal of Grid and Distributed Computing 5(3) (2012), 81–94.

26.

Merrill

and Knipps

, What’s in a title? The Journal of Wildlife Management 78(5) (2014), 761–762.

27.

Miller

G.A.

and Charles

W.G.

, Contextual correlates of semantic similarity, Language and Cognitive Processes 6(1) (1991), 1–28.

28.

Robertson

S.E.

and Sparck Jones

, Document retrieval systems, in: Willett

, Relevance weighting of search terms, 1988, pp. 143–160. London, UK: Taylor Graham Publishing.

29.

Sánchez

Batet

and Isern

, Ontology-based information content computation, Knowledge-Based Systems 24(2) (2011), 297–303.

30.

Small

, Co-citation context analysis and the structure of paradigms, Journal of Documentation 36(3) (1980), 183–196.

31.

Small

, Coâ€citation in the scientific literature: A new measure of the relationship between two documents, Journal of the Association for Information Science and Technology 24(4) (1973), 265–269.

32.

Small

H.G.

, A co-citation model of a scientific specialty: A longitudinal study of collagen research, Social Studies of Science 7(2) (1977), 139–166.

33.

Small

, Macro-level changes in the structure of co-citation clusters: 1983–1989, Scientometrics 26(1) (1993), 5–20.

34.

Small

, The synthesis of specialty narratives from co-citation clusters, Journal of the American Society for Information Science 37(3) (1986), 97–110.

35.

Small

and Sweeney

, Clustering the science citation index® using co-citations: I. A comparison of methods, Scientometrics 7(3–6) (1985), 391–409.

36.

Small

Sweeney

and Greenlee

, Clustering the science citation index using co-citations. II. Mapping science, Scientometrics 8(5–6) (1985), 321–340.

37.

Sternitzke

and Bergmann

, Similarity measures for document mapping: A comparative study on the level of an individual scientist, Scientometrics 78(1) (2009), 113–130.

38.

Sullivan

Koester

White

and Kern

, Understanding rapid theoretical change in particle physics: A month-by-month co-citation analysis, Scientometrics 2(4) (1980), 309–319.

39.

Tang

Zhang

Yao

Zhang

and Su

, Arnetminer: extraction and mining of academic social networks, in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008, pp. 990–998. ACM.

40.

Thijs

and Glänzel

, The contribution of the lexical component in hybrid clustering, the case of four decades of “Scientometrics”, Scientometrics 115(1) (2018), 21–33.

41.

van Eck

N.J.

and Waltman

, Citation-based clustering of publications using CitNetExplorer and VOSviewer, Scientometrics 111(2) (2017), 1053–1070.

42.

Wang

Liang

Jia

Xue

and Wang

, Cloud computing research in the IS discipline: A citation/co-citation analysis, Decision Support Systems 86 (2016), 35–47.

43.

Wang

and Hirst

, Refining the notions of depth and density in wordnet-based semantic similarity measures, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2011, pp. 1003–1011. Association for Computational Linguistics.

44.

Wang

Zhao

Liu

and Zhang

, Knowledge-transfer analysis based on co-citation clustering, Scientometrics 97(3) (2013), 859–869.

45.

Whittaker

, Creativity and conformity in science: Titles, keywords and co-word analysis, Social Studies of Science 19(3) (1989), 473–496.

46.

and Palmer

, Verbs semantics and lexical selection, in: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, 1994, pp. 133–138. Association for Computational Linguistics.

47.

Zhou

Wang

and Gu

, New model of semantic similarity measuring in wordnet, in: Intelligent System and Knowledge Engineering, 2008. ISKE 2008. 3rd International Conference on. 1, 2008, pp. 256–261. IEEE.

48.

Shiau

W.-L.

Dwivedi

Y.K.

and Yang

H.S.

, Co-citation and cluster analyses of extant literature on social networks, International Journal of Information Management 37(5) (2017), 390–399.

49.

Hadj Taieb

M.A.

Ben Aouicha

and Turki

, Paper co-citation analysis using semantic similarity measures, in: International Conference on Intelligent Systems Design and Applications (ISDA 2019), 2019.

50.

Harris

, Distributional structure, Word 10 (1954), 146–162.

51.

Lastra-Díaz

Goikoetxea

Hadj Taieb

M.A.

García-Serrano

Ben Aouicha

and Agirre

, A reproducible survey on word embeddings and ontology-based methods for word similarity: Linear combinations outperform the state of the art, Eng. Appl. of AI 85 (2019), 645–665.

52.

Hadj Taieb

M.A.

Ben Aouicha

and Bourouis

, FM3S: Features-based measure of sentences semantic similarity, HAIS 2015 (2015), 515–529.

53.

Rubenstein

and Goodenough

J.B.

, Contextual correlates of synonymy, Commun ACM 8(10) (1965), 627–633.

54.

Halawi

Dror

Gabrilovich

and Koren

, Large-scale learning of word relatedness with con-straints, in: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, 2012, pp. 1406–1414.

55.

Hill

Reichart

and Korhonen

, Simlex-999: Evaluating semantic models with (genuine) similarity estimation, Comput Linguist 41(4) (2015), 665–695.

56.

Bruni

Tran

N.K.

and Baroni

, Multimodal distributional semantics, J Artif Int Res 49(1) (2014), 1–47.

57.

Hadj Taieb

M.A.

Zesch

and Ben Aouicha

, A survey of semantic relatedness evaluation datasets and procedures, Artif Intell Rev, 2019.

58.

Zheng

, Using mutual information as a cocitation similarity measure, Scientometrics 119 (2019), 1695–1713.

59.

Réale

Khelfaoui

Montiglio

et al., Mapping the dynamics of research networks in ecology and evolution using co-citation analysis (1975–2014), Scientometrics 122 (2020), 1361–1385.

60.

Bonilla-Aldana

D.K.

Quintero-Rada

Montoya-Posada

J.P.

et al., SARS-CoV, MERS-CoV and now the 2019-novel CoV: Have we investigated enough about coronaviruses? – A bibliometric analysis, Travel Med Infect Dis, 2020.

Semantic-driven bibliometric techniques for co-citation analysis

Abstract

Keywords

1. Introduction

2.1 The role, value and usage of research paper titles

2.2 Coupling co-citation analysis and semantic measures

3.1 Pre-treatment step

3.2 Computing semantic similarity

3.2.2 Computational model

4.1 Path and depth-based measures

5.1 Semantic similarity benchmarks

5.2 Evaluation metrics

6. Experiments’ results

6.1 Determination of optimal tuning values

6.2 Results of word-based benchmarks

Table 2 Results of semantic similarity measures applied on a set of word-based benchmarks using the Spearman correlation ( ρ )

7. Conclusion and future work

Footnotes

Appendix A

Appendix B: Top pairs of co-cited papers

References

Table 2
Results of semantic similarity measures applied on a set of word-based benchmarks using the Spearman correlation ( $\rho$ )