Abstract
The name disambiguation task is designed to solve the name ambiguity problem of documents of multiple persons who have the same name with one another. The task aims to partition all the publications belonging to multiple person with the same name and realize that each decomposed partition is composed of publications of a unique person. Many works on name disambiguation task have a common feature that clustering method is usually used in the last step. The paper presents a complementary study to these works from another point of view. Based on the idea that documents with strong association relationships are likely to belong to the same author, this paper proposes a method of discovering meta clusters by graph partition with a heuristic rule to improve these clustering-based works. Specially, different from these works, this work uses clustering ensemble method instead of clustering method in the last step. Experimental results on a real-life dataset show that the improved method has satisfactory performance compared with the clustering-based baseline method.
Introduction
In the online bibliography systems, such as DBLP and CiteSeerX, multiple authors have the same abbreviated name or multiple authors have the same full name, which usually result in author name ambiguity. Such phenomenon decreases the performance of document retrieval, Web search and information integration. For example, when searching for “Ajay Gupta” in Microsoft Bing, one may find a page (URL: https://www.cs.wmich.edu/∼gupta/) of a Professor of Computer Science at WesternMichigan University, USA, a page (URL: https://my.clevelandclinic.org/staff/4898-ajay-gupta) of a famous doctor in Cleveland clinic, USA, and the pages of other persons sharing the name. Evidently, name ambiguity also appears in digital library for the reason that many distinct authors share the same reference. These authors may have the same abbreviated name, or even have the same full name. For example, an online search query for “Ajay Gupta” in Google scholar profile retrieves many distinct persons’ publications that are mixed together. Therefore, it is necessary for digital library to study name disambiguation to purify online search query results and estimate a researcher’s citation related impact metrics accurately.
The traditional solution for solving the name disambiguation task is a clustering-based algorithm. The clustering-based method for name disambiguation problem can be formalized as: given a list of publications from many distinct authors having same name, the task is solved by assigning these publications to some different clusters, each of which corresponds to a unique person. In essence, all publications are transformed into a high-dimensional vector, and then publications corresponding to a unique author are clustered together by a clustering method.
In order to improve the accuracy of name disambiguation, this paper proposes a trick that meta clusters should be discovered firstly and then be used to cluster instead of clustering publication vector directly, based on the idea that publications with strong association relationships are very likely to belong to a same unique author. The concept of meta cluster is introduced in detail in Section 5.Our experiments results show that the cluster method using meta clusters performs better than previous cluster-based method. What’s more, considering that different clustering algorithms have different characteristics, we adopt a clustering ensemble method to further improve the effect of name disambiguation instead of using one kind of cluster method as previous works. By comprehensively utilizing the advantages of various clustering algorithms, the proposed method further improves the pairwise F1 value of the name disambiguation task. In the end, we compare the performance of the proposed method with that of some existing methods on the name disambiguation task. We name this novel name disambiguation algorithm using
The key contributions of this work include: Proposing an effective name disambiguation algorithm named MCCE. Using the Arnetminer [6] dataset for evaluating the performance of MCCE method and the results show that the method performs satisfactorily.
The rest of this paper is organized as follows. Section 2 reviews related work. Section 3 presents a formal definition of the research questions in this paper. In Section 4, the MCCE method is proposed, including two processes: the meta cluster finding and cluster ensemble. Section 5 introduces the experimental details and analysis of the experimental results. Finally, we summarize the paper in Section 6.
Related work
There is a rich body of work on the author name disambiguation problem. In general, existing works for name disambiguation have considered supervised, unsupervised, and probabilistic relational models.
For the supervised name disambiguation, Han et al. [2] proposed two supervised approaches by utilizing Naive Bayes and SVM. The two approaches attempt to learn the manually labeled training data and trains a specific classification model for each author’s name. Then, the learned model is used to predict the corresponding real author of each paper. However, one drawback of the supervised approach is its bad scalability, and it is impractical to train the model for each author in a large publication retrieval system.
In the unsupervised setting, clustering algorithms are used to cluster publications, and different publication clusters correspond to different authors. Giles et al. [3] put forward an unsupervised method based on K-way spectral clustering algorithm. For each name dataset, they calculated a Gram matrix and clustered Gram matrix by K-way spectral clustering algorithm to disambiguate this name. Wang et al. [4] introduced a name disambiguation algorithm based on finding atomic clusters using a bias classifier. Arif et al. [5] presented an enhanced vector space model for ambiguous authors and their publications. The three features of email, organization and coauthor are clustered separately, and the next clustering is conducted on the basis of the last cluster to achieve mixed similarity clustering. Some methods [3] clustered data graphs based on node similarity, and some graph clustering methods [7, 8] divide data graphs according to topological structure. In [9], a search engine driven clustering method is proposed. [10] proposed an approach named DISTINCT with two similarity measures including set resemble of neighbor tuples and random walk probability. Zhou et al. [11] put forward a novel graph clustering algorithm called SA-Cluster, based on both structural and attribute similarities. Zhang et al. [1] proposed a new representation learning model by using the anonymous graph of relational data. They used the relational data graph to train an embedded vector for each document, and then solved the name disambiguation task by hierarchical clustering algorithm. Different from other unsupervised method, the method leverages only relational data in the form of anonymized graphs to protect privacy of the people in the dataset.
As for the approach based on probabilistic relational models, Zhang et al. [12] presented a semi-supervised constraint-based probability model for name disambiguation algorithm. They used Hidden Markov Random Fields to formalize the problem of the name disambiguation in the constraint-based probability framework, and then defined six kinds of constraint types, and used EM algorithm to measure different distances for different people. Wang et al. [13] introduced a pairwise factor graph (PFG) and proposed an ADANA method based on the PFG model,aiming to improve the disambiguation performance via active user interactions. Si et al. [14] put forward a conditional random field model for name disambiguation. They defined an objective function and use parameters learning algorithm to get the suitable parameters. Experimental results indicated that their approach significantly outperforms other different traditional clustering methods.
Inspired by these articles, we propose a new name disambiguation method using meta clusters and clustering ensemble. Compared with these articles, our method has the following advantages: 1) No need to manually labeled a training data set. 2) Since the meta clusters finding process only needs to establish edges between nodes, the efficiency of our method is higher. 3) In the final clustering stage, all the articles of the cluster are regarded as an meta cluster, and the features are connected together to calculate the similarity, rather than calculate the similarity using each pair of articles and then take the average to represent the overall similarity. 4) Different from other unsupervised approach, we integrate several clustering algorithms by the voting method in ensemble learning and use the clustering ensemble algorithm to further improve the performance of our method.
Problem formulation
In a publication retrieval system, each author name e may correspond to multiple authors {a1, a2, a3, . . . , ak(e)} Each a i of them is the same name author of e. The number of the same name is the ambiguity of the author e. The degree of ambiguity is represented by k (e). Each paper d has a collection of authors {a1, a2, . . . , a m }. Assuming that the name of a i is e, the remaining authors are called coauthors of paper d, denoted as co(d).
Given an author name e, and a set of publications p (e) written by e, the name disambiguation problem is to split p (e) into different clusters {C1, C2, C3, . . . , C K } to achieve this goal that all the publications in C j belong to the author a i and the publications written by a i are clustered in C i .
The real number of paper clusters in Section 4 is represented as K, and the value of K is equal to the degree of ambiguity. The number of connected subgraphs in Section 5 is denoted as N sub . The details of these notations are illustrated in Table 1.
Notation table
Notation table
In this section, we introduce out proposed

Overview of the MCCE algorithm.
Our solution divides the name disambiguation task into two processes: In the first step, we extract meta clusters from all the publications based on a heuristic rule and two kinds of strong association relationships we defined. In the second step, we represent meta clusters as vectors by using the space vector model, and finally cluster the meta clusters and replace the meta clusters with corresponding publications of each meta clusters to complete the disambiguationprocess.
In the previous related work, all publications are regarded as vectors of multidimensional space, and each publication is regarded as a point in the multidimensional space, and the name disambiguation of all publications is realized by using a clustering method. However, in fact, the publication vector data in the multidimensional vector space is very sparse. Therefore, it is difficult to find the correct cluster center and correctly cluster each publication, which results to the low accuracy of this kind of method.
Through the analysis of the publication data, it is found that there are two publications with the same coauthors, two publications from the same organization, and two publications published in the same venue. Directly clustered by traditional clustering algorithms, these publications with strong association relationships may be separated due to the sparsity of publication data vectors in multidimensional vector space. By clustering these publications with strong association relationships together directly, the possibility of separation of publications with strong association relationships is reduced, which makes it possible to improve accuracy.
However, in some special cases, there are authors of the same name who are from the same organization and publish publications at the same venue, and even more specially, they also come from the same research group, namely they have the same coauthors. This can cause the failure of the name disambiguation algorithm, because it is difficult to have an algorithm that separates these almost completely similar authors. Therefore, we define cluster the publications together under certain condition which is defined as heuristic rule.
The strong association relationships and heuristic rule we defined in the meta cluster finding process are introduced in detail in Section 5.
After completing the meta cluster finding process, we cluster the meta clusters. First, the space vector model is used to obtain the space vector corresponding to every meta cluster and the word frequency-inverse document frequency(TF-IDF) model is used to represent the weight of each dimension in the vector. Then, in general, a clustering algorithm to perform clustering to complete the name disambiguation process. In particular, at this point, we need replace each meta cluster using the corresponding publications to obtain the publication cluster list.
Taking into account that different clustering algorithms have different advantages, this paper realizes a clustering ensemble algorithm by using the voting method which is one combining strategy in ensemble learning. The clustering ensemble algorithm can further improve the performance of name disambiguation algorithm.
The implementation of clustering ensemble is introduced in Section 6.
Meta cluster finding method
Meta cluster refers to cluster of publications with strong association relationships.
In the meta cluster finding process, we treat each publication as a node, and all the publication nodes constitute an undirected graph. In the initial state, all nodes are not connected by edges, and each node is regarded as a subgraph. If there is a strong association relationship between the two publications, an edge should be established between the two nodes. After several edges are established, several connected subgraphs containing one or multiple nodes are formed. Then, it is checked whether the heuristic rule is satisfied. If it is satisfied, each subgraph is regarded as a meta cluster and the graph at this time is used as an input to detect the next strong association relationship. The key point of the process is to set up a heuristic rule and determine whether there is a strong association relationship betweentwo publications.
The heuristic rule used in this paper is defined as follows: after connecting different nodes according to a strong association relationship, the number of connected subgraphs should be larger than the real number of clusters of the publications (N sub > K) On the contrary,if the heuristic rule is not satisfied (N sub < K), this strong association relationship is abandoned to use to find meta clusters and the graph is restored to the initial state. Specially, if the two numbers are equal (N sub = K), the name disambiguation task are completed directly without performing the clustering process.
The pseudo-code of meta finding process of the proposed
In this paper, two kinds of strong association relationships are defined.
The first strong association relationship is that two publications share same coauthors. When the authors’ names of two publications are the same, the probability that their coauthors still share the same name is very small. In fact, in most cases, coauthors of the same-name author’s publications are no longer duplicated. Therefore, if two publications have coauthors with the same name, the two documents have strong association relationship.
The second strong association relationship is defined on the basis of the first strong association relationship. For two meta clusters with the first strong correlation, if the publications in the two meta clusters have the same organization attributes and venue attributes, then two meta clusters have association relationship and should be merged. This paper defines the second strong association relationship is based on the fact that an author may have different academic cooperation circles at different times, that is, there is no the first strong association relationship between these publications, but the publications still belong to the same expert. In addition, publications from the same author have same organizations and belong to the same research area(thus, have same venues).
Using the defined heuristic rule and these two strong association relationships, the final publication graph can be obtained, and each connected subgraph is regarded as a final meta cluster. In particular, subgraphs with only a single publication node are considered special meta clusters respectively.
Clustering ensemble method
In this paper, the voting method in ensemble learning is used to realize the ensemble of various clustering algorithms. The implementation of clustering ensemble can be divided into the following two processes:
The first step is the re-labeling process. Clustering the same data set with different clustering algorithms may produce similar clustering results, but the labels of clustering results are usually different. Taking the dataset D = {x1, x2, . . . , x7} as an example, the clustering label of the first clustering algorithm is [1, 1], and the clustering label of the second clustering algorithm is [3, 1]. Although both algorithms cluster the third and fourth samples, the labels of the two clustering results are different. In fact, the label 3 of the first clustering result is equivalent to the label 1 of the second clustering result. Therefore, the first step of clustering ensemble requires re-labeling of the results of different clustering algorithms. Referring to the work of [15], this paper uses Hungarian algorithm [16] to realize the re-labeling process of clustering results.
The second step is merging process of different clustering results. After realizing the re-labeling of different clustering results, the clustering labels of all clustering results are unified. Then, the cluster labels of each sample are statistically processed. According to the plurality voting method, the clustering labels with the highest frequency are used as the final cluster labels of the sample, which is used as the clustering result of the clustering ensemble algorithm.
If the predicted output of the learner h
i
on the sample
The pseudo-code of clustering ensemble process of the proposed
Experimental datasets
In order to evaluate the MCCE method, similar to those work [1, 13], we chose an open dataset Arnetminer [6] to perform our experiments. The statistics for some name references in Arnetminer are shown in Table 2. In the table, we show the total number of all publications and the total number of different authors corresponding to these name references.
Statistics of the Arnetminer data set
Statistics of the Arnetminer data set
Through the analysis of the data, we found that although the titles of the two publications are similar in meaning, the word forms that reflect this meaning are not necessarily identical. For example: ‘algorithm’ and ‘algorithms’, ‘envolution’ and ‘envolutionary’. Therefore, unlike previous work, we have rooted the title of the paper, turning all the words in the title of the paper into corresponding word roots, and at the same time filtering out the meaningless words to improve the clustering accuracy of space vector models. As for venue feature, we filter out some common phrases, such as: ‘proceedings of...’, ‘journal of...’, which only retains words that reflect the specific subject of the venue’s paper. For the organization features, we unify the abbreviation and full name forms of the organization, and use the full name form of the organization.After data preprocessing, we can get the name disambiguation results by using the MCCE method proposed in this paper.
To evaluate the performance of our method and compare it to other baseline methods, we use pairwise precision, pairwise recall, and pairwise F1, as in these publications [1, 13]. In detail, any two publications annotated with the same label are called the correct pair, and any two publications are predicted with the same label by an algorithm but are labeled differently actually (if they are grouped in the same cluster, we also call them the same label) are called a wrong pair. These evaluation measures are as the follows:
We used the complex network analysis package NetworkX libraries to perform meta cluster finding experiments, then use the clustering algorithms in the scikit-learn libraries to complete the clustering process based on the space vector model. Finally, the Hungarian algorithm provided in the Munkres libraries is used to realize the re-labeling process of different clustering results and the integration of clustering results by the relative majority voting method. In the experiment, we set the real author number as the termination condition of the cluster iteration, normalize the vector using L2-norm and use the cosine similarity method as the similarity measure between connected subgraphs. If the vectors of two meta clusters c1 and c2 are represented as
Last, each connected subgraph in the clustering result is replaced with each publication node in each connected subgraph and then the final clustering result is obtained. In addition, we use the hierarchical clustering algorithm based on the space vector model(baseline method) and the hierarchical clustering algorithm based on the space vector model and meta clusters(HAC with meta clusters) as two comparison methods.
The meta clusters finding process is aimed to prevent them from being separated in the final clustering process due to the sparsity of the vectors by clustering together publications with strong association relationships directly. Therefore, the ideal meta clusters finding result should be that the disambiguation accuracy of the meta cluster data is particularly high, and the disambiguation recall may be very low. Because only the publications with strong association relationships are clustered together.
Taking the publication data of the author Bin Yu in the Arnetminer dataset as an example, the results of the meta cluster finding using the first strong association relationship are shown in Fig. 2. In Fig. 2, connecting the blue line between the two nodes means that there is a common coauthor at least between the two nodes.

Meta cluster finding using the coauthor features.
Then, based on the results of the first period of the meta cluster finding process, the results of the meta cluster finding using the second strong association relationship are shown in Fig. 3. In Fig. 3, the green dashed line connecting the two connected subgraphs indicates that the two connected subgraphs have the same main organization and a common venue at least.

Meta cluster finding using the organization and venue features.
The evaluation results of these two periods of the meta cluster finding process are shown in Table 3, which conforms our expectations for the results of the meta cluster finding process.
The evaluation results of two periods of meta cluster finding
The contribution contrast of two periods of meta cluster finding on the name disambiguation task is shown in Fig. 4.

The contribution contrast histogram of two periods of meta cluster finding.
After the meta clusters are obtained, clustering the meta clusters is carried out using clustering ensemble algorithm. The Pairwise F1 value of the experimental results is shown in Table 4. Figure 5 shows the contrast effect of Pairwise F1 value of three methods by histogram.
The evaluation results (Paiwirse F1) of the three method

The pairwise F1 contrast histogram between the three methods.
As can be seen from Table 4 and Fig. 5, on average, the HAC with meta clusters increases Pairwise F1 value by 20%, the clustering ensemble with meta clusters(MCCE) further increases Pairwise F1 value by 1%.
The implementation details of the cluster ensemble algorithm is as follows. In this paper, four clustering algorithms, namely hierarchical clustering algorithm, Gauss mixture clustering algorithm, K-means clustering algorithm and spectral clustering algorithm, are used. Then, the clustering results of these four clustering algorithms are integrated by re-labeling and voting method. Finally, the name disambiguation results of the clustering ensemble algorithm can be obtained.
The above experiments are in good agreement with the dataset used in [13]. Figure 2 of [13] shows the experimental results of ADANA method and four methods proposed in this paper. The figure is shown in Fig. 6.

Pairwise F1 value contrast histogram of different methods (From [13]).
In the figure above, the Pairwise F1 value of the right four methods, namely SA-Cluster [11], DISTINCT [10], CONSTRAINT [12], HAC [9] is less than 81%, which means the MCCE method performs better. However, the Pairwise F1 value of the ADANA method reaches 89.2%. This is because the ADANA method in [13] not only uses the three features used in this paper, but also uses another two features, namely whether two publications appear simultaneously on the personal homepage of an expert(
This work presented a novel name disambiguation method based on Meta Clusters and Clustering Ensemble(MCCE). This new method allows to dynamically adjust the details of the method according to the specific situation of the problem, such as using different strong association relationships to discover meta clusters, and using different clustering methods to achieve clustering ensemble. The experimental results show that our method has satisfactory performance compared with the clustering-based baseline method. Our approach is just a method framework which can also use more other features for better disambiguation. In our future work, we will introduce the CoHomepage feature and the Citation feature to further improve the MCCE method.
Footnotes
Acknowledgement
This work is supported by the Fundamental Research Funds for the Central Universities.
