Abstract
Multi-label learning has attracted significant attention from machine learning and data mining over the last decade. Although many multi-label classification algorithms have been devised, few research studies focus on multi-assignment clustering (MAC), in which a data instance can be assigned to multiple clusters. The MAC problem is practical in many application domains, such as document clustering, customer segmentation and image clustering. Additionally, specifying the number of clusters is always a difficult but critical problem for a certain class of clustering algorithms. Hence, this work proposes a nonparametric multi-assignment clustering algorithm called multi-assignment Chinese restaurant process (MACRP), which allows the model complexity to grow as more data instances are observed. The proposed algorithm determines the number of clusters from data, so it provides a practical model to process massive data sets. In the proposed algorithm, we devise a novel prior distribution based on the similarity graph to achieve the goal of multi-assignment, and propose a Gibbs sampling algorithm to carry out posterior inference. The implementation in this work uses collapsed Gibbs sampling and compares with several methods. Additionally, previous evaluation metrics used by multi-label classification are inappropriate for MAC, since label information is unavailable. This work further devises an evaluation metric for MAC based on the characteristics of clustering and multi-assignment problems. We conduct experiments on two real data sets, and the experimental results indicate that the proposed method is competitive and outperforms the alternatives on most data sets.
Introduction
Multi-label learning has attracted significant attention from machine learning and data mining over the last decade [1, 40]. One of the reasons is that multi-label learning problems naturally arise in many application domains such as text categorization, music classification, and bioinformatics. For example, a news document may involve several topics such as technology, science, and politics. Similarly, in bioinformatics, a gene may be multi-functional, associating with the functions of “metabolism”, “transcription” and “protein synthesis”. In music information retrieval, a single music sound may be characterized by both “dreamy” and “cheerful”.
Existing multi-label research studies focus on multi-label classification problems, in which each data instance is associated with a set of class labels. Given a label set
The results of MAC and multi-label classification are both associated with a set of class labels, but label information is unavailable for MAC. Clustering is an important task when exploring massive data sets. Additionally, it is difficult to define explicit labels in certain application domains. For example, customer segmentation is an important topic in retail and telecommunications industries, since the result can be used for targeting market. However, it is difficult to define the labels of the customers, so customer segmentation traditionally belongs to clustering problem. Meanwhile, MAC is more appropriate than single label clustering in customer micro-segmentation, since each customer should be naturally assigned to several segments. Besides customer segmentation, MAC problems arise in many application domains such as text clustering and image clustering.
Currently, many clustering algorithms can achieve the goal of MAC by giving clustering results associated with probabilities or fuzziness. Two typical clustering algorithms are Gaussian of mixture models (GMM) and Fuzzy C-Means (FCM) [3], both allowing each data instance to belong to two or more clusters with different degrees of membership. The results can be transformed into MAC outcomes by using a threshold value to filter the cluster assignments. However, using this approach to achieve the goal of MAC is inappropriate owing to three main reasons. First, the objective function of the above algorithms do not consider cluster correlations. Next, it is difficult to define the threshold value for the multi-assignment transformation. Finally, specifying the number of clusters in advance of learning process is necessary for GMM and FCM, but the setting of this parameter is crucial to clustering performance of GMM and FCM. However, when one is confronted with massive data sets, and the goal is to explore structure or the latent information behind the data sets, it is difficult to determine the number of clusters precisely. In many applications, new classes may not exist at the time of training, they may exist but are not known, or their existence may be known but samples are simply not obtainable [10].
The Bayesian nonparametric models provide an alternative to this problem, allowing the model complexity to grow as more data instances are observed. The Bayesian analysis models can place a prior probability distribution on the parameters of the distribution generating the underlying data, giving a way to incorporate prior knowledge information into the analysis. The Dirichlet process (DP) [13] provides a distribution on distributions, and is a stochastic process frequently used as a prior in Bayesian nonparametric statistics. Moreover, the DP has many attractive properties, and has been widely used in practice [16, 34, 32, 27].
This study proposes a MAC algorithm based on Chinese restaurant process (CRP) called multi-assignment Chinese restaurant process (MACRP), in which we design a novel prior distribution to consider cluster correlations. The proposed method is a nonparametric method, allowing the model to determine the number of clusters from the data. The proposed multi-assignment CRP uses the multi-assignment prior along with the likelihood of data instances to model the posterior distribution of cluster assignment. This work gives a collapsed Gibbs sampling algorithm, which is efficient to inference posterior probability distribution of the cluster assignment, and compares with several methods in the experiments.
Current evaluation metrics for multi-label learning focus on classification problems, and various metrics have been proposed in the literature, including Hamming Loss [31], accuracy [7], and precision and recall [17]. Existing evaluation metrics are inappropriate for MAC, since label information is unavailable in clustering applications. Hence, this work further devises a evaluation metric for MAC based on the characteristics of clustering and multi-assignment problems.
The main contribution of this paper is to devise a non-parametric multi-assignment clustering algorithm, and devise a new evaluation metric for MAC. The proposed algorithm allows the model complexity to grow as more data instances are observed. The nonparametric design yields a flexible model in dealing with massive data sets. Additionally, this work proposes a new evaluation metric for MAC problems. We use several algorithms to compare with the proposed algorithm on two data sets, and the experimental results indicate that the proposed algorithm is competitive on the two data sets.
The rest of this paper is organized as follows. Section 2 presents the survey on multi-label learning and introduces the Dirichlet process mixture models. Section 3 presents the proposed MACRP algorithm. Next, Section 4 summarizes the results of several experiments. Section 5 analyzes and discusses the experimental results. Conclusions are finally drawn in Section 6.
Related work
This section provides the survey of multi-label learning. Besides, the proposed algorithm is a Bayesian nonparametric method, explaining why we introduce Dirichlet process mixture models in the following section.
Multi-label classification
Existing multi-label classification methods usually fall into one of two main categories: algorithm adaption and problem transformation [35]. The algorithm adaption methods extend specific learning algorithms to deal with multi-label problems directly. For example, AdaBoost.MH and AdaBoost.MR [30] are two extensions of AdaBoost for multi-label data. Clare and King [8] developed re-sampling strategies and modified the algorithm C4.5 to deal with multi-label classification problems presented in bioinformatic data. Conversely, the problem transformation methods transform the multi-label problems into a series of single label sub-problems. One of the advantages of problem transformation over algorithm adaption is that problem transformation can use any off-the-shelf classifier to deal with the sub-problems. Representative algorithms include binary relevance (BR) [7], label ranking [15] and label power-set [38]. The BR and label power-set both rely on classification technique to tackle multi-label classification problems. Given
Although the number of classifiers is linear, the correlations between labels are not considered in the BR. Effective exploitation of the label correlation information is deemed to be crucial for the success of multi-label learning techniques. Recently, many approaches have been proposed to incorporate label correlations into the model to further improve classification performance [9, 18, 20]. Compared to the BR, the power-set approach considers label correlations by incorporating pairwise, or potentially even higher order, label interactions. It is apparent that the number of combinations is enormous, so the key challenge of learning from large-scale multi-label data lies in the overwhelming size of output space [44]. Hsu et al. [19] proposed to use compressed sensing to reduce the dimension of the labels, and learn a predictor of these reduced labels. Bi and Kwok [4] used randomized sampling procedure to select a small subset of class labels that can approximately span the original label space. Tai and Lin [33] proposed a hypercube view to model multi-label problems and devised a multi-label classification algorithm called principal label space transformation (PLST) to capture key correlations between labels before learning. The random
Multi-assignment clustering
Compared with multi-label classification, few research studies focus on MAC problems. One of the reasons is that label information is unavailable in clustering applications, and specifying the number of clusters is not a straightforward task. Additionally, existing evaluation metrics are inappropriate to evaluate MAC problems. However, assigning a data instance to multiple clusters is reasonable and practical in many application domains. Recently, Frank et al. [14] proposed a multi-assignment algorithm, which is abbreviated as B-MAC in this work, to cluster Boolean data from generative viewpoint, in which multiple clusters can simultaneously generate a data instance. This study uses Bayesian nonparametric approach to devise a MAC algorithm, and the proposed algorithm can be applied to more application domains than B-MAC.
Dirichlet process mixture models
Dirichlet process (DP) mixture models are the cornerstone of nonparametric Bayesian statistics [6]. Let
The stick-breaking construction provides a constructive definition of the DP, and indicates that the mixing proportion decreases exponentially quickly, so only a small number of components will be used to model the data. Besides stick-breaking construction, one can further analyze the DP using different representations. Marginalizing out
The CRP gets the name from the following metaphor. Imagine a Chinese restaurant with an infinite number of tables each with infinite capacity, and a sequence of
As shown in Eq. (1), the CRP is a nonparametric model, meaning that the number of occupied tables grows as more customers enter the restaurant. The larger
It is infeasible to perform exact inference of a DP mixture model, so many approximation inference methods have been developed over the past few years. The development of Markov chain methods provides a systematic approach to the computation of the posterior distribution of the parameters [11, 42, 24, 12, 26, 21]. Neal [26] presented several Markov chain methods for sampling from distribution of a DP mixture model. The Markov chain Monte Carlo (MCMC) sampling methods are nondeterministic approaches, and their convergence can be difficult to diagnose. Besides MCMC, variational inference provides an alternative, deterministic methodology for approximating likelihoods and posteriors [6, 41].
Notation
This section describes the notation used throughout this work. As mentioned above, the MACRP is a multi-assignment clustering algorithm, so label information is unavailable during the learning process. Meanwhile, this work uses the terminologies used by CRP to describe the notation. For example, a table is correspondent to a cluster in the MACRP. The
Note that the number of possible tables from CRP metaphor is infinite, and this work introduces a variable
Multi-assignment prior
This works focuses on MAC problems, and proposes a multi-assignment prior based on similarity graph to encode the correlation between clusters, which allows the posterior distribution to consider the cluster correlations elegantly. Consider an undirected graph
We then introduce a
where
Illustration of a similarity graph.
Exploiting the label correlations between labels is critical to multi-label learning. In order to explain the connection between the label correlations and the proposed multi-assignment prior, we use a toy example to illustrate the connection. Figure 1 shows a similarity graph, which comprises four vertices and six edges. Note that the entries on the diagonal are the same. Then, using the weight matrix
Besides the prior distribution listed in Eq. (3), the likelihood
As shown in Eq. (5), the cluster assignments allows the model to create a new cluster, indicating that the proposed method is a nonparametric method. Besides the posterior sampling for all existing cluster and a new cluster, the model has to determine possible clusters that generate the data instance
Note that
This work performs posterior inference by using Markov chain Monte Carlo (MCMC) methods, which are the most widely used posterior inference methods in Bayesian nonparametric models. The MCMC constructs a Markov chain on the hidden variables that has the desired distribution as its equilibrium distribution. Then one can obtain samples from the posterior distribution by drawing samples from the Markov chain. Gibbs sampling is a simple and widely applicable MCMC algorithms, and it can be viewed as a special case of the Metropolis-Hastings algorithm. Gibbs sampling is applicable when the joint distribution is unknown, but the conditional distribution of each variable is known. The Markov chain is constructed by considering the conditional distribution of each hidden variable given the others and the observations.
The posterior distribution of the indicator variable
Multi-assignment Chinese Restaurant Process with Gibbs Sampling Algorithm
Get a data
The marginalization of some variables from a joint distribution always reduces the variance due to the Rao-Blackwell Theorem [22]. In a conjugate context, one can integrate analytically over
Algorithm 2 shows the collapsed sampling of MACRP, and this work implements Algorithm 2 in the experiments. There are
Multi-assignment Chinese Restaurant Process with Collapsed Gibbs Sampling Algorithm
Get a data
In the experiments, we conduct experiments on two data sets, which are text documents. Documents are conventionally modeled by multinomial distribution over words, and this work follows the same setting. Meanwhile, we use Dirichlet distribution to model the parameters of the multinomial distribution. The experiments use several methods to compare with the proposed algorithm on two data sets listed below.
The medical dataset is collected from the Cincinnati Children’s Hospital Medical Center’s (CCHMC) Department of Radiology [28]. It contains 978 data instances and follows HIPAA standards which include three steps: disambiguation, anonymization, and data scrubbing. Ueda and Saito [39] categorized real Web pages linked from the “yahoo.com” domain, which comprises 14 top-level categories, and each category is classified into a number of second-level subcategories. They maked 14 independent text categorization problems by focusing on the second-level categories. This work uses eight of them in the experiments, including Arts, Computers, Education, Entertainment, Health, Recreation, Reference, and Science.
Although several multi-label classification evaluation metrics have been proposed over the last decade, they are inappropriate for MAC. One of the reasons is that label information is unavailable for the data instances in clustering applications. The B-MAC [14] used a evaluation measure called generalization error, which is based on hamming distance, but it is not appropriate in this work since the generalization error only applies to application domains of Boolean data. Therefore, this work proposes a new metric for evaluating MAC problems based on pairwise information between each pair of data instances as used by
Notation of evaluation metric for MAC
Notation of evaluation metric for MAC
True Positives (TP) The clustering algorithms predict the two data instances in a pair intersect, and data corpus also has them in intersection. Multi-assignment True Positives (MTP) The measurement of TP can be used to evaluate the correctness of predicted positive pairs, but it is insufficient in MAC problems. More detailed conditions should be considered for MAC, explaining why we further devise a MTP measurement to discriminate the degree of TP under different conditions. For the pairs that satisfy TP condition, we further consider the difference between the number of correct intersections and the number of intersections for predicted cluster assignment pairs. Let First, this work introduces
Then, we use the absolute difference between the real size and expected size of predicting cluster assignments to represent the losses of prediction results as presented in Eq. (13).
Next we score the two data instances using Eq. (4.1). If CL equals to zero, then the result is the best, and TPS will be one. Conversely, a large value of CL causes the TPS to approach zero. The final score of MTP is the average of
False Positives (FP) The clustering algorithms predict the two data instances in a pair intersect, but data corpus has them in no intersection. True Negatives (TN) The clustering algorithms predict the two data instances in a pair have no intersection, and data corpus also has them in no intersection. False Negatives (FN) The clustering algorithms predict the two data instances in a pair have no intersection, but data corpus has them in intersection.
Similar to traditional information retrieval definition, Eq. (4.1) shows the formulas of precision, recall and
Evaluation example
To explain how the above evaluation works, this work uses a toy example with available label information listed in Table 2 to demonstrate the evaluation process. Note that the label information is absent in clustering applications, and we can only consider whether two data instances are in the same cluster or not.
Pair 1 ( Pair 2 ( The cluster assignments of the two data instances have no intersection in correct cluster assignments, but intersect in predicted cluster assignments. Thus, the number of FP increases by one. Pair 3 ( The cluster assignments of the two data instances have intersection in correct cluster assignments, but no intersection in predicted cluster assignments. Thus, the number of FN increases by one. Pair 4 ( The cluster assignments of the two data instances have no intersection in correct cluster assignments and predicted cluster assignments. Thus, the number of TN increase by one. Pair 5 ( The number of TP increases one, and Pair 6 ( The number of TP increases by one, and Pair 7 ( The number of FN increases by one. Pair 8 ( The number of TP increases one, and Pair 9 ( The number of FN increases by one. Pair 10 ( The number of TP increases by one, and
Finall, we can get the results, where precision
This work conducts experiments on medical and yahoo data sets, and applies several algorithms to the two data sets to compare with the proposed method. The proposed method is an unsupervised learning, and is a nonparametric method, which allows the model complexity to grow as more data instances are observed.
As for the comparison methods, this work uses the algorithms provided by Mulan [37], which is an open-source Java library for learning from multi-label data sets. We use three multi-label classification methods, including binary relevance (BR), multi-label
We repeat each experiment five times and use the average and standard deviation of the results to present the experimental result, in which mean plus or minus two standard deviations is presented in the tables. Note that the results of those multi-label classification methods in Mulan library have no standard deviation because all the results are the same in five-time experiments. The multi-label classification methods require training phases to learn classification models, and then use the learned models to classify testing data. Thus, the experiments on multi-label classification methods split the data sets into training data and testing data. Additionally, the proposed method is a nonparametric method, which allows the model to detect new clusters in the leaning process. To further investigate the detection on new clusters, the training set does not comprise complete cluster set, and the experimental settings are presented in Table 3. The experiments were conducted on a personal computer with Intel Core i7 CPU and 16 GB RAM. The proposed method is implemented using MATLAB. As mentioned above, the proposed method requires to specify hyperparameters
Experimental settings of multi-assignment clustering
Experimental settings of multi-assignment clustering
Experimental results (mean
As shown in Table 4, the proposed algorithm outperforms other methods on most data sets. Additionally, the proposed method is an unsupervised method, indicating that it does not require labeled data for model training. The proposed MACRP algorithm significantly outperforms B-MAC algorithm, and this result may be explained by the fact that MACRP considers correlations between clusters. Additionally, MACRP can naturally handle different data types, while B-MAC algorithm can only be applied to Boolean data. Besides MAC experiments, this work further analyzes the proposed algorithm in this section.
Performance evaluation with different threshold values.
The proposed method uses the formula listed in Eq. (6) to determine the cluster set, so we investigate the parameter
Number of clusters detected with different threshold values.
Performance evaluation with different values of 
As shown in Eq. (11), the proposed method uses a concentration parameter
Number of clusters detected with different values of 
Performance evaluation using different cluster selection method.
This work uses the proposed formula listed in Eq. (6) to determine the final cluster set. The proposed scheme possesses a good property, namely it can shrink to single cluster algorithm or expand to use all clusters as cluster set by adjusting a parameter
Number of clusters detected using different cluster selection method.
The experimental results are presented in Figs 6 and 7. The experimental results indicate that the new cluster selection scheme can achieve the best performance when the experiments set
This work focuses on MAC problems, and devises a nonparametric model, which considers cluster correlation in the prior. Central to the proposed algorithm is using the proposed multi-assignment prior along with the likelihood to perform posterior of the cluster assignment. Besides multi-assignment clustering, the proposed method is a nonparametric model, which is flexible to deal with big data problems with weaker assumptions and allows the model to adapt to observed data. The implementation of the proposed algorithm is Matlab and the experimental results on two real data sets have shown that the proposed MACRP is flexible owing to the characteristic of nonparametric model. Note that the algorithm should be implemented with other languages and platforms, such as Apache Spark, to process big data problems. Additionally, previous evaluation metrics on multi-label classification are inappropriate for MAC problems, explaining why we propose a new evaluation metric in this paper. This work focuses on clustering, and the future work is to devise an algorithm to automatically generate appropriate labels or tags for the multi-assignment results. Another research direction is to devise variational algorithms to deal with MAC problems.
Footnotes
Acknowledgments
This work was supported in part by Ministry of Science and Technology, Taiwan, under Grant No. MOST 104-2218-E-009-028.
Appendix
The posterior inference with collapsed Gibbs sampling on the cluster assignments listed in Eq. (11) can be further represented as Eq. (Appendix), where
We use
