Nonparametric multi-assignment clustering

Abstract

Multi-label learning has attracted significant attention from machine learning and data mining over the last decade. Although many multi-label classification algorithms have been devised, few research studies focus on multi-assignment clustering (MAC), in which a data instance can be assigned to multiple clusters. The MAC problem is practical in many application domains, such as document clustering, customer segmentation and image clustering. Additionally, specifying the number of clusters is always a difficult but critical problem for a certain class of clustering algorithms. Hence, this work proposes a nonparametric multi-assignment clustering algorithm called multi-assignment Chinese restaurant process (MACRP), which allows the model complexity to grow as more data instances are observed. The proposed algorithm determines the number of clusters from data, so it provides a practical model to process massive data sets. In the proposed algorithm, we devise a novel prior distribution based on the similarity graph to achieve the goal of multi-assignment, and propose a Gibbs sampling algorithm to carry out posterior inference. The implementation in this work uses collapsed Gibbs sampling and compares with several methods. Additionally, previous evaluation metrics used by multi-label classification are inappropriate for MAC, since label information is unavailable. This work further devises an evaluation metric for MAC based on the characteristics of clustering and multi-assignment problems. We conduct experiments on two real data sets, and the experimental results indicate that the proposed method is competitive and outperforms the alternatives on most data sets.

Keywords

Multi-assignment clustering Chinese restaurant process (CRP)Non-parametric Bayesian

1. Introduction

Multi-label learning has attracted significant attention from machine learning and data mining over the last decade [1, 40]. One of the reasons is that multi-label learning problems naturally arise in many application domains such as text categorization, music classification, and bioinformatics. For example, a news document may involve several topics such as technology, science, and politics. Similarly, in bioinformatics, a gene may be multi-functional, associating with the functions of “metabolism”, “transcription” and “protein synthesis”. In music information retrieval, a single music sound may be characterized by both “dreamy” and “cheerful”.

Existing multi-label research studies focus on multi-label classification problems, in which each data instance is associated with a set of class labels. Given a label set $\mathcal{L}$ , the prediction on data instance $\mathbf{x}_{i}$ is $\mathrm{Y}$ , where $\mathrm{Y}\subseteq\mathcal{L}$ . It is apparent that traditional classification is a special case of multi-label classification with $|\mathrm{Y}|=1$ . Several multi-label classification algorithms have been developed recently, and they can be generally divided into two categories, namely problem transform methods and algorithm adaption methods [35]. The binary relevance (BR) is a typical method in problem transformation methods by using one-against-all approach to decompose the original multi-label problems into $|\mathcal{L}|$ binary classification problems. Conversely, the algorithm adaption methods extend specific learning algorithms to deal with multi-label problems directly. However, few previous research studies directly focus on multi-assignment clustering (MAC).

The results of MAC and multi-label classification are both associated with a set of class labels, but label information is unavailable for MAC. Clustering is an important task when exploring massive data sets. Additionally, it is difficult to define explicit labels in certain application domains. For example, customer segmentation is an important topic in retail and telecommunications industries, since the result can be used for targeting market. However, it is difficult to define the labels of the customers, so customer segmentation traditionally belongs to clustering problem. Meanwhile, MAC is more appropriate than single label clustering in customer micro-segmentation, since each customer should be naturally assigned to several segments. Besides customer segmentation, MAC problems arise in many application domains such as text clustering and image clustering.

Currently, many clustering algorithms can achieve the goal of MAC by giving clustering results associated with probabilities or fuzziness. Two typical clustering algorithms are Gaussian of mixture models (GMM) and Fuzzy C-Means (FCM) [3], both allowing each data instance to belong to two or more clusters with different degrees of membership. The results can be transformed into MAC outcomes by using a threshold value to filter the cluster assignments. However, using this approach to achieve the goal of MAC is inappropriate owing to three main reasons. First, the objective function of the above algorithms do not consider cluster correlations. Next, it is difficult to define the threshold value for the multi-assignment transformation. Finally, specifying the number of clusters in advance of learning process is necessary for GMM and FCM, but the setting of this parameter is crucial to clustering performance of GMM and FCM. However, when one is confronted with massive data sets, and the goal is to explore structure or the latent information behind the data sets, it is difficult to determine the number of clusters precisely. In many applications, new classes may not exist at the time of training, they may exist but are not known, or their existence may be known but samples are simply not obtainable [10].

The Bayesian nonparametric models provide an alternative to this problem, allowing the model complexity to grow as more data instances are observed. The Bayesian analysis models can place a prior probability distribution on the parameters of the distribution generating the underlying data, giving a way to incorporate prior knowledge information into the analysis. The Dirichlet process (DP) [13] provides a distribution on distributions, and is a stochastic process frequently used as a prior in Bayesian nonparametric statistics. Moreover, the DP has many attractive properties, and has been widely used in practice [16, 34, 32, 27].

This study proposes a MAC algorithm based on Chinese restaurant process (CRP) called multi-assignment Chinese restaurant process (MACRP), in which we design a novel prior distribution to consider cluster correlations. The proposed method is a nonparametric method, allowing the model to determine the number of clusters from the data. The proposed multi-assignment CRP uses the multi-assignment prior along with the likelihood of data instances to model the posterior distribution of cluster assignment. This work gives a collapsed Gibbs sampling algorithm, which is efficient to inference posterior probability distribution of the cluster assignment, and compares with several methods in the experiments.

Current evaluation metrics for multi-label learning focus on classification problems, and various metrics have been proposed in the literature, including Hamming Loss [31], accuracy [7], and precision and recall [17]. Existing evaluation metrics are inappropriate for MAC, since label information is unavailable in clustering applications. Hence, this work further devises a evaluation metric for MAC based on the characteristics of clustering and multi-assignment problems.

The main contribution of this paper is to devise a non-parametric multi-assignment clustering algorithm, and devise a new evaluation metric for MAC. The proposed algorithm allows the model complexity to grow as more data instances are observed. The nonparametric design yields a flexible model in dealing with massive data sets. Additionally, this work proposes a new evaluation metric for MAC problems. We use several algorithms to compare with the proposed algorithm on two data sets, and the experimental results indicate that the proposed algorithm is competitive on the two data sets.

The rest of this paper is organized as follows. Section 2 presents the survey on multi-label learning and introduces the Dirichlet process mixture models. Section 3 presents the proposed MACRP algorithm. Next, Section 4 summarizes the results of several experiments. Section 5 analyzes and discusses the experimental results. Conclusions are finally drawn in Section 6.

2. Related work

This section provides the survey of multi-label learning. Besides, the proposed algorithm is a Bayesian nonparametric method, explaining why we introduce Dirichlet process mixture models in the following section.

2.1 Multi-label classification

Existing multi-label classification methods usually fall into one of two main categories: algorithm adaption and problem transformation [35]. The algorithm adaption methods extend specific learning algorithms to deal with multi-label problems directly. For example, AdaBoost.MH and AdaBoost.MR [30] are two extensions of AdaBoost for multi-label data. Clare and King [8] developed re-sampling strategies and modified the algorithm C4.5 to deal with multi-label classification problems presented in bioinformatic data. Conversely, the problem transformation methods transform the multi-label problems into a series of single label sub-problems. One of the advantages of problem transformation over algorithm adaption is that problem transformation can use any off-the-shelf classifier to deal with the sub-problems. Representative algorithms include binary relevance (BR) [7], label ranking [15] and label power-set [38]. The BR and label power-set both rely on classification technique to tackle multi-label classification problems. Given $\mathcal{L}$ class labels, the BR independently trains binary classifiers $H:\mathbf{X}\rightarrow\{l,\neg l\}$ , each of which focuses on a different label $l\in\mathcal{L}$ . Conversely, the power-set considers each different set of labels that exists in the combinations of the labels, namely, it learns classifiers $H:\mathbf{X}\rightarrow\mathbb{P}(\mathcal{L})$ , in which $\mathbb{P}(\mathcal{L})$ is the power set of $\mathcal{L}$ . Intuitively, the multi-label classification can be reduced to a ranking problem, in which the topmost labels are more related with the new data instance. However, post-processing is required to transform the ranking of labels into a set of labels.

Although the number of classifiers is linear, the correlations between labels are not considered in the BR. Effective exploitation of the label correlation information is deemed to be crucial for the success of multi-label learning techniques. Recently, many approaches have been proposed to incorporate label correlations into the model to further improve classification performance [9, 18, 20]. Compared to the BR, the power-set approach considers label correlations by incorporating pairwise, or potentially even higher order, label interactions. It is apparent that the number of combinations is enormous, so the key challenge of learning from large-scale multi-label data lies in the overwhelming size of output space [44]. Hsu et al. [19] proposed to use compressed sensing to reduce the dimension of the labels, and learn a predictor of these reduced labels. Bi and Kwok [4] used randomized sampling procedure to select a small subset of class labels that can approximately span the original label space. Tai and Lin [33] proposed a hypercube view to model multi-label problems and devised a multi-label classification algorithm called principal label space transformation (PLST) to capture key correlations between labels before learning. The random $k$ -label [36] ensembles multiple power-set classifiers, each trained on a random subset of the actual labels.

2.2 Multi-assignment clustering

Compared with multi-label classification, few research studies focus on MAC problems. One of the reasons is that label information is unavailable in clustering applications, and specifying the number of clusters is not a straightforward task. Additionally, existing evaluation metrics are inappropriate to evaluate MAC problems. However, assigning a data instance to multiple clusters is reasonable and practical in many application domains. Recently, Frank et al. [14] proposed a multi-assignment algorithm, which is abbreviated as B-MAC in this work, to cluster Boolean data from generative viewpoint, in which multiple clusters can simultaneously generate a data instance. This study uses Bayesian nonparametric approach to devise a MAC algorithm, and the proposed algorithm can be applied to more application domains than B-MAC.

2.3 Dirichlet process mixture models

Dirichlet process (DP) mixture models are the cornerstone of nonparametric Bayesian statistics [6]. Let $G$ be a random measure drawn from a DP, parameterized by a base distribution $G_{0}$ and a positive concentration parameter $\alpha$ . Suppose the data instance $\mathbf{x}_{i}$ is drawn from a mixture of distributions of the form $F(\theta)$ with the mixing distribution over $\theta$ being $G$ . This gives the following DP mixture models:

$\displaystyle G\sim DP(\alpha,G_{0})$ $\displaystyle\theta_{i}|G\sim G$ $\displaystyle\mathbf{x}_{i}|\theta_{i}\sim F(\theta_{i})$

The stick-breaking construction provides a constructive definition of the DP, and indicates that the mixing proportion decreases exponentially quickly, so only a small number of components will be used to model the data. Besides stick-breaking construction, one can further analyze the DP using different representations. Marginalizing out $G$ , the conditional distributions of $\theta_{i}$ given $\theta_{1},\ldots,\theta_{i-1}$ follows a Polya urn scheme [5], in which the $\theta_{i}$ has positive probability of being equal to one of the previous draws, namely the DP exhibits a self-reinforcing property. Let the $N$ samples $\theta_{1},\ldots,\theta_{N}$ take on $k$ distinct values, then the $k$ distinct values of the variables define a partition of the sequence $1,\ldots,N$ . The induced distribution over partitions is a Chinese restaurant process (CRP) [2, 29]. Note that the Polya urn scheme and CRP both make it clear that a DP imposes a clustering structure on the observations with a self-reinforcing property expressed as the rich get richer.

The CRP gets the name from the following metaphor. Imagine a Chinese restaurant with an infinite number of tables each with infinite capacity, and a sequence of $N$ customers enter the restaurant and sit down. The first customer comes in the restaurant and sits at the first table. The $i$ th customer sits at an occupied table $j$ with probability proportional to the number of already seated customers $m_{j}$ , or sits at a new table with probability proportional to $\alpha$ . Let $z_{i}$ indicate the table associated with $i$ th customer, then the predictive distribution of $z_{i}$ is presented in Eq. (1), in which $z_{1:i-1}$ denotes the tables from $z_{1}$ to $z_{i-1}$ , and $k$ indicates the number of occupied tables.

$\displaystyle P(z_{i}=j|z_{1:i-1},\alpha)\propto\left\{\begin{array}[]{l l}% \frac{m_{j}}{i-1+\alpha}&\quad\text{if}\,j\leqslant k\\ \frac{\alpha}{i-1+\alpha}&\quad\text{if}\,j=k+1\end{array}\right.$ (1)

As shown in Eq. (1), the CRP is a nonparametric model, meaning that the number of occupied tables grows as more customers enter the restaurant. The larger $m_{j}$ is, the higher the probability that it will grow. Therefore, the CRP has the clustering property, explaining why we use CRP as the base model of MACRP. Additionally, the concentration parameter $\alpha$ determines how likely a customer is to sit at a new table, namely generating a new cluster component. Many variations of CRP have been proposed over the last decade. For example, Liu et al. [23] proposed an online learning algorithm called Online CRP, which requires label information to adjust model parameter. Note that MACRP and online CRP are totally different algorithms. First, MACRP is a batch learning algorithm, while online CRP is an online learning algorithm. Second, MACRP is proposed to tackle multi-assignment clustering problems, while online CRP belongs to single label classification problem.

It is infeasible to perform exact inference of a DP mixture model, so many approximation inference methods have been developed over the past few years. The development of Markov chain methods provides a systematic approach to the computation of the posterior distribution of the parameters [11, 42, 24, 12, 26, 21]. Neal [26] presented several Markov chain methods for sampling from distribution of a DP mixture model. The Markov chain Monte Carlo (MCMC) sampling methods are nondeterministic approaches, and their convergence can be difficult to diagnose. Besides MCMC, variational inference provides an alternative, deterministic methodology for approximating likelihoods and posteriors [6, 41].

3. Multi-assignment Chinese restaurant process (MACRP)

3.1 Notation

This section describes the notation used throughout this work. As mentioned above, the MACRP is a multi-assignment clustering algorithm, so label information is unavailable during the learning process. Meanwhile, this work uses the terminologies used by CRP to describe the notation. For example, a table is correspondent to a cluster in the MACRP. The $i$ th customer is analogous to the $i$ th data instance $\mathbf{x}_{i}$ , and the customers sitting at the same table indicates that the correspondent data instances belong to the same cluster. Meanwhile, the dish in table $j$ is analogous to the parameter of the cluster $j$ , denoted as $\theta_{j}$ , and the number of customers sitting at table $j$ is corresponding to the number of data instances belonging to cluster $j$ , denoted as $m_{j}$ . Thus, the preference of customer $\mathbf{x}_{i}$ on the dish served at table $j$ is represented as $H(\mathbf{x}_{i},\theta_{j})$ , which is the likelihood of data instance $\mathbf{x}_{i}$ belonging to cluster $j$ . Hereafter, to follow the usage of CRP terminologies, the terms table and cluster are used interchangeably in this work.

Note that the number of possible tables from CRP metaphor is infinite, and this work introduces a variable $k$ to indicate the number of occupied tables. The base distribution and concentration parameters are $G_{0}$ and $\alpha$ , respectively. Finally, this study introduces $\mathbf{x}_{-i}$ to represent all the customers except customer $i$ , and $\mathbf{x}_{-i,j}$ represents all the customers in table $j$ except customer $i$ . The same representation is applied to $z$ .

3.2 Multi-assignment prior

This works focuses on MAC problems, and proposes a multi-assignment prior based on similarity graph to encode the correlation between clusters, which allows the posterior distribution to consider the cluster correlations elegantly. Consider an undirected graph $G=(V,E)$ with vertex set $V=\{v_{1},\ldots,v_{k}\}$ and each edge between $v_{i}$ and $v_{j}$ carries a non-negative weight $w_{ij}\geqslant 0$ . The weighted adjacency matrix of the graph is the matrix $\mathbf{W}=(w_{ij})_{i,j=1,\ldots,k}$ . If $w_{ij}>0$ , then $v_{i}$ and $v_{j}$ are connected. Otherwise, they are disconnected. In the graph, each vertex denotes a cluster, and an edge between two vertices represents the similarity of the two clusters. Without loss of generality, we simply use $i$ and $j$ to represent cluster $i$ and cluster $j$ , respectively, while $v_{i}$ and $v_{j}$ are denoted as their corresponding vertices in the graph, respectively. Note that the weight $w_{ij}$ denotes the weight of edge between $v_{i}$ and $v_{j}$ , and this work determines the value of $w_{ij}$ in terms of $\theta_{i}$ and $\theta_{j}$ , which are the parameters for clusters $i$ and $j$ , respectively. It is apparent that $\mathbf{W}$ is a symmetric matrix, and the degree of a vertex $v_{i}$ is defined as

$\displaystyle d_{i}=\sum\limits_{j=1}^{n}w_{ij}.$ (2)

We then introduce a $k\times k$ diagonal matrix $\mathbf{D}$ with the degrees $d_{1},\ldots,d_{k}$ on the diagonal. This work proposes to use a graph to model MAC problems, and uses edges to model the correlation between clusters. Based on the similarity graph, this work devises a novel prior distribution that is called multi-assignment prior to consider the similarity between clusters during the learning process. The $i$ th subsequent customer sits at an occupied table, or at the next unoccupied table according to the following distribution:

$\displaystyle P(j\in z_{i}|z_{1:i-1},\alpha)\propto\left\{\begin{array}[]{l l}% d_{j}&\quad\text{if}\,j\leqslant k\\ {\alpha}&\quad\text{if}\,j=k+1\end{array}\right.$ (3)

where $d_{j}$ is the degree of vertex $j$ , namely table $j$ . Note that the similarity graph prior is probabilistic even though the degrees mean the summation of weights. The $i$ th customer sits at an occupied table $j$ with probability proportional to degree of table $j$ , or sits at a new table with probability proportional to $\alpha$ .

Figure 1.

Illustration of a similarity graph.

Exploiting the label correlations between labels is critical to multi-label learning. In order to explain the connection between the label correlations and the proposed multi-assignment prior, we use a toy example to illustrate the connection. Figure 1 shows a similarity graph, which comprises four vertices and six edges. Note that the entries on the diagonal are the same. Then, using the weight matrix $\mathbf{W}$ as listed in Eq. (4), the degrees, such as $d_{1}=w_{11}+w_{12}+w_{13}+w_{14}$ and $d_{2}=w_{21}+w_{22}+w_{23}+w_{24}$ , can be obtained. As shown in Fig. 1, vertex 1 and vertex 2 are close to each other, indicating that $w_{12}$ and $w_{21}$ are large as compared with other edges. A large values of $w_{12}$ and $w_{21}$ contribute more weights to $d_{1}$ and $d_{2}$ than those to $d_{3}$ and $d_{4}$ . Thus, using $d_{j}$ as the prior can elegantly model the correlation among clusters.

$\displaystyle\mathbf{W}=\left[\begin{array}[]{cccc}w_{11}&\mathbf{w_{12}}&w_{1% 3}&w_{14}\\ \mathbf{w_{21}}&w_{22}&w_{23}&w_{24}\\ w_{31}&w_{32}&w_{33}&w_{34}\\ w_{41}&w_{42}&w_{43}&w_{44}\\ \end{array}\right]$ (4)

Besides the prior distribution listed in Eq. (3), the likelihood $P(\mathbf{x}_{i}|\theta_{j})$ is the same as $H(\mathbf{x}_{i},\theta_{j})$ mentioned above, since the likelihood is used to measure how likely $\mathbf{x}_{i}$ belongs to cluster $j$ . The combination of multi-assignment prior with the likelihood provides a way to estimate the posterior of the cluster assignment, and it can be formulated as the following distribution:

$\displaystyle P(j\in z_{i}|z_{1:i-1},\mathbf{x}_{i},\Theta,G_{0},\alpha)% \propto\left\{\begin{array}[]{l l}d_{j}\cdot H(\mathbf{x}_{i},\theta_{j})&% \text{if}\,j\leqslant k\\ {\alpha\cdot\int H(\mathbf{x}_{i},\theta_{j})dG_{0}(\theta_{j})}&\text{if}\,j=% k+1\end{array}\right.$ (5)

3.3 Multi-cluster assignment

As shown in Eq. (5), the cluster assignments allows the model to create a new cluster, indicating that the proposed method is a nonparametric method. Besides the posterior sampling for all existing cluster and a new cluster, the model has to determine possible clusters that generate the data instance $\mathbf{x}_{i}$ . This work devises a scheme to assign multiple clusters for each data instance. The model predicts the cluster assignments $z_{i}$ for data instance $\mathbf{x}_{i}$ according to the Eq. (6), where $j$ ( $1\leqslant j\leqslant k+1$ ) is the cluster assignment and $\eta$ is defined in Eq. (7). The value of $\eta$ listed in Eq. (7) is dynamic and determined by a parameter $\epsilon$ ( $0\leqslant\epsilon\leqslant 1$ ), and $\rho_{\max}$ and $\rho_{\min}$ as listed in Eqs (8) and (9), which represent the probabilities of the most and the least likely cluster assignments for the data instance $\mathbf{x}_{i}$ .

$\displaystyle z_{i}=\{j|P(j\in z_{i}|z_{1:i-1},\mathbf{x}_{i},\Theta,G_{0},% \alpha)\geqslant\eta\}$ (6) $\displaystyle\eta=\rho_{\max}-(\rho_{\max}-\rho_{\min})\cdot\epsilon$ (7) $\displaystyle\rho_{\max}=\max_{j=1,\ldots,k+1}{P(j\in z_{i}|z_{1:i-1},\mathbf{% x}_{i},\Theta,G_{0},\alpha)}$ (8) $\displaystyle\rho_{\min}=\min_{j=1,\ldots,k+1}{P(j\in z_{i}|z_{1:i-1},\mathbf{% x}_{i},\Theta,G_{0},\alpha)}$ (9)

Note that $\epsilon$ controls the cardinality of the predicted cluster assignments $z_{i}$ , and $z_{i}$ reduces to single cluster when $\epsilon$ is $0$ . Conversely, the predicted cluster assignments $z_{i}$ of $\mathbf{x}_{i}$ comprises all clusters when $\epsilon$ is $1$ .

3.4 Multi-assignment posterior inference

This work performs posterior inference by using Markov chain Monte Carlo (MCMC) methods, which are the most widely used posterior inference methods in Bayesian nonparametric models. The MCMC constructs a Markov chain on the hidden variables that has the desired distribution as its equilibrium distribution. Then one can obtain samples from the posterior distribution by drawing samples from the Markov chain. Gibbs sampling is a simple and widely applicable MCMC algorithms, and it can be viewed as a special case of the Metropolis-Hastings algorithm. Gibbs sampling is applicable when the joint distribution is unknown, but the conditional distribution of each variable is known. The Markov chain is constructed by considering the conditional distribution of each hidden variable given the others and the observations.

The posterior distribution of the indicator variable $z_{i}$ conditional on the other variables is listed in Eq. (5). As for $\theta_{j}$ , the conditional distribution of $\theta_{j}$ is presented in Eq. (10), in which $\mathbf{x}_{i}$ represents the $i$ th data instance and $\mathcal{L}(\mathbf{x}_{i}|\theta_{j})$ is the likelihood of $\mathbf{x}_{i}$ given $\theta_{j}$ . Algorithm 1 shows the proposed method with Gibbs sampling algorithm.

$\displaystyle P(\theta_{j}|\theta_{j,-i},\mathbf{x}_{i},G_{0},\alpha)\propto G% _{0}(\theta_{j})\mathcal{L}(\mathbf{x}_{i}|\theta_{j})$ (10)

linesnumbered 1

Multi-assignment Chinese Restaurant Process with Gibbs Sampling Algorithm

Input: The dispersion parameter $\alpha$ and base distribution $G_{0}$

$k\leftarrow$ the cluster number obtained from training algorithm for Gibbs Iteration do

for $i\leftarrow 1$ to $N^{test}$ do

Get a data $\mathbf{x}_{i}$ Sample $z_{i}$ according to Eq. (5) Sample a new $\theta_{z_{i}}$ according to Eq. (10) if $k+1\in z_{i}$ then

$k\leftarrow k+1$

end

3.5 Posterior inference with collapsed gibbs sampling

The marginalization of some variables from a joint distribution always reduces the variance due to the Rao-Blackwell Theorem [22]. In a conjugate context, one can integrate analytically over $\theta_{j}$ , eliminating them from the algorithm to simplify the sampling process. Then, for each data instance $\mathbf{x}_{i}$ , we only need to sample $z_{i}$ using Eq. (11), in which $F(\theta_{j})_{-i}$ is the posterior probability of $\theta_{j}$ conditional on $\mathbf{x}_{-i,j}$ and $G_{0}$ . Since Dirichlet distribution is the conjugate prior of the multinomial distribution, Eq. (11) can be represented as a simple form as presented in Appendix.

Algorithm 2 shows the collapsed sampling of MACRP, and this work implements Algorithm 2 in the experiments. There are $k$ clusters, so the sampling process should repeat $k$ times to estimate the posterior probability for all the $k$ clusters. Note that $k$ is not a fixed number, and it varies as more data instances are observed. Besides existing clusters, the model should consider the probability of assigning $\mathbf{x}_{i}$ to a new cluster, which is labeled as $k+1$ in the Eq. (11). The model selects the most possible cluster from the $k+1$ clusters and call it $z_{i}$ in the algorithm. When the predicted cluster is available, the model updates the sufficient statistics of $z_{i}$ with the addition of $\mathbf{x}_{i}$ as shown in Line 5–6. If the table $k+1$ is in $z_{i}$ , we let the number of clusters, namely $k$ , increase. The above process is listed in Line 7–9.

linesnumbered 2

Multi-assignment Chinese Restaurant Process with Collapsed Gibbs Sampling Algorithm

Input: The dispersion parameter $\alpha$ and base distribution $G_{0}$

$k\leftarrow$ the cluster number obtained from training algorithm for Gibbs Iteration do

for $i\leftarrow 1$ to $N^{test}$ do

Get a data $\mathbf{x}_{i}$ Sample $z_{i}$ according to Equation 11 Update the sufficient statistics of table $z_{i}$ on joining $\mathbf{x}_{i}$ to table $z_{i}$ if $k+1\in z_{i}$ then

$k\leftarrow k+1$

end

4. Experiments

In the experiments, we conduct experiments on two data sets, which are text documents. Documents are conventionally modeled by multinomial distribution over words, and this work follows the same setting. Meanwhile, we use Dirichlet distribution to model the parameters of the multinomial distribution. The experiments use several methods to compare with the proposed algorithm on two data sets listed below.

•
The medical dataset is collected from the Cincinnati Children’s Hospital Medical Center’s (CCHMC) Department of Radiology [28]. It contains 978 data instances and follows HIPAA standards which include three steps: disambiguation, anonymization, and data scrubbing.
•
Ueda and Saito [39] categorized real Web pages linked from the “yahoo.com” domain, which comprises 14 top-level categories, and each category is classified into a number of second-level subcategories. They maked 14 independent text categorization problems by focusing on the second-level categories. This work uses eight of them in the experiments, including Arts, Computers, Education, Entertainment, Health, Recreation, Reference, and Science.

4.1 Evaluation measurements

Although several multi-label classification evaluation metrics have been proposed over the last decade, they are inappropriate for MAC. One of the reasons is that label information is unavailable for the data instances in clustering applications. The B-MAC [14] used a evaluation measure called generalization error, which is based on hamming distance, but it is not appropriate in this work since the generalization error only applies to application domains of Boolean data. Therefore, this work proposes a new metric for evaluating MAC problems based on pairwise information between each pair of data instances as used by $F_{1}$ cluster evaluation measure [25]. The MAC problem is more complicated than single-assignment clustering, so this work introduces additional notation as listed in Table 1.

Table 1
Notation of evaluation metric for MAC

Notation	Meaning
MTP	The multi-assignment true positives (MTP) is to evaluate the predicting result between $i$ th data and $j$ th data.
NEL	The expected number of clusters predicted by the algorithm.
CL	The difference between expected number of clusters and the number of clusters predicted by algorithm is defined as cluster loss (CL).
TPS	The true positives score used to score the prediction result.

•

True Positives (TP)

The clustering algorithms predict the two data instances in a pair intersect, and data corpus also has them in intersection.

•

Multi-assignment True Positives (MTP)

The measurement of TP can be used to evaluate the correctness of predicted positive pairs, but it is insufficient in MAC problems. More detailed conditions should be considered for MAC, explaining why we further devise a MTP measurement to discriminate the degree of TP under different conditions. For the pairs that satisfy TP condition, we further consider the difference between the number of correct intersections and the number of intersections for predicted cluster assignment pairs. Let $Y$ denote the set of correct cluster assignments in data corpus, and $Z$ denote the set of predicted cluster assignments. For the $i$ th data, we use $Y_{i}$ and $Z_{i}$ to represent its correct and predicted cluster assignments, respectively. For the pair comprising $i$ th and $j$ th data, the proposed evaluation metric calculates the multi-assignment true positives (MTP) score to evaluate the prediction result. The calculation of the MTP is presented as follows.

First, this work introduces $\text{NEL}_{i}$ and $\text{NEL}_{j}$ as listed in Eq. (12) to denote the number of expected predicting cluster assignments of $i$ th and $j$ th data, in which we use the ratio of $|Z_{i}\cap Z_{j}|$ and $|Y_{i}\cap Y_{j}|$ to examine whether clusters should split or combine. If clusters should split, then more cluster assignments for the prediction result are expected.

$\displaystyle\text{NEL}_{i}=\frac{|Z_{i}\cap Z_{j}|}{|Y_{i}\cap Y_{j}|}\times|% Y_{i}|$ (12) $\displaystyle\text{NEL}_{j}=\frac{|Z_{i}\cap Z_{j}|}{|Y_{i}\cap Y_{j}|}\times|% Y_{j}|$

Then, we use the absolute difference between the real size and expected size of predicting cluster assignments to represent the losses of prediction results as presented in Eq. (13).

$\displaystyle\text{CL}_{i}=\text{abs}(|Z_{i}|-\text{NEL}_{i})$ (13) $\displaystyle\text{CL}_{j}=\text{abs}(|Z_{j}|-\text{NEL}_{j})$

Next we score the two data instances using Eq. (4.1). If CL equals to zero, then the result is the best, and TPS will be one. Conversely, a large value of CL causes the TPS to approach zero. The final score of MTP is the average of $\text{TPS}_{i}$ and $\text{TPS}_{j}$ .

$\displaystyle\text{TPS}_{i}=\left\{\begin{array}[]{l l}\frac{\text{NEL}_{i}-% \text{CL}_{i}}{\text{NEL}_{i}}&\quad\text{if}\,\text{CL}_{i}\leqslant\text{NEL% }_{i}\\ 0&\quad\text{if}\,\text{CL}_{i}>\text{NEL}_{i}\end{array}\right.$ $\displaystyle\text{TPS}_{j}=\left\{\begin{array}[]{l l}\frac{\text{NEL}_{j}-% \text{CL}_{j}}{\text{NEL}_{j}}&\quad\text{if}\,\text{CL}_{j}\leqslant\text{NEL% }_{j}\\ 0&\quad\text{if}\,\text{CL}_{j}>\text{NEL}_{j}\end{array}\right.$ (14) $\displaystyle\text{MTP}=\frac{\text{TPS}_{i}+\text{TPS}_{j}}{2}$

•

False Positives (FP)

The clustering algorithms predict the two data instances in a pair intersect, but data corpus has them in no intersection.

•

True Negatives (TN)

The clustering algorithms predict the two data instances in a pair have no intersection, and data corpus also has them in no intersection.

•

False Negatives (FN)

The clustering algorithms predict the two data instances in a pair have no intersection, but data corpus has them in intersection.

Similar to traditional information retrieval definition, Eq. (4.1) shows the formulas of precision, recall and $F_{1}$ evaluation.

$\displaystyle\text{Precision}=\frac{\text{MTP}}{\text{TP}+\text{FP}}$ $\displaystyle\text{Recall}=\frac{\text{MTP}}{\text{TP}+\text{FN}}$ (15) $\displaystyle F_{1}=\frac{2\times\text{Precision}\times\text{Recall}}{\text{% Precision}+\text{Recall}}$

Table 2

Evaluation example

	Correct cluster assignments	Predict cluster assignments
$\mathbf{x}_{1}$	2,3	3,4
$\mathbf{x}_{2}$	1,2	2,3
$\mathbf{x}_{3}$	1	2,4
$\mathbf{x}_{4}$	1,2,3	1,2
$\mathbf{x}_{5}$	1	1

To explain how the above evaluation works, this work uses a toy example with available label information listed in Table 2 to demonstrate the evaluation process. Note that the label information is absent in clustering applications, and we can only consider whether two data instances are in the same cluster or not.

•

Pair 1 ( $\mathbf{x}_{1}$ and $\mathbf{x}_{2}$ ) The cluster assignments of two data instances both intersect in correct cluster assignments and predicted cluster assignments, so the pair belongs to TP. Therefore, the number of TP increases by one. Additionally, we have to compute the MTP, which is one here. The result can be obtained from the above formula, in which $\text{NEL}_{1}=2$ , $\text{NEL}_{2}=2$ , $\text{CL}_{1}=0$ , and $\text{CL}_{2}=0$ .

•

Pair 2 ( $\mathbf{x}_{1}$ and $\mathbf{x}_{3}$ )

The cluster assignments of the two data instances have no intersection in correct cluster assignments, but intersect in predicted cluster assignments. Thus, the number of FP increases by one.

•

Pair 3 ( $\mathbf{x}_{1}$ and $\mathbf{x}_{4}$ )

The cluster assignments of the two data instances have intersection in correct cluster assignments, but no intersection in predicted cluster assignments. Thus, the number of FN increases by one.

•

Pair 4 ( $\mathbf{x}_{1}$ and $\mathbf{x}_{5}$ )

The cluster assignments of the two data instances have no intersection in correct cluster assignments and predicted cluster assignments. Thus, the number of TN increase by one.

•

Pair 5 ( $\mathbf{x}_{2}$ and $\mathbf{x}_{3}$ )

The number of TP increases one, and $\text{MTP}=0.5$ .

•

Pair 6 ( $\mathbf{x}_{2}$ and $\mathbf{x}_{4}$ )

The number of TP increases by one, and $\text{MTP}=0.333$ .

•

Pair 7 ( $\mathbf{x}_{2}$ and $\mathbf{x}_{5}$ )

The number of FN increases by one.

•

Pair 8 ( $\mathbf{x}_{3}$ and $\mathbf{x}_{4}$ )

The number of TP increases one, and $\text{MTP}=0.333$ .

•

Pair 9 ( $\mathbf{x}_{3}$ and $\mathbf{x}_{5}$ )

The number of FN increases by one.

•

Pair 10 ( $\mathbf{x}_{4}$ and $\mathbf{x}_{5}$ )

The number of TP increases by one, and $\text{MTP}=0.833$ .

Finall, we can get the results, where precision $=$ 0.5, recall $=$ 0.375, and F1 $=$ 0.4286 in terms of the results TP $=$ 5, MTP $=$ 3, FN $=$ 3, and FP $=$ 1.

4.2 Document analysis experiments

This work conducts experiments on medical and yahoo data sets, and applies several algorithms to the two data sets to compare with the proposed method. The proposed method is an unsupervised learning, and is a nonparametric method, which allows the model complexity to grow as more data instances are observed.

As for the comparison methods, this work uses the algorithms provided by Mulan [37], which is an open-source Java library for learning from multi-label data sets. We use three multi-label classification methods, including binary relevance (BR), multi-label $k$ nearest neighbor (MLkNN) [43] and random $k$ -Labelsets (RAkEL) [36]. They are typical multi-label classification algorithms, so they are used in the experiments. Besides traditional multi-label classification methods, we also compare with a MAC algorithm called B-MAC algorithm, which is proposed by Frank et al. [14].

We repeat each experiment five times and use the average and standard deviation of the results to present the experimental result, in which mean plus or minus two standard deviations is presented in the tables. Note that the results of those multi-label classification methods in Mulan library have no standard deviation because all the results are the same in five-time experiments. The multi-label classification methods require training phases to learn classification models, and then use the learned models to classify testing data. Thus, the experiments on multi-label classification methods split the data sets into training data and testing data. Additionally, the proposed method is a nonparametric method, which allows the model to detect new clusters in the leaning process. To further investigate the detection on new clusters, the training set does not comprise complete cluster set, and the experimental settings are presented in Table 3. The experiments were conducted on a personal computer with Intel Core i7 CPU and 16 GB RAM. The proposed method is implemented using MATLAB. As mentioned above, the proposed method requires to specify hyperparameters $\alpha$ and $\epsilon$ , in which $\alpha$ determines how likely a data instance is to assign to a new cluster, and $\epsilon$ determines the cluster sets obtained from the posterior inference results. Although the proposed method does not require domain experts to specify the number of clusters, they can verify the clustering outcomes and provide suggestion to adjust values of hyperparameters $\alpha$ and $\epsilon$ . The experiments set the $\alpha$ to be 0.01 and $\epsilon$ to be 0.5. Table 4 presents the experimental results.

Table 3
Experimental settings of multi-assignment clustering

Data set	Number of data	Number of all labels	Number of unknown labels	Percentage of training data
medical	978	45	11	64.42%
Arts	7,484	26	6	62.25%
Computers	12,444	33	7	62.94%
Education	12,030	33	7	62.33%
Entertainment	12,730	21	6	76.79%
Health	9,205	32	6	75.32%
Recreation	12,828	22	5	56.49%
Reference	8,027	33	8	59.96%
Science	6,428	40	13	52.10%

Table 4

Experimental results (mean $\pm$ 2 $\times$ std)

	BR	MLkNN	RAkEL	MACRP	B-MAC
medical	0.1923	0.2094	0.1013	0.3205 $\pm$ 0.0730	0.1512 $\pm$ 0.0052
Arts	0.1978	0.0025	0.0710	0.1965 $\pm$ 0.0078	0.1393 $\pm$ 0.0705
Computers	0.1731	0.3525	0.3029	0.2912 $\pm$ 0.0179	0.1702 $\pm$ 0.0625
Education	0.1787	0.0432	0.1573	0.1849 $\pm$ 0.0107	0.1407 $\pm$ 0.0415
Entertainment	0.2324	0.0570	0.0719	0.2425 $\pm$ 0.0226	0.1491 $\pm$ 0.0378
Health	0.1872	0.2684	0.2616	0.3429 $\pm$ 0.0425	0.1902 $\pm$ 0.0736
Recreation	0.2479	0.0013	0.0835	0.1906 $\pm$ 0.0041	0.1402 $\pm$ 0.0178
Reference	0.1223	0.2889	0.2052	0.3680 $\pm$ 0.0458	0.1311 $\pm$ 0.0296
Science	0.1250	0.1042	0.1089	0.1997 $\pm$ 0.0218	0.1238 $\pm$ 0.0325

5. Discussion and analysis

As shown in Table 4, the proposed algorithm outperforms other methods on most data sets. Additionally, the proposed method is an unsupervised method, indicating that it does not require labeled data for model training. The proposed MACRP algorithm significantly outperforms B-MAC algorithm, and this result may be explained by the fact that MACRP considers correlations between clusters. Additionally, MACRP can naturally handle different data types, while B-MAC algorithm can only be applied to Boolean data. Besides MAC experiments, this work further analyzes the proposed algorithm in this section.

Figure 2.

Performance evaluation with different threshold values.

5.1 The effect of

\epsilon

The proposed method uses the formula listed in Eq. (6) to determine the cluster set, so we investigate the parameter $\epsilon$ that is involved in Eq. (6). We conduct experiments on medical data set using different values of $\epsilon$ . Figures 2 and 3 show the experimental results, including evaluation performance and the number of clusters detected by the proposed method. The best $F_{1}$ value appears at the place where $\epsilon$ equals to 0.5. When $\epsilon$ is larger than 0.5, the performance would drop as expected, since a larger value of $\epsilon$ results to more clusters. Conversely, when $\epsilon$ becomes smaller, the impact of $\epsilon$ on the performance is minor. When $\epsilon$ is zero, the model reduces to single cluster model, namely assigning each data instance to a single cluster.

Figure 3.

Number of clusters detected with different threshold values.

Figure 4.

Performance evaluation with different values of $\alpha$ .

5.2 The effect of concentration parameter

\alpha

As shown in Eq. (11), the proposed method uses a concentration parameter $\alpha$ to control the probability of creating a new cluster. Therefore, this work further conducts experiments on medical data set to analyze the impact of $\alpha$ on performance, and the number of clusters generated by the model. The experimental results are presented in Figs 4 and 5. We repeat the experiment five times, and use the average as the experiment results. As shown in Fig. 5, when the value of $\alpha$ becomes larger, it is easier to create new clusters. The result conforms to the posterior distribution listed in Eq. (11). More clusters also mean that the chance of creating useless clusters increases, leading to performance degradation as presented in Fig. 4.

Figure 5.

Number of clusters detected with different values of $\alpha$ .

Figure 6.

Performance evaluation using different cluster selection method.

5.3 Using different cluster selection method

This work uses the proposed formula listed in Eq. (6) to determine the final cluster set. The proposed scheme possesses a good property, namely it can shrink to single cluster algorithm or expand to use all clusters as cluster set by adjusting a parameter $\epsilon$ , where $0\leqslant\epsilon\leqslant 1$ . When the $\epsilon$ equals to zero, the $\eta$ will be $\rho_{\max}$ , indicating that cluster with the maximum posterior probability will be selected. Conversely, when the $\epsilon$ is one, the $\eta$ becomes $\rho_{\min}$ , indicating that all of the clusters will be selected. Besides the proposed scheme, this work conducts experiments with a different scheme as listed in Eq. (16), where the definitions and settings for $\eta$ , $\rho_{\max}$ , and $\epsilon$ are the same as Eq. (6).

$\displaystyle\eta=\rho_{\max}-\rho_{\max}\cdot\epsilon,$ (16)

Figure 7.

Number of clusters detected using different cluster selection method.

The experimental results are presented in Figs 6 and 7. The experimental results indicate that the new cluster selection scheme can achieve the best performance when the experiments set $\epsilon$ to be 0.5. However, the number of clusters fluctuates between 29 and 51 when the value of $\epsilon$ ranges from 0.6 to 0.8. Conversely, using the proposed scheme listed in Eq. (6) leads to a stable result as presented in Fig. 3. Besides, we can just adjust $\epsilon$ to control the final cluster sets, so Eq. (6) provides a reasonable and explainable cluster selection scheme.

6. Conclusion

This work focuses on MAC problems, and devises a nonparametric model, which considers cluster correlation in the prior. Central to the proposed algorithm is using the proposed multi-assignment prior along with the likelihood to perform posterior of the cluster assignment. Besides multi-assignment clustering, the proposed method is a nonparametric model, which is flexible to deal with big data problems with weaker assumptions and allows the model to adapt to observed data. The implementation of the proposed algorithm is Matlab and the experimental results on two real data sets have shown that the proposed MACRP is flexible owing to the characteristic of nonparametric model. Note that the algorithm should be implemented with other languages and platforms, such as Apache Spark, to process big data problems. Additionally, previous evaluation metrics on multi-label classification are inappropriate for MAC problems, explaining why we propose a new evaluation metric in this paper. This work focuses on clustering, and the future work is to devise an algorithm to automatically generate appropriate labels or tags for the multi-assignment results. Another research direction is to devise variational algorithms to deal with MAC problems.

Footnotes

Acknowledgments

This work was supported in part by Ministry of Science and Technology, Taiwan, under Grant No. MOST 104-2218-E-009-028.

Appendix

The posterior inference with collapsed Gibbs sampling on the cluster assignments listed in Eq. (11) can be further represented as Eq. (Appendix), where $\beta$ is a Dirichlet prior, the definition of $C(\beta)$ is presented in Eq. (18), and $N_{Y=j}(\mathbf{x}_{-i})$ is the word count of those data in $j$ th table except data $\mathbf{x}_{i}$ .

$\displaystyle P(j\in z_{i}|z_{-i},\mathbf{x}_{i},\Theta,G_{0},\alpha)\propto% \left\{\begin{array}[]{ll}d_{j}\int H(\mathbf{x}_{i},\theta_{j})\mathrm{d}F(% \theta_{j})_{-i}&\quad\text{if}\,j\leqslant k\\ \alpha\int H(\mathbf{x}_{i},\theta_{j})\mathrm{d}G_{0}(\theta_{j})&\quad\text{% if}\,j=k+1\end{array}\right.$ (17) $\displaystyle\quad=\left\{\begin{array}[]{l l}d_{j}\frac{C(N(\mathbf{x}_{i})+N% _{Y=j}(\mathbf{x}_{-i})+\beta)}{C(N_{Y=j}(\mathbf{x}_{-i})+\beta)}&\quad\text{% if}\,j\leqslant k\\ \alpha\frac{C(N(\mathbf{x}_{i})+\beta)}{C(\beta)}&\quad\text{if}\,j=k+1\end{% array}\right.$ (18) $\displaystyle C(\beta)=\int\prod_{j=1}^{m}\theta_{j}^{\beta_{j}-1}\mathrm{d}% \theta=\frac{\prod_{j=1}^{m}\Gamma(\beta_{j})}{\Gamma(\beta\bullet)},\text{% where}\,\beta\bullet=\sum_{j=1}^{m}\beta_{j}.N(\mathbf{x}_{i})$

We use $\mathbf{x}$ to represent a single data instance and $\theta$ is the cluster parameter. Note that Dirichlet distribution is the conjugate prior of the multinomial distribution, so Eq. (Appendix) can become Eq. (19), in which the $P(\mathbf{x}|\theta)$ is the $H(\mathbf{x}_{i},\theta_{j})$ in Eq. (11) and $P(\theta|\beta)$ represents $G_{0}(\theta_{j})$ .

(19) $\displaystyle\int P(\mathbf{x}|\theta)P(\theta|\beta)\mathrm{d}\theta=\int% \left({\prod_{j=1}^{m}\theta_{j}^{N_{j}(\mathbf{x})}}\right)\left({\frac{1}{C(% \beta)}\prod_{j=1}^{m}\theta_{j}^{\beta_{j}-1}}\right)\mathrm{d}\theta=\frac{1% }{C(\beta)}\int\left({\prod_{j=1}^{m}\theta_{j}^{N_{j}(\mathbf{x})+\beta_{j}-1% }}\right)\mathrm{d}\theta=\frac{C(N(\mathbf{x})+\beta)}{C(\beta)}$

References

Adeli-Mosabbeb

and Fathy

, Distributed matrix completion for large-scale multi-label classification, Intell. Data Anal. 18(6) (2014), 1137–1151.

Aldous

, Exchangeability and related topics, In École d’Été St Flour 1983, Springer-Verlag, 1985, pp. 1–198. Lecture Notes in Math. 1117.

Bezdek

J.C.

, Pattern Recognition with Fuzzy Objective Function Algorithms, Kluwer Academic Publishers, Norwell, MA, USA, 1981.

and Kwok

, Efficient multi-label classification with many labels, In Dasgupta

and Mcallester

, editors, Proceedings of the 30th International Conference on Machine Learning (ICML-13), JMLR Workshop and Conference Proceedings, Vol. 28, May 2013, pp. 405–413.

Blackwell

and MacQueen

J.B.

, Ferguson distributions via Polya urn schemes, The Annals of Statistics 1(2) (1973), 353–355.

Blei

D.M.

and Jordan

M.I.

, Variational inference for dirichlet process mixtures, Bayesian Analysis 1(1) (2006), 121–144.

Boutell

M.R.

Luo

Shen

and Brown

C.M.

, Learning multi-label scene classification, Pattern Recognition 37(9) (2004), 1757–1771.

Clare

and King

R.D.

, Knowledge discovery in multi-label phenotype data, In Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery, PKDD ’01, London, UK, UK, 2001, pp. 42–53. Springer-Verlag.

Dembczynski

Cheng

and Hüllermeier

, Bayes optimal multilabel classification via probabilistic classifier chains, In ICML, 2010, pp. 279–286.

10.

Dundar

Akova

and Rajwa

, Bayesian nonexhaustive learning for online discovery and modeling of emerging classes, In ICML, 2012.

11.

Escobar

M.D.

, Estimating normal means with a dirichlet process prior, Journal of the American Statistical Association 89(425) (1994), 268–277.

12.

Escobar

M.D.

and West

, Bayesian density estimation and inference using mixtures, Journal of the American Statistical Association 90(430) (1995), 577–588.

13.

Ferguson

T.S.

, A bayesian analysis of some nonparametric problems, The Annals of Statistics (1973), 209–230.

14.

Frank

Streich

A.P.

Basin

and Buhmann

J.M.

, Multi-assignment clustering for boolean data, J. Mach. Learn. Res 13(1) (Feb. 2012), 459–489.

15.

Fürnkranz

Hüllermeier

Loza Mencía

and Brinker

, Multilabel classification via calibrated label ranking, Mach. Learn 73(2) (Nov. 2008), 133–153.

16.

Gelfand

A.E.

Kottas

and MacEachern

S.N.

, Bayesian nonparametric spatial modeling with Dirichlet process mixing, Journal of the American Statistical Association 100(471) (2005), 1021–1035.

17.

Godbole

and Sarawagi

, Discriminative methods for multi-labeled classification, In PAKDD, 2004, pp. 22–30.

18.

Hariharan

Zelnik-Manor

Vishwanathan

S.V.N.

and Varma

, Large scale max-margin multi-label classification with priors, In ICML, 2010, pp. 423–430.

19.

Hsu

Kakade

Langford

and Zhang

, Multi-label prediction via compressed sensing, In NIPS, 2009, pp. 772–780.

20.

Huang

S.-J.

and Zhou

Z.-H.

, Multi-label learning by exploiting label correlations locally, In Hoffmann

and Selman

, editors, AAAI. AAAI Press, 2012.

21.

Ishwaran

and James

L.F.

, Gibbs sampling methods for stick-breaking priors, Journal of the American Statistical Association 96(453) (2001), 161–173.

22.

Kay

S.M.

, Fundamentals of statistical signal processing: estimation theory, Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1993.

23.

Liu

C.-L.

Tsai

T.-H.

and Lee

C.-H.

, Online chinese restaurant process, In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA, 2014, pp. 591–600. ACM.

24.

MacEachern

S.N.

, Estimating normal means with a conjugate style Dirichlet process prior, Communications in Statistics – Simulation and Computation 23(3) (1994), 727–741.

25.

Manning

C.D.

Raghavan

and Schtze

, Introduction to Information Retrieval, Cambridge University Press, New York, NY, USA, 2008.

26.

Neal

R.M.

, Markov chain sampling methods for dirichlet process mixture models, Journal of Computational and Graphical Statistics (2000), 249–265.

27.

Olszewski

, Employing kullback-leibler divergence and latent dirichlet allocation for fraud detection in telecommunications, Intell. Data Anal 16(3) (2012), 467–485.

28.

Pestian

J.P.

Brew

Matykiewicz

Hovermale

D.J.

Johnson

Cohen

K.B.

and Duch

, A shared task involving multi-label classification of clinical free text, In Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing, BioNLP ’07, Stroudsburg, PA, USA, 2007, pp. 97–104. Association for Computational Linguistics.

29.

Pitman

, Combinatorial Stochastic Processes: Preliminaries, Chapter 0, Springer-Verlag, 2006, 1–11.

30.

Schapire

R.E.

, A Brief Introduction to Boosting, In IJCAI ’99: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, San Francisco, CA, USA, 1999, pp. 1401–1406. Morgan Kaufmann Publishers Inc.

31.

Schapire

R.E.

and Singer

, Boostexter: A boosting-based systemfor text categorization, Mach. Learn 39(2–3) (May 2000), 135–168.

32.

Sun

and Garibaldi

J.M.

, A comparative study of novel robust clustering algorithms, Intell. Data Anal 16(6) (2012), 969–992.

33.

Tai

and Lin

H.-T.

, Multilabel classification with principal label space transformation, Neural Comput 24(9) (Sept. 2012), 2508–2542.

34.

Teh

Y.W.

Jordan

M.I.

Beal

M.J.

and Blei

D.M.

, Hierarchical dirichlet processes, Journal of the American Statistical Association 101(476) (2006), 1566–1581.

35.

Tsoumakas

and Katakis

, Multi-label classification: An overview, IJDWM 3(3) (2007), 1–13.

36.

Tsoumakas

Katakis

and Vlahavas

, Random k-labelsets for multilabel classification, IEEE Trans. on Knowl. and Data Eng 23(7) (July 2011), 1079–1089.

37.

Tsoumakas

Spyromitros-Xioufis

Vilcek

and Vlahavas

, Mulan: A java library for multi-label learning, Journal of Machine Learning Research 12 (2011), 2411–2414.

38.

Tsoumakas

and Vlahavas

, Random k-labelsets: An ensemble method for multilabel classification, In Proceedings of the 18th European Conference on Machine Learning, ECML ’07, Berlin, Heidelberg, 2007, 406–417. Springer-Verlag.

39.

Ueda

and Saito

, Parametric mixture models for multi-labeled text, In In Advances in Neural Information Processing Systems 15, MIT Press, 2003, pp. 721–728.

40.

Vateekul

Kubat

and Sarinnapakorn

, Hierarchical multi-label classification with svms: A case study in gene function prediction, Intell. Data Anal 18(4) (2014), 717–738.

41.

Wainwright

M.J.

and Jordan

M.I.

, Graphical models, exponential families, and variational inference, Found. Trends Mach. Learn 1(1–2) (Jan. 2008), 1–305.

42.

West

Müller

and Escobar

M.D.

, Hierarchical priors and mixture models, with applications in regression and density estimation, Aspects of Uncertainty: A Tribute to D. V. Lindley (1994), 363–386.

43.

Zhang

M.-L.

and Zhou

Z.-H.

, Ml-knn: A lazy learning approach to multi-label learning, Pattern Recogn 40(7) (2007), 2038–2048.

44.

Zhang

M.-L.

and Zhou

Z.-H.

, A review on multi-label learning algorithms, IEEE Transactions on Knowledge and Data Engineering 99(PrePrints) (2013), 1.

Nonparametric multi-assignment clustering

Abstract

Keywords

1. Introduction

2. Related work

2.1 Multi-label classification

2.2 Multi-assignment clustering

2.3 Dirichlet process mixture models

3.1 Notation

3.2 Multi-assignment prior

Table 1 Notation of evaluation metric for MAC

Table 3 Experimental settings of multi-assignment clustering

Footnotes

Acknowledgments

Appendix

References

Table 1
Notation of evaluation metric for MAC

Table 3
Experimental settings of multi-assignment clustering