Multi-label classification of documents using fine-grained weights and modified co-training

Abstract

This paper use multinomial nave Bayes to improve multi-label classification methods in a number of ways. First, we use the value weighting method, a new fine-grained weighting method, to calculate the weights of the feature values. Second, we employ a co-training method to incorporate the dependencies among the class values. The results of our experiments show that the proposed approach outperforms other state-of-the-art methods.

Keywords

Multi-label classification multinomial naive Bayes fine-grained weights co-training

1. Introduction

Correctly identifying documents according to particular categories remains a challenging task due to the vast amount of features in a given dataset. In particular, multi-label text classification problems have received a considerable amount of attention because this allows for each document to be associated with multiple class labels. In this paper, we explore the multi-label text classification problem, and there are two issues of relevance.

The first issue is to improve the performance of the multi-label text classification tasks. A multinomial nave Bayes (MNB) algorithm is the most common for multi-label text classification problems. A MNB classifier is an efficient, reliable text classifier, and many researchers usually regard it as a standard nave Bayes text classifier. However, their performance is not as good as some other learning methods, such as support vector machines and boosting. In this paper, we first investigate the reasons behind the poor performance of the MNB. Then, to improve the performance of the MNB method, we propose the value weighing method for MNB learning by implementing a new paradigm.

The second issue in multi-label classification is to effectively make use of the dependencies between class labels, and one common approach to process these class dependencies is to use the binary relevance method (BR) to treat each class as a separate binary classification problem [17]. When building the classifiers, BR does not directly model correlations that exist between labels in the training data, resulting in too many or too few predicted labels [17]. Apparently, this model ignores the dependencies between the labels. However, in many real-world tasks, labels are highly interdependent, and therefore the key to successful multi-label learning becomes is to effectively exploit the dependencies between different labels.

To capture the dependencies between labels, this paper presents a new method that uses the co-training method to learn two separate classifiers over the features and/or label sets. The co-training algorithm is used to capture the dependencies between the class labels, and doing so improves the classification performance. Furthermore, we also compare the performance of the proposed model, which combines the value of the weighting method and co-training with that of other state-of-the-art multi-label classifiers.

2. Related work

Text classifiers based on nave Bayes have been extensively studied in the literature. In particular, there has been much research on the use of the MNB model for text classification.

McCallum and Nigam [15] compared the classification performance between the multi-variate Bernoulli model and the multinomial model, and they showed that the multi-variate Bernoulli model performs well with small vocabulary sizes, but the multinomial model usually performs even better at larger vocabulary sizes.

Rennie et al. [18] introduced the complement nave Bayes (CNB) for skewed training data. CNB estimates the parameters using data from all classes except the currently estimated class. Furthermore they demonstrated that MNB can achieve better accuracy by adopting a TF-IDF representation, which is traditionally used in Information Retrieval.

Schneider [22] addressed the problems of a nave Bayesian text classifier and showed that they can be solved by some simple corrections, and he effectively removed duplicate words in a document to account for the burstiness phenomena in text.

Zhang and Zhou [27] defined the ML-kNN based on the kNN algorithm. They first identified the k nearest neighbors of the test instance where the label sets of its neighboring instances were obtained and then predicted the set of labels of the test instance by maximizing a posterior principle. In recent years, there has been a significant amount of study in multi-label classification problems, as motivated from emerging applications. As mentioned above, the key to successful multi-label learning is how to effectively exploit dependencies between the different labels.

Ghamrawi and McCallum [8] proposed two undirected graphical models that directly parameterize label dependencies in multi-label classification. The first is Collective Multi-Label classifier (CML), which jointly learns parameters for each pair of labels. The second is Collective Multi-Label with Features classifier (CMLF), which learns parameters for feature-label-label triples.

McCallum [16] defined a probabilistic generative model according to which, each label generates different words. Based on this model, a multi-label document is produced by a mixture of the word distributions of its labels.

Read et al. [17] introduced a classifier chain. The classifier chain is a binary relevance-based method that consists of binary classifiers linked in a chain, which overcomes predicted labels from BR that contain too many or too few labels or labels that never co-occur in practice.

M. Zhang and K. Zhang [28] tried Bayesian network structure to encode the conditional dependencies of the labels as well as the feature set. With the help of this network, multi-label learning is decomposed into a series of single-label classification problems.

Dembczynski et al. [6] distinguished two types of label dependences, namely conditional and marginal dependence. Subsequently, three scenarios in which one of these types of dependence may boost the predictive performance of a classifier were presented. In this regard, a close connection with loss of minimization was also established, showing that the benefit of exploiting label dependence also does not depend on the type of loss to be minimized.

Song and Lee [23] improved the performance of the multinomial nave Bayes classification of documents by using a new fine-grained weight method, called the value weighting method. Unlike traditional feature weighting methods that assign the same weight to word features, the fine-grained weight method assigns a different weight to each world frequency. They have thus shown that the new weighting method could significantly improve the performance of multinomial nave Bayes learning.

This paper is an extension of Song and Lee’s work [23]. In a multi-label classification problem, class labels are known not to be independent of each other. For example, if a certain document is classified into Sports, it is also likely to be classified into Entertainment. As such, these label dependencies are utilized in this paper by using the co-training method. While previous work in [23] improves multi-label classification of documents using a fine-grained weight method, this paper improves its performance even further by utilizing the interrelationships among class labels.

Two mutually disjoint feature sets, label value set and word feature set, are defined and label dependencies are represented as the features in the label value set. By combining fine-grained weighting and co-training, we could improve the performance of multi-label classification in documents.

3. Improving the performance of multinomial naïve Bayes in document classification tasks

In this paper, we assume that documents are generated according to a multinomial event model. Thus each document can be represented as a vector ${\bf{x}}=(f_{1},\dots,f_{|V|})$ of word counts where $|V|$ is the vocabulary size and each $f_{t}$ indicates how often $t$ - $t h$ word ${W_{t}}$ occurs in ${\bf{x}}$ . Given model parameters $p(W_{t}|c)$ and $p(c)$ , assuming independence of the words, the likelihood of a class value $c$ for a document ${\bf{x}}$ is computed as

$c^{*}_{\textit{MNB}}({\bf{x}})=\textit{arg max}_{c}p(c)\prod_{t=1}^{|V|}p(W_{t% }|c)^{f_{t}}$ (1)

where $p(W_{t}|c)$ is the conditional probability that a word $W_{t}$ may happen in a document $\bf{x}$ given the class value $c$ and $p(c)$ is the prior probability that a document with class label c may happen in the document collections [21].

MNB provides reasonable prediction performance and is easy to implement. However, it has some unrealistic assumptions that affect the overall performance of the classification. The first is that all features are equally important in MNB learning. Since this assumption is rarely true in real-world applications, the predictions provided by MNB are sometimes poor. The second is that the probabilities for word occurrence are independent of document length, and due to these assumptions, the parameters are clearly dominated by the word count coming from long documents. The performance of the MNB can thus be improved by mitigating these assumptions. The following section describes these issues in further detail.

3.1 Fine-grained weighting method in text classification

Since the assumption that all features are equally important hardly holds true in real world applications, there have been some attempts to relax this assumption in machine learning methods. The feature weighting in nave Bayesian approaches is one method to ease the assumption of the independence, and it assigns a continuous value weight to each feature. The feature weighted nave Bayesian method involves a much larger search space than the feature selection, and it is generally known to improve the performance of nave Bayesian learning [12].

The MNB classification is a special form of feature weighted naïve Bayesian learning. The MNB classification with feature weighting is represented as follows.

$\displaystyle c^{*}_{\textit{MNB-FW}}({\bf{x}})=\textit{arg max}_{c}p(c)\prod_% {t=1}^{|V|}p(W_{t}|c)^{W_{t}}$ (2)

The basic idea of feature weighting is that the more important a feature is, the higher its weight is. In feature weighting naïve Bayes, each word $W_{t}$ has its own weight $w_{t}$ .

In traditional MNB Eq. (1), the frequency ( $f_{t}$ ) of each word $W_{t}$ plays the role of the significance of the word. Therefore, the basic assumption in MNB is that when a certain word appears frequently in a document, the importance of the word grows in proportion to its occurrence. Each word is given a weight, which is the frequency of the word in the document.

Kim et al. [11] proposed a feature weighting scheme using information gain. The information gain for a word given a class, which becomes the weight of the word, is calculated as follows

$\displaystyle w_{t}=f_{t}\sum_{c}\sum_{{W_{i}}\in\{W_{t},\overline{W}_{t}\}}p(% c,W_{i})\textit{log}\frac{p(c,W_{i})}{p(c)p(W_{i})}$

where $p(c,W_{i})$ is the number of documents with word $W_{i}$ and class label $c$ divided by the total number of documents, and $p(W_{i})$ is the number of documents with word $W_{i}$ divided by the total number of documents, respectively.

In this paper, we think of a new method in which weights are assigned in a more fine-grained way, and we treat each occurrence of a word differently in terms of its importance. When a certain word appears for the first time in a document, it becomes very important with respect to the classification of the document. However, when the same word already appeared many times, say 100, the next occurrence has virtually no importance. The probability of the second occurrence is much higher than that of the first occurrence. Since MNB treats the significance of each occurrence of a word equally, the MNB model does not take this phenomenon into account. In this paper we will investigate whether assigning different weights to each word count can improve the performance of the classification.

In order to implement the fine-grained weighting, we first discretize the term frequencies of each word. The discretization task converts a continuous term frequency $f_{t}$ to a categorical $\textit{word\ frequency\ bin\ a}_{tj}$ , which represents the $j$ - $t h$ discretized value of the term frequency. In other words, instead of assigning a weight to each word feature (e.g. MNB), we assign a weight to each word frequency bin. After that, the weights of these word frequency bins are automatically calculated through training data.

We refer to this method as the value weighting method. As we can see, unlike the current feature weighting methods, the value weighting method calculates a weight for each word frequency bin. The value weighting method in MNB can be defined as follows

$\displaystyle c^{*}_{\textit{MNB-VW}}({\bf{x}})=\textit{arg max}_{c}p(c)\prod_% {a_{tj}\in{\bf{x}}}p(a_{tj}|c)^{W_{tj}}$ (3)

where $w_{tj}$ represents the weight of word frequency bin $a_{tj}$ . You can easily see that each word frequency bin is assigned a different weight.

3.2 Calculating value weights of word frequency bins

This section describes the method used to calculate the weights of the frequency bins. We use an information-theoretic method to assign the weights to each word frequency bin. The basic assumption of the value weighting method is that when a certain word frequency bin is observed, it gives a certain amount of information to the target word feature. The more information a word frequency bin provides to the target class, the more important the bin becomes. The critical part now is how to define or select a proper measure that can correctly measure the amount of information.

In this paper, the amount of information that a certain word frequency bin contains is defined as the difference between a priori probability distribution and a posteriori distribution of class label under the word frequency bin. We employ Kullback-Leibler measure (Kullback and Leibler, 1951) to calculate the difference between these distributions. The Kullback-Leibler measure (denoted as KL) for a word frequency bin $a_{tj}$ is defined as

$\displaystyle\textit{KL}(C|a_{tj})=\sum_{c}p(c|a_{tj})log\frac{p(c|a_{tj})}{p(% c)}$ (4)

The formula $\textit{KL}(C|a_{tj})$ is the average mutual information between the events $c$ and $a_{tj}$ with the expectation taken with respect to a posteriori probability distribution of $C$ . It can be used as a proper weighting function, so the value weight for a word frequency bin $a_{tj}$ is defined as

$w_{tj}=\frac{1}{Z_{t}}\sum_{c}p(c|a_{tj})\textit{log}\frac{p(c|a_{tj})}{p(c)}$ (5)

where $Z_{t}$ is a normalization constant given as $Z_{t}=\frac{1}{|a_{t}|}\sum_{j|t}w_{tj}$ . The $|a_{t}|$ represents the number of word frequency bins in word feature $t$ .

4. Multi-label text classification

4.1 Binary relevance method (BR)

Many of the proposed methods tackle multi-label problems by first transforming a multi-label problem into a set of independent binary classification problems and then employing ranking or thresholding schemes for the overall multi-label classification.

One of them is the binary relevance method (BR), which treats each class as a separate binary classification problem. The BR method has several advantages. It is theoretically simple, intuitive and generally has low computational complexity, but it results in too many or too few predicted labels [17]. Apparently this model is ignoring the dependencies between the labels.

Generally for the BR model, suppose ${\bf{L}}=1,\dots,L$ is the set of possible labels, and each document $\bf{x}$ is associated with a subset of these labels represented by class label vector ${\bf{c}}=\{c_{1},\dots,c_{L}\}$ , where $c_{i}$ is the $i$ - $t h$ class value with a binary value $0$ or $1$ . For each label $c_{l}$ , the BR model trains an independent binary classifier, assigning a new instance to a particular label $c_{l}$ or not. Applying the same principle of the value weighting method to the BR classifier, the classification on ${\bf{x}}$ using BR with the value weighting method is defined as follows

$c_{l}^{\textit{BR}}{\bf(x)}=\textit{arg max}_{c_{l}\in\{0,1\}}p(c_{l})\prod_{a% _{tj}\in{\bf{x}}}p(a_{tj}|c_{l})^{w_{tj}}$ (6)

where the classifier $c_{l}^{\textit{BR}}$ is a function that assigns the $l$ - $t h$ class label to a document by using the BR method.

An obvious drawback of such methods is that it ignores the interdependencies among multiple labels. In many applications, strong co-occurrences and interdependencies exist among multiple class labels. Capturing the dependencies among class labels during classification is thus expected to lead to an improvement in classification performance.

Many methods with this motivation have been proposed in the literature to exploit label correlations. Some approaches resort to external knowledge such as existing label hierarchies [3, 20, 4, 14] or label correlation matrices [10]. Some approaches exploit graphical models to capture the label dependencies and conduct structured classification, including those using Bayesian networks [26, 19]. These methods obviously have more expressive power to capture label dependencies than the BR method does. However, even though they require more complicated learning than binary classification models, they still can not fully exploit label dependencies due to the representational limitations of graphical models. Besides, it is almost impossible to find optimal instances of parameters in these graphical structures.

4.2 Modified co-training method for multi-label learning

In this paper, we propose a new method to effectively utilize the dependencies of the labels. We employ a co-training method as a tool to capture the label dependencies. The co-training method is a semi-supervised learning technique that requires two views of the data. It assumes that each instance is described using two different feature sets providing different, complementary information of the instance.

The advantage of using co-training in multi-label classification is that the label dependencies can be effectively exploited by classifiers, and people have used a number of explicit tools to effectively represent dependencies between the different labels. However, in our co-training model, the class dependencies are implicitly represented as classification boundaries in a feature space as a result of learning. The dependencies are not explicitly explainable, but are treated implicitly in a form of black box. Since the proposed co-training method takes advantage of the expressive power of learning algorithms, the dependencies are represented in a more effective and fine-grained manner than the explicit expression tools.

Multi-label documents provide a natural environment for co-training classification learning. In this paper, we divide the features of multi-label documents into two different views. The first view consists of the traditional features generated from words in the documents while the second view we use is the class labels transformed into binary features. These feature sets are called Label Value Set (LVS) and Word Feature Set (WFS) respectively. Assuming $c_{i}$ is the $i$ - $t h$ class value, the first feature set LVS is defined as follows.

Definition 1: Label Value Set (LVS) Label Value Set is defined as $\{c_{i}|i=1,2,\cdots,L\}$ where $L$ means the number of class labels.

Notice that the way LVS is represented is different from that of traditional feature set. Each feature in LSV is actually a value (label) of the feature (class). In other words, each of these label values becomes a separate binary feature. For example, suppose there are eight possible labels, and a document is multi-classified as $c_{2}$ , $c_{3}$ and $c_{6}$ . The multi-class value set $\{c_{2},c_{3},c_{6}\}$ is represented as the following eight binary features $\{0,1,1,0,0,1,0,0\}$ .

The second feature set of the co-training method is called the Word Feature Set (WFS), which is defined as follows.

Definition 2: Word Feature Set (WFS) Suppose $a_{i}$ is the word frequency bin of the $i$ - $t h$ word, which is converted from the frequency of the $i$ - $t h$ word. The Word Feature Set is defined as $\{a_{i},c_{j}|\forall i,j=1,2,\cdots,L\}=\{a_{i}|\forall i\}\cup\textit{LVS}$ .

WFS consists of two parts: (1) the word frequency bins derived from words in the document, and (2) Label Value Set. As described in the previous section, the frequency term of each word is converted into word frequency bin, and we then apply the value weighting method to multinomial nave Bayes learning. These word frequency bins are the first component of WFS.

There are subtle differences between traditional co-training and the co-training proposed in this paper. The main difference is that in our co-training method, the word feature set (WFS) contains the second feature set (LVS). In a traditional co-training setting, the features are divided into two disjoint feature sets. However, in our co-training model, the two feature sets are not completely disjoint with each other because the Label Value Set ${\bf{c}}=\{c_{1},c_{2},\dots,c_{L}\}$ is shared by both classifiers. In other words, LVS becomes a subset of WFS.

LVS is included in WFS for the following reasons. In traditional co-training, unlabeled data plays the role of the messenger passing predicted label information from one classifier to another, and different classifiers teach each other by using unlabeled data. The main topic of this paper is not about semi-supervised learning, and thus we assume there no unlabeled data are available. Therefore, we need a kind of messenger that transfers the learning results from one classifier to another. LVS plays the role of intermediate unlabeled data as a messenger. When one classifier classifies a document with multiple labels, the results are stored in LVS and are fed to the other classifier as features.

4.3 Building base classifiers

There are two classifiers in our co-training model: Dependency Classifier (DC) and Feature Classifier (FC).

Dependency Classifier, denoted $c^{D}_{l}(c)$ , is the base classifier associated with LVS. It regards the label set as independent features and performs the classification task using only the label information. Therefore, in DC, LSV becomes both input features and target feature. To predict the binary value of a certain label $c_{i}$ , DC is learned using all $\bf{c}$ values except $c_{i}$ , and this process is repeated for every class label $c_{i}$ . DC is learned using the label value set only, and it is defined as follows.

Definition 3 : Dependency Classifier (DC) Dependency Classifier, denoted $c^{D}_{l}({\bf{c}})$ , is defined as

$\displaystyle c^{D}_{l}({\bf{c}})=\textit{arg max}_{c_{l}\in\{0,1\}}p(c_{l})% \prod_{c_{kj}\in c,k\neq l}p(c_{k,j}|c_{l})^{w_{kj}}$ (7)

where $c_{kj}$ represents the $j$ - $t h$ value of the $k$ - $t h$ class and it has binary value and $w_{kj}$ represents the weight of $c_{kj}$ , respectively.

The value weighted MNB described in Eq. (3) is used again as the basic algorithm for a Dependency Classifier. Since Eq. (3) is based on the value weighting method, we need to develop a new formula to calculate the weight of the class value. A class takes on a binary value, and each binary class value might have different importance with respect to the target class due to the class dependencies. The approach that we used to calculate the weights of the feature values is used again to calculate the class labels. The importance of class value $c_{j}$ with respect to $c_{j}$ is defined as the amount of information $c_{j}$ value gives to $c_{j}$ value. The difference between the distribution of the a priori probability and that of a posteriori probability of the class label becomes the amount of information for the class label, and it is measured using the Kullback-Leibler measure. By assigning different weights to each label value, we can discriminate the embedded significance among the label values.

We can apply the same value weighting method used in the weights of word frequency bins to calculate the weight of the class value by using the KL measure as follows

$w_{kj}=\frac{1}{Z_{k}}\sum_{c_{m}\in c,m\neq k}p(c_{m}|c_{kj})\textit{log}% \frac{p(c_{m}|c_{kj})}{p(c_{m})}$ (8)

where $Z_{k}$ is a normalization constant given as $Z_{k}=\frac{1}{2}\sum_{j|k}w_{kj}$ .

The second classifier in the modified co-training method is the Feature Classifier (FC). The Feature Classifier is also built based on the value weighted MNB. For each of class label $c_{l}$ , it predicts the relevance of the $i$ - $t h$ label using Word Feature Set (WFS). Since WFS consists of class labels as well as regular word features, FC is a multiplicative form of two components. The predicted values using regular features are multiplied by the values using class labels. Given a document $\bf{x}$ and its label set $\bf{c}$ , FC is defined as follows.

Definition 4: Feature Classifier (FC) Feature Classifier, denoted $c^{F}_{l}(\bf{x,c})$ , is defined as

$c^{F}_{l}({\bf{x,c}})=\textit{arg max}_{c_{l}\in\{0,1\}}p(c_{l})\prod_{c_{k,j}% \in{\bf{c}},k\neq l}p(c_{kj}|c_{l})^{w_{kj}}\prod_{a_{tj}\in{\bf{x}}}p(a_{a_{t% }j}|c_{l})^{w_{tj}}$

where $w_{tj}$ represents the weight of the word frequency bin $a_{tj}$ and $w_{kj}$ represents the weight of class value $c_{kj}$ .

Notice that the first two elements of FC are identical to the formula of DC itself, and thus the prediction results of DC are automatically forwarded to FC.

Algorithm 1 Multi-label Co-Training

Input: Training Data f(x1; c1),…, (xN; cN)g

Test Instance; Predicted Label Set

For each class

c

, learn the Binary Relevance Classifier and predict the class value c

while

\hat{c}

changes do

For each class c, learn the Dependency Classifier and predict the class value

\hat{c}

For each class

c

, learn the Feature Classifier and predict the class value

\hat{c}

end while

Output: Predicted Label Set

\hat{c}

of the test instance

4.4 Multi-label co-training method

The modified co-training method first predicts the labels by using a BR classifier $c_{l}^{\textit{BR}}({\bf{x}})$ Eq. (6) and the predicted label values are fed to DC. Note that at the beginning of the method, we use BR classifier ( $c_{l}^{\textit{BR}}({\bf{x}})$ ) instead of FC ( $c^{F}_{l}({\bf{x;c}})$ ) because the values of ${\bf{c}}$ are not available at the beginning. After that, the method iterates the following two procedures. First, for each of the class label, the DC performs a classification task using the previously predicted labels generated from FC. Second, the FC is trained by using both word features and the predicted labels from DC. These classifiers iteratively trains each other and predict labels until their results converge.

Figure 1 illustrates the co-training process with an example. $c^{\textit{BR}}_{l}$ , $c^{D}_{l}$ and $c^{F}_{l}$ represent the $l$ - $t h$ BR classifier, DC and FC respectively. BR initially performs the classification task and then DC and FC alternately perform a classification task with each other. The classified labels $\hat{\bf c}$ are used to by the next classifier as additional features. When $\hat{\bf c}$ does not change, the labels $\hat{\bf c}$ produced by FC are chosen as the final output. Algorithm 1 describes the pseudo code of the proposed method.

5. Experimental evaluation

In order to evaluate the performance of each proposed method, we divide the experiment into two subsections. First, the performance of the value weighting method is compared to that of the MNB. Second, we compare co-training with other multi-label algorithms.

5.1 Effect of value weighting method

In this section, we describe how we conducted the experiments to measure the effects of the value weighting method. We then present the empirical results obtained using these methods. We used four text datasets to conduct our empirical text classification study. All these datasets have been widely used in text classification and are publicly available. Table 1 provides a brief description of each dataset.

The “New3” [25] dataset contains a collection of news stories and “Ohsumed” [25] is a dataset of medical articles. The “Amazon” [7] dataset is derived from the customer reviews in the Amazon Commerce Website for authorship identification. “CNAE-9” [7] is a dataset of free text business descriptions of Brazilian companies. The continuous features in datasets were discretized using equal distance method with 5 bins.

Table 1
Test datasets for the value weighting method

Dataset	Data	Label	Feature
New3	3204	6	13196
Ohsumed	1003	10	3183
Amazon	1500	50	10000
CNAE-9	1080	9	857

Figure 1.

Example of the co-training process.

Single-label classification was conducted in this experiment. So the value weighting method (MNB-VW) was used without employing the co-training model. The proposed MNB-VW is compared with regular naïve Bayes (NB) and multinomial naïve Bayes (MNB).

We used Weka software to run NB and MNB. Even though MNB-VW is an iterative method, its computational complexity was not high and approached its optimal weight values quickly. In most cases, the algorithm converged within 10–20 iterations.

Table 2

Accuracies of the methods

Dataset	NB	MNB	MNB-VW
New3	0.8620	0.8846	0.9104
Ohsumed	0.9402	0.9860	0.9571
Amazon	0.9760	0.9826	0.9893
CNAE-9	0.9602	0.9783	0.9787

Table 2 shows the results of the accuracies of these methods. The numbers with bold letters mean they shows the best accuracy among NB, MNB, and MNBVW. The MNB-VW shows the best performance in all 3 cases, and it always shows better performance than the NB method. These results clearly indicate that assigning weights to each word frequency bin could improve the performance of the classification task of the nave Bayesian during document classification.

The average value weights used in above experiment are shown in Fig. 2. We calculated the average value for all features in each dataset, and the number of bins is 5, so all features have 5 discretized word frequency bins.

Figure 2.

Average value weights of each dataset.

When the term frequency is zero, it is discretized into the first word frequency bin. The remaining term frequencies are discretized by equal distance, and the discretized values are assigned in order of frequency. The highest term frequencies are discretized to the fifth word frequency bin.

Figure 2 shows that the second word frequency bin has the highest average weight except for the CNAE-9 dataset and that the low word frequency bins generally have a higher weight than the high word frequency bins. The CNAE-9 dataset got an opposite result since it has abnormally high term frequencies.

This means that higher weights are generally assigned to low term frequencies. Actually, the event for which a word occurs less frequently has more significance than the event for which a word occurs several more times.

5.2 Effect of multi-label co-training method

The proposed co-training method was tested on 5 different benchmark multi-label text datasets [13]. All class variables of the datasets are binary. The features of the Eur-Lex and RCV1 datasets have numeric variables, and the features of the rest of the datasets are binary.

The numeric features in the datasets were discretized using the equal distance method, and the bin number of the Eur-Lex dataset is 6 while the bin number of the RCV1 dataset is 3. Since the term frequency intervals of each dataset are different, different bin numbers are used. All datasets are originally partitioned into training and testing datasets. The details of the datasets are summarized in Table 3.

Table 3
Multi-label text datasets used in the experiments

Dataset	Train/Test	Label	Feature
Medical	645/333	45	1449
Enron	1123/579	53	1001
Delicious	12920/3185	983	500
Eur-lex	17413/1935	201	5000
Rcv1	3000/3000	101	47236

We have considered four classifiers: binary relevance (BR), classifier chain (CC) [17], ensembled classifier chain (ECC) [17], and value weighting with co-training (MNB-VWCO).

The classifiers chain (CC) is multi-label classification method in a way to overcome the label independence assumption of BR. CC consists of L classifiers linked in a chain, such that each classifier incorporates the class predicted by the previous classifiers as additional features. The ensemble of the classifier chain (ECC) is a method that combines several classifier chains by changing the order of the labels. Using the previously predicted labels for additional features is very similar to our co-training model, so we are trying to compare the co-training algorithm with CC and ECC.

Since all algorithms are meta-learners, we used MNB-VW as a base learner for all of them. CC and ECC are implemented according to [17], with minor exceptions. The permutations of the labels in all chain-based algorithms have been generated at random, and the ensemble size in ECC was set to 10 [5]. The computational complexity of the algorithm was not high, and in most cases, the algorithm reached final sates within 10–15 iterations.

For comparison purposes, we used three different multi-label performance measures [17]:

•

Hamming Loss : Each label assignment is a separate binary evaluation

$\textit{Hamming Loss}=1-\frac{1}{NL}\sum_{i=1}^{N}\sum_{l=1}^{L}1(c_{l}^{i}={% \hat{c}}_{l}^{i})$

•

0/1 Loss: Any predicted set of labels hatc must match the true set of labels ${\bf c}$ exactly

$\textit{0/1 Loss}=1-\frac{1}{N}\sum_{i=1}^{N}1({\bf{c}}^{i}=\hat{\bf{c}}^{i})$

•

Accuracy: Godbole and Sarawagi [9] introduced a multi-label accuracy measure

$\textit{Accuracy}=\frac{1}{N}\sum_{i=1}^{N}\frac{{\bf{c}}^{i}\wedge\hat{\bf{c}% }^{i}}{{\bf{c}}^{i}\vee\hat{\bf{c}}^{i}}$

The tests were performed with original train and test dataset splits. For each data set, we provide a ranking of the algorithms and results are shown in Table 4.

Table 4

Multi-label text datasets used in the experiments

Dataset	BR	CC	ECC	MNB-VWCO
Hamming loss
Medical	0.016(4)	0.016(3)	0.015(2)	0.011(1)
Enron	0.200(3)	0.212(4)	0.059(1)	0.084(2)
Delicious	0.176(4)	0.174(2)	0.176(3)	0.165(1)
Eur-lex	0.024(4)	0.024(4)	0.024(2)	0.024(2)
Rcv1	0.018(4)	0.018(1)	0.018(4)	0.018(2)
Ave.Rank	3.8	2.8	2.4	1.6
0/1 loss
Medical	0.967(3)	0.966(2)	0.965(1)	0.982(4)
Enron	1.000(4)	1.000(4)	1.000(4)	0.999(1)
Delicious	1.000(4)	1.000(4)	1.000(4)	0.998(1)
Eur-lex	0.893(4)	0.889(3)	0.886(2)	0.886(2)
Rcv1	0.783(4)	0.775(2)	0.774(1)	0.777(3)
Ave.Rank	3.8	3.0	2.4	2.2
Accuracy
Medical	0.968(4)	0.968(3)	0.968(3)	0.978(1)
Enron	0.681(3)	0.652(4)	0.887(1)	0.862(2)
Delicious	0.706(4)	0.711(2)	0.708(3)	0.744(1)
Eur-lex	0.952(4)	0.952(3)	0.952(1)	0.952(3)
Rcv1	0.963(4)	0.964(1)	0.963(4)	0.963(2)
Ave.Rank	3.8	2.6	2.4	1.8

As can be seen from this table, similar results are derived in each evaluation. BR is the only algorithm that does not incorporate the label dependencies, so it is not surprising that BR under-performed compared to the others. ECC is an extended version of CC that overcomes the order dependency problem. So it is also reasonable to expect that ECC has better performance than CC. The major outcome of the experiment is that the MNB-VWCO algorithm performs best, and the result clearly indicates that MNB-VWCO can more efficiently incorporate the label dependencies than CC or ECC.

In MNB-VWCO, The average rank of Hamming loss is better than the other evaluation measures. The Hamming loss is a label-based evaluation measure but not a label set-based one. Since MNB-VWCO deals with the multiple binary classification problem, our algorithm is well-suited for the Hamming loss. In reverse, 0/1 loss is a label set-based evaluation measure and has the lowest average rank.

6. Conclusions and future work

In this paper, the multinomial nave Bayes algorithm is used to improve multilabel document classification in a number of ways. The value weighting method, a new weighting method that is further fine-grained, is proposed for multinomial nave Bayesian learning, and unlike traditional weighting methods, it assigns a different weight to each feature value. We also have employed a new co-training method to incorporate the dependencies among the class values.

The results of the experiment show that both value weighting and modified co-training methods are successful and show better performance in most cases than their counterpart algorithms. As a result, this work suggests that we could improve the performance of multinomial nave Bayes in multi-label text classification by using value weighting and co-training approaches.

For the future, it will also be interesting to use two different classifiers in the co-training method as the base classifier. In that case, each classifier in this paper can read the entire feature space and thus the performance of each classifier might increase accordingly. Another future work is that we employ a Bayesian Network as the base classifier of the Dependency Classifier. By using the Baysian Network as a Dependency Classifier (DC), DC has more expressive power and thus increases the performance of the Dependency Classifier.

Footnotes

Acknowledgments

This work was supported by the Korea Research Foundation (KRF) grant funded by the Korea government (MEST) (No. 2017R1A2A2A05069662) and by the Technology Innovation Program: Industrial Strategic Technology Development Program (No: 11073162) funded By the Ministry of Trade, Industry & Energy (MOTIE, Korea).

References

Arias

Gamez

J.A.

Nielsen

T.D.

and Puerta

J.M.

, A pairwise class interaction framework for multilabel classification, Probabilistic Graphical Models (2014), 17–32.

Bielza

and Larranaga

, Multidimensional classification with bayesian networks, International Journal of Approximate Reasoning 52(6) (2011), 705–727.

Cai

and Hofmann

, Hierarchical document categorization with support vector machines, in Proceedings of the 13th ACM International Conference on Information and Knowledge Management, 2004, pp. 78–87.

Cesa-Bianchi

Gentile

and Zaniboni

, Hierarchical classification: Combining bayes with svm, in Proceedings of the 23rd International Conference on Machine Learning, 2006, pp. 177–184.

Dembczynski

Cheng

and Hullermeier

, Bayes optimal multilabel classification via probabilistic classifier chains, Proceedings ICML, 2010.

Dembczyński

Waegeman

Cheng

and Hllermeier

, On label dependence and loss minimization in multi-label classification, Machine Learning 88 (2012), 5–45.

Frank

and Asuncion

, Uci machine learning repository, 2010.

Ghamrawi

and McCallum

, Collective multi-label classification, in Inter. Conf. on Inform. and Know. Manage, 2005.

Godbole

and Sarawagi

, Discriminative methods for multi-labeled classification, in The 8-th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2004.

10.

Hariharan

Zelnik-Manor

Vishwanathan

S.V.N.

and Varma

. Large scale max-margin multi-label classification with priors, in Proceedings of the 27th International Conference on Machine Learning, 2010.

11.

Kim

Han

Rim

and Myaeng

, Some effective techniques for naïve bayes text classification, IEEE Transactions on Knowledge and Data Engineering 18(11) (2006), 1457–1466.

12.

Lee

C.-H.

Gutierrez

and Dou

, Calculating feature weights in naïve bayes with kullbackleibler measure, in 11th IEEE International Conference on Data Mining, 2011.

13.

Mulan. Mulan: A java library for multi-label learning, 2012.

14.

Jin

Muller

Zhai

and Lu

, Multi-label literature classification based on the gene ontology graph, Bioinformatics 9(1) (2008).

15.

McCallum

and Nigam

, A comparison of event models for naive bayes text classification, in AAAI-98 Workshop on Learning for Text Categorization, 1998.

16.

McCallum

, Multi-label text classification with a mixture model trained by em, in AAAI99 Workshop on Text Learning, 1999.

17.

Read

Pfahringer

Holmes

and Frank

, Classifier chains for multi-label classification, ECML/PKDD (2009), 254–269.

18.

Rennie

Shih

Teevan

and Karger

, Tackling the poor assumptions of naive bayes text classifiers, in Proceedings of the 20th International Conference on Machine Learning (ICML), 2003, pp. 616–623.

19.

Rodriguez

and Lozano

, Multiple-objective learning of multi-dimensional bayesian classifiers, in Inter. Conf. on Hybrid Intelligent Systems, 2008.

20.

Rousu

Saunders

Szedmak

and Shawe-Taylor

, Learning hierarchical multi-category text classifcation models, in Proceedings of the 22nd International Conference on Machine Learning, 2005, pp. 774–751.

21.

Schneider

K.M.

, A new feature selection score for multinomial naïve bayes text classification based on kl-divergence. Proceedings of the 42nd Meeting of the Association of Computational Linguistics (ACL), 2004, 186–189.

22.

Schneider

, Techniques for improving the performance of naïve bayes for text classification, LNCS 3406 (2005), 682–693.

23.

Song

S.-H.

and Lee

C.-H.

, Improving Multi-label Classification of Documents Using Fine-Grained Weights. in IEA/AIE 2015: The 28th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, 2015.

24.

Sucar

L.E.

Bielza

Morales

E.F.

Hernandez-Leal

Zaragoza

J.H.

and Larranaga

, Multilabel classification with Bayesian network-based chain classifiers, Pattern Recognition Letters 41 (2014), 14–22.

25.

TunedIT. Tunedit data mining blog, 2012.

26.

de Waal

and van der Gaag

, Inference and learning in multi-dimensional bayesian network classifiers, in Proc. of Euro. Conf. on Symb. and Quant. Appr. to Reason. with Uncertain, 2007.

27.

Zhang

M.-L.

and Zhou

Z.-H.

, A k-nearest neighbor based algorithm for multi-label classification, in Granular Computing, 2005 IEEE International Conference on, Vol 2, 2005, pp. 718–721.

28.