Multi-label learning vector quantization for semi-supervised classification

Abstract

In the context of expensive and time-consuming acquisition of reliably labeled data, how to utilize the unlabeled instances that can potentially improve the classification accuracy becomes an attractive problem with significant importance in practice. Semi-supervised classification that fills the gap between supervised learning and unsupervised learning is designed to take advantage of the unlabeled data in regular supervised learning procedure for classification tasks. In this paper we proposed a self-learning framework, that firstly pre-learns a classification model using the labeled data, then makes the prediction of unlabeled instances in the form of soft class labels, and re-learned a model based on the enlarged training data. Two multi-label Learning Vector Quantization Neural Networks (LVQ-NNs) are proposed, namely multi-label online LVQ-NN (mLVQo) and multi-label batch LVQ-NN (mLVQb), to work with the soft labels of training instances. The experiments demonstrate that the semi-supervised models using multi-label LVQ-NN as the base classifier can produce better generalization accuracy than the supervised counterpart.

Keywords

Semi-supervised classification self-training soft label entropy multi-label learning vector quantization

1. Introduction

Machine learning is a computer science field to learn some knowledge from data with respect to a task and performance measure. From the perspective of data, machine learning tasks can be divided into three categories: supervised learning, unsupervised learning, and semi-supervised learning. Supervised learning is intended to exploit the labeled data with desired input and output, and infer a mapping function that describes definitely the relation between the input features and the target. The map function can be represented as a mathematical formalization, decision tree, decision rule, neural network or other structures, and used on new data to produce a prediction in response to a query. The typical examples of supervised learning include face recognition, medical diagnosis, image classification, risk assessment, natural language processing, fault detection and so forth. The widely studied supervised learning problems are binary classification, multi-class classification, multi-label classification, ranking problem, and real-valued prediction. Classification is a typical and integral supervised machine learning task to separate the data into distinct classes using adequate labeled instances. A diversity of classification methods have been proposed in the literature including decision tree, multilayer perceptron neural network, support vector machine, nearest neighbor and so forth.

Unsupervised Learning concerns on the raw data without the desired output, and intends to discover the hidden patterns under specific assumptions about the structural properties of the investigated data. A diverse array of clustering methods such as hierarchical clustering, partitional clustering, density-based clustering, self-organizing feature map, have been developed to detect the clusters embedded in data.

Due to the difficulty (e.g., high cost in time and effort, technical problems) of acquiring the labeled data, the decision makers usually face with a small amount of labeled instances and a large quantity of unlabeled instances in real world classification tasks. How to utilize the unlabeled instances that can potentially improve the classification accuracy during the learning procedure becomes an attractive problem with significant importance in reality. Semi-supervised learning (SSL) is devoted to making use of the abundant unlabeled data along with the limited labeled data in the context of supervised learning. It falls between supervised learning and unsupervised learning and usually performs through a direct combination of these strategies.

In this paper we propose a semi-supervised classification framework that makes use of the unlabeled instances to enhance the generalization capability of the supervised counterpart. The Learning Lector Quantization Neural Networks (LVQ-NNs) (for pre-learning) and its extension working with multi-labeled data (for re-learning) are used as the base classifier. Firstly, a regular LVQ-NN classifier is constructed based on the available labeled data, then the classification of unlabeled instances is performed using the initial classifier. Afterwards, the assessment procedure selects the highly confident unlabeled instances which achieve an entropy value over a preset threshold. The selected instances are used as the supplement of labeled data in the subsequent re-learning phase. However, the prediction of unlabeled instances is sometimes unreliable and therefore downgrades the accuracy of subsequent classification due to the mislabeling problem. The motivation of this paper is to perform a soft (fuzzy) labeling of the unlabeled instances that specifies an unlabeled instance with different membership values to each class, resulting in a weight (membership) matrix. In other words, each instance has multiple class labels with different weights. A new classifier capable to tackle the soft-labeled data together along with the weight matrix is then re-learned based on the enlarged training data set using the proposed multi-label LVQ-NN algorithms. Two algorithms, called multi-label online LVQ-NN (mLVQo), and multi-label batch LVQ-NN (mLVQb) are presented to handle the multi-label instances during the learning phase. Experiments on some benchmark datasets from UCI database and a real-world financial database reveal that the semi-supervised learning approaches can achieve significantly better generalization accuracy when compared to the supervised counterpart.

The rest of this paper is organized as follows. The related research background of semi-supervised learning is introduced in Section 2. Section 3 describes the methodology of a self-training classification framework that utilizes the unlabeled instances for boosting the accuracy of supervised learning. Two multi-label learning vector quantization neural networks are proposed to work with the soft-labeled training data. Section 4 describes the databases, experimental design, and comparative study of competing algorithms. Finally, we conclude the paper and give some future research lines in Section 5.

2. Research background

In many real applications such as remote sensing, medical diagnosis, text and image classification, and speech recognition, the acquisition of reliably labeled data is difficult due to the high cost in labor and time. In such scenarios that the labeled data are rare, learning with partially or weakly labeled data, called Partially Supervised Learning (PSL) or Semi-Supervised Learning (SSL), has become a challenging branch of machine learning and pattern recognition. From the general view, PSL is a machine learning task under weak supervision lying between supervised learning and unsupervised learning. In the past two decades, a variety of approaches have been proposed from different directions, including active learning, semi-supervised classification, learning with fuzzy labels, and semi-supervised clustering [19].

Active learning is essentially a supervised learning procedure by selecting the most informative unlabeled instances through an uncertainty sampling (a single query) or a committee query, and asking an expert add the label information [25]. The active learning procedure can be performed in sequential or pool-based manner [20]. When used for classification tasks, active learning mainly concerns on the highest generalization ability with the lowest labeling labor due to the need of an expert as the supervisor to offer the label for the informative samples.

Semi-supervised classification (SSC) aims to take advantage of the unlabeled data that can potentially improve the classification during the supervised learning phase. Different from active learning, the labels are generated automatically by the learning machine itself without the human intervention. The unlabeled instances are ranked by some confidence measure and the highly confident ones are selected to the pool of training set. Then the classifier is re-trained using the augmented training data set. The state-of-the-art SSC algorithms are categorized into self-training SSC [18], SSC with generative models [23], semi-supervised support vector machines [1, 4], graph-based SSC [14], SSC with disagreement [37], and transductive learning [2]. The disagreement-based SSC can be further divided into single-view (one attribute set) and multi-view (several disjoint attribute sets) semi-supervised learning [36]. Nevertheless, they are relied on the interaction among learners to exploit the disagreement on unlabeled data. Particular self-training SSC is a promising learning paradigm applicable to any supervised learning algorithm. A recent review on self-trained techniques for semi-supervised learning is presented in [28].

Learning with fuzzy labels is a particular supervised learning that incorporates the fuzzy (soft) labels of input data into the learning phase. Some variants of Multivariate Polynomials (MP), Multi-layer Perceptrons (MLP), Support Vector Machine (SVM), Radial Basis Function Neural Networks (RBF-NN), have been proposed to handle the training set with soft labels [11, 26]. In [27], the standard Learning Vector Quantization Neural Network (LVQ-NN) was enhanced to work with soft labels in the context of noise data. Initially both training data and prototypes are assigned soft labels. During the learning, the prototype $p*$ closest to the input $x$ is shifted to or away from the sample determined by the similarity between their labels. If the similarity is higher than the preset threshold, the prototype $p*$ is shifted towards $x$ , otherwise away from $x$ . One drawback of the algorithm is the need of a preset threshold for determining the match between the input and the winner neuron. In this paper we propose two multi-label LVQ-NN algorithms based on the same idea, i.e., using soft-labeled data to avoid the accumulation of mislabeling noise, however, we use a new updating strategy to avoid the problem of threshold specification. The soft labels of instances originate from the prediction with respect to the pre-trained base classifier.

As opposed to semi-supervised classification, semi-supervised clustering is intended to integrate the labeled instances into the clustering process. The prior knowledge is represented as constraints in form of must-link (instances belonging to the same group) or cannot-link (instances belonging to different groups), and integrated into the error function of clustering [6]. Various semi-supervised clustering algorithms were proposed based on K-means, hierarchical clustering, hidden Markov random fields (HMRFs), and kernel function [3, 14, 29]. In [9] a semi-supervised fuzzy c-means algorithm is used to boost the classification incrementally by producing the membership degree of unlabeled data to different classes, creating labels for the unlabeled data, re-training the classifier with labeled and newly labeled data, and re-clustering the remaining unlabeled data until all unlabeled data are labeled. In general semi-supervised clustering concerns on the inherent clustering task that is out of the scope of this paper.

Table 1 lists the well-known approaches used for semi-supervised classification including the category, method, role of human, methodology, and examples.

Table 1
Some research directions for semi-supervised classification

Category	Methods	Human	Methodology	Examples
		intervention
Active learning	Single query	Y	Train a single learner and query the unlabeled example of least confidence	Image retrieval [12]
	Committee query	Y	Train multiple learners and query the unlabeled example of most disagreement	Information extraction from research articles [32]
Semi-supervised classification	Self-training SSC	N	Train a single classifier, select most confident unlabeled instances, and re-train the classifier on the augmented training data set	Brain-computer interface [31] Sentence subjectivity classification [30]
	SSC with generative models	N	Use expectation maximization algorithm to optimize the model parameters	Text classification [8]
	SVM-based SSC	N	Define the decision hyperplane by incorporating the unlabeled data in optimization function	Hyperspectral image classification [22]
	Graph-based SSC	N	Find strong connected components with label propagation over the graph	Natural language processing [24]
	Disagreement-based SSC	Y/N	Train multiple learners and exploit the disagreements among the learners	Intrusion Detection [16] Email classification [17]
	Transductive learning	N	Optimize the prediction on unlabeled instances instead of generalization ability on unseen data	Harmonic analysis [10]
Learning with fuzzy labels	MP, MLP, SVM, RBF- NN, LVQ-NN	N	Adapt the regular learning algorithms to working with soft-labeled training data	Document categorization [34] House electrical disaggregation [15]

3. Methodology of proposed approach

Given a set of labeled instances: $D_{l}=\{(x_{1},y_{1}),(x_{2},y_{2}),\ldots,(x_{L},y_{L})\}$ , and a set of unlabeled instances: $D_{u}=\{x_{L+1},x_{L+2},\ldots,x_{L+U}\}$ , where $x_{i}\in X,i=1,2,\ldots,L+U$ is a $d$ -dimensional feature vector in the input space $X$ , $y_{i}\in Y,i=1,2,\ldots,L$ is a class label in the output space $Y$ , and usually $L\ll U$ , semi-supervised classification (SSC) is intended to exploit $D_{l}\bigcup D_{u}$ and construct a function $f:X\mapsto Y$ for predicting the class of an unseen instance.

In this paper, we proposed a LVQ-NN SSC approach that integrates fuzzy labeling strategy during the self-learning phase to avoid the accumulation of mislabeling noise.

LVQ-NN is widely applied in a diversity of domains such as image analysis, document categorization, natural language processing, financial risk assessment. The superiority of LVQ on classification was demonstrated compared with support vector machines, multivariate statistical methods, and other state-of-the-art learning methods [35]. Moreover, the neurons of LVQ convey the important information regarding the confidence of unlabeled instance prediction and thus could be used for the self-learning.

Figure 1.

Scheme of the semi-supervised classification using multi-label LVQ-NN as the base classifier.

In Fig. 1, the scheme of the proposed semi-supervised classification is outlined, along with the experimental design described in detail in Section 4. In the first phase, a standard LVQ-NN classifier is constructed based on the labeled data set ( $D_{l}$ ). Afterwards the classification of unlabeled instances ( $D_{u}$ ) is performed with respect to the pre-learned classifier, resulting in a weight matrix that specifies the possibility (membership) of an instance belonging to each class. Then the assessment procedure selects the highly confident unlabeled instances ( $D_{su}$ ) which achieve an entropy value over a preset threshold, as the supplement of labeled training data. In the last phase, a new classifier is re-learned based on the enlarged training data set ( $D_{l}\bigcup D_{su}$ ) using the proposed multi-label LVQ-NN algorithms. Finally the performance of the re-trained classifier is evaluated using a separated test dataset ( $D_{e}$ ).

3.1 LVQ-NN learning algorithms

Learning vector quantization (LVQ) neural network (NN) is a supervised learning variant of self-organized feature map (SOM) with an array of neurons arranged regularly on the map. Each neuron is associated with a prototype that defines the class region. LVQ-NN is trained in a competitive manner that the best matching neuron (BMU) is activated and strengthened (or weakened) with respect to the input depending on the match (or mismatch) of class label between the input and BMU, so that the class boundary is adjusted accordingly. When the input has the same class label with the BMU, it poses a positive influence on the prototype (i.e., shifts the prototype close to the input), otherwise a negative influence (i.e., shifts the prototype far from the input).

The standard LVQ-NN can be trained in two ways. In the sequential manner, the neuron is updated immediately according to each input, whereas in the batch manner, the neuron updating is performed at the end of one epoch when all training examples have been processed. The online LVQ-NN and batch LVQ-NN are described in the Algorithms 1 and 2 respectively [13].

Algorithm 1: The online LVQ-NN algorithm (LVQo).
Notations: $m_{i}$ : the prototype of neuron i; $\textit{Class}(m_{i})$ : class label of neuron i; $\alpha$ : the learning rate, $\lambda$ : number of maximal iterations
(1)	For $i=1\ldots m$ Initialize $m_{i}$ and $\textit{Class}(m_{i})$ //Project $x_{i}$ to $m_{p}$
(2)	$t=$ 0
(3)	For $i=1\ldots L$
	(3.1)	Input $x_{i}$ to the map
	(3.2)	For $j=1\ldots m$ Calculate $d(x_{i},m_{j})$
	(3.3)	$m_{p}=\underset{j}{\operatorname{argmin}}\,\,d(x_{i},m_{j})$
	(3.4)	if $\textit{Class}(m_{p})=\textit{Class}(x_{i})$ then
	(3.5)	$\qquad m_{p}=m_{p}+\alpha(x_{i}-m_{p})$
	(3.6)	else $m_{p}=m_{p}-\alpha(x_{i}-m_{p})$ //Update $m_{p}$
(4)	Assign $m_{p}$ by the majority voting principle
(5)	$t=t+1$
(6)	Goto (3) until $t=\lambda$

Algorithm 2: The batch LVQ-NN algorithm (LVQb).
Notations: $m_{i}$ : the prototype of neuron i; $\textit{Class}(m_{i})$ : class label of neuron i; $\lambda$ : maximal number of iteration
(1)	For $i=1\ldots m$ Initialize $m_{i}$ and $\textit{Class}(m_{i})$
(2)	$t=$ 0
(3)	For $i=1\ldots m$ $V_{i}=\phi$
(4)	For $i=1\ldots L$
	(4.1)	Input $x_{i}$ to the map
	(4.2)	For $j=1\ldots m$ Calculate $d(x_{i},m_{j})$
	(4.3)	$m_{p}=\underset{j}{\operatorname{argmin}}\,\,d(x_{i},m_{j})$ //Project $x_{i}$ to $m_{p}$
	(4.4)	$V_{p}=V_{p}\cup\{x_{i}\}$ //Add $x_{i}$ to the Voronoi set of $m_{p}$
	(4.5)	$s_{ip}=\left\{\begin{array}[]{rl}1&\text{if}\,\,\textit{Class}(m_{p})=\textit{% Class}(x_{i})\\ -1&\textit{otherwise}\end{array}\right.$
(5)	For $p=1\ldots m$
	(5.1)	If $\sum_{x_{i}\in V_{p}}s_{ip}>0$ $m_{p}=\frac{\sum_{x_{i}\in V_{p}}s_{ip}x_{i}}{\sum_{x_{i}\in V_{p}}s_{ip}}$ //Update $m_{p}$
	(5.2)	Assign $m_{p}$ by the majority voting principle
(6)	$t=t+1$
(7)	Goto (4) until $t=\lambda$

3.2 Confidence assessment of unlabeled instances

One fundamental aspect of self-learning algorithms is to define the assessment measure to estimate the annotation confidence of unlabeled data. The effectiveness of subsequent classification mostly relies on the quantity and quality of the selected confident examples, that is to say, which examples should be considered as confident.

Information entropy introduced by Claude Shannon [21] is one of the most popular measures to characterize the uncertainty and irregularity of a system. The larger the entropy is, the more disorder and irregular the system is. On the contrary, the smaller the entropy is, the more ordered and deterministic the system is. It performed well for evaluating the quality of features in feature selection [5], evaluating the cutting sets in data discretization [33] etc. Entropy is simple but effective measurement, and moreover it is stable and robust for noise. In this method, information entropy is used as a diversity measurement of the Voronoi set corresponding to each neuron. The unlabeled samples that has smaller entropy value indicate the higher certainty of belonging to one class.

Given an information system X with a number of possible values: $\{x_{1},\ldots,x_{k}\}$ and the probability function $P(X)$ , the entropy can be defined as:

$\textit{Entropy}=-\sum_{i}^{k}p(x_{i})\textit{log}_{b}p(x_{i})$

where $b$ is the base of the algorithm, and commonly $b=2$ . Particularly the Shannon’s entropy is maximal if all the outcomes are equally likely, that is to say, X achieves the maximal uncertainty when all possible events $x_{i}$ are equiprobable.

When using LVQ-NN as the base classifier, the confidence values for its predictions can be estimated by the Voronoi set of neurons. Regarding a map neuron $m_{p}$ , the Voronoi set is defined as the set of labeled instances projected to it via the nearest principle: $V(m_{p})=\{x_{i}\in D_{l}|m_{p}=\underset{j}{\operatorname{argmin}}\,\,d(x_{i}% ,m_{j})\}$ , where $d$ is the distance metric. Normally the Voronoi set is composed of instances from different classes, forming a hit vector $\{p_{1},p_{2},\ldots,p_{k}\}$ , where k is the number of class labels, and $p_{i}$ is the frequency (percentage) of instances from class $i$ , satisfying $\sum_{i=1}^{k}p_{i}=1$ . The hit vector implies the confidence of predicting the instances in the Voronoi set to each class.

The entropy of a Voronoi set $V$ can be calculated using the hit vector. Intuitively the more the entropy value is, the less reliable the neuron becomes for future classification. In particular, the entropy achieves the maximal value $\textit{log}_{2}k$ if $p_{1}=p_{2},\ldots,=p_{k}=1/k$ . On the contrary, it achieves the minimal value 0 if $\exists j_{0},p_{j_{0}}=1$ and $p_{i}=0\,(i\neq j_{0})$ . Given a threshold $\tau\leqslant\textit{log}_{2}k$ , a number of neurons can be selected as confident with the entropy value no more than the specified threshold $\tau$ .

In the proposed semi-supervised classification framework, the confidence of unlabeled instances is estimated by the entropy of the corresponding Voronoi set on the pre-learned map following the nearest principle. The unlabeled instances $x_{i}\in D_{u}$ can be projected to the BMU on the learned map. Given an unlabeled instance $x$ , it is assigned a weight vector that is the hit of underlying neuron (i.e., BMU). Finally a weight matrix as shown in Table 2 implies the membership (weight) of unlabeled instances with respect to the classes. Let $w(i,j)$ be the weight of instance $x_{i}$ assigned to class $C_{j}$ , satisfying that $0\leqslant w(i,j)\leqslant 1$ , and $\sum_{j=1}^{k}w(i,j)=1$ . Particularly $\exists j_{0},w(i,j_{0})=1$ if the corresponding Voronoi set contains instances from the same class $j_{0}$ . It should be noted that only the unlabeled instances projected to a confident neuron are reserved for the subsequent training, depending on the threshold. For $x_{i}\in D_{l}$ , $w(i,j_{0})=1$ if $x_{i}$ has class label $j$ , and $w(i,j)=0,j\neq j_{0}$ . Finally the new training set is enlarged to $D_{l}\bigcup D_{su}=\{x_{1},x_{2},\ldots,x_{L+SU}\}$ along with a $(L+SU)\times k$ weight matrix $W$ , where $D_{su}$ denotes the selected unlabeled instances with a confident prediction using the pre-learned classifier.

Table 2
Weight matrix of instances

	$C_{1}$	$C_{2}$	$\ldots$	$C_{k}$
$x_{1}$	$w(1,1)$	$w(1,2)$	$\ldots$	$w(1,k)$
$x_{2}$	$w(2,1)$	$w(2,2)$	$\ldots$	$w(2,k)$
$\vdots$	$\ddots$
$x_{n}$	$w(n,1)$	$w(n,2)$	$\ldots$	$w(n,k)$

3.3 Multi-label LVQ-NN algorithms

The standard form of LVQ-NN is designed for the training data with definite class, i.e., each instance has a unique label. In this paper we extend the LVQ-NN algorithms for multi-label instances using a weight matrix denoting the membership of the instance w.r.t. the class labels. In other words, the definite class is replaced with a fuzzy class specification along with the membership values. Accordingly, two multi-label learning vector quantization algorithms, namely mLVQo (online multi-label LVQ-NN) and mLVQb (batch multi-label LVQ-NN) are presented for the same purpose. Both are designed to adjust the class boundary by updating the prototypes with respect to the response to the training data. The mLVQo algorithm is described in Algorithm 3. For each input, the distance between the input and each neuron is calculated and the BMU is obtained. The difference from standard batch LVQ-NN is that the weight matrix of instances is incorporated into the updating of prototypes so that a given input has accumulative influence on the BMU with respect to the membership of class labels. For boosting the efficiency of online learning, we propose a batch version of multi-label LVQ-NN shown in Algorithm 4.

Algorithm 3: The online multi-label LVQ-NN algorithm (mLVQo).
Notations: $m_{i}$ : the prototype of neuron i; $Class(m_{i})$ : class label of neuron i; $w(i,l)$ : weight matrix of training instances, $\alpha$ : the learning rate, $\lambda$ : maximal number of iteration, k: number of class labels
(1)	For $i=1\ldots m$ Initialize $m_{i}$ and $Class(m_{i})$
(2)	$t=$ 0
(3)	For $i=1\ldots L+SU$
	(3.1)	Input $x_{i}$ to the map
	(3.2)	For $j=1\ldots m$ Calculate $d(x_{i},m_{j})$
	(3.3)	$m_{p}=\underset{j}{\operatorname{argmin}}\,\,d(x_{i},m_{j})$
	(3.4)	for $l=1\ldots k$
		(3.4.1)	if $Class(m_{p})=l$ then
		(3.4.2)	$\qquad m_{p}=m_{p}+\alpha w(i,l)(x_{i}-m_{p})$
		(3.4.3)	else $m_{p}=m_{p}-\alpha w(i,l)(x_{i}-m_{p})$
(4)	Assign $m_{p}$ by the majority voting principle
(5)	$t=t+1$
(6)	Goto (3) until $t=\lambda$

Algorithm 4: The batch multi-label LVQ-NN algorithm (mLVQb).
Notations: $m_{i}$ : the prototype of neuron i; $\textit{Class}(m_{i})$ : class label of neuron i; $w(i,l)$ : weight matrix of training instances, $\lambda$ : maximal number of iteration, k: number of class labels
(1)	For $i=1\ldots m$ Initialize $m_{i}$ and $\textit{Class}(m_{i})$
(2)	$t=$ 0
(3)	For $i=1\ldots m$ $V_{i}=\phi$
(4)	For $i=1\ldots L+SU$
	(4.1)	Input $x_{i}$ to the map
	(4.2)	For $j=1\ldots m$ Calculate $d(x_{i},m_{j})$
	(4.3)	$m_{p}=\underset{j}{\operatorname{argmin}}\,\,d(x_{i},m_{j})$ //Project $x_{i}$ to $m_{p}$
	(4.4)	$V_{p}=V_{p}\cup\{x_{i}\}$ //Add $x_{i}$ to the Voronoi set of $m_{p}$
	(4.4)	for $l=1\ldots k$
	(4.5)	$s_{ip}=\left\{\begin{array}[]{ll}s_{ip}+w(i,l)&\text{if}\,\,\textit{Class}(m_{% p})=l\\ s_{ip}-w(i,l)&\textit{otherwise}\end{array}\right.$
(5)	For $p=1\ldots m$
	(5.1)	If $\sum_{x_{i}\in V_{p}}s_{ip}>0$ $m_{p}=\frac{\sum_{x_{i}\in V_{p}}s_{ip}x_{i}}{\sum_{x_{i}\in V_{p}}s_{ip}}$ //Update $m_{p}$
	(5.2)	Assign $m_{p}$ by the majority voting principle
(6)	$t=t+1$
(7)	Goto (4) until $t=\lambda$

4. Experiments and results

In the experiments, we carry out an empirical study of the proposed SSC approaches built on two multi-label LVQ-NN algorithms respectively. The performance is compared with the regular supervised learning counterpart in term of prediction accuracy.

4.1 Experimental design

The experiments are conducted in the MATLAB environment as follows:

Step 1: 1.
Given an experimental data set $D$ , it is divided into two subsets by holdout (70% parts for training and 30% parts for performance validation). The training set is subsequently partitioned into two parts: labeled ( $D_{l}$ ) and unlabeled instances ( $D_{u}$ ) according to a ratio: $R=L/L+U$ . By varying the ratio from 5% to 50%, a series of labeled data set is generated, and the rest are unlabeled instances. Finally we have three subsets: $D_{l}$ (labeled data set), $D_{u}$ (unlabeled data set), and $D_{te}$ (test data set).
2.
For each generated training data set, the semi-supervised classifiers are constructed using the proposed multi-label LVQ-NN algorithms. The threshold $\tau$ for selecting the confident unlabeled instances is chosen from the preset list: $\{0.2,0.4,0.6,0.8,1\}*\textit{log}_{2}k$ with respect to the best accuracy, where $k$ is the number of classes.
3.
The test data is input to the re-learned classifier, and the performance is evaluated in terms of prediction accuracy.
4.
The process is repeated 10 times with random data division, and the average results are calculated.

4.2 Data sets

In the experiments, 9 datasets obtained from UCI database [7] in different domains varying in number of classes, features, and instances are used for performance comparison. Another benchmark dataset studied in the experiments is French, a financial data set of small or middle scaled business companies described in Table 3. It was used to predict the status (healthy or distress) of a company over a period of years. A balanced subset for experiments is composed of 600 distressed companies and 600 healthy ones. The properties of the experimental data sets are characterized in Table 4. The categorical features are converted to several binary ones by means of one categorical value corresponding to one binary feature in the new data set.

Table 3
Financial ratios of French database

Variable description
$x_{1}$ -	Number of Employees Previous year	$x_{16}$ -	Cashflow/Turnover
$x_{2}$ -	Capital Employed/Fixed Assets	$x_{17}$ -	Working Capital/Turnover days
$x_{3}$ -	Financial Debt/Capital Employed	$x_{18}$ -	Net Current Assets/Turnover days
$x_{4}$ -	Depreciation of Tangible Assets	$x_{19}$ -	Working Capital Needs/Turnover
$x_{5}$ -	Working Capital/Current Assets	$x_{20}$ -	Export
$x_{6}$ -	Current ratio	$x_{21}$ -	Added Value per Employee in k EUR
$x_{7}$ -	Liquidity Ratio	$x_{22}$ -	Total Assets Turnover
$x_{8}$ -	Stock Turnover days	$x_{23}$ -	Operating Profit Margin
$x_{9}$ -	Collection Period days	$x_{24}$ -	Net Profit Margin
$x_{10}$ -	Credit Period days	$x_{25}$ -	Added Value Margin
$x_{11}$ -	Turnover per Employee k EUR	$x_{26}$ -	Part of Employees
$x_{12}$ -	Interest/Turnover	$x_{27}$ -	Return on Capital Employed
$x_{13}$ -	Debt Period days	$x_{28}$ -	Return on Total Assets
$x_{14}$ -	Financial Debt/Equity	$x_{29}$ -	EBIT Margin
$x_{15}$ -	Financial Debt/Cashflow	$x_{30}$ -	EBITDA Margin

Table 4

Data set description (N: numerical feature, C: categorical feature)

Data sets	No. instances	No. features (N/C)	No. classes
French	1200	30/0	2
Transfusion	748	4/0	2
Wpbc	198	33/0	2
Haberman	306	3/0	2
Ionosphere	351	34/0	2
German	1000	3/21	2
Credit	690	6/9	2
Echocardiogram	131	6/1	2
Wine	178	13/0	3
Iris	150	4/0	3

4.3 Results and discussion

It is known that the quantity and quality of confident instances selected from the unlabeled data set is a vital factor for the final classification. In the proposed framework it is depended on the entropy value and a preset threshold. Figure 2 investigates the projection of labeled data set ( $D_{l}$ ) on the pre-learned map in the first phase. The distance matrix information is extracted from the U-matrix, and modified by knowledge of zero-hits (interpolative) units. The histogram is then calculated as the frequency of instances in different class labels projected to the same neuron. In this example (French dataset of two classes), it is a 2-d vector corresponding to the two classes. Finally three visualizations are shown: the color code with the number of hits on each unit, the piechart of histogram, and the neuron labels by majority voting principle. Apparently the neurons colored by both red and blue in pirchart histogram are less confident than those colored by only red (bankrupt) or blue (healthy).

Figure 2.

Visualizations of learned map in the first phase (French data set, $|D_{l}|=$ 168 and $|D_{u}|=$ 672 at $R=$ 20%).

As mentioned the entropy value indicates the confidence in labeling an instance in $D_{u}$ . In Fig. 3, the entropy value is visualized for each map neuron with the colormap on the right, taken French data set ( $R=$ 20%) as example. Particularly the zero-hits neuron is shown as white, referring that no labeled instance is projected to it in the first phase. The values shown on the place of neurons indicate the number of unlabeled instances ( $D_{u}$ ) projected to the underlying neuron. By adjusting the threshold, a varying amount of unlabeled instances are selected as the confident instances ( $D_{su}$ ) to enlarge the training data set for the subsequent re-learning. The smaller the threshold is, the less unlabeled instances are considered as confident. Particularly, if the threshold is set as the maximum (i.e., $\textit{log}_{2}k$ , where $k$ is the number of classes), all unlabeled instances are reserved for the re-learning phase. However the optimal value is difficultly set in practice. In this paper, a grid search strategy is used to find a best one from a list of preset values.

Figure 3.

Visualization of entropy value on the map (French data set, $|D_{l}|=$ 168, and $|D_{u}|=$ 672).

Figure 4.

Performance comparison in terms of accuracy.

Table 5

Performance comparison of two multi-label LVQ-NN algorithms in terms of average accuracy and standard deviation in brackets. The best results for each dataset and ratio are underlined and * denote the statistical significance at 5% level w.r.t. the regular LVQ-NN

Ratio	Iris			French
	LVQ-NN	mLVQb	mLVQo	LVQ-NN	mLVQb	mLVQo
5%	0.798(0.11)	0.804(0.10)	0.762(0.19)	0.800(0.05)	0.824(0.03)*	0.826(0.03)*
10%	0.891(0.05)	0.911(0.06)*	0.913(0.06)*	0.799(0.02)	0.832(0.02)*	0.832(0.02)*
15%	0.898(0.06)	0.904(0.10)	0.900(0.10)	0.834(0.04)	0.845(0.03)	0.845(0.03)
20%	0.924(0.05)	0.933(0.03)	0.936(0.04)	0.851(0.02)	0.863(0.02)*	0.863(0.02)*
25%	0.911(0.03)	0.947(0.02)*	0.947(0.03)*	0.851(0.03)	0.863(0.01)*	0.865(0.01)*
30%	0.949(0.02)	0.967(0.03)*	0.958(0.02)	0.844(0.02)	0.861(0.02)*	0.864(0.01)*
35%	0.940(0.03)	0.964(0.03)*	0.969(0.02)*	0.843(0.02)	0.872(0.01)*	0.870(0.01)*
40%	0.938(0.03)	0.956(0.03)*	0.960(0.03)*	0.860(0.02)	0.874(0.01)*	0.869(0.02)*
45%	0.942(0.03)	0.951(0.03)	0.953(0.02)	0.861(0.02)	0.872(0.02)*	0.871(0.01)*
50%	0.938(0.03)	0.956(0.02)*	0.953(0.03)*	0.874(0.02)	0.874(0.02)	0.874(0.02)
	Transfusion			Wpbc
	LVQ-NN	mLVQb	mLVQo	LVQ-NN	mLVQb	mLVQo
5%	0.689(0.06)	0.732(0.04)*	0.729(0.04)*	0.695(0.09)	0.720(0.07)*	0.697(0.06)
10%	0.708(0.04)	0.729(0.04)*	0.729(0.04)*	0.632(0.10)	0.692(0.08)*	0.707(0.07)*
15%	0.707(0.03)	0.737(0.04)*	0.733(0.04)*	0.610(0.11)	0.676(0.08)*	0.702(0.08)*
20%	0.714(0.04)	0.745(0.03)*	0.748(0.03)	0.690(0.06)	0.741(0.04)*	0.725(0.05)
25%	0.716(0.04)	0.736(0.04)*	0.731(0.04)	0.636(0.05)	0.705(0.04)*	0.707(0.07)*
30%	0.714(0.02)	0.748(0.02)*	0.742(0.02)*	0.661(0.05)	0.736(0.03)*	0.731(0.06)*
35%	0.733(0.02)	0.753(0.02)*	0.752(0.02)*	0.692(0.09)	0.753(0.05)*	0.759(0.05)*
40%	0.724(0.03)	0.739(0.02)*	0.733(0.02)	0.680(0.06)	0.746(0.03)*	0.749(0.05)*
45%	0.732(0.03)	0.758(0.03)*	0.752(0.03)*	0.690(0.04)	0.746(0.03)*	0.746(0.06)*
50%	0.750(0.03)	0.766(0.03)*	0.762(0.02)	0.693(0.07)	0.766(0.04)*	0.749(0.04)*
	Haberman			Ionosphere
	LVQ-NN	mLVQb	mLVQo	LVQ-NN	mLVQb	mLVQo
5%	0.605(0.13)	0.638(0.14)	0.653(0.13)*	0.772(0.08)	0.798(0.07)	0.780(0.08)
10%	0.635(0.05)	0.678(0.07)*	0.676(0.07)*	0.750(0.10)	0.782(0.08)	0.744(0.11)
15%	0.644(0.09)	0.693(0.05)*	0.695(0.05)*	0.795(0.05)	0.830(0.05)*	0.828(0.05)*
20%	0.631(0.05)	0.697(0.04)*	0.688(0.05)*	0.788(0.09)	0.839(0.06)*	0.827(0.06)*
25%	0.619(0.06)	0.654(0.05)*	0.662(0.05)*	0.830(0.04)	0.856(0.03)*	0.850(0.05)
30%	0.657(0.07)	0.686(0.07)*	0.700(0.06)*	0.827(0.05)	0.870(0.04)*	0.858(0.04)*
35%	0.633(0.07)	0.682(0.07)*	0.679(0.06)*	0.863(0.05)	0.873(0.04)	0.873(0.04)
40%	0.626(0.05)	0.677(0.05)*	0.677(0.05)*	0.872(0.03)	0.872(0.04)	0.870(0.04)
45%	0.656(0.05)	0.708(0.08)*	0.688(0.07)*	0.847(0.06)	0.885(0.03)*	0.874(0.04)*
50%	0.624(0.07)	0.678(0.06)*	0.671(0.05)*	0.822(0.06)	0.863(0.04)*	0.849(0.05)*
	German			Credit
	LVQ-NN	mLVQb	mLVQo	LVQ-NN	mLVQb	mLVQo
5%	0.672(0.04)	0.699(0.03)*	0.691(0.02)*	0.769(0.06)	0.815(0.04)*	0.827(0.04)*
10%	0.681(0.03)	0.688(0.03)	0.697(0.02)	0.782(0.04)	0.826(0.03)*	0.830(0.03)*
15%	0.678(0.03)	0.707(0.02)*	0.696(0.02)*	0.786(0.06)	0.814(0.04)*	0.809(0.05)*
20%	0.681(0.05)	0.709(0.03)*	0.701(0.02)	0.801(0.03)	0.833(0.03)*	0.840(0.03)*
25%	0.694(0.02)	0.698(0.02)	0.692(0.03)	0.815(0.03)	0.849(0.02)*	0.851(0.02)*
30%	0.692(0.03)	0.701(0.03)*	0.691(0.04)	0.825(0.03)	0.836(0.03)*	0.843(0.02)*
35%	0.687(0.03)	0.702(0.02)*	0.699(0.02)*	0.833(0.03)	0.849(0.02)*	0.849(0.02)*
40%	0.699(0.03)	0.712(0.03)*	0.696(0.03)	0.819(0.04)	0.843(0.02)*	0.848(0.03)*
45%	0.693(0.03)	0.707(0.02)*	0.691(0.03)	0.818(0.02)	0.844(0.02)*	0.851(0.02)*
50%	0.694(0.03)	0.707(0.02)*	0.696(0.03)	0.839(0.03)	0.853(0.02)*	0.854(0.02)*

Table 5, continued
	Echocardiogram			Wine
	LVQ-NN	mLVQb	mLVQo	LVQ-NN	mLVQb	mLVQo
5%	0.538(0.17)	0.554(0.15)	0.567(0.14)	0.830(0.13)	0.858(0.14)*	0.832(0.19)
10%	0.641(0.07)	0.646(0.07)	0.651(0.07)	0.900(0.11)	0.925(0.07)	0.940(0.04)
15%	0.638(0.06)	0.692(0.08)*	0.692(0.08)*	0.913(0.04)	0.942(0.03)*	0.938(0.04)*
20%	0.595(0.06)	0.651(0.07)*	0.662(0.08)*	0.898(0.09)	0.945(0.02)*	0.953(0.03)*
25%	0.641(0.11)	0.682(0.07)*	0.677(0.08)	0.925(0.05)	0.943(0.03)*	0.945(0.02)*
30%	0.592(0.08)	0.672(0.06)*	0.723(0.07)*	0.923(0.04)	0.949(0.03)*	0.947(0.04)*
35%	0.603(0.08)	0.633(0.08)	0.664(0.07)*	0.934(0.04)	0.957(0.03)	0.949(0.04)
40%	0.610(0.06)	0.674(0.07)*	0.685(0.06)*	0.911(0.04)	0.932(0.03)*	0.942(0.04)*
45%	0.610(0.07)	0.677(0.07)*	0.705(0.06)*	0.951(0.03)	0.960(0.02)	0.955(0.03)
50%	0.569(0.09)	0.656(0.09)*	0.687(0.08)*	0.938(0.04)	0.951(0.03)	0.955(0.02)*

In Table 5, the performance results of two multi-label LVQ-NN algorithms are shown in the context of semi-supervised classification. For easy comparison the results of these datasets are shown in Fig. 4. The prediction accuracy averaged on ten runs for different ratio values varying from 5% to 50% is compared with the regular batch LVQ-NN (without the use of unlabeled instances) as the baseline. In general, almost all classifiers receive better accuracy by increasing the size of labeled data. It is shown that the two multi-label LVQ-NN algorithms always have better prediction accuracy than the baseline, with statistical significance at 5% level in most cases. In the case of good data such as Iris and Wine, by exploiting the unlabeled data the self-training paradigm is able to achieve satisfactory prediction accuracy using a smaller amount of labeled examples than the regular supervised classifier. When the original dataset contains much noise such as Wpbc, increasing the size of labeled data does not result in better generalization capability. For example, when increasing the ratio $R$ from 5% to 50% for Wpbc dataset, the accuracy on test dataset has no apparent improvement (always less than 70%). However, the self-learning classifier is able to enhance the generalization accuracy up to 76.6% due to the selection of highly confident unlabeled instances and the correction of some mislabeled instances in the original dataset. This gives some evidences that making use of the unlabeled data as valuable background information during supervised learning is a promising approach to improve the classification performance when the labeled data is rare but the unlabeled data is abundant.

5. Conclusions and future work

In many scenarios, label annotation is a difficult, expensive, time consuming, labor insensitive, and error-prone task where a lot of human efforts and experience is involved, whereas, the unlabeled data is abundant and easily obtained. Due to the fact that the small amount of labeled data and a large amount of unlabeled data exist simultaneously in practice, semi-supervised classification aiming to leverage the presence of unlabeled examples receives more and more attention in recent decades. Among a variety of approaches, self-learning algorithms are attractive that attempt to pre-learn the class of unlabeled data automatically and select the instances of high confidence for the re-learning phase. In this paper, we proposed two multi-label LVQ-NN algorithms to work with soft-labeled data. Both algorithms are based on the same idea, that an input with multiple labels has accumulative influence on the best-matching neuron with respect to the membership to class labels. The multi-label LVQ-NN algorithms are integrated in a semi-supervised classification framework and tested on some real databases from diverse domains. The results show the benefit of incorporating unlabeled instances of high confidence in the form of soft class labels during the supervised learning procedure on improving the generalization performance.

In the future work, some limitations of this empirical study will be addressed as the research directions. Firstly, the most confident instances are selected with respect to a preset threshold in the current framework. As a deeper study the best threshold value will be optimized automatically through some strategies such as cross-validation and global optimization. Secondly, the proposed framework is inherently a wrapper applicable for any classification methods. In the further study, we will investigate the performance of the framework constituted on state-of-the-art supervised learning algorithms as the base classifier. Further comparative experiments with competing SSC solutions will also be done in future research.

Footnotes

Acknowledgments

This work was supported by national funds through the Beijing National Science Foundation (9182017), National Natural Science Foundation of China (11601129), and Introduction of Talent Research Fund of Henan Polytechnic University (Y2017-1).

References

Chapelle

Sindhwani

and Keerthi

, Optimization techniques for semi-supervised support vector machines, Journal of Machine Learning Research 9(1) (2008), 203–233.

Covoes

T.F.

et al., Hierarchical bottom-up safe semi-supervised support vector machines for multi-class transductive learning, Journal of Information & Data Management 4(3) (2013), 345–358.

Diaz-Valenzuela

Vila

M.A.

and Martin-Bautista

M.J.

, On the use of fuzzy constraints in semi-supervised clustering, IEEE Transactions on Fuzzy Systems 24(4) (2016), 992–999.

Ding

Zhu

and Zhang

, An overview on semi-supervised support vector machine, Neural Computing & Applications 28 (2015), 1–10.

Wang

S.M.

and Gong

, Research on decision tree algorithm based on information entropy, Advanced Materials Research 267 (2011), 732–737.

Fauber

and Schwenker

, Semi-supervised kernel clustering with sample-to-cluster weights, In: Iapr Tc3 Workshop, 2001, pp. 72–81.

Frank

and Asuncion

, UCI Machine Learning Repository, http://archive.ics.uci.edu/ml, 2010.

Fujino

and Ueda

, A semi-supervised AUC optimization method with generative models, In: IEEE International Conference on Data Mining, 2017, pp. 883–888.

Gan

et al., Using clustering analysis to improve semi-supervised classification, Neurocomputing 101(3) (2013), 290–298.

10.

Gavish

Nadler

and Coifman

, Multiscale wavelets on trees, graphs and high dimensional data: theory and applications to semi supervised learning, In: International Conference on International Conference on Machine Learning, 2010, pp. 367–374.

11.

Gayar

Schwenker

and Palm

, A study of the robustness of KNN classifiers trained using soft labels, In: Lecture Notes in Computer Science 4087, 2006, pp. 67–80.

12.

Hoi

S.C.H.

and Lyu

M.R.

, A semi-supervised active learning framework for image retrieval, In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, pp. 302–309.

13.

Kohonen

, Self-organizing maps, Springer Berlin Heidelberg, 1997.

14.

Kulis

et al., Semi-supervised graph clustering: a kernel approach, Machine Learning 74(1) (2009), 1–22.

15.

Sawyer

and Dick

, Disaggregating household loads via semi-supervised multi-label classification, In: Fuzzy Information Processing Society, 2015, pp. 1–5.

16.

Meng

and Kwok

, Intrusion detection using disagreement based semi-supervised learning: detection enhancement and false alarm reduction, In: International Conference on Cyberspace Safety and Security, 2012, pp. 483–497.

17.

Meng

and Kwok

, Enhancing Email classification using data reduction and disagreement-based semi-supervised learning, In: IEEE International Conference on Communications, 2014, pp. 622–627.

18.

Nikos

et al., Self-trained LMT for semi-supervised learning, Computational Intelligence & Neuroscience 2 (2016), 10.

19.

Schwenker

and Trentin

, Pattern classification and clustering: a review of partially supervised learning approaches, Pattern Recognition Letters 37(1) (2014), 4–14.

20.

Settles

, Active learning literature survey, University of Wisconsin-Madison 39(2) (2009), 127–131.

21.

Shannon

, A mathematical theory of communication, Bell System Technical Journal 27(3) (1948), 379–423.

22.

Shao

et al., A novel hierarchical semi-supervised SVM for classification of hyperspectral images, IEEE Geoscience & Remote Sensing Letters 11(9) (2014), 1609–1613.

23.

Singh

Nowak

R.D.

and Zhu

, Unlabeled data: now it helps, now it doesnâ€™t, In: Conference on Neural Information Processing Systems, 2010, pp. 1513–1520.

24.

Subramanya

Petrov

and Pereira

, Efficient graph-based semi-supervised learning of structured tagging models, In: Conference on Empirical Methods in Natural Language Processing, 2010, pp. 167–176.

25.

Tencer

Reznakova

and Cheriet

, Summit-training: a hybrid semi-supervised technique and its application to classification tasks, Applied Soft Computing 52 (2016), 1296–1315.

26.

Thiel

Scherer

and Schwenker

, Fuzzy-input fuzzy-output one-against-all support vector machines, Springer Berlin Heidelberg, 2007.

27.

Thiel

Sonntag

and Schwenker

, Experiments with supervised fuzzy LVQ, In: Iapr Workshop on Artificial Neural Networks in Pattern Recognition, 2008, pp. 125–132.

28.

Triguero

Garcia

and Herrera

, Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study, Knowledge & Information Systems 42(2) (2015), 245–284.

29.

Wagstaff

Cardie

and Rogers

, Constrained k-means clustering with background knowledge, In: Eighteenth International Conference on Machine Learning, 2001, pp. 577–584.

30.

Wang

et al., Semi-supervised self-training for sentence subjectivity classification, In: Conference of the Canadian Society for Computational Studies of Intelligence, 2008, pp. 344–355.

31.

Wang

J.J.

and Jia

, Self-training semi-supervised support vector machines using label mean for EEG classification, Chinese Journal of Biomedical Engineering 30(5) (2011), 666–672.

32.

and Pottenger

, A semi-supervised active learning algorithm for information extraction from textual data, Journal of the Association for Information Science & Technology 56(3) (2014), 258–271.

33.

Xie

et al., Novel classification method for remote sensing images based on information entropy discretization algorithm and vector space model, Computers & Geosciences 89(C) (2016), 252–259.

34.

Yan

and Chen

, Label-based semi-supervised fuzzy co-clustering for document categorization, In: Communications and Signal Processing, 2011, pp. 1–5.

35.

Yang

, Monitoring and diagnosing of mean shifts in multivariate manufacturing processes using two-level selective ensemble of learning vector quantization neural networks, Journal of Intelligent Manufacturing 26(4) (2015), 1–15.

36.

Zhou

and Li

, Semi-supervised learning by disagreement, Knowledge & Information Systems 24(3) (2010), 415–439.

37.

Zhu

Sun

and Jin

, Multi-view semi-supervised learning for image classification, Neurocomputing 208 (2016), 136–142.

Multi-label learning vector quantization for semi-supervised classification

Abstract

Keywords

1. Introduction

2. Research background

Table 1 Some research directions for semi-supervised classification

3.2 Confidence assessment of unlabeled instances

Table 2 Weight matrix of instances

4. Experiments and results

4.1 Experimental design

Table 3 Financial ratios of French database

Footnotes

Acknowledgments

References

Table 1
Some research directions for semi-supervised classification

Table 2
Weight matrix of instances

Table 3
Financial ratios of French database