Abstract
In order to improve the performance of semi-supervised learning, a kind of safe semi-supervised classification algorithm based active learning sampling strategy is proposed. First, an active learning sampling method based on uncertainty and representativenes is designed. The weighted algorithm combining the uncertainty and representativenesss is used to select the unlabeled samples with rich information and representation, providing for semi-supervised learning. Second, a method of label prediction based on grouping verification is designed. Prelabeling is executed on unlabeled sample selected by active learning. The sample with pseudo-label is added into the labeled sample set to carry out grouping, training and testing. The corresponding errors of various pseudo-labels are calculated and the pseudo-label making the accuracy least is selected as the candidate label of the unlabeled sample. Third, a method of security verification is designed. Only the label making the accuracy lower than before is selected as the final label of the unlabeled sample to expand the number of labeled samples. Iterations are repeatedly executed until a certain precision is met. Finally, the classifier is trained using the final labeled set. The experiments are carried out on semi-supervised datasets and UCI datasets, and the results show that the proposed algorithms are effective.
Introduction
Semi-supervised learning [1, 2] is a kind of new machine learning method between supervised learning and unsupervised learning, whose purpose is to make full use of large unlabeled samples to make up for the lack of labeled samples. Semi-supervised learning is divided into semi-supervised clustering and semi-supervised classification. The main aim of semi-supervised classification is to study how to use a large number of unlabeled samples to help to train supervised learning classifier, when labeled samples are insufficient.
Active learning and semi-supervised learning face the same problems and challenges. They use labeled samples to build highly accurate classifiers, and reduce the workload of manually marking unlabeled samples. The common point between active learning and semi-supervised learning is to mark unlabeled samples and expand the number of labeled samples [3, 4]. The major difference between them is as follow: (1) The semi-supervised learning algorithm realizes the automatic labeling of the unlabeled samples, and active learning is artificial to mark the unlabeled samples by experts; (2) The unlabeled samples are screened by active learning and only the samples with abundant information are marked.
For semi-supervised learning, the calculation of the classification model is mainly dependent on a small number of labeled samples and a large number of unlabeled samples. A small number of labeled samples are mainly used for the initialization of the classification model, and its classification performance will affect the performance of the final classification model. For semi-supervised classification algorithm, it is necessary to extract labeled samples according to the data distribution and mark the unlabeled samples with abundant information as much as possible, which is an effective way to improve the accuracy of the initial classification model [5, 6]. So it requires the combination of active learning and semi-supervised learning, and active learning should select the sample with high information to provide for semi-supervised learning.
Some scholars combined active learning with semi-supervised learning and proposed several improved semi- supervised learning algorithm. By embedding active learning into semi-supervised learning, namely active semi-supervised learning, the generalization performance of classifier can be effectively improved. Zhang [7] applied co-training to select the most reliable instances according to the two criterions of high confidence and nearest neighbor for boosting the classifier, exploit the most informative instances with human annotation for improving the classification performance. He [8] combined active learning and semi-supervised learning to obtain a confident and sufficient labeled training data for multivariate time series classification. Chang [9] proposed a new active learner learning pairwise constraints known as must-link and cannot-link constraints, it combined semi-supervised clustering and active learning. Reitmaier [6] provided a transductive active learning, a completely labeled data pool was provided in each active learning cycle. It used this generative model to label all samples not labeled by an expert so far in order to train the kind of classifier training with the active learning process. Yan [10] introduced active learning in the process of semi-supervised learning. Some samples have been selected by artificial markers, and these new added labeled samples have been used to optimize semi-supervised learning. Hajmohammadi [11] proposed a learning model based on the combination of uncertainty-based active learning and semi-supervised self-training approaches to incorporate unlabeled sentiment documents from the target language in order to improve the performance of cross-lingual methods.
However, the existing semi-supervised classification algorithms combined with active learning still have many problems to be improved, and the main problems are as follows: (1) The sample selected by active learning only has uncertainty or representativeness, and it can‘t provide effective unlabeled samples for semi-supervised learning. (2) They do not make full use of semi-supervised learning to label the unlabeled samples automatically, and the unlabeled samples selected by active learning are all manually labeled by experts. (3) In the process of semi-supervised learning, some noise may be introduced to reduce the performance of data classification.
In order to improve the performance of semi-supervised classifier, a kind of safe semi-supervised classification algorithm based active learning sampling strategy (S3CA-AL for short) was proposed in this paper. On the whole, S3CA-AL was divided into three modules: active learning sampling, label prediction and security verification. (1) In active learning sampling module, an active learning sampling method based on uncertainty and representativeness (AL-ur for short) is designed. The weighted algorithm is used to select the unlabeled samples with information and representation, providing for semi-supervised learning. (2) In label prediction module, a method of label prediction based on grouping verification (LP-gv for short) is designed. The pseudo-label making the classifier error least is selected as the candidate label. (3) In security verification, a method of security verification (SV for short) is designed. The candidate label is verified and only the label making the classification rate lower than before is selected as the final label of the unlabeled sample. The three modules are composed and iterations are repeatedly executed until a certain precision is met. Finally, the experiment is executed to verify the effectiveness of the algorithm.
The remainder of the present study is structured as follows: In Section 2, methods of previous studies on active learning and semi-supervised learning are summarized. In Section 3, our proposed algorithm is presented. In Section 4, the experiment and the discussion are presented. In Section 5, the conclusion is given.
Related work
Active learning
Active learning was first proposed by Professor Angluin [12] of the Yale University in the United States, the main idea is to use the unlabeled samples to help train the classifier. The method is to select and mark some unlabeled samples, and put them into labeled samples set to train the classifier. Then the classifier is used to select unlabeled samples again, and iterations are repeated.
The iterative process of the active learning algorithm is as follows [12]: (1) The classifier is trained in the labeled sample set. (2) The classifier is used to classify the unlabeled samples. (3) According to the classification results, the sampling engine is used to select unlabeled samples to be given to experts for marking and the sample will be marked to a labeled sample set for the next classifier training. (4) The newly added labeled samples are added to the labeled sample set for the next training. The termination condition of the algorithm is that the marking cost or the generalization accuracy of the classifier reaches a certain standard [13–15]. The algorithm description is shown in Algorithm 1.
Active learning is different from semi-supervised learning only using existing data, but not using external information. In active learning, the classifier usually picks out some unlabeled samples to obtain its label from experts, then use these newly added labeled samples to learn a classifier again. This process will circulate until a satisfactory classifier is achieved. Research on active learning usually focuses on reducing the number of queries, to obtain better classification results with less labeled data [14, 15].
In active learning, there are two strategies for selecting and querying samples. One of strategies is based on the information content of samples, that is, the sample can reduce the degree of uncertainty of the classification model in statistical learning. The other strategy is based on the representation of samples, that is, whether the sample can represent the whole data set. Most of the active learning techniques are based on one of these two strategies to select samples. Huang [16] has put forward a systematic way to combine these two selection strategies, so as to select informative and representative unlabeled samples to query.
Semi-supervised learning
The basic idea of semi-supervised learning is to use the model hypothesis in the data distribution to establish a learner to mark unlabeled samples. At present, there are three commonly basic model assumptions in semi- supervised learning: Cluster Assumption [17, 18], Manifold Assumption [19, 20] and Local & Global Consistency Assumption hypothesis [21]. Semi-supervised classification uses a large number of unlabeled data to expand the training set of the classification algorithm to make up for of the lack of labeled data. Semi-supervised classification mainly includes disagreement-based methods, generative methods, discriminative methods and graph-based methods.
The disagreement-based semi-supervised classification realizes the utilization of unlabeled data by using multiple classifiers. In the learning process, the unlabeled data is used as a platform for information interaction between multiple classifiers. The original disagreement-based algorithm was developed by A. Blum and T. Mitchell [22] in 1998, or called the standard collaborative training algorithm. They assumed that the data set had two views of sufficient redundancy, meeting the following conditions: First, each set of attributes was sufficient to describe the problem; second, each attribute set was conditioned to be independent of another set of attributes when it was marked. The collaborative training process iterated until a stop condition was reached. The description of the algorithm is shown in Algorithm 2.
The generative method [23] assumes that the sample and class labels are generated by a set of probability distributions of a certain or certain structural relationship. Class prior distribution P(y) and class conditional distribution P (x|y) have been known, repeated sampling y∼p (y) and x∼p (x|y). From these distributions, the sample L with the class label and the sample U without the class label are generated. Then, the posteriori distribution P (y|x) is obtained according to the theorem of probability, and the class label which makes the largest P (y|x) is found to mark the x. Graph-based learning [24] is a very active direction of semi-supervised learning in recent years. The essence of the graph based approach is the label propagation. It describes the sample space by graph, and uses neighbors to spread label information in point cloud. The discriminative method [25] uses the maximum interval algorithm to train the learning decision boundary of the labeled sample and unlabeled sample. The purpose of learning is to make the classification hyperplane through the low density data region, and to make the distance maximum between the classification hyperplane and the nearest sample. The discriminant methods include LDA, generalized discriminant analysis, semi-supervised support vector machines, entropy regularization and KNN.
Our method
Here, we propose a kind of safe semi-supervised classification algorithm based active learning sampling strategy (S3CA-AL for short). It includes three modules: active learning sampling module, label prediction module and security verification module. In the following content, firstly we introduce the overall framework of S3CA-AL, and then introduce the function and process of each module.
Algorithm framework
The framework diagram of S3CA-AL is shown in Fig. 1. The first step is the process of using active learning to sample. Active learning strategy is used to select informative and representative sample from the unlabeled sample set, and these selected samples are used as candidate samples to be labeled by semi-supervised learning algorithm. The second step is label prediction. A method of label prediction based on grouping verification is used to predict the candidate samples above as a predicted label for unlabeled samples. The third step is the process of the security verification. For the predicted label of unlabeled sample from the second step, second verification is carried out to confirm the final label for the unlabeled sample, making the classifier error rate lower than before. The fourth step is the process of new sample addition. The newly label and its corresponding sample are added into the training set L to provide the next step training for active learning and semi-supervised learning. Iterations are repeatedly executed until a certain precision is met. Finally, the final classifier is formed by training the supervised classification algorithm using the final labeled set L.

Algorithm framework.
Through active learning, the space of unlabeled samples is reduced, providing semi-supervised learning with abundant informative unlabeled samples, shortening the time of semi-supervised learning, and improving the performance of semi-supervised learning. On the other hand, semi-supervised learning is used to extract information from unlabeled samples and improve the reliability of labeled samples information for active learning. At the same time, the whole process is completed automatically without the need of manual participation.
In active learning, there are two strategies for selecting and querying samples. One is based on the information content of samples, that is, the sample can reduce the degree of uncertainty of the classification model in statistical learning. The other is based on the representation of samples, that is, whether the sample can represent the whole data set. Most of the active learning techniques are based on one of these two strategies to select samples.
Here a method is used to combine these two selection strategies, and an active learning sampling method based on uncertainty and representativeness (AL-ur for short) is proposed. The weighted algorithm is used to combine uncertainty and representativeness, so that both informative and representative unlabeled samples are selected for marking. The definition of the weighted algorithm is shown in the formula (1):
Where V (x) represents the voting entropy of the sample x, that is the uncertainty; R (x) represents the representativeness of the sample x, W (x) represents the weighted value of uncertainty and representativeness, α represents the weighing parameter, and α ∈ [0, 1].
The sample density [24] proposed by Zhu is used to measure the representativeness of the samples, its formalization is defined as formula (2).
In the process of active learning, the representativeness of samples is considered. If the sample is outlier and the representation is too small, even if the uncertainty is large enough, it will not be selected as a candidate sample. Through active learning sampling, high informative and representative unlabeled samples are selected and marked by predictive module, and the scale of labeled samples is expanded.
In this module, a method of label prediction based on grouping verification (LP-gv for short) is proposed. The main idea of the algorithm is as follows. First, prelabeling is carried out for unlabeled sample selected by active learning. The sample with pseudo-label is added into the labeled sample set. Then, grouping, training and testing are executed in the labeled set, and the corresponding errors of various pseudo-labels are calculated. At last, the pseudo-label with the lowest classification rate is selected as the candidate label of the unlabeled sample for next step judgement.
The algorithm description is shown in Algorithm 3. First, an unlabeled sample set U is selected. Different pseudo-labels are respectively classified according to different classification categories and mark the selected sample. Second, the selected sample and its corresponding pseudo-label are added into the labeled set to form a new training set. The training set is grouped into multiple groups. One of group acts as validation set and the other groups are used as training sets. The training set is used to train the classifier and test is carried out on the validation set. Cross validation is performed many times, allowing each group to act as a validation set. Then calculate the average value of the corresponding error rate after each cross validation. Third, the error values of the classifier are calculated based on various pseudo-labels, and the corresponding pseudo-label with the lowest error values are selected as the final predictive value of the sample.
Security verification module
In order to improve the accuracy and safety of new labeled samples for classifier, a method of security verification(SV for short) is proposed to judge the new added label above. Only the candidate label that meets the criteria of judgment can be used as final label to mark the unlabeled w and expand the labeled set L.
Where e t represents the classification error of the classifier in the t iteration, et-1 represents the classification error of the classifier in the t-1 iteration.
The judging condition is as shown in formula (3). It assumes that the newly added candidate label and its corresponding sample are added into the labeled set L to form L‘. The training set L is used to train the classifier, which is used to test on the test set, and then calculate the classification e. At the same time, the training set L‘ is used to train the classifier, which is used to test on the test set, and calculate the classification e‘. Judge whether it satisfies the formula (3).
If the formula (3) is satisfied, the candidate sample is used as a new sample of L, otherwise the candidate sample will be eliminated directly. In this way, it can guarantee that the error rate of the current round is reduced compared with that of the previous one when the new sample is added into L. It can ensure that the classifier evolves in the direction of improving the classification rate. It can avoid the increase of unlabeled samples to make the classifier performance worse and ensure that the classification algorithm is safe and reliable.
Experimental platform and data set
The proposed approaches are evaluated on data set including three semi-supervised benchmark data sets digit1, g241c, coil, and six UCI data sets. The size of data ranges from 270 to 1500, and the dimensionality ranges from 8 to 241. The data set is shown in Table 1.
The data set
The data set
Here, the classification is a two classification problem, which is divided into normal and abnormal. Four metrics are typically applied to evaluate the results.
True Positive (TP): It is judged to be a positive sample, and in fact it is a positive sample. True Negative (TN): It is judged to be a negative sample and in fact it is a negative sample. False Negative (FN): It is judged to be a negative sample, but in fact it is a positive sample. False Positive (FP): It is judged to be a positive sample, but in fact it is a negative sample.
Because the data set is balanced, the classification accuracy is used as an important evaluation index to directly evaluate the entire prediction performance of the proposed method. Based on these four metrics above, the classification accuracy is defined using the following formula (4):
In order to verify the effectiveness of the proposed algorithm, the experiment is divided into the following four categories.
Results and analysis of experiment 1
In the experiment,
Experiment result is shown in Table 2. It can be seen that the accuracy of the algorithm LP-gv is better than SVM, decision tree and logistic regression, indicating that the proposed algorithm LP-gv has higher accuracy and better classification performance.
Comparison of accuracy (mean-std.)
Comparison of accuracy (mean-std.)
In this experiment, a semi-supervised experiment is simulated.
The selected samples are divided into two parts: the training set and the test set. The proportion of the sample number of the training set and the test set is set to 2 : 1. At the same time, the training set is divided into labeled samples and unlabeled samples. According to the number of labeled samples accounting for the proportion of the number of the training set, three types of semi-supervised classification data set is constructed. In the first category of experiment, labeled samples accounts for 5% of training set samples (λ= 5%). In the second category of experiment, labeled samples accounts for 10% of training set samples (λ= 10%). In the third category of experiment, labeled samples accounts for 15% of training set samples (λ= 15%). According to the proportion, three kinds of experiments are carried out to verify the performance of the proposed algorithm. The condition of the algorithm termination is that the unlabeled sample set is empty. The experiments are repeated 30 times, and the performance of several classification algorithms is compared by calculating the mean value.
The results of the experiment are shown in Tables 3, 4 and 5. Among them, Table 3 shows the semi-supervised classification result when the percent is 5%. Table 4 shows the semi-supervised classification result when the percent is 10%. Table 5 shows the semi-supervised classification result when the percent is 15%.
Comparison of accuracy (mean-std.) λ= 5%
Comparison of accuracy (mean-std.) λ= 5%
Comparison of accuracy (mean-std.) λ= 10%
Comparison of accuracy (mean-std.) λ= 15%
It can also be seen from Tables 3–5: (1) As the number of labeled samples increases, the accuracy of semi-supervised learning is improved. This situation can be reflected in SVM, S3VM, S4VM and our algorithm SSC-LP-SV. It indicates that the more the labeled samples are, the stronger the classification ability of the generated classifier is. (2) SSC-LP-SV uses a cross grouping method to mark and extend unlabeled samples and verify two times. Compared with SVM, S3VM and S4VM, SSC-LP-SV has a higher accuracy. It indicates that our algorithm SSC-LP-SV is an effective semi-supervised classification algorithm, which can extend the number of labeled samples well and improve the accuracy.
It can also be seen in Table 3 that the S3VM classification algorithm has a decline in accuracy as the introduction of noise. In the experiment, the reduction of accuracy has not appeared in SSC-LP-SV. It is because that SSC-LP-SV selects the candidate label with minimum error to mark the sample, and the algorithm evolves towards the direction of classifier performance improvement. So SSC-LP-SV is a safe semi-supervised classification algorithm.
In this experiment, the selection method of data sets is the same as experiment 2. In the process of the experiment, the semi-supervised classification process is simulated. The condition of the algorithm termination is that the unlabeled sample set is empty. The experiment was done 30 times, and the average value was calculated as the result of the experiment.
In the experiment, algorithm S3CA-AL with active learning sampling is compared with algorithm SSC-LP-SV without active learning sampling, so as to judge whether the combination of active learning and semi- supervised learning can improve the classification performance rather than cause the decline of classification performance.
The results of the experiment are shown in Tables 6, 7 and 8. Among them, Table 6 shows the semi-supervised classification result when the percent of labeled sample is 5%. Table 7 shows the semi-supervised classification result when the percent of labeled sample is 10%. Table 8 shows the semi-supervised classification result when the percent of labeled sample is 15%.
Comparison of accuracy (mean-std.) 5%
Comparison of accuracy (mean-std.) 5%
Comparison of accuracy (mean-std.) 10%
Comparison of accuracy (mean-std.) 15%
It can be seen from the experimental results: Although only a few part of the labeled sample is used for training classifier, the accuracy of S3CA-AL is very close to SSC-LP-SV, and even higher. For example, the accuracy of data set isolet and optdigits in Table 7 have been improved, and the accuracy of data set digit1, g241c and optdigits in Table 8 have also been improved. It is because that S3CA-AL uses active learning to select high informative and representative samples to mark, and it can improve classification performance to a certain extent. On the premise of reducing the scale of semi-supervised learning, it can maintain the accuracy of the algorithm and even improve the accuracy to some extent.
In the experiment, the proposed active learning algorithm
In the experiment, labeled samples are constantly added into the training set. With the increasing number of labeled samples, the accuracy of the three semi-supervised classification algorithm is calculated respectively and the change trend of accuracy is judged.
Compared with the increase rate and stability value of the accuracy among three algorithms, we can find that
Conclusion
In this paper, the active learning algorithm and semi-supervised learning are combined to realize data classification. Active learning is used to select unlabeled samples with abundant and representative information. Semi-supervised learning is used to expand the number of labeled samples. The whole process is the automatic completion of the algorithm, and no manual intervention is needed. In the semi-supervised learning process, the security of the classifier is emphasized.
Several simulation experiments are carried out on the proposed algorithm and several modules. The experiment shows that the proposed algorithm can effectively reduce the size of the sample and it is easier to mark the unlabeled samples. The result also shows that the proposed algorithm has good classification performance.
Footnotes
Acknowledgments
The Project was supported by Natural Science Basic Research Plan in Shaanxi Province of China (No. 2015JM6347), the Science and Technology Research Project of Shangluo University (No. 14SKY026), Scientific Research Project of Education Department in Shaanxi Province of China (No. 09JK424).
