Abstract
The problem of initialization of active learning is considered in this paper. Especially, this paper studies the problem in an imbalanced data scenario, which is called as class-imbalance active learning cold-start. The novel method is two-stage clustering-based active learning cold-start (ALCS). In the first stage, to separate the instances of minority class from that of majority class, a multi-center clustering is constructed based on a new inter-cluster tightness measure, thus the data is grouped into multiple clusters. Then, in the second stage, the initial training instances are selected from each cluster based on an adaptive candidate representative instances determination mechanism and a clusters-cyclic instance query mechanism. The comprehensive experiments demonstrate the effectiveness of the proposed method from the aspects of class coverage, classification performance, and impact on active learning.
Introduction
Active learning (AL) is an effective machine learning technique to achieve an accurate predictive model with fewer labeling costs. In the past few decades, many researchers focused on designing new AL algorithms. However, little attention has been paid to the initialization problem of AL. The initialization of AL will encounter an embarrassing dilemma that all the data is unlabeled, and particularly the unlabeled data is potentially imbalanced. In this paper, the above situation is termed as class-imbalance AL cold-start. In this circumstance, most of the existing AL algorithms are far from being satisfactory without a feasible initialization approach. On the one hand, the supervised AL approaches cannot start properly without an initial training set; on the other hand, the unsupervised AL approaches mostly seldom consider the class skewed distribution scenario, thus the instances in the minority classes tend to be left out in the AL process. Therefore, it is a crucial job to construct an effective initialization method to alleviate the dilemma of class-imbalance AL cold-start.
Supervised active learning is the most popular branch in the AL community, which includes uncertainty sampling [5], query-by-committee [18], expected error reduction [13], etc. A common feature of this kind of active learning is classifier dependence, i.e., they need some initial labeled instances to activate the basic classifiers. However, in practice, the few initially labeled instances are not always guaranteed. Random sampling is a common approach to start supervised active learning. But, it is not suitable for potentially imbalanced data. According to Lewis and Gale [5], it may be a long period of random selection before instances of the minority class are stumbled upon. Moreover, although a few clustering-based AL initialization methods have been designed [11, 19, 23, 25], they are not flexible and not applicable to potentially imbalanced data. The unsupervised active learning techniques, such as clustering-based active learning [21], experimental design methods [24, 14], etc., they typically query the representative instances by exploiting the intrinsic distribution of the data, thus can be used to populate the AL initial training set. However, those methods usually perform well on balanced data set and seldom consider the potentially class-imbalance scenario.
However, imbalanced data is common in many practical applications, such as anomaly detection, credit fraud recognition, and so on [9]. AL users may frequently suffer from the class-imbalance AL cold-start problem. Therefore, it is worthwhile to research how to populate a high-quality and representative initial training set with fewer queries in the class-imbalance AL cold-start scenario. This paper concentrates on this problem in pool-based AL setting and proposes a two-stage clustering-based AL cold-start method (ALCS). The so-called two stages are the pre-clustering stage and instances selection stage.
In our method, the first task is to separate the instances of minority classes from that of majority classes as much as possible and group them into different clusters. To accomplish this, in the pre-clustering stage, a multi-center clustering is built to group the data, which is based on a newly designed inter-cluster tightness measure and the density peak clustering algorithm (DPC, refers to [2]). The second task in our method is to populate the initial training set by selecting and labeling the critical representative instances in each cluster. Therefore, in the second stage, a cluster-adaptive candidate representative instance determination method is designed to find the most representative instances in each cluster. Besides, a clusters-cyclic instance query mechanism that combines an impure cluster splitting strategy is introduced to select the initial training instances sequentially from the candidate representative instance set. The main contributions of this paper are as follows:
A multi-center clustering is constructed to group the potentially imbalanced data. This promotes the instances selection procedure to achieve a higher class coverage with fewer queries. A cluster-adaptive candidate representative instance determination method is designed, which enables our model to discover the most representative instances in an arbitrary cluster. A clusters-cyclic instance query mechanism is built to label the critical representative instances sequentially, which avoids the sampling bias between the clusters. Besides, a strategy of impure cluster splitting is incorporated into this procedure to promote the model to label more instances close to the classification boundary. The quality of the selected initial training set is evaluated comprehensively from the aspects of class coverage, classification performance, and impact on active learning.
The rest of this paper is organized as follows. Section 2 reviews the related work about active learning initialization. The proposed method is presented in Section 3. Section 4 presents the experimental results. Finally, conclusions and future work are discussed in Section 5.
In the community of AL, random sampling is the common initialization method to start the AL process [23, 25], owing to the assumption that random sampling is likely to build the initial training set with the same prior distribution as that of the whole original data. However, if the data is potentially skewed, it will be a long period of random selection to obtain a preferable AL initial training set. Besides, random sampling does not guarantee to select a most representative subset [19].
Clustering techniques have been utilized to populate the AL initial training set in a more target way [16, 11, 19, 23, 25]. In the related work, K-means type algorithms are commonly used and the instances located in the center of the clusters are selected as the initial training instances. Although the clustering-based initialization methods are more effective than random sampling, there are two inherent problems that prevent those methods from being widely used. On the one hand, the clustering-based methods do not support incremental initial training instances selection, whereas the selected single batch of instances does not necessarily guarantee to start the active learning process; on the other hand, the number of clusters is generally predefined subjectively and only the center instances of the clusters are selected, however, one single instance is generally not enough to represent a cluster well.
The unsupervised active learning techniques typically do not depend on classifiers, thus can be used to obtain the AL initial training set. Clustering-based AL is a branch of unsupervised AL techniques. However, the most existing clustering-based AL methods do not consider the class-imbalance scenario and lack a cluster-adaptive mechanism to determine how many representative instances need to query in an arbitrary cluster. Dasgupta and Hsu [22] built an active learning scheme based on the hierarchical clustering, in which the training instances are randomly selected from each cluster, but failed to consider the representativeness of the selected instances. The density peaks clustering based active learning (ALEC) by Wang et al. [21] is the recently proposed clustering-based AL method, in which the cluster centers in the DPC algorithm are selected as the training instances. However, ALEC assigns an identical fixed query budget to each cluster, which is inflexible and has no cluster-adaptive ability.
The experimental design method is another branch of unsupervised AL techniques and usually refers to as early active learning [7]. Yu et al. [14, 15] proposed a transductive experimental design (TED) from the perspective of data reconstruction. In the TED, the instance subset which enables to minimize the average expected square predictive error of the learned function is selected as the training set. Cai et al. [4] proposed the MAED method by introducing a manifold adaptive mechanism into the TED scheme and has yielded impressive results. Nie et al. [7] proposed a robust active learning approach (RRSS) by using the structured sparsity inducing norms to relax the TED’s NP-hard objective to a convex formulation. Much previous work has been done by researchers based on experimental design, however, most of the works do not solve the challenge of early active learning, i.e., the class-imbalance AL cold-start problem. Recently, Wang et al. [24] proposed a balanced active learning model (FISTA-LOCP) with an exclusive sparsity norm. To the best of our knowledge, FISTA-LOCP is the only experimental design method that takes into account the potentially imbalanced data.
The proposed two-stage clustering-based AL cold-start method
As mentioned in Section 1, the proposed method ALCS consists of two stages, i.e., the pre-clustering stage and the instances selection stage, which are shorted as stage 1 and stage 2, respectively. In the pre-clustering stage, a multi-center clustering is built based on the DPC algorithm and a newly designed inter-clusters tightness measure to obtain a partition of the potentially imbalanced data. In the instances selection stage, a cluster-adaptive representative instance determination method is designed to obtain the candidate representative instances from each cluster. Then, the critical representative instances are sequentially selected based on a clusters-cyclic query mechanism.
Framework of the proposed ALCS.
The framework of ALCS is depicted in Fig. 1. Suppose
In this stage, the input unlabeled data is clustered by applying a multi-center clustering strategy. Therefore, the instances of the minority classes can be separated from that of the majority classes as much as possible [17, 27], which enables the subsequent instances selection procedure to achieve a higher class coverage as soon as possible. In the multi-center clustering, the data is first grouped into
In the construction of multi-center clustering, the DPC algorithm is employed to obtain the multiple clusters, owing to the DPC algorithm has the advantages of finding clusters of any shape, grouping instances without iteration, and having few input parameters. The key idea of the DPC is that each instance
In Eq. (1),
Based on the “larger
Since the DPC algorithm is sensitive to the input parameter
where
It should be noted that calculate the optimal number of initial multiple cluster centers
After obtaining
.
The tightness measure between two clusters
In Eq. (5),
[htp] Pre-clustering algorithm[1] Dataset
Candidate representative instances determination for each cluster
In this subsection, a cluster-adaptive representative instance determination mechanism is built. Through this mechanism, the most representative instances in each cluster can be found. Generally, the cluster centers (the instances with larger
An example of candidate representative instances determination. (a) is a original 2-dimensional dataset from the reference [2]; (b) presents the 
Suppose there are
For example,
The
Based on Eq. (8), the
[htbp] Candidate representative instances determination algorithm[1] Current cluster
After the candidate representative instances in each cluster are determined, the procedure of initial training instances selection is conducted through a clusters-cyclic instances query mechanism. In this mechanism, at each round, in each cluster, only one unlabeled candidate representative instance with the largest
In Eq. (9),
Furthermore, to improve the classification performance of the learning model based on the selected initial training set, an impure cluster splitting strategy is incorporated into the clusters-cyclic instance query mechanism. Such that, once a cluster is found to be impure, i.e., the cluster in which the labeled instances belong to different classes, it will be split into two smaller clusters. Then, the candidate representative instances corresponding to the split clusters will be recalculated via Algorithm 2. This promotes the model to select more training instances close to the classification boundary. The impure cluster is split by applying the DPC algorithm with the labeled instances as the cluster centers, and the sub-clusters with the same cluster index are merged. The procedure of instances selection is summarized in Algorithm 3.2.2.
[htp] Instances selection algorithm[1] Cluster set
Suppose there are
The time complexity of ALCS depends on the following aspects: First, in the pre-clustering stage: (a) obtain the initial clusters by DPC takes
Experiments
In this section, the experiments are conducted on a bunch of public data sets. The information of the used data sets is summarized in Table 1. The imbalance ratio “IR” represents the ratio of the number of instances in the most majority class to the number of instances in the most minority class. The IR of the nine data sets ranges from 1.00 to 439.60. Obviously, there are 7 imbalanced data sets and 2 balanced data sets. R15 and Aggregation are from the reference [2], Jain is from the reference [10], and the other six data sets are from UCI repository [12]. There are both binary classification and multi-classification in the data sets, and the information of class distribution is also shown in Table 1. For simplicity, the names of Wine Quality, Robot Navigation, and Electrical Grid are abbreviated as Wine, Robot, and Electrical in the following. It should be noted that the experiments do not involve discrete features encoding and dimension reduction, and all the data sets were not normalized or standardized.
Information of the nine data sets used in the experiments
Information of the nine data sets used in the experiments
In the experiments, five methods are compared with the proposed method ALCS:
Random (random sampling) is the baseline method, which arbitrarily selects a subset of instances from the unlabeled instances pool to inquire labels. HSAL [22] is the hierarchical clustering-based active learning method, which selects and labels the instances based on the hierarchical cluster structure and the error estimation of the pruning. ALEC [21] is the density peak clustering-based active learning method, in which the critical instances are selected and labeled based on the master tree and the clustering structure. MAED [4] is an experimental design method, which incorporates a manifold adaptive mechanism into the TED scheme, and tends to select the most representative and discriminative data points for labeling. FISTA-LOCP [24] is an experimental design method based on structure sparsity, which leverages an exclusive sparsity norm to cope with the problem of instances selection in imbalanced data.
In order to verify the effectiveness of the proposed method, the quality of the selected initial training set is evaluated comprehensively from the aspects of class coverage, classification performance, and impact on active learning.
The class coverage of the selected initial training set is critical to active learning initialization. Generally, a higher class coverage of the initial training set leads to a better active learning effect. In this paper, the class coverage is measured by the following two metrics DCC and MNIC.
.
The degree of class coverage (DCC) for initial instances selection is defined as:
In Eq. (10),
.
The minimal number of label inquiries for all classes covered (MNIC) in the set of initial selected instances is defined as:
The metric MNIC represents the minimal number of label inquiries when all of the classes are covered in the selected initial training set. The smaller this metric value, the better the method on initial training set selection.
Furthermore, the metrics classification accuracy (Acc) and F1-macro (F1) score are involved in the experiments, since it makes sense to evaluate the quality of the selected initial training set from the perspective of classification performance. To measure the impact of the selected initial training set on the active learning process, the area under the learning curve (ALC) is a feasible metric [6]. The metric ALC is usually used to measure the performance of an active learning algorithm, which is also applicable to measure the initialization impact on standard active learning. Thus the metrics ALC on Acc (ALC-Acc) and ALC on F1-macro-score (ALC-F1-macro) are involved in the experiments.
Experimental results on class coverage
Curves of the metric DCC in initial training instances selection based on the proposed ALCS and five competing methods with the number of queries ranges from 
To compare the performance of the different methods on metric DCC, for each data set, all the data is used as the input to execute each method 10 times. The initial instances are selected with the number of inquires
Minimum number of label inquiries under all classes covered (MNIC) on the nine data sets based on the six different methods
In the comparison on the metric MNIC, each method is conducted 10 times on the whole data of each data set. The results are shown in Table 2 (with the best results in boldface). Needs to note that, Ecoli and Wine Quality are high imbalanced data sets, where Ecoli includes two small classes with each contains only 2 instances (proportion is 0.5%). Wine Quality also includes two small classes with 5 and 20 instances (proportions are 0.1% and 0.4%) insides, respectively. Since none of the six methods can achieve class cover on Ecoli and Wine with a reasonable number of queries, thus the MNIC results on these two data sets are obtained at
The quality of the initial training sets selected by different methods is further compared in the task of classification. The selected initial training sets are utilized to train classifiers and compared based on the results of Acc and F1-macro-score. In the experiment, the basic learner logistics regression model from the Python toolbox “Scikit-Learn” with
The experimental results are summarized in Table 3 (with the best results in boldface). The comprehensive results, i.e., the average results (Mean) and the average rank (Avg Rank) are also shown in the table. From the results in Table 3, it can be observed that ALCS outperforms the competing methods on most data sets except Ecoli and Wine quality. But, ALCS still performs better on the comprehensive result. The reason for ALCS not outstanding on Ecoli and Wine quality is that ALCS tends to assign almost the same sampling opportunities to majority and minority classes, while the classification performance in imbalanced data is likely to be dominated by the number of labeled majority class instances. In summary, ALCS can not only perform well in class coverage but also classification on both imbalanced and balanced data for initial training set selection.
Experimental results on initialization impact on active learning
In this subsection, the impact of the selected initial training set on standard active learning is compared based on the metric area under the learning curve (ALC). In the experiment, each compared method offers
The results show that different initialization methods bring different impacts on standard active learning, and the proposed ALCS does provide a better initialization effect for active learning on most data sets. For data set Wine, since ALCS tends to label more instances of minority classes, which leads the uncertainty sampling approach labels more instances of minority classes, thus ALCS does not perform outstandingly on this data set. The comprehensive results, i.e., Mean and Avg Rank, further demonstrate that ALCS is a more effective solution to cope with the class-imbalance AL cold-start problems.
Statistical analysis of the experimental results
In order to detect whether there are significant differences in performance among the results of the different methods, the Friedman test [20] is conducted on each performance measure, i.e., MNIC, Acc, F1-macro, ALC-Acc, and ALC-F1-macro, at a confidence level of
Furthermore, to detect whether a competing method performs equivalently or significantly different from the control method (i.e., the proposed method ALCS), the Wilcoxon signed-rank test [8] is conducted on each performance measure at a confidence level of
Classification performance (Acc and F1-macro) of the logistic regression classifiers which were trained on the selected
initial training set by the six different methods (%)
Classification performance (Acc and F1-macro) of the logistic regression classifiers which were trained on the selected
Results of ALC on classification accuracy and F1-macro-score based on the standard active learning models (Uncertainty Sampling) initialized by the six different methods
Initial training instances selection results based on ALCS on three 2-dimensional datasets.
initial training instances selection. For metric Acc, only the p-values with respect to Random Sampling and FISTA-LOCP are less than
In this subsection, in order to visually display the ability of initial training set selection of ALCS on imbalanced data, the initial instances selection results on three 2-dimensional imbalanced data sets (Jain [10], Aggregation [2] and Three Blobs) based on ALCS are exhibited. For the data set Three Blobs, the class distribution is [800, 50, 50]. The results are shown in Fig. 4, with each data set displays the queried instances on the number of queries
Evidently, when
Conclusion and future work
Aiming at the class-imbalance AL cold-start problem, this paper proposes a two-stage clustering-based AL cold-start method (ALCS). In the pre-clustering stage, a multi-center clustering is constructed based on a new inter-clusters tightness measure and the DPC algorithm to separate the instances of minority classes from that of majority classes in potentially imbalanced data. In the instances selection stage, an adaptive mechanism is designed to find the candidate representative instances in each cluster. Then, a clusters-cyclic instance query mechanism and an impure clusters splitting strategy are introduced to select the initial training instances from the candidate representative instance set. Experiments verify the effectiveness of our method on both imbalanced data sets and balanced data sets from the perspectives of class coverage, classification performance, and impact on active learning.
There are three works merit further investigation: (1) Incorporate a distance metric learning into ALCS is expected to achieve a more effective initial training set selection method. (2) The ALCS method depends on the number of nearest neighbor
Footnotes
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 61876027, 61533020 and 61751312.
