Abstract
Many real-world applications, such as those related to sensors, allow collecting large amounts of inexpensive unlabeled sequential data. However, the use of supervised machine learning methods is frequently hindered by the high costs involved in gathering labels for such data. These methods assume the availability of a considerable amount of labeled data to build an accurate classification model. To overcome this bottleneck, active learning methods are designed to selectively label the most informative examples instead of requesting all true labels. Although active learning has been widely used in many problems, most of the methods consider the presence of labeled data or some prior knowledge about the problem, as the number of classes. Differently, in this paper, we are interested in the realistic scenario where the active learning is performed from scratch on a fully unlabeled dataset and with the absence of any classifier or prior knowledge about the data. In general, the methods that consider fully unlabeled data use random sampling to select examples to label. The goal of this work is to show a broad experimental evaluation with different unsupervised active learning methods to select examples from fully unlabeled sequential data. We evaluated methods based on clustering algorithms and centrality measures from graphs for instance selection and the performance of supervised and semi-supervised learning algorithms in the classification task. Given our evaluation on a benchmark of sequential data and in a case study of insect species classification, we indicated the sampling based on hierarchical clustering or k-Means. These methods present a statistically significantly better performance to the popular random sampling. In addition, they are simple algorithms and readily available in many software packages.
Keywords
Introduction
In many real-world applications, such as sensors and measurement devices, massive amounts of unlabeled sequential data are available. However, learning a classification procedure in these applications require the labeling of a considerable portion of data, since supervised machine learning algorithms assume the availability of plenty labeled data. Unfortunately, the labeling procedure can be expensive, time consuming or highly dependent of a domain expert. In cases where abundant unlabeled data are cheap but acquiring their labels is costly, active learning methods can drastically reduce the effort of the domain expert by performing an instance selection to find a minor portion of data to be labeled in a huge amount of unlabeled data.
Active learning studies how to selectively label the most informative examples instead of requiring the true labels for all instances. The active learning methods attempt to overcome the labeling bottleneck by making queries in the form of unlabeled instances to be labeled by an oracle (e.g., a human annotator). In this way, the active learner aims to achieve high accuracy using as few labeled instances as possible, minimizing the cost of obtaining labeled data [51].
Generally, an active learner begins with a small set of labeled instances, selects a few informative instances from a pool of unlabeled data based on past knowledge, and query labels from an oracle [53]. Some examples of query strategies to select the most informative or representative data to be labeled are the sampling based on uncertainty, query-by-committee, expected model change, expected error reduction, variance reduction, and density-weighted methods (a comprehensible review about these methods is presented by Settles [51]). We noticed here, that all these strategies depend on the existence of an initial classification model. Thus, these approaches require a set of labeled examples in order to obtain an initial model that will provide the information to start the unlabeled instance selection task.
Differently from the broadly used setting, we are interested in a more realistic scenario in which we start the active learning process on a fully unlabeled dataset and without any prior knowledge about the data. Therefore, we cannot assume the existence of an initial classifier that provides useful information about unlabeled examples or the knowledge of the number of classes of a problem for careful tuning the parameters of an algorithm. Although active learning has been widely used in many application domains, their use from a completely unlabeled set has not received much attention. According to a review on 206 papers from top-tier conferences such as NIPS, ICCV, CVPR, ICML, UAI, ECML and journals such as Machine Learning, Pattern Recognition, Data Mining and Knowledge Discovery performed by Hu et al. [28] in 2010, over 94% of researchers, used a randomly selected initial training set or failed to specify their criteria. Fewer than 6% used a targeted approach to populate their initial training set. In a less systematic review in the last years on conferences such as SDM, ICDM, and KDD, and top-tier journals, we observed that this scenario did not change significantly. We believe that such practice is due to the lack of a more in depth study which shows that simple methods can be a better choice rather than the random sampling. Thus, this study wants to contribute in this direction with a wide review and experimental evaluation. Given the increasing popularity of real-world applications that generate or process sequential data, our evaluation will focus on time series data.
For a concrete example of a real application that needs active learning methods on fully unlabeled data from scratch, consider the problem presented by Souza et al. [61] in which the use of a laser sensor is proposed to perform an online classification of flying insects. Insect classification is an important problem since the automatic identification of the species can be used to build devices to control harmful species such as disease vectors and agricultural pests. The sensor generates a huge amount of sequential data in an inexpensive way; however, the process of labeling an initial training set is expensive and dependent of a well-trained expert. For convenience, the authors assume an initially labeled dataset previously collected in a laboratory to start a classifier that incrementally updates over time. However, this assumption has some drawbacks, including the lack of knowledge of possible insect species that will cross the laser in the field and the fact that the environmental conditions of the laboratory may differ from the field. Frequently, these conditions affect the behavior of the insects and consequently the measured data from the sensor. Thus, the current model can be inefficient in real conditions.
To mitigate this problem, a simple practical solution is to place the sensor in the field conditions to collect data for a period of time. We can use the collected data to build an initial classifier that will be used to process new incoming data in an online fashion. As all the collected data are unlabeled, an expert can analyze a subset of the signals and provide the respective labels. We can sample the most representative examples using the techniques discussed in this paper to reduce the domain expert effort to label the initial training set. We return to this application in Section 4.3, where we perform a case study.
There are many other examples in which unlabeled data are inexpensive to collect, but costly to label. For instance, in medicine, the non-invasive electrocardiogram (ECG) and electroencephalogram (EEG) procedures can record massive amounts of low-cost data; however, data labeling is a more tedious process that requires a trained clinician. In the industry, for instance in electric power plants or automotive manufacturing, huge amounts of sensor data are collected to monitor the fabrication process. However, a supervised classification task such as fault detection requires a domain expert to revise the data in order to identify the moment that precedes a fault.
In this paper, we provide a broad experimental evaluation of active learning methods to build an initial training set from fully unlabeled data. The evaluation considers a combination of active learning strategies to select instances, a different number of instances to be selected, and machine learning algorithms. Given the growth of popularity of sequential data, we carried out our experimental evaluation on 24 time series datasets from the UCR benchmark repository from different application domains such as biology, economy, entomology, and medicine, and in a promising case study of insect classification by laser sensor. The contributions of this paper come from the analysis of the results provided by this extensive empirical evaluation. The main contributions of this paper are summarized follow:
Evaluation and analysis considering the most representative and relevant active learning strategies for instance selection. Most of past works of literature considered a reduced number of methods. In this paper, we considered the most used techniques based on tabular and graph data to define the importance of instances in a dataset and sampling. Even more, differently from the broadly used setting, we consider a more realistic scenario where the data are fully unlabeled and the user does not have any prior knowledge about the data, as the number of classes; Evaluation and analysis on sequential data. Time series are ubiquitous in almost every human activity. Time-oriented data are present in many application domains such as industry, astronomy, medicine, biology, economy, signal processing, among others. Moreover, some types of data such as text or image can be converted to time series. Consequently, the analysis of the time series data has attracted much attention and effort from several researchers around the world. To the best of our knowledge, most of the works in active learning are mostly concerned with textual data and no substantial work has evaluated different methods for sequential data. In addition, we discuss our results in an important application for public health and agriculture, related to insect species classification; Evaluation and analysis considering different amounts of labeled data. We present a trade-off between the number of labeled instances and classification performance. This allows guiding the user about how much instances must be labeled in practical situations; A comparison among semi-supervised learning techniques. Since the sampling methods consider the selection of a portion of data to be labeled, another portion of unlabeled data is discarded and usually not used to learn a classification model. Thus, we consider the use of semi-supervised learning to label the remaining data without requesting the oracle. We verify if the use of this unlabeled data can improve the classification performance achieved by supervised learning techniques.
With these contributions we are able to answer the following questions:
Are there simple and effective alternatives to the random sampling in order to select examples to build an initial training set? Among the evaluated alternatives which one stands out? Can the unlabeled examples (not selected for the initial model) improve the classification performance of the initial training set?
The remaining of the paper is organized as follows. In Section 2, we present the background and related work about time series classification, active learning and unsupervised active learning methods for instance selection. In Section 3 we discuss our experimental setup to conduct the evaluation. The results are presented and discussed in Section 4. Finally, in Section 5 we discuss our main conclusions about the wide evaluation performed in this paper regarding different methods to data sampling selection and active learning in an unsupervised way for time series classification.
In this section, we present the background and related work about the three main concepts about this paper: time series classification, active learning and unsupervised active learning methods for instance selection.
Time series classification
A time series
The task of time series classification has attracted a great interest of the researchers in the past two decades. Many algorithms have been proposed for time series classification, including Decision Trees [46], Neural Networks [41], Hidden Markov Models [34], first-order logic rules with boosting [47], and SVM [67]. However, the literature indicates that a simple One-Nearest Neighbor (1NN) algorithm, with a proper distance measure, presents very good results and frequently outperforming more complex classification algorithms [20, 68]. In a wide experimental evaluation performed by Bagnall and Lines [4] for time series classification, the authors compared 1NN against C4.5, Random Forest, Rotation Forest, Naive Bayes, Bayesian networks and Support Vector Machines with linear and quadratic kernels, and they confirmed that 1NN is hard to beat.
The 1NN is a lazy classifier (no training is required), that consists of assigning to a query time series
The Euclidean distance is probably the most known and used distance to compare time series. It measures the similarity between time series considering observations at the exact same time index
For problems with few training cases, an elastic distance measure such as DTW or LCSS is often superior to Euclidean distance, but as the number of series increases “the accuracy of elastic measures converge with that of the Euclidean distance” [20]. In particular, in this work we consider the 1NN classifier with Euclidean distance in our evaluation given their competitive results compared to more complex methods and other advantages. ED is easy to implement and indexable with any access method and, in addition, this distance measure is parameter-free.
Given the high costs to obtain labeled data in many domains, active learning techniques that require the actual class label for a reduced amount of data have been proposed in the literature to overcome this bottleneck. Active learning aims to select the most informative unlabeled instances and request to an oracle for their labels to retrain a learning algorithm.
In general, active learning literature considers three main approaches [51]:
Membership query synthesis: the model itself generates some synthetic instances to be labeled rather than using real unlabeled instances [2];
Pool-based: given a set of unlabeled instances, the method chooses some representative examples to request their labels and update the current model [36];
Stream-based selective sampling: the unlabeled instances are presented in a stream manner and the method decides online whether or not to query for its label [77].
The pool-based approach is the most studied setting. It has been applied on several application domains such as the automatic classification of texts [64], images [27], videos [69], and music retrieval [39]. Typically, pool-based methods consider a database
Thus, the most important task of active learning methods is to perform good choices in the instance selection process. A common approach is to use evaluation measures to estimate the example’s utility and select the one with maximal utility values. Examples of utility metrics (or sampling selection strategies) are uncertainty sampling [36], query by committee [55], expected model change [54], expected error reduction [49], and density weighted methods [52].
In uncertainty sampling, the instances are selected according to the uncertainty of label prediction. In query by committee, there is a committee of models trained on the currently labeled data. For each unlabeled instance, the committee votes for the label and instances with the largest disagreements on the votes are selected. In expected model change, the instances that cause most change in the current model are selected. In expected error reduction, the instances are selected in order to reduce the expected error of model as much as possible. In density weighted methods, the selected instances must be both uncertain and representative in order to decrease the effect of outliers. This last strategy aims to valid the problems caused by outliers, especially in uncertainty sampling and query by committee strategies.
Algorithm 2.2 presents a general view of the pool-based active learning process. An initial model
[H] General process of active learning methods [25].InputInputOutputOutput
The main difference between the stream-based and pool-based methods is that the former scans the data sequentially and decides whether to label or not an instance individually; whereas the latter evaluates and ranks the entire unlabeled data to decide. However, both settings assume the presence of an initial labeled data
Some approaches such as [37] and [50], perform the active learning from scratch as proposed in this work, i.e., without initial labeled data or classification model. These approaches are based on unsupervised criteria obtained from clusters where samples lying near cluster centers and near borders of clusters are expected to represent the most informative ones regarding the distribution characteristics of the classes. However, in the clustering phase, these methods have an important assumption in their approaches: they know a prior the number of classes of the problem. Thus, the number of clusters is set to the number of classes present in the dataset.
In this work, we are interested in the scenario in which we start the active learning process on a fully unlabeled dataset and with the absence of an initial model/classifier. Although active learning has been widely used in many problems, their use to aid in the labeling of an initial training data or to build an initial classifier has not received much attention. Most of the recent works adopt random sampling to generate an initial training set. However, sampling based on clustering algorithms have presented better results [28, 31, 43, 70]. In other direction, some works have used centrality measures from graph to extract information about unlabeled data and use this information to select instances for the active learning process [3, 35, 38]. However, these two approaches are commonly evaluated separately on different datasets. To give a direct comparison of these approaches, different methods are discussed, evaluated and compared in this work in a wide experimental setting.
To the best of our knowledge, the most similar work to ours in the literature are [28, 31, 3]. However, they performed their evaluations in a reduced number of methods and their evaluations are mostly concerned with textual data. In contrast, we perform a wide comparison of active learning methods using time series data. Time series are ubiquitous in almost every human activity. Time-oriented data are present in many application domains such as industry, astronomy, medicine, biology, economy, signal processing, among others. Moreover, some types of data such as text [1] or image [33] can be converted to time series. Consequently, the analysis of time series data has attracted much attention and effort from several researchers around the world.
In this section, we present a brief overview of the main methods that can be used to select examples from unlabeled data to perform the active learning process. We begin presenting the Random Sampling and the Farthest-First Traversal algorithms. Next, we introduce the clustering algorithms evaluated in this work and how these algorithms can be used to select unlabeled examples. Finally, we present graph-based centrality measures that can be used to extract information from unlabeled data.
Random sampling
Due to simplicity, random sampling is probably the most popular method to select data to be labeled by the oracle for generation of an initial training set in classification problems. According to Zhu et al. [73], this practice is based on the assumption that random sampling will be likely to build the initial training set with the same prior data distribution of the whole data. However, this situation seldom occurs in real-world applications due to the small size of initial training set which is typically used. An important and evident advantage of random sampling, is their constant complexity time to select examples.
Farthest-First Traversal
In order to select a set of diverse examples in a dataset to be labeled, the Farthest-First Traversal algorithm finds
Sampling based on clustering
The goal of data clustering, also known as cluster analysis, is to discover the natural grouping of a set of patterns, points, or objects in an unsupervised manner [30]. Given a representation of
Given an unlabeled dataset
In this work, we evaluated five different cluster algorithms to sampling data: k-Means, k-Medoids, Gaussian Mixture Models, Hierarchical Agglomerative and Density Peak. The choice of these five clustering methods, have considered the popularity, ease of implementation and the performance achieved by the methods in different works of literature. In the next sections we present a general overview of these algorithms in the context of sampling data for an initial training set.
k-Means and k-Medoids
The use of clustering algorithms for selective sampling was firstly presented by Zhu et al. [73] to solve problems in natural language processing such as word sense disambiguation and text classification. They used the k-Means algorithm to build a representative initial training dataset for active learning.
The k-Means is a well-known clustering algorithm that aims to divide
The k-Medoids is similar to k-Means except that the centroids (medoids) must be a data point of the dataset that is being clustered. k-Medoids is more robust than
The Gaussian Mixture Model (GMM) is a model-based clustering algorithm which uses a finite mixture of parametric multivariate Gaussian distributions to model a dataset. Each Gaussian distribution of the GMM represents a cluster of data points. An individual distribution used to model a specific cluster is often referred to as a component distribution. Formally, a GMM is a weighted sum of
where
with mean vector
The complete Gaussian Mixture Model is parameterized by the mean vectors, covariance matrices, and mixture weights from all components. These parameters are collectively represented by
For our purpose to build an initial training data, we consider the number of components to be estimated as the number of clusters.
Hierarchical clustering algorithms create a hierarchy of groups in a bottom-up (or agglomerative) process by merging small groups or in a top-down (or divisive) process by dividing large groups into smaller ones. The data clustering can be visualized in a variety of scales by creating a cluster tree or dendrogram.
Due to the simplicity and popularity, we choose the agglomerative clustering in this work. The algorithm starts by computing the similarity matrix between all pairs of examples to be clustered. In the next step, the algorithm iteratively selects two clusters with the largest affinity under a certain measure to merge, until some stopping criteria is reached. After preliminary experiments, we choose the Ward’s (minimum variance criterion) [66] along with the Euclidean distance to compute the similarity between clusters.
To select
The Density Peak is a recently proposed clustering algorithm based on density [45]. This is an alternative algorithm to density based algorithms such as Mean Shift [26] and DBSCAN [21]. Density Peak presents competitive results with state-of-the-art algorithms in data with arbitrary shape without requiring the number of clusters as input.
The algorithm has the assumption that cluster centers are surrounded by neighbors with lower local density and that they are at a relatively large distance from any points with a higher local density. Thus, for each instance
With these two quantities computed for all examples,
To use the Density Peak algorithm together with active learning techniques, we sort the examples in a decreasing order considering
In the last years, various graph-based algorithms, mainly for semi-supervised learning (SSL), have been proposed in literature [76, 10]. They rely on a similarity-based graph built from a dataset, in which the objects represent the examples and the undirected edges represent the similarity between these examples. Besides semi-supervised learning, the graph topology also allows to extract properties from the dataset and from each object/example in an unsupervised manner [42]. Properties extracted from the objects allow us to measure the importance of an object in the graph. These measures are called centrality measures. In this article, we consider that the central objects of a graph correspond to representative examples that will be selected and labeled by experts. In next sections, we present details about graph construction and centrality measures evaluated in our experiments.
Formally, a graph is defined by
There are a few different ways to build a graph based on similarity between objects. The most common are [76]:
Fully connected graph: in this graph, every pair of objects is linked by an edge. Usually the edge weights between an object representing an example
where
In this paper we use Mutual
We use centrality measures to provide scores to the graph objects. The objects are sorted in a descending order considering the generated centrality scores, and the top
Is this paper we consider three centrality measures which are the basis for others and have been used in literature for active learning [3, 42]: i) Degree, ii) PageRank, and iii) Betweenness.
Degree is the simplest centrality measure. The degree score of an object
Although the degree is a simple centrality measure, it is insightful about the importance of objects in a graph. For instance, in social networks, objects with high degree score have a bigger influence in a higher number of people than objects with low degree. In the scientific domain, papers with a large number of citations are more influential than papers with low degree score [42].
Centrality scores can also be computed by importance of propagation scores, as performed by PageRank [8]. The idea behind PageRank is that the importance of an object is given by the importance of linked objects. In other words, an object is important (or central) if its linked objects are also important. The PageRank’s score for an object
in which
Another way to compute centrality measures is considering the importance of an object in the geodesic path among all objects of a graph. A geodesic path is the shortest path, in terms of number of edges traversed, between a specified pair of vertices. The Betweenness measure compute centrality scores through the sum of the number of times that an object
where
In this section, we present and compare the time and space complexity of the active learning algorithms presented in the previous section. We divide the complexity analysis into two groups due the characteristics and the notations to be used:
First group: composed by the Farthest-First Traversal and Clustering-based algorithms ( Second group: composed by centrality measure algorithms based on graphs (Degree, PageRank, Betweenness).
The time and space complexity of the first group of algorithms are summarized in Table 1. In this table, we consider
Time and space complexity of first group of algorithms.
The time and space complexity of the second group of algorithms are summarized in Table 2. In this table, we consider
Time and space complexity of second group of algorithms.
In this section, we present a description of the datasets evaluated in our experiments and a general overview of our evaluation framework considering the scenarios of supervised and semi-supervised learning.
Datasets description
Due to the fact that time-oriented data are present in many application domains and the possibility of conversion of another type of data into time series, we choose this sequential data to perform our experimental evaluation. To facilitate the execution of experiments and the direct comparison of our results with other methods, we used the UCR Time Series Archive [13]. These datasets have standard partitions of training and test sets which are widely accepted in the literature of time series classification. Thus, in our experimental setup, we selected examples from the training set of each dataset to perform the active learning process and we evaluated their performance in the test set. This setup allows evaluating the performance of classification using a reduced amount of labeled data and a direct comparison against the accuracy achieved by a classifier with all training data.
Many datasets from UCR have a training set with a reduced size. As one of the major motivations to perform the active learning process is to reduce the human annotation effort by selecting some examples in a massive amount of data, we decided to select 24 popular datasets that have more than 150 instances on the training set. For datasets with less than 150 instances, we believe that all instances can be labeled with minor effort. A description of these data with respect to the number of classes, the size of training and test sets, the length of series, the accuracy achieved by the 1-NN classifier with Euclidean distance using all training data in a holdout procedure and domain, are presented in Table 3. We also present the imbalance ratio of each dataset, which considers the ratio between the number of instances of the majority and minority classes [23]. We can observe that the datasets are quite varied in their characteristics.
Description of datasets evaluated
Description of datasets evaluated
We can observe in Table 3, that many of datasets are obtained from figure shapes. This means that a two-dimensional shape of an image was converted to a single-dimensional time series by calculating the distance between a central point and the object contour. In Fig. 1 we can observe an example of this conversion given the datasets OSULeaf and Yoga, respectively.
Examples of figure shape datasets. In these datasets, given a figure, the distances from contour to a central point are measured resulting in a series in which ordinate values are the distances measured and abscissa values are increasing clockwise angle values around the contour. (a) OSULeaf; (b) Yoga.
Another popular data domain in our evaluation are the series measured by accelerometers. One example is the uWaveGestureLibrary dataset. This dataset consists of the performances of the gestures shown in Fig. 2 by eight participants. To collect data, the participant held a Nintendo Wii remote and repeated each of the eight gestures ten times. As each accelerometer has three synchronous measures for three axes (X, Y, and Z), at the end we considered 3 different datasets: uWaveGestLib-X, uWaveGestLib-Y, and uWaveGestLib-Z.
Gesture vocabulary of datasets uWaveGestLib-X (on x-axis), uWaveGestLib-Y (on y-axis), and uWaveGestLib-Z (on z-axis) [56]. The dot denotes the begin and the arrow the end of the gesture.
We consider only the access to a pool with
A general view of the instance selection process considered in the experimental evaluation.
From the provided unlabeled data
Using the Farthest-First Traversal algorithm; Using the random sampling; Constructing a graph using the unlabeled data and extracting the centrality measures Betweenness, Degree and PageRank. After computing these measures, we sort them in a descending order and the Running clustering algorithms such as k-Means, Agglomerative Hierarchical or Gaussian Mixture Models. In this case, we consider two possibilities to select the most representative examples to be labeled: i) we choose examples only near the cluster centers or ii) near the cluster centers and near the cluster borders. In the first case, the number of clusters
As pointed by Kang et al. [31], an active learning strategy which iteratively outputs individual instances to be label by the user at each step of the algorithm, is time-consuming and requires a full-time annotator during the whole process. In addition, annotators frequently perform better by categorizing examples in groups, since the oracle will be able to compare different examples and process them in any order. Thus, we consider that the selected instances by the algorithm are presented to the expert all at once and non-iteratively.
In order to cover a wide variation for each dataset, we vary the portion of selected examples that will be labeled by the oracle from 5% to 95% of training size. Thus, for small datasets such as Haptics (with 155 training examples), we started our evaluation considering only 8 examples to be labeled by the oracle. On the other hand, for datasets as NonInvasiveFetalECG Thorax (with 1800 training examples), we started our evaluation considering 90 examples to be labeled by the oracle.
After the labeling process, we have two pools of examples. The first one has
A general view of the experimental evaluation. We consider two scenarios: 
More details about the evaluations showed in Fig. 4 are discussed following:
Evaluation using supervised learning: Using only the examples from
Evaluation using semi-supervised learning: Using the oracle-labeled set
In general, inductive semi-supervised learning can be performed through two steps: i) transductive semi-supervised learning to assign labels to unlabeled instances; and ii) extraction of a classification model considering the labels of instances after transductive learning. Algorithms, such as Self-Training, GFHF, and LLGC performs transductive learning, i.e., they classify the unlabeled instances. Thus, strategies to extract classification model from the set of labeled instances after transductive learning are required.
In transductive learning performed by Self-Training, initially the labeled instances in
[H] Self-Training.InputInputOutputOutput
The extraction of a classification model after a transductive learning based on tabular data, such as Self-Training, is straightforward [76]. In case of Self-Training, the classification model induced considering all instances, i.e., when all unlabeled instances were labeled by the algorithm, is used to classify new instances.
Graph-based algorithms can also be used to perform transductive learning. The first step is to build a graph considering the instances of the dataset. The approach presented in Section 2.3.4 can be used to perform such task. In our experiments, we considered Mutual
We used the algorithms GFHF and LLGC to perform transductive learning on graphs. Both algorithms have an iterative version in which the labels of instances are iteratively propagated through the neighboring instances. The strength of label propagation is proportional to the strength of a connection considering all connections of a node (GFHF) or all connections of a pair of nodes (LLGC). Besides, LLGC allows to change the labels of labeled instances during the classification process according to a user-defined value of
Both LLGC and GFHF produce a weight vector
[H] Gaussian Fields and Harmonic Functions.InputInputOutputOutput
[H] Learning with Local and Global Consistency.InputInputOutputOutput
After obtain the
where
In this section, we present the results regarding the two previously discussed evaluations presented in Fig. 4. In the first evaluation, the initial training set had only data selected by the active learning algorithms. In the second evaluation, we used inductive semi-supervised learning algorithms to label the remaining unlabeled data that were not selected. Given some common patterns found in our results and to avoid showing a great number of plots, we selected a reduced number of cases for discussion. However, all the results are included in the online supplementary material for this paper [60].
Evaluation using supervised learning
We begin our evaluation by classifying the test data using a model induced from a training set with
As previously discussed, we evaluated two main approaches for sampling data based on clustering. In the first approach, examples near the cluster centers and near the cluster borders were selected. Thus, we evaluated a different number of border examples to be selected. We considered
Average classification accuracy achieved by a classifier that uses a training set sampled by clustering algorithms using 1 example near from cluster centers and 3 examples from the cluster borders
Average classification accuracy achieved by a classifier that uses a training set sampled by clustering algorithms using 1 example near from cluster centers and 3 examples from the cluster borders
The average classification accuracy achieved by the sampling approach that considers only examples near the cluster centers are presented in Table 5. In this table, we also compare the method which achieved the best mean accuracy in Table 4, the
General results of evaluated methods for each dataset considering the average accuracy for 5% to 95% of labeled data.
Results in Table 5 show that for all 24 datasets, the best results are achieved in general by k-Means and Hierarchical clustering. Only in the StarLightCurves dataset, the best result was achieved by the Farthest First Traversal algorithm. We also present the number of times that each method outperformed random sampling, the baseline considered in this work. Although k-Means and Hierarchical clustering present better results, k-Medoids achieved the highest number of wins against random sampling. From the results presented in Table 5, we can also note that considering only examples near the cluster centers slightly surpass the accuracy obtained considering both examples near cluster centers and cluster borders (presented in Table 4).
We performed the Friedman test with the Nemenyi post-hoc with 95% as confidence level to statistically compare the results of different algorithms. In Fig. 5 we present the critical difference diagram for ranked accuracies that illustrates the test.
Critical difference diagram considering the average results of each algorithm.
In this diagram, the algorithms are sorted according to their average ranking. The algorithms connected by a line do not present statistically significant differences among them [19]. Thus, we can note in the diagram from Fig. 5 that k-Means is ranked in first place. In contrast, random sampling, the most used method in the literature, is ranked only in the 5th among 10 evaluated methods. There is no statistically significant difference among k-Means, Hierarchical clustering, k-Medoids and Gaussian Mixture Models. However, k-Means and Hierarchical clustering are superior to random sampling with statistically significant difference.
In Fig. 5, we consider the average results achieved by the algorithms with 5% to 95% of labeled data. In Fig. 6 we show that this ranking can slightly change if we consider a different number of labeled examples. In the example, we consider 20% and 50% of labeled data.
Critical difference diagrams considering the results achieved by the algorithms with 20% and 50% of training set labeled. (a) 20% of labeled data; (b) 50% of labeled data.
In the three presented cases, the methods that provided the worst performance are same: Degree centrality measure, Farthest First Traversal, Betweenness centrality measure, and Density Peak clustering algorithm. Thus, we do not recommend these methods to select initial data on fully unlabeled data. Based on our results, we can recommend k-Means and Hierarchical clustering rather than the random sampling. k-Means is computationally more efficient than Hierarchical clustering. However, k-Means results present some variability due to the non-deterministic nature of the algorithm.
To present a general view of results for all datasets in a paired comparison against random sampling, we show a graphical representation in which each dataset is represented by a point and the
General results of k-Means, Hierarchical clustering, and k-Medoids compared to random sampling considering the average of results from 5% to 95% of labeled data. (a) Random 
We can note in the three illustrations of Fig. 7, that almost all points are situated above the main diagonal (except Haptics dataset in
Similar results of k-Means and Hierarchical clustering methods. (a) FaceAll; (b) FacesUCR; (c) TwoPatterns.
Analyzing the results of methods based on clustering varying the amount of labeled data, we can observe that in many cases as the datasets FaceAll, FacesUCR and TwoPatterns, both algorithms k-Means and Hierarchical clustering presented very similar results as observed in Fig. 8. In these figures, the inferior x-axis represents a percentage of training data which is selected by the algorithm to be labeled, the superior x-axis represents the respective values in absolute terms and in y-axis presents the accuracies in a holdout evaluation using the test data. Obviously, with 100% of labeled training data all the algorithms converge to the same result which can be observed in the last column from Table 3.
Datasets with inconclusive results in the evaluation on test set. (a) ChlorineConcentration; (b) Haptics.
Among all the 24 datasets, only in ChlorineConcentration and Haptics the random sampling achieved competitive results as presented in Fig. 9. However, we can note that for ChlorineConcentration, all methods present very similar results. For Haptics, due to some characteristic of the data distribution, Hierarchical clustering does not perform very well. It is also interesting to note that with only 25% (30 examples) of the labeled data, the algorithm k-Medoids achieves almost the same result considering 100% (or 155 examples) of labeled data.
Datasets with worst results considering the Density Peak clustering and Farthest-First Traversal algorithms. (a) SwedishLeaf; (b) Synthetic Control.
For the Density Peak and Farthest-First algorithms, we highlighted the poor results on the SwedishLeaf and Synthetic Control in Fig. 10. Although these methods present poor results in many datasets, the behavior is more evident in these two.
The results achieved by the Farthest-First Traversal are justified given that the algorithm is susceptible to noise and outliers. In addition, even with the absence of noise or outliers, the algorithm is more likely to select border examples in the feature space. This characteristic becomes a problem when the dataset has many classes and/or the classes are not well separated in the feature space.
In relation to the Density Peak, we noted in our experiments that the cutoff distance
Considering only the results achieved by methods based on centrality measures and varying the amount of labeled data, we can note that the random sampling is very competitive. As previously discussed, considering 20% and 50% of the labeled data or the average of results, the measures Degree, and Betweenness are in the top 3 worst methods among the 10 methods evaluated in this paper. The PageRank centrality measure presents better results than the others, but its results are comparable to the random sampling. For example, in the datasets SwedishLeaf and Yoga, the random sampling presents a slightly better accuracy than the competitors methods, although the PageRank measure presents a similar performance in some cases, as presented in Fig. 11.
Datasets in which random sampling slightly outperforms the results achieved by methods based on centrality measures. (a) SwedishLeaf; (b) Yoga.
The differences in the results are even less evident in datasets as ChlorineConcentration, Fish, Synthetic Control and uWaveGestureLibrary Y. As we can observe in Fig. 12, for ChlorineConcentration and uWaveGestureLibrary Y, the results of random sampling and PageRank are very similar, although the PageRank shows a slightly better performance at the beginning with less labeled data. However, for the Synthetic Control dataset, the Betweenness measure slightly outperforms the results achieved by random sampling and PageRank.
Datasets where methods based on centrality measures slightly outperformed the random sampling. (a) ChlorineConcentration; (b) Synthetic Control; (c) uWaveGestureLibrary Y.
From the results achieved from graph-based methods, we can conclude that it is safer to avoid these methods to select an initial data to be labeled. These methods present similar results to the results achieved by random sampling with more effort and only rarely present slightly better results. We believe that the poor results of these methods are justified by the fact that when we map the tabular representation to a graph-based representation, we are changing the view of the data, and the use of centrality measures extracts the importance of instances in the graph topology, not in the tabular data. Thus, the most important instances in one view may be noisy in another view.
Datasets where semi-supervised learning improves the quality of initial training set. (a) Cricket-X; (b) TwoPatterns; (c) Haptics; (d) StarLightCurves.
At this point we are in condition to answer two of the three research questions posed earlier:
Question 1: Are there simple and effective alternatives to the random sampling in order to select examples to build an initial training set? Answer: Yes. Considering our experimental evaluation, we have that the simple methods based on clustering algorithms such as k-Means, Hierarchical and k-Medoids are able to provide an initial training data better than the random sampling in terms of predictive power. In general, these algorithms are easy to understand and their implementations are easily found online. Thus, the use of random sampling can be replaced by a method based on clustering that generally presents better results.
Question 2: Among the evaluated alternatives which one stands out? Answer: We can note that the best results are achieved by the sampling based on clustering methods which select examples near the cluster centers. In particular, by k-Means and Hierarchical clustering. However, we recommend the use of Hierarchical clustering since it is a deterministic algorithm. In general, this method outperforms the random sampling. Regarding the sampling based on centrality measures, PageRank provides better results among the three evaluated measures. However, its results are not consistently better than the random sampling. The Farthest-First Traversal presents the worst results together with the Density Peak clustering algorithm and they are usually outperformed by random sampling.
The use of active learning methods to build an initial training data is justified by the huge amount of data that can be easily collected in many real-world applications, but can not be fully labeled due to the high costs. In the evaluation previously discussed, we consider the performance of a classifier that selects only
Datasets where semi-supervised learning deteriorates the quality of initial training set. (a) Cricket-Y; (b) SwedishLeaf; (c) Synthetic Control; (d) uWaveGestureLibrary Y.
According to our experiments, the best algorithms to select examples to be labeled in the active learning process were k-Means and Hierarchical clustering. However, the Hierarchical clustering has the advantage of being deterministic. Thus, we use the labeled data selected by the Hierarchical clustering to evaluate if a higher amount of data in training set using unlabeled data can improve the results obtained by instance selection. We present a comparison of accuracies achieved by the classifier that uses only the selected examples and those that also use the unlabeled data in the training set after the inductive semi-supervised learning process.
Due to the parameter settings of each evaluated semi-supervised learning algorithm, the results present some variability. For this reason, we will present the mean accuracy and the standard deviation achieved by the algorithms varying the values of the parameters and the amount of labeled data.
Of a total of 24 datasets evaluated in this work, we note that the SSL improves the quality of classification in approximately 8 datasets. From these datasets, we present the results of 4 of them (Cricket-X, Haptics, StarLightCurves, and TwoPatterns) in Fig. 13. In Fig. 13b and Fig. 13d, we can observe that the Self-Training algorithm can improve slightly the quality of the training set by the use of unlabeled data to complement the labeled data. On other hand, Self-Training achieves poor or equal results compared to the use of only labeled data on the Cricket-X (Fig. 13a) and Haptics (Fig. 13c) datasets. In the Cricket-X dataset, the best results are achieved by the GFHF algorithm, and in Haptics dataset, the best results are achieved by the LLGC algorithm.
Curiously, we note in some datasets that the use SSL to label the remaining unlabeled data in the training set was responsible for deteriorating the quality of initial training set. Thus, in these cases, the best choice is the use of only labeled data by the oracle. To better illustrate this problem, we present the results of datasets Cricket-Y, SwedishLeaf, Synthetic Control and uWaveGestureLibrary-Y in Fig. 14. In all of these datasets, we can note that the Self-Training presents slightly worse results than the use of only labeled data. However, LLGC and GFHF present considerably worse results.
This behavior of Self-Training algorithm occurs when the initial most confident instances are wrongly classified in the first iterations. Thus, a poor classification model is generated in the first iteration and the classification errors are propagated through the next iterations [76].
Illustration of laser sensor to capture information about insects and an example of audio signal collected. (a) Sensor; (b) Signal.
Graph-based algorithms tend to obtain higher classification performances than Self-Training in several application domains [48, 76, 72, 75]. However, in many datasets, they present poor results than supervised learning algorithms [48]. In our evaluation, i.e, considering sequential data, we also verified that they tend to obtain poor classification performances than the obtained by supervised learning.
At this point we are in condition to answer the last research question made earlier in this work:
Question 3: Can the unlabeled examples (not selected for the initial model) improve the classification performance of the initial training set? Answer: Of a total of 24 datasets evaluated, we note that the SSL algorithms can slightly improve the results of only one-third of the cases. Among the three evaluated SSL algorithms (GFHF, LLGC, and Self-Training), we note that the Self-Training presented the best results. However, in many datasets the SSL algorithms were responsible for deteriorating the quality of the training set. Thus, besides presenting an additional cost in the building process of a labeled training set, we do not have evidence that the SSL algorithms can improve the quality of the training data.
In our case study, we present a real application of insect species classification using laser sensors in a data stream environment [61]. There are two main motivations to classify insects species using the laser sensor: i) provide estimates of insects’ population that can aid traditional methods of control such as the spreading of insecticides and larvicides with less waste and ii) the laser can be used on intelligent traps that attract insects using allures and capture only specific target species as disease vectors and agricultural pests as proposed in [58].
Basically, the sensor uses a low-powered planar laser source that is pointed at an array of phototransistors as presented in Fig. 15a. When a flying insect crosses the laser, its wings partially occlude the light, causing small variations in the light captured by the phototransistors. These variations are recorded as an audio signal, as the example presented in Fig. 15b. In this example, we can note an audio segment with one-second length where most of the time we have background noise and a higher amplitude signal that lasts for tenths of a second that constitute an insect passage. From the audio data with this passage, it is possible to extract discriminant features for each species. In this work, we use the Mel-Frequency Cepstral Coefficients (MFCC) as recommended in our previous evaluation [59]. MFCCs are popular features in various application domains, particularly speech and speaker recognition [71] as well as musical instruments classification [63].
In controlled experiments in the laboratory, we can consider that we know a subset of species and we have an initial training data to evaluate discriminant features and classifiers. However, in the use of the sensor in field conditions, we do not have the knowledge of potential species at the specific region where the sensor will be placed. Even if it were possible to have knowledge of these species to collect data in the laboratory previously, the environmental conditions in the laboratory are possibly different from the field conditions. It is known that climate variations such as temperature [62], air pressure [9] and humidity [40] can influence the insects’ metabolisms and consequently the measured data. Thus, the previously collected data in the laboratory cannot reflect the current data observed in field conditions.
We propose in this paper a simple and practical solution to collect an initial training data and build an accurate classifier for the sensor of insects using a subset of examples. Our idea is to place the sensor in field conditions to collect data for a period of time and use active learning methods to select some of the data. An expert will analyze the selected signals and provide their respective labels.
Given the general results achieved in our experimental evaluation, in this case study, we only consider the three best methods based on clustering: k-Means, Hierarchical, and k-Medoids to select the data to be labeled. We also evaluated the popular method random sampling as a baseline method. We use the same dataset evaluated in [61], which contains two species of flies (Drosophila melanogaster and Musca domestica) and three species of mosquitoes (Culex quinquefasciatus, Culex tarsalis and Aedes aegypti) distributed in a total of 5,325 examples collected over a period of seven days. In our experiments, we considered a setup time of one day. In other words, the sensor collects data over one day and from this subset of unlabeled data, we select some of the data to present to an expert that will provide the respective labels. Then, we use this initial training set to evaluate the classification accuracy for the next six days. One day of data collection constitutes the first 1,083 examples of the dataset. Thus, the test is performed over the 4,242 remaining examples.
The accuracies achieved by the evaluated methods considering a different amount of labeled data are presented in Fig. 16.
Accuracies of methods based on clustering to select initial data to classify insects.
We can note that the results presented in Fig. 16 are according to those presented in our experimental evaluation in the benchmark datasets. We noted that Hierarchical clustering achieved better or equal results to the k-Means. These results are considerably better than random sampling. Among the methods based on clustering, k-Medoids presented the worst results, but still better than random sampling.
It is also interesting to note that the Hierarchical clustering algorithm to sampling data presents an accuracy of 0.874 with 35% (or 379 examples) of labeled examples. This result is very competitive if we consider that with 100% (or 1083 examples) of training data labeled, we have a very similar accuracy of 0.877. Thus, we noted that the use of methods based on clustering as Hierarchical clustering to select data to be labeled allows to reduce drastically the cost to build an accurate classifier.
Critical difference diagram for the results achieved by methods to select initial data to classify insect species.
Given the results presented in Fig. 16, we performed the Friedman test to compare the results statistically. The critical difference diagram is presented in Fig. 17. We can note that Hierarchical clustering is the first in the ranking without a significant difference to k-Means. We also observe no significant difference among the results of k-Means and k-Medoids, and the results of k-Medoids and random sampling. However, Hierarchical and k-Means are statistically better than random sampling.
Finally, we evaluated if the use of semi-supervised learning algorithms can improve the quality of the training set labeled by the oracle after the sampling based on Hierarchical clustering. Once again, we chose the data selected by Hierarchical clustering due to the best achieved results, as previously discussed. The results achieved by the algorithms GFHF, LLGC and Self-Training are presented in Fig. 18.
Results of semi-supervised learning algorithms for insects classification.
Similarly to the previously presented results for benchmark data, we can note that the SSL algorithms are responsible for deteriorating the quality of training set labeled by the oracle.
Although active learning has been widely used in many problems, the majority of methods from literature consider the presence of an initially labeled dataset or a trained classification model. Thus, previously labeled data are useful to rank unlabeled data according to their importance. Then, the user must label just the top ranked instances and the training data will be enriched with those instances. However, in some real applications, there is the need to perform the active learning procedure at the beginning of the life-cycle of the problem on a fully unlabeled dataset and with the absence of a classification model.
In this fully unlabeled scenario, most works in literature randomly sample examples to be labeled by an oracle. Although it is a simple and popular approach, we question in this paper if this practice is the most effective for this problem. We believe that such practice is popular due to the lack of a more in-depth study which shows that other simple methods can be a better choice.
In this direction, we evaluated in this paper 9 different methods based on clustering algorithms and centrality measures from graphs to select the most informative examples of a fully unlabeled dataset to be labeled by an oracle. Based on our wide experimental evaluation performed on 24 time series datasets and one case study of insects’ classification, we found that some simple approaches present better accuracy than the random sampling. The best results are achieved by the sampling based on clustering methods, in particular, by k-Means and Hierarchical clustering. These methods are simple and readily available in many software packages. Thus, the random sampling method can be easily replaced by one of these methods in order to obtain better results. On the other hand, the Density Peak clustering algorithm and the three evaluated centrality measures from graphs (Degree, Betweenness, and PageRank) have performed worse than the simple random sampling.
After the selection of instances to be labeled by an oracle to build an initial training set, we also evaluated if the use semi-supervised learning algorithm can help to improve the quality of the training set by the addition of more examples. Unfortunately, we note that in addition to present an additional cost in the building process of a labeled training set, we do not have evidence that the SSL algorithms can consistently improve the quality of the training set.
Footnotes
Acknowledgments
The authors would like to thank São Paulo Research Foundation (FAPESP) for the support in the grants #2011/17698-5, #2011/12823-6, and #2014/08996-0 and National Counsel of Technological and Scientific Development (CNPq) for the support in the grants #446330/2014-0, and #303083/2013-1.
