Unsupervised active learning techniques for labeling training sets: An experimental evaluation on sequential data

Abstract

Many real-world applications, such as those related to sensors, allow collecting large amounts of inexpensive unlabeled sequential data. However, the use of supervised machine learning methods is frequently hindered by the high costs involved in gathering labels for such data. These methods assume the availability of a considerable amount of labeled data to build an accurate classification model. To overcome this bottleneck, active learning methods are designed to selectively label the most informative examples instead of requesting all true labels. Although active learning has been widely used in many problems, most of the methods consider the presence of labeled data or some prior knowledge about the problem, as the number of classes. Differently, in this paper, we are interested in the realistic scenario where the active learning is performed from scratch on a fully unlabeled dataset and with the absence of any classifier or prior knowledge about the data. In general, the methods that consider fully unlabeled data use random sampling to select examples to label. The goal of this work is to show a broad experimental evaluation with different unsupervised active learning methods to select examples from fully unlabeled sequential data. We evaluated methods based on clustering algorithms and centrality measures from graphs for instance selection and the performance of supervised and semi-supervised learning algorithms in the classification task. Given our evaluation on a benchmark of sequential data and in a case study of insect species classification, we indicated the sampling based on hierarchical clustering or k-Means. These methods present a statistically significantly better performance to the popular random sampling. In addition, they are simple algorithms and readily available in many software packages.

Keywords

Unsupervised active learning training set labeling clustering centrality measures sequential data

1. Introduction

In many real-world applications, such as sensors and measurement devices, massive amounts of unlabeled sequential data are available. However, learning a classification procedure in these applications require the labeling of a considerable portion of data, since supervised machine learning algorithms assume the availability of plenty labeled data. Unfortunately, the labeling procedure can be expensive, time consuming or highly dependent of a domain expert. In cases where abundant unlabeled data are cheap but acquiring their labels is costly, active learning methods can drastically reduce the effort of the domain expert by performing an instance selection to find a minor portion of data to be labeled in a huge amount of unlabeled data.

Active learning studies how to selectively label the most informative examples instead of requiring the true labels for all instances. The active learning methods attempt to overcome the labeling bottleneck by making queries in the form of unlabeled instances to be labeled by an oracle (e.g., a human annotator). In this way, the active learner aims to achieve high accuracy using as few labeled instances as possible, minimizing the cost of obtaining labeled data [51].

Generally, an active learner begins with a small set of labeled instances, selects a few informative instances from a pool of unlabeled data based on past knowledge, and query labels from an oracle [53]. Some examples of query strategies to select the most informative or representative data to be labeled are the sampling based on uncertainty, query-by-committee, expected model change, expected error reduction, variance reduction, and density-weighted methods (a comprehensible review about these methods is presented by Settles [51]). We noticed here, that all these strategies depend on the existence of an initial classification model. Thus, these approaches require a set of labeled examples in order to obtain an initial model that will provide the information to start the unlabeled instance selection task.

Differently from the broadly used setting, we are interested in a more realistic scenario in which we start the active learning process on a fully unlabeled dataset and without any prior knowledge about the data. Therefore, we cannot assume the existence of an initial classifier that provides useful information about unlabeled examples or the knowledge of the number of classes of a problem for careful tuning the parameters of an algorithm. Although active learning has been widely used in many application domains, their use from a completely unlabeled set has not received much attention. According to a review on 206 papers from top-tier conferences such as NIPS, ICCV, CVPR, ICML, UAI, ECML and journals such as Machine Learning, Pattern Recognition, Data Mining and Knowledge Discovery performed by Hu et al. [28] in 2010, over 94% of researchers, used a randomly selected initial training set or failed to specify their criteria. Fewer than 6% used a targeted approach to populate their initial training set. In a less systematic review in the last years on conferences such as SDM, ICDM, and KDD, and top-tier journals, we observed that this scenario did not change significantly. We believe that such practice is due to the lack of a more in depth study which shows that simple methods can be a better choice rather than the random sampling. Thus, this study wants to contribute in this direction with a wide review and experimental evaluation. Given the increasing popularity of real-world applications that generate or process sequential data, our evaluation will focus on time series data.

For a concrete example of a real application that needs active learning methods on fully unlabeled data from scratch, consider the problem presented by Souza et al. [61] in which the use of a laser sensor is proposed to perform an online classification of flying insects. Insect classification is an important problem since the automatic identification of the species can be used to build devices to control harmful species such as disease vectors and agricultural pests. The sensor generates a huge amount of sequential data in an inexpensive way; however, the process of labeling an initial training set is expensive and dependent of a well-trained expert. For convenience, the authors assume an initially labeled dataset previously collected in a laboratory to start a classifier that incrementally updates over time. However, this assumption has some drawbacks, including the lack of knowledge of possible insect species that will cross the laser in the field and the fact that the environmental conditions of the laboratory may differ from the field. Frequently, these conditions affect the behavior of the insects and consequently the measured data from the sensor. Thus, the current model can be inefficient in real conditions.

To mitigate this problem, a simple practical solution is to place the sensor in the field conditions to collect data for a period of time. We can use the collected data to build an initial classifier that will be used to process new incoming data in an online fashion. As all the collected data are unlabeled, an expert can analyze a subset of the signals and provide the respective labels. We can sample the most representative examples using the techniques discussed in this paper to reduce the domain expert effort to label the initial training set. We return to this application in Section 4.3, where we perform a case study.

There are many other examples in which unlabeled data are inexpensive to collect, but costly to label. For instance, in medicine, the non-invasive electrocardiogram (ECG) and electroencephalogram (EEG) procedures can record massive amounts of low-cost data; however, data labeling is a more tedious process that requires a trained clinician. In the industry, for instance in electric power plants or automotive manufacturing, huge amounts of sensor data are collected to monitor the fabrication process. However, a supervised classification task such as fault detection requires a domain expert to revise the data in order to identify the moment that precedes a fault.

In this paper, we provide a broad experimental evaluation of active learning methods to build an initial training set from fully unlabeled data. The evaluation considers a combination of active learning strategies to select instances, a different number of instances to be selected, and machine learning algorithms. Given the growth of popularity of sequential data, we carried out our experimental evaluation on 24 time series datasets from the UCR benchmark repository from different application domains such as biology, economy, entomology, and medicine, and in a promising case study of insect classification by laser sensor. The contributions of this paper come from the analysis of the results provided by this extensive empirical evaluation. The main contributions of this paper are summarized follow:

•
Evaluation and analysis considering the most representative and relevant active learning strategies for instance selection. Most of past works of literature considered a reduced number of methods. In this paper, we considered the most used techniques based on tabular and graph data to define the importance of instances in a dataset and sampling. Even more, differently from the broadly used setting, we consider a more realistic scenario where the data are fully unlabeled and the user does not have any prior knowledge about the data, as the number of classes;
•
Evaluation and analysis on sequential data. Time series are ubiquitous in almost every human activity. Time-oriented data are present in many application domains such as industry, astronomy, medicine, biology, economy, signal processing, among others. Moreover, some types of data such as text or image can be converted to time series. Consequently, the analysis of the time series data has attracted much attention and effort from several researchers around the world. To the best of our knowledge, most of the works in active learning are mostly concerned with textual data and no substantial work has evaluated different methods for sequential data. In addition, we discuss our results in an important application for public health and agriculture, related to insect species classification;
•
Evaluation and analysis considering different amounts of labeled data. We present a trade-off between the number of labeled instances and classification performance. This allows guiding the user about how much instances must be labeled in practical situations;
•
A comparison among semi-supervised learning techniques. Since the sampling methods consider the selection of a portion of data to be labeled, another portion of unlabeled data is discarded and usually not used to learn a classification model. Thus, we consider the use of semi-supervised learning to label the remaining data without requesting the oracle. We verify if the use of this unlabeled data can improve the classification performance achieved by supervised learning techniques.

With these contributions we are able to answer the following questions:

1.
Are there simple and effective alternatives to the random sampling in order to select examples to build an initial training set?
2.
Among the evaluated alternatives which one stands out?
3.
Can the unlabeled examples (not selected for the initial model) improve the classification performance of the initial training set?

The remaining of the paper is organized as follows. In Section 2, we present the background and related work about time series classification, active learning and unsupervised active learning methods for instance selection. In Section 3 we discuss our experimental setup to conduct the evaluation. The results are presented and discussed in Section 4. Finally, in Section 5 we discuss our main conclusions about the wide evaluation performed in this paper regarding different methods to data sampling selection and active learning in an unsupervised way for time series classification.
2. Background & related work

In this section, we present the background and related work about the three main concepts about this paper: time series classification, active learning and unsupervised active learning methods for instance selection.

2.1 Time series classification

A time series $S_{i}=s_{1},s_{2},\ldots,s_{l}$ , is an ordered sequence of $l$ real-valued variables obtained through repeated measurements over time. Given a set of $m$ unlabeled time series from a set $D^{U}$ , the task of time series classification is to map each time series to one of the predefined classes. In order to do so, a dataset $D^{L}$ with $n$ labeled time series is required.

The task of time series classification has attracted a great interest of the researchers in the past two decades. Many algorithms have been proposed for time series classification, including Decision Trees [46], Neural Networks [41], Hidden Markov Models [34], first-order logic rules with boosting [47], and SVM [67]. However, the literature indicates that a simple One-Nearest Neighbor (1NN) algorithm, with a proper distance measure, presents very good results and frequently outperforming more complex classification algorithms [20, 68]. In a wide experimental evaluation performed by Bagnall and Lines [4] for time series classification, the authors compared 1NN against C4.5, Random Forest, Rotation Forest, Naive Bayes, Bayesian networks and Support Vector Machines with linear and quadratic kernels, and they confirmed that 1NN is hard to beat.

The 1NN is a lazy classifier (no training is required), that consists of assigning to a query time series $Q$ , the label of the most similar time series $S_{i}$ from the labeled training set $D^{L}$ according to a distance measure. There are over a dozen distance measures for similarity of time series data in the literature, e.g., Euclidean distance (ED) [22], Dynamic Time Warping (DTW) [32], Longest Common Subsequence (LCSS) [65], Edit Distance with Real Penalty (ERP) [11], Edit Distance on Real sequence (EDR) [12], DISSIM [24], Complexity-Invariant Distance (CID) [6], Recurrence Patterns Compression Distance (RPCD) [57].

The Euclidean distance is probably the most known and used distance to compare time series. It measures the similarity between time series considering observations at the exact same time index $j$ according to the Eq. (1), where we consider two time series $S$ and $Q$ both with the same length $l$ .

$\textit{ED(S, Q)}=\sqrt{\sum_{j=1}^{l}(s_{j}-q_{j})^{2}}$ (1)

For problems with few training cases, an elastic distance measure such as DTW or LCSS is often superior to Euclidean distance, but as the number of series increases “the accuracy of elastic measures converge with that of the Euclidean distance” [20]. In particular, in this work we consider the 1NN classifier with Euclidean distance in our evaluation given their competitive results compared to more complex methods and other advantages. ED is easy to implement and indexable with any access method and, in addition, this distance measure is parameter-free.

2.2 Active learning

Given the high costs to obtain labeled data in many domains, active learning techniques that require the actual class label for a reduced amount of data have been proposed in the literature to overcome this bottleneck. Active learning aims to select the most informative unlabeled instances and request to an oracle for their labels to retrain a learning algorithm.

In general, active learning literature considers three main approaches [51]:

1.
Membership query synthesis: the model itself generates some synthetic instances to be labeled rather than using real unlabeled instances [2];
2.
Pool-based: given a set of unlabeled instances, the method chooses some representative examples to request their labels and update the current model [36];
3.
Stream-based selective sampling: the unlabeled instances are presented in a stream manner and the method decides online whether or not to query for its label [77].

The pool-based approach is the most studied setting. It has been applied on several application domains such as the automatic classification of texts [64], images [27], videos [69], and music retrieval [39]. Typically, pool-based methods consider a database $D$ comprising of a minor labeled subset of instances $D^{L}=\{(\vec{x}_{1},y_{1}),(\vec{x}_{2},y_{2}),\ldots,(\vec{x}_{n},y_{n})\}$ with $n$ instances and a large unlabeled subset of instances $D^{U}=\{(\vec{x}_{n+1},y_{n+1}),(\vec{x}_{n+2},y_{n+2}),\ldots,(\vec{x}_{n+m},% y_{n+m})\}$ , with $m$ examples so that $D=D^{L}\cup D^{U}$ , where $\vec{x}_{i}$ is a $d$ -dimensional data and $y_{i}$ denotes the class label of $\vec{x}_{i}$ , and each pair $(\vec{x_{i}},y_{i})\in D^{U}$ have an unknown class label $y_{i}$ . The main goal of the active learning techniques is to improve the quality of the labeled subset $D^{L}$ , in terms of predictive power, adding some examples from $D^{U}$ accompanied with their respective actual class labels given by an oracle.

Thus, the most important task of active learning methods is to perform good choices in the instance selection process. A common approach is to use evaluation measures to estimate the example’s utility and select the one with maximal utility values. Examples of utility metrics (or sampling selection strategies) are uncertainty sampling [36], query by committee [55], expected model change [54], expected error reduction [49], and density weighted methods [52].

In uncertainty sampling, the instances are selected according to the uncertainty of label prediction. In query by committee, there is a committee of models trained on the currently labeled data. For each unlabeled instance, the committee votes for the label and instances with the largest disagreements on the votes are selected. In expected model change, the instances that cause most change in the current model are selected. In expected error reduction, the instances are selected in order to reduce the expected error of model as much as possible. In density weighted methods, the selected instances must be both uncertain and representative in order to decrease the effect of outliers. This last strategy aims to valid the problems caused by outliers, especially in uncertainty sampling and query by committee strategies.

Algorithm 2.2 presents a general view of the pool-based active learning process. An initial model $\Theta$ is generated from the labeled data $D^{L}$ . In the next steps, all the unlabeled data from $D^{U}$ are evaluated according to a utility measure $\mathcal{U}$ using a previously generated model. The example with the highest utility is selected and its label is requested for an oracle. This instance is then added to the labeled subset $U^{L}$ and the model $\Theta$ is updated. This process is repeated until reaching a training set with $n$ examples.

[H] General process of active learning methods [25].InputInputOutputOutput $D^{L}$ - Initial labeled subset

$D^{U}$ - Unlabeled subset

$n$ - Size of training set $\Theta$ - Model $\Theta\leftarrow$ learn a model based on $D^{L}$ $|D^{L}|\leqslant n$ instance $\vec{x_{i}}\in D^{U}$ $u_{i}\leftarrow\mathcal{U}(\vec{x}_{i},\Theta)$ $\vec{x}_{}\leftarrow{argmax}_{i}$ ( $u_{i}$ ) $y_{}\leftarrow requestLabel(\overrightarrow{x_{}})$ $D^{L}\leftarrow D^{L}\cup\{\vec{x}_{},y_{}\}$ $D^{U}\leftarrow D^{U}\setminus\{\vec{x}_{},y_{*}\}$ $\Theta\leftarrow$ update the model based on $D^{L}$

The main difference between the stream-based and pool-based methods is that the former scans the data sequentially and decides whether to label or not an instance individually; whereas the latter evaluates and ranks the entire unlabeled data to decide. However, both settings assume the presence of an initial labeled data $D^{L}$ which is used to generate a classification model $\Theta$ . In short, the utility measures are based on information generated by the classification model or from labeled data. Although in [29] the authors proposed an unsupervised approach, they needed of previously labeled examples randomly chosen to start the active learning process.

Some approaches such as [37] and [50], perform the active learning from scratch as proposed in this work, i.e., without initial labeled data or classification model. These approaches are based on unsupervised criteria obtained from clusters where samples lying near cluster centers and near borders of clusters are expected to represent the most informative ones regarding the distribution characteristics of the classes. However, in the clustering phase, these methods have an important assumption in their approaches: they know a prior the number of classes of the problem. Thus, the number of clusters is set to the number of classes present in the dataset.

In this work, we are interested in the scenario in which we start the active learning process on a fully unlabeled dataset and with the absence of an initial model/classifier. Although active learning has been widely used in many problems, their use to aid in the labeling of an initial training data or to build an initial classifier has not received much attention. Most of the recent works adopt random sampling to generate an initial training set. However, sampling based on clustering algorithms have presented better results [28, 31, 43, 70]. In other direction, some works have used centrality measures from graph to extract information about unlabeled data and use this information to select instances for the active learning process [3, 35, 38]. However, these two approaches are commonly evaluated separately on different datasets. To give a direct comparison of these approaches, different methods are discussed, evaluated and compared in this work in a wide experimental setting.

To the best of our knowledge, the most similar work to ours in the literature are [28, 31, 3]. However, they performed their evaluations in a reduced number of methods and their evaluations are mostly concerned with textual data. In contrast, we perform a wide comparison of active learning methods using time series data. Time series are ubiquitous in almost every human activity. Time-oriented data are present in many application domains such as industry, astronomy, medicine, biology, economy, signal processing, among others. Moreover, some types of data such as text [1] or image [33] can be converted to time series. Consequently, the analysis of time series data has attracted much attention and effort from several researchers around the world.
2.3 Unsupervised active learning methods for instance selection

In this section, we present a brief overview of the main methods that can be used to select examples from unlabeled data to perform the active learning process. We begin presenting the Random Sampling and the Farthest-First Traversal algorithms. Next, we introduce the clustering algorithms evaluated in this work and how these algorithms can be used to select unlabeled examples. Finally, we present graph-based centrality measures that can be used to extract information from unlabeled data.

2.3.1 Random sampling

Due to simplicity, random sampling is probably the most popular method to select data to be labeled by the oracle for generation of an initial training set in classification problems. According to Zhu et al. [73], this practice is based on the assumption that random sampling will be likely to build the initial training set with the same prior data distribution of the whole data. However, this situation seldom occurs in real-world applications due to the small size of initial training set which is typically used. An important and evident advantage of random sampling, is their constant complexity time to select examples.

2.3.2 Farthest-First Traversal

In order to select a set of diverse examples in a dataset to be labeled, the Farthest-First Traversal algorithm finds $b$ examples so that they are far from each other. The algorithm begins by choosing a random example (or the closest example to the center of data) and iteratively chooses the next farthest example from the current set. The algorithm has been used in active learning methods as a selection strategy in traditional scenarios where there is the presence of an initial training set [5]. However, this algorithm can be used to build an initial training set as presented in [28] and evaluated in this paper. Notice that, as pointed by Hu et al. [28], the algorithm can be susceptible to noise and outliers. In terms of complexity, this method has a quadratic time to find the farthest set of examples.

2.3.3 Sampling based on clustering

The goal of data clustering, also known as cluster analysis, is to discover the natural grouping of a set of patterns, points, or objects in an unsupervised manner [30]. Given a representation of $n$ objects, a clustering algorithm finds $k$ groups based on a similarity measure so that the similarities between objects in the same group are higher than the similarities between objects in different groups. Thus, the structure found in the data can be useful to select representative examples. Basically, the active learning methods based on sampling by clustering algorithms select the examples near cluster centers and near cluster borders as the representatives to be labeled. As cluster algorithms typically require the number of groups $k$ , this value is set according to the amount of data which we want to label.

Given an unlabeled dataset $D^{U}$ and a bucket with $b$ examples from $D^{U}$ that will be selected to be labeled, we evaluate two approaches based on clustering algorithms to sampling data. The first one is based on [37, 50], but without the knowledge about the number of classes of a dataset to set the number of clusters. Thus, given a minimum number of border examples (be) that we want to select from each obtained cluster and one example near each cluster center, the clustering phase is performed considering the number of clusters $k=\lfloor b/(be+1)\rfloor$ . For example, if we want to select $20$ examples ( $b=20$ ) and the minimum number of border examples $be=3$ , we perform the clustering algorithm with $k=5$ , where for each cluster one example near the center and 3 from the border are selected. If $b=21$ , the number of clusters $k=5$ and one example near the cluster centers is selected. For one group, $4$ border examples are selected and for the remaining $4$ groups, $3$ border examples are selected. The second evaluated approach is based on the works [3, 28, 31]. In this approach, we set the number of clusters to be found by the algorithm as the same number of examples to be labeled, i.e., $k=b$ . Thus, for each obtained cluster, only the example near the cluster center is selected to be labeled by the active learner.

In this work, we evaluated five different cluster algorithms to sampling data: k-Means, k-Medoids, Gaussian Mixture Models, Hierarchical Agglomerative and Density Peak. The choice of these five clustering methods, have considered the popularity, ease of implementation and the performance achieved by the methods in different works of literature. In the next sections we present a general overview of these algorithms in the context of sampling data for an initial training set.

k-Means and k-Medoids

The use of clustering algorithms for selective sampling was firstly presented by Zhu et al. [73] to solve problems in natural language processing such as word sense disambiguation and text classification. They used the k-Means algorithm to build a representative initial training dataset for active learning.

The k-Means is a well-known clustering algorithm that aims to divide $n$ points into $k$ clusters and uses prototypes (centroids) to represent the clusters that minimize the sum of squared error (SSE) function. SSE is the sum of the squared differences between each observation and its group’s mean. If all cases within a cluster are identical, the SSE would then be equal to 0. Typically, the k-Means algorithm starts with $k$ initial centroids randomly chosen, then it assigns each data point to the nearest centroid according to the Euclidean distance, updates the cluster centroids, and repeats the process until the positions of $k$ centroids do not change.

The k-Medoids is similar to k-Means except that the centroids (medoids) must be a data point of the dataset that is being clustered. k-Medoids is more robust than $k$ -Means in the presence of noise and outliers, because a medoid is less influenced by outliers or other extreme values than a mean.

Gaussian Mixture Model

The Gaussian Mixture Model (GMM) is a model-based clustering algorithm which uses a finite mixture of parametric multivariate Gaussian distributions to model a dataset. Each Gaussian distribution of the GMM represents a cluster of data points. An individual distribution used to model a specific cluster is often referred to as a component distribution. Formally, a GMM is a weighted sum of $k$ component Gaussian densities as given by Eq. (2) [44]:

$p(x|\lambda)=\sum_{i=1}^{k}w_{i}g(x|\mu_{i},\Sigma_{i}),$ (2)

where $x$ is a $d$ -dimensional data, $w_{i}$ , $i=1,\ldots,C$ , are the mixture weights, and $g(x|\mu_{i},\Sigma_{i})$ , $i=1,\ldots,C$ , are the component Gaussian densities. Each component density is a $d$ -variate Gaussian function as defined by Eq. (3).

$g(x|\mu_{i},\Sigma_{i})=\frac{1}{(2\pi)^{d/2}|\Sigma_{i}|^{1/2}}\textit{exp}\ % \left\{-\frac{1}{2}(x-\mu_{i})^{T}\Sigma_{i}^{-1}(x-\mu_{i})\right\},$ (3)

with mean vector $\mu_{i}$ and covariance matrix $\Sigma_{i}$ . The mixture weights satisfy the constraint $\sum_{i=1}^{k}w_{i}=1$ .

The complete Gaussian Mixture Model is parameterized by the mean vectors, covariance matrices, and mixture weights from all components. These parameters are collectively represented by $\lambda=\{w_{i},\mu_{i},\Sigma_{i}\}$ , with $i=1,\ldots,k$ . Given a dataset, the $\lambda$ parameters can be estimated by the maximum likelihood criterion using the Expectation-Maximization (EM) Algorithm [18].

For our purpose to build an initial training data, we consider the number of components to be estimated as the number of clusters.

Hierarchical clustering

Hierarchical clustering algorithms create a hierarchy of groups in a bottom-up (or agglomerative) process by merging small groups or in a top-down (or divisive) process by dividing large groups into smaller ones. The data clustering can be visualized in a variety of scales by creating a cluster tree or dendrogram.

Due to the simplicity and popularity, we choose the agglomerative clustering in this work. The algorithm starts by computing the similarity matrix between all pairs of examples to be clustered. In the next step, the algorithm iteratively selects two clusters with the largest affinity under a certain measure to merge, until some stopping criteria is reached. After preliminary experiments, we choose the Ward’s (minimum variance criterion) [66] along with the Euclidean distance to compute the similarity between clusters.

To select $k$ clusters, the tree built in agglomerative clustering is pruned so as to retain $k$ clusters corresponding to the $k$ lowest level branches in the hierarchy. The examples closest to the center and/or to the border of these clusters are then selected to be labeled by the oracle and included in the initial training set.

Density Peak

The Density Peak is a recently proposed clustering algorithm based on density [45]. This is an alternative algorithm to density based algorithms such as Mean Shift [26] and DBSCAN [21]. Density Peak presents competitive results with state-of-the-art algorithms in data with arbitrary shape without requiring the number of clusters as input.

The algorithm has the assumption that cluster centers are surrounded by neighbors with lower local density and that they are at a relatively large distance from any points with a higher local density. Thus, for each instance $i$ the algorithm computes a local density $\rho_{i}$ and a distance $\delta_{i}$ from examples of higher density. $\rho_{i}$ is equal to the number of examples that are closer than a cutoff distance $d_{c}$ to the example $i$ . $\delta_{i}$ is the minimum distance between the example $i$ and any other example $j$ with higher density $\rho_{j}$ .

With these two quantities computed for all examples, $\delta_{i}$ is plot as a function of $\rho_{i}$ . In this decision graph, the cluster centers are isolated from the remaining data. After finding the cluster centers, each remaining example is assigned to the same cluster as its nearest neighbor of higher density.

To use the Density Peak algorithm together with active learning techniques, we sort the examples in a decreasing order considering $\gamma_{i}=\rho_{i}*\delta_{i}$ . We select the top $k$ examples to present to the expert and get the correct labels. As this algorithm does not allow the definition of the number of clusters $k$ , the approach that selects examples from the border of clusters has not been evaluated in our experiments. To the best of our knowledge, this paper is the first to evaluate the Density Peak clustering algorithm in the active learning scenario.

2.3.4 Sampling based on centrality measures from graphs

In the last years, various graph-based algorithms, mainly for semi-supervised learning (SSL), have been proposed in literature [76, 10]. They rely on a similarity-based graph built from a dataset, in which the objects represent the examples and the undirected edges represent the similarity between these examples. Besides semi-supervised learning, the graph topology also allows to extract properties from the dataset and from each object/example in an unsupervised manner [42]. Properties extracted from the objects allow us to measure the importance of an object in the graph. These measures are called centrality measures. In this article, we consider that the central objects of a graph correspond to representative examples that will be selected and labeled by experts. In next sections, we present details about graph construction and centrality measures evaluated in our experiments.

Graph construction

Formally, a graph is defined by $G=\langle\mathcal{O},\mathcal{E},\mathcal{W}\rangle$ , in which $\mathcal{O}$ represents the set of objects, $\mathcal{E}$ represents the set of edges, and $\mathcal{W}$ represents the weights of edges. We represent the edge between an object $o_{i}$ and an object $o_{j}$ by $e_{o_{i},o_{j}}$ and its weight by $w_{o_{i},o_{j}}$ .

There are a few different ways to build a graph based on similarity between objects. The most common are [76]:

•
Fully connected graph: in this graph, every pair of objects is linked by an edge. Usually the edge weights between an object representing an example $\overrightarrow{x_{i}}$ $(o_{i})$ and an example $\overrightarrow{x_{j}}$ ( $o_{j}$ ) is given by Eq. (4):

$w_{o_{i},o_{j}}=\exp\Bigg{(}-\frac{\textit{dist}(o_{i},o_{j})}{2\sigma^{2}}% \Bigg{)},$ (4)

where $\textit{dist}(o_{i},o_{j})$ is a distance function between objects $o_{i}$ and $o_{j}$ , and $\sigma$ is a parameter to control the bandwidth of the Gaussian function;
•
$k$ -Nearest Neighbor ( $k$ -NN) Graph: in this graph each object is linked only with its $k$ nearest neighbors. Weights of edges are equal to 1, in the case of a unweighted graph, or the value of a distance function, or the value given by Eq. (4), in the case of weighted graphs. A commonly used variant of $k$ -NN graph is Mutual $k$ -NN Graph. In such graph, an object $o_{i}$ is linked to an object $o_{j}$ if $o_{i}$ is one of the $k$ nearest neighbors of $o_{j}$ and $o_{j}$ is one of the $k$ nearest neighbors of $o_{i}$ ;
•
$\epsilon$ NN Graph: in this graph, an object $o_{i}$ is linked to an object $o_{j}$ if $||o_{i}-o_{j}||\leqslant\epsilon$ . Edge weight can be the same of $k$ -NN graphs.

In this paper we use Mutual $k$ -NN strategy to generate graphs for two reasons: $i$ ) it tends to provide better classification performance than other strategies both for time series [15] and other data domains [14] and $i i$ ) we empirically verified that centrality measures extracted from other strategies to generate graphs provided inferior results than the ones obtained using Mutual $k$ -NN.

Centrality measures

We use centrality measures to provide scores to the graph objects. The objects are sorted in a descending order considering the generated centrality scores, and the top $k$ ranked objects are selected to be labeled by an expert.

Is this paper we consider three centrality measures which are the basis for others and have been used in literature for active learning [3, 42]: i) Degree, ii) PageRank, and iii) Betweenness.

Degree is the simplest centrality measure. The degree score of an object $o_{i}$ in a weighted undirected graph is the sum of edge weights of the object $o_{i}$ , i.e.,

$\textit{Degree}(o_{i})=\displaystyle\sum_{e_{o_{i},o_{j}}\in\mathcal{E}}w_{o_{% i},o_{j}}$ (5)

Although the degree is a simple centrality measure, it is insightful about the importance of objects in a graph. For instance, in social networks, objects with high degree score have a bigger influence in a higher number of people than objects with low degree. In the scientific domain, papers with a large number of citations are more influential than papers with low degree score [42].

Centrality scores can also be computed by importance of propagation scores, as performed by PageRank [8]. The idea behind PageRank is that the importance of an object is given by the importance of linked objects. In other words, an object is important (or central) if its linked objects are also important. The PageRank’s score for an object $o_{i}$ in an unweighted undirected graph is given by Eq. (6):

$\textit{PageRank}(o_{i})=(1-\lambda)+\lambda\displaystyle\sum_{e_{o_{i},o_{j}% }\in∼{}\mathcal{E}}\frac{w_{o_{i},o_{j}}}{\displaystyle\sum_{e_{o_{j},o_{k}}% \in∼{}\mathcal{E}}w_{o_{j},o_{k}}}\textit{PageRank}(o_{j}),$ (6)

in which $\lambda$ is a damping factor that can be set between 0 and 1. Equation (6) is applied iteratively until convergence, i.e., until the scores of the objects do not change significantly or until a fixed number of iterations is reached.

Another way to compute centrality measures is considering the importance of an object in the geodesic path among all objects of a graph. A geodesic path is the shortest path, in terms of number of edges traversed, between a specified pair of vertices. The Betweenness measure compute centrality scores through the sum of the number of times that an object $o_{i}$ is contained in the geodesic path among all pair of objects in a graph [42]. Formally, Betweenness scores are given by Eq. (7):

$\textit{Betweenness}(o_{i})=\displaystyle\sum_{o_{j},o_{k}\in\mathcal{N}}gp^{o% _{i}}_{o_{j},o_{k}},$ (7)

where $gp^{o_{i}}_{o_{j},o_{k}}$ is equal to 1 if $o_{i}$ is in the geodesic path between the objects $o_{j}$ and $o_{k}$ . According to [42], objects with high betweenness centrality may have considerable influence within a network by virtue of their control over the information passing between other objects of the graph.
2.4 Time and space complexity of instance selection methods

In this section, we present and compare the time and space complexity of the active learning algorithms presented in the previous section. We divide the complexity analysis into two groups due the characteristics and the notations to be used:

•
First group: composed by the Farthest-First Traversal and Clustering-based algorithms ( $k$ -Means and $k$ -Medoids, Gaussian Mixture Model, Hierarchical clustering and Density Peak);
•
Second group: composed by centrality measure algorithms based on graphs (Degree, PageRank, Betweenness).

The time and space complexity of the first group of algorithms are summarized in Table 1. In this table, we consider $I$ as the number of iterations required for convergence by the algorithms $k$ -Means and GMM, $n$ as the number of data points, $k$ as the number of clusters, and $d$ as the number of dimensions. In terms of time, the most efficient algorithms are Farthest-First Traversal and Density Peak, with quadratic time. In terms of space, the most efficient algorithm is the Gaussian Mixture Models clustering.

Table 1
Time and space complexity of first group of algorithms. $I$ is the number of iterations required for convergence, $n$ is the number of data points, $k$ is the number of clusters, $d$ is the number of dimensions

Algorithm Time-complexity Space-complexity

Farthest-First Traversal $O(n^{2})$ $O(dn)$

$k$ -Means O $(Ikdn)$ $O(d(n+k))$

$k$ -Medoids $O(k(n-k)^{2})$ $O(d(n+k))$

Hierarchical agglomerative (Ward’s) $O(n^{2}\log n)$ $O(n^{2})$

GMM $O(Ind^{2})$ $O(kd)$

Density Peak $O(n^{2})$ $O(n^{2})$

The time and space complexity of the second group of algorithms are summarized in Table 2. In this table, we consider $I$ as the number of iterations required for convergence by the calculation of PageRank measure, $o$ as the number of graph objects (data points), $e$ as the number of edges and $\overline{e}$ as the average number of edges per object. In terms of space, we can see that all measures have the same complexity. In terms of time, Degree is the most efficient measure.

Table 2
Time and space complexity of second group of algorithms. $I$ is the number of iterations required for convergence, $o$ is the number of graph objects (data points), $e$ is the number of edges and $\overline{e}$ is the average number of edges per object. Time and space complexity presented in this table consider graphs represented by adjacency lists

Algorithm Time-complexity Space-complexity

Degree $O(o\overline{e})$ $O((oe)+o)$

PageRank $O(Io\overline{e})$ $O((oe)+o)$

Betweenness $O((o+e)\log o)$ $O((oe)+o)$

3. Experimental setup

Algorithm	Time-complexity	Space-complexity
Farthest-First Traversal	$O(n^{2})$	$O(d*n)$
$k$ -Means	O $(Ikd*n)$	$O(d*(n+k))$
$k$ -Medoids	$O(k(n-k)^{2})$	$O(d*(n+k))$
Hierarchical agglomerative (Ward’s)	$O(n^{2}\log n)$	$O(n^{2})$
GMM	$O(Ind^{2})$	$O(k*d)$
Density Peak	$O(n^{2})$	$O(n^{2})$

Algorithm	Time-complexity	Space-complexity
Degree	$O(o*\overline{e})$	$O((o*e)+o)$
PageRank	$O(Io\overline{e})$	$O((o*e)+o)$
Betweenness	$O((o+e)*\log o)$	$O((o*e)+o)$

In this section, we present a description of the datasets evaluated in our experiments and a general overview of our evaluation framework considering the scenarios of supervised and semi-supervised learning.

3.1 Datasets description

Due to the fact that time-oriented data are present in many application domains and the possibility of conversion of another type of data into time series, we choose this sequential data to perform our experimental evaluation. To facilitate the execution of experiments and the direct comparison of our results with other methods, we used the UCR Time Series Archive [13]. These datasets have standard partitions of training and test sets which are widely accepted in the literature of time series classification. Thus, in our experimental setup, we selected examples from the training set of each dataset to perform the active learning process and we evaluated their performance in the test set. This setup allows evaluating the performance of classification using a reduced amount of labeled data and a direct comparison against the accuracy achieved by a classifier with all training data.

Many datasets from UCR have a training set with a reduced size. As one of the major motivations to perform the active learning process is to reduce the human annotation effort by selecting some examples in a massive amount of data, we decided to select 24 popular datasets that have more than 150 instances on the training set. For datasets with less than 150 instances, we believe that all instances can be labeled with minor effort. A description of these data with respect to the number of classes, the size of training and test sets, the length of series, the accuracy achieved by the 1-NN classifier with Euclidean distance using all training data in a holdout procedure and domain, are presented in Table 3. We also present the imbalance ratio of each dataset, which considers the ratio between the number of instances of the majority and minority classes [23]. We can observe that the datasets are quite varied in their characteristics.

Table 3
Description of datasets evaluated

Dataset	Number	Instances		Series	Imbalance	Holdout	Domain
	of classes	Train	Test	length	ratio	accuracy
50words	50	450	455	270	18.167	0.631	Figure shape
Adiac	37	390	391	176	1.450	0.611	Figure shape
ChlorineConc.	3	467	3840	166	2.307	0.650	Measurement
Cricket-X	12	390	390	300	1	0.574	Accelerometer
Cricket-Y	12	390	390	300	1	0.644	Accelerometer
Cricket-Z	12	390	390	300	1	0.620	Accelerometer
FaceAll	14	560	1690	131	6.813	0.714	Figure shape
FacesUCR	14	200	2050	131	6.813	0.769	Figure shape
Fish	7	175	175	463	1	0.783	Figure shape
Haptics	5	155	308	1092	1.2821	0.370	Accelerometer
MedicalImages	10	381	760	99	25.826	0.684	Histogram
NonInvFetECG-Th1	42	1800	1965	750	1.307	0.829	ECG
NonInvFetECG-Th2	42	1800	1965	750	1.307	0.880	ECG
OSULeaf	6	200	242	427	2.553	0.517	Figure shape
StarLightCurves	3	1000	8236	1024	4.008	0.849	Measurement
SwedishLeaf	15	500	625	128	1	0.789	Figure shape
Synt. Control	6	300	300	60	1	0.880	Synthetic
Two Patterns	4	1000	4000	128	1.087	0.910	Synthetic
uWaveGestLib-X	8	896	3582	315	1.002	0.739	Accelerometer
uWaveGestLib-Y	8	896	3582	315	1.002	0.662	Accelerometer
uWaveGestLib-Z	8	896	3582	315	1.002	0.650	Accelerometer
Wafer	3	1000	6164	152	8.402	0.995	Measurement
WordsSynonyms	25	267	638	270	16.667	0.618	Figure shape
Yoga	2	300	3000	426	1.157	0.830	Figure shape

We can observe in Table 3, that many of datasets are obtained from figure shapes. This means that a two-dimensional shape of an image was converted to a single-dimensional time series by calculating the distance between a central point and the object contour. In Fig. 1 we can observe an example of this conversion given the datasets OSULeaf and Yoga, respectively.

Figure 1.

Examples of figure shape datasets. In these datasets, given a figure, the distances from contour to a central point are measured resulting in a series in which ordinate values are the distances measured and abscissa values are increasing clockwise angle values around the contour. (a) OSULeaf; (b) Yoga.

Another popular data domain in our evaluation are the series measured by accelerometers. One example is the uWaveGestureLibrary dataset. This dataset consists of the performances of the gestures shown in Fig. 2 by eight participants. To collect data, the participant held a Nintendo Wii remote and repeated each of the eight gestures ten times. As each accelerometer has three synchronous measures for three axes (X, Y, and Z), at the end we considered 3 different datasets: uWaveGestLib-X, uWaveGestLib-Y, and uWaveGestLib-Z.

Figure 2.

Gesture vocabulary of datasets uWaveGestLib-X (on x-axis), uWaveGestLib-Y (on y-axis), and uWaveGestLib-Z (on z-axis) [56]. The dot denotes the begin and the arrow the end of the gesture.

3.2 General framework

We consider only the access to a pool with $n$ time series instances $D^{U}=\{(\overrightarrow{x_{1}},y_{1}),\ldots,(\overrightarrow{x_{n}},y_{n})\}$ where their respective labels $y_{i}$ are unknown. The main goal is to select a bucket with $b\\$ of instances from the pool of candidate instances that will be labeled by the oracle and build a feasible training set. For each dataset originally divided into training and test, we consider that the unlabeled instances $D^{U}$ are from the training partition and they have a size according to the Table 3. The algorithms are evaluated using the data from the test partition. Figure 3 shows a general view of the instance selection process.

Figure 3.

A general view of the instance selection process considered in the experimental evaluation.

From the provided unlabeled data $D^{U}$ , we consider four main possible ways to select $b\\$ of examples which will be labeled by the oracle:

Using the Farthest-First Traversal algorithm;

Using the random sampling;

Constructing a graph using the unlabeled data and extracting the centrality measures Betweenness, Degree and PageRank. After computing these measures, we sort them in a descending order and the $b\\$ -top examples are selected as the most informative example;

Running clustering algorithms such as k-Means, Agglomerative Hierarchical or Gaussian Mixture Models. In this case, we consider two possibilities to select the most representative examples to be labeled: i) we choose examples only near the cluster centers or ii) near the cluster centers and near the cluster borders. In the first case, the number of clusters $k$ considered by each algorithm is equal to the number of data that will be labeled ( $b$ ). In the second case, the number of clusters $k$ is given by $k=\lfloor b/(be+1)\rfloor$ , where $b e$ is the minimum number of border examples that will be selected from each cluster. For example, if we want to label 100 examples ( $b=100$ ) and consider the choice of at least 3 border examples of each group ( $be=3$ ), the clustering algorithm will find 25 clusters. Thus, the oracle will label 25 examples near the cluster centers and more 75 border examples (3 examples belonging to each cluster).

As pointed by Kang et al. [31], an active learning strategy which iteratively outputs individual instances to be label by the user at each step of the algorithm, is time-consuming and requires a full-time annotator during the whole process. In addition, annotators frequently perform better by categorizing examples in groups, since the oracle will be able to compare different examples and process them in any order. Thus, we consider that the selected instances by the algorithm are presented to the expert all at once and non-iteratively.

In order to cover a wide variation for each dataset, we vary the portion of selected examples that will be labeled by the oracle from 5% to 95% of training size. Thus, for small datasets such as Haptics (with 155 training examples), we started our evaluation considering only 8 examples to be labeled by the oracle. On the other hand, for datasets as NonInvasiveFetalECG Thorax (with 1800 training examples), we started our evaluation considering 90 examples to be labeled by the oracle.

After the labeling process, we have two pools of examples. The first one has $b\\$ labeled examples ( $L1$ ) and the second pool ( $R$ ) consists in the $r\\$ remaining examples from the unlabeled data $D^{U}$ which were not selected to be processed by the oracle. From the sets $L1$ and $R$ , we perform two evaluations, as presented in Fig. 4.

Figure 4.

A general view of the experimental evaluation. We consider two scenarios: $a)$ the former considers an evaluation using only the data labeled by the oracle (supervised learning) and $b)$ the latter considers the use of labeled data by the oracle together with the remaining data labeled by semi-supervised algorithms. (a) Evaluation using supervised learning; (b) Evaluation using semi-supervised learning.

More details about the evaluations showed in Fig. 4 are discussed following:

Evaluation using supervised learning: Using only the examples from $L1$ , we train a classifier and evaluate its performance on the unseen test data $T$ . The test data $T$ are from the standard partition defined for each dataset as shown in Table 3. More specifically, in this evaluation, we used the One-Nearest Neighbor to classify the unseen examples. The evaluation on this setting allows to analyze whether active learning has selected good examples to build an accurate initial classifier;

Evaluation using semi-supervised learning: Using the oracle-labeled set $L1$ , we classify the remaining unlabeled examples $R$ using semi-supervised learning (SSL) algorithms such as Gaussian Fields and Harmonic Functions (GFHF) [75], Learning with Local and Global Consistency (LLGC) [72] and Self-Training [76]. LLGC and GFHF are simple and the two best graph-based algorithms to perform semi-supervised learning. Besides, they provide a satisfactory trade-off between the number of parameters and the classification performance [48, 14, 7]. We also carried out experiments considering the Self-Training approach since this is one of the most natural semi-supervised learning algorithms based on tabular data (vector space model) [76]. The use of semi-supervised learning algorithms allows analyzing if unlabeled data can improve the classification performance. In this scenario, we have all the initial unlabeled data $D^{U}$ labeled in a set $L2$ . In order to analyze if an initial training set with more data can improve the classification accuracy, we train a model with the set $L1\cup$ $L2$ and evaluate its performance on the unseen test data $T$ . More details about the evaluated algorithms and their settings are presented in the next section.

In general, inductive semi-supervised learning can be performed through two steps: i) transductive semi-supervised learning to assign labels to unlabeled instances; and ii) extraction of a classification model considering the labels of instances after transductive learning. Algorithms, such as Self-Training, GFHF, and LLGC performs transductive learning, i.e., they classify the unlabeled instances. Thus, strategies to extract classification model from the set of labeled instances after transductive learning are required.

In transductive learning performed by Self-Training, initially the labeled instances in $L1$ are used to induce a classification model through supervised learning. The model is used to classify the unlabeled instances in $R$ and the most confident classified examples are added to the set of labeled instances $L1$ . Then, the classification model is retrained considering the new set of labeled instances in $L1$ . This process is repeated until all unlabeled examples in $R$ were added to the set of labeled examples $L1$ , forming the $L2$ labeled set. In this paper, we used the One-Nearest Neighbor to classify the unlabeled instances. Algorithm 4 presents the pseudo code of Self-Training algorithm [48].

[H] Self-Training.InputInputOutputOutput $\mathcal{D}^{L}$ - set of labeled instances

$\mathcal{D}^{U}$ - set of unlabeled instances

$\mathbf{X}$ - instance values

$\mathbf{Y}$ - instance labels

$S$ - number of unlabeled instances to be include as labeled instances at each iteration $\mathbf{F}(\mathcal{D}^{U})$ $\mathcal{D}^{R}=\mathcal{D}^{U}$ *Copy of instances in $D^{U}$ which is used in the iterative process $\mathcal{D}^{R}=\emptyset$ $Classification\_Model=Inductive\_Learning(\mathcal{D}^{L},\mathbf{X},\mathbf{Y})$ *Classification model induction considering labeled instances $\mathbf{F}(\mathcal{D}^{R})=Classification\_Model(\mathcal{D}^{R},\mathbf{X})$ *Classification confidences for unlabeled instances $\mathcal{C}=Most\_Confident(\mathbf{F}(\mathcal{D}^{R}),S)$ * $\mathcal{C}$ contains the $S$ unlabeled instances with the highest classification confidences $d_{i}\in\mathcal{C}$ $\mathbf{f}_{d_{i}}=Arg\_Max(\mathbf{f}_{d_{i}})$ *Defining the class of $d_{i}$ considering the arg-max value of $\mathbf{f}_{d_{i}}$ $\mathcal{D}^{R}=\mathcal{D}^{R}-\mathcal{C}$ *Remove from $\mathcal{D}^{R}$ the instances in $\mathcal{C}$ $\mathcal{D}^{L}=\mathcal{D}^{L}\cup\mathcal{C}$ *Insert in $\mathcal{D}^{L}$ the instances in $\mathcal{C}$

The extraction of a classification model after a transductive learning based on tabular data, such as Self-Training, is straightforward [76]. In case of Self-Training, the classification model induced considering all instances, i.e., when all unlabeled instances were labeled by the algorithm, is used to classify new instances.

Graph-based algorithms can also be used to perform transductive learning. The first step is to build a graph considering the instances of the dataset. The approach presented in Section 2.3.4 can be used to perform such task. In our experiments, we considered Mutual $k$ -NN graphs since they present the best results for semi-supervised time series classification [16]. We consider $k=\{5,15,25\}$ . According to [74], $k$ -NN graphs with small $k$ tends to perform better empirically. Thus, we started with $k=5$ , to ensure some connections in a Mutual $k$ -NN graph and this value is increased with 10 units until $k=25$ .

We used the algorithms GFHF and LLGC to perform transductive learning on graphs. Both algorithms have an iterative version in which the labels of instances are iteratively propagated through the neighboring instances. The strength of label propagation is proportional to the strength of a connection considering all connections of a node (GFHF) or all connections of a pair of nodes (LLGC). Besides, LLGC allows to change the labels of labeled instances during the classification process according to a user-defined value of $\alpha$ , in which $0\leqslant\alpha\leqslant 1$ . In our evaluation we considered $\alpha=\{0.1,0.3,0.5,0.7,0.9\}$ . We disregarded $\alpha=0$ since this cancels the influence of neighboring objects during transductive classification and $\alpha=1$ since this makes LLGC similar to GFHF. In Algorithm 4 we present the pseudo code of GFHF and in Algorithm 4 we present the pseudo code of LLGC.

Both LLGC and GFHF produce a weight vector $\mathbf{f}$ for each graph object as a result of transductive classification. This vector contains the weight or the relevance score of each object for each class and they are used to label the instances.

[H] Gaussian Fields and Harmonic Functions.InputInputOutputOutput $\mathcal{D}^{L}$ - set of labeled objects

$\mathcal{D}^{U}$ - set of unlabeled objects

$\mathbf{W}$ - edge weights

$\mathbf{Y}$ - object labels $\mathbf{F}(\mathcal{D}^{U})$ $\mathbf{D}=diag(\mathbf{W}\cdot\mathbf{I}_{|\mathcal{D}|})$ *diag(…) is the matrix diagonal operator $\mathbf{P}\leftarrow(1/\mathbf{D}).\mathbf{W}$ convergence or fixed number of iterations $\mathbf{F}(\mathcal{D})\leftarrow\mathbf{P}\cdot\mathbf{F}(\mathcal{D})$ $\mathbf{F}(\mathcal{D}^{L})\leftarrow\mathbf{Y}(\mathcal{D}^{L})$

[H] Learning with Local and Global Consistency.InputInputOutputOutput $\mathcal{D}^{L}$ - set of labeled objects

$\mathcal{D}^{U}$ - set of unlabeled objects

$\mathbf{W}$ - edge weights

$\mathbf{Y}$ - object labels

$\alpha$ - LLGC’s parameter to attenuate differences of the class information of labeled instances in consecutive iterations $\mathbf{F}(\mathcal{D}^{U})$ $\mathbf{D}=diag(\mathbf{W}\cdot\mathbf{I}_{|\mathcal{D}|})$ $\mathbf{S}=\mathbf{D}^{-1/2}\cdot\mathbf{W}\cdot\mathbf{D}^{-1/2}$ convergence or fixed number of iterations $\mathbf{F}(\mathcal{D})\leftarrow\alpha\cdot\mathbf{S}\cdot\mathbf{F}(\mathcal% {D})+(1-\alpha)\cdot\mathbf{Y}(\mathcal{D})$

After obtain the $\mathbf{f}$ vector of all instances in a network, a mechanism to extract a classification model from this network can be used. Delalleau et al. [17] propose a straightforward approach to classifying new instances through a weighted linear function considering the $\mathbf{f}$ vectors of graph objects. The weight vector of a new instance $\vec{x}_{i}$ is

$\mathbf{f}_{\vec{x}_{i}}=\frac{\displaystyle\sum_{\vec{x}_{j}\in D^{L}\cup D^{% U}}w_{\vec{x}_{i},\vec{x}_{j}}\cdot\mathbf{f}_{\vec{x}_{j}}}{\displaystyle\sum% _{\vec{x}_{j}\in D^{L}\cup D^{U}}w_{\vec{x}_{i},\vec{x}_{j}}}$ (8)

where $w_{\vec{x}_{i},\vec{x}_{j}}$ is given by Eq. (4). The new instance $\vec{x}_{i}$ is classified according to the class corresponding to the $\arg\max_{c_{j}}f_{\vec{x}_{i},c_{j}}$ .

4. Results and discussion

In this section, we present the results regarding the two previously discussed evaluations presented in Fig. 4. In the first evaluation, the initial training set had only data selected by the active learning algorithms. In the second evaluation, we used inductive semi-supervised learning algorithms to label the remaining unlabeled data that were not selected. Given some common patterns found in our results and to avoid showing a great number of plots, we selected a reduced number of cases for discussion. However, all the results are included in the online supplementary material for this paper [60].

4.1 Evaluation using supervised learning

We begin our evaluation by classifying the test data using a model induced from a training set with $b\\$ selected examples (Fig. 4a). We start our analysis by presenting the general results achieved by methods based on clustering and methods based on centrality measures from graphs.

As previously discussed, we evaluated two main approaches for sampling data based on clustering. In the first approach, examples near the cluster centers and near the cluster borders were selected. Thus, we evaluated a different number of border examples to be selected. We considered $be=\{3,5,7,9,11\}$ . However, in general, the best results were achieved with $be=3$ . The average classification accuracy varying from $5\\$ to $95\\$ of the labeled data and 3 examples from the border of each cluster found by the clustering algorithms are presented in Table 4. The results achieved by different values of $b e$ are presented in the supplementary material. In the second approach, only examples near the cluster centers were considered. In Table 4, we highlighted in bold the best result achieved by an algorithm given a dataset. In general, we can note that the $k$ -Means clustering algorithm presents the best results for most of datasets.

Table 4
Average classification accuracy achieved by a classifier that uses a training set sampled by clustering algorithms using 1 example near from cluster centers and 3 examples from the cluster borders

Dataset	k-Means	k-Medoids	Hierarchical	GMM
50words	0.564	0.548	0.567	0.542
Adiac	0.487	0.488	0.457	0.482
ChlorineConc.	0.557	0.557	0.557	0.554
Cricket-X	0.465	0.451	0.469	0.463
Cricket-Y	0.511	0.505	0.479	0.519
Cricket-Z	0.494	0.478	0.492	0.474
FaceAll	0.625	0.611	0.617	0.618
FacesUCR	0.623	0.620	0.613	0.603
Fish	0.676	0.669	0.654	0.683
Haptics	0.376	0.376	0.373	0.374
MedicalImages	0.641	0.636	0.625	0.628
NonInvFetECG1	0.775	0.773	0.769	0.773
NonInvFetECG2	0.833	0.831	0.829	0.827
OSULeaf	0.471	0.478	0.496	0.470
StarLightCurves	0.823	0.819	0.805	0.820
SwedishLeaf	0.680	0.686	0.650	0.685
Synt. Control	0.839	0.847	0.841	0.835
Two Patterns	0.799	0.791	0.798	0.786
WordsSynonyms	0.513	0.520	0.509	0.504
uWaveGestLib-X	0.714	0.703	0.716	0.709
uWaveGestLib-Y	0.616	0.617	0.613	0.618
uWaveGestLib-Z	0.624	0.621	0.617	0.623
Wafer	0.993	0.992	0.993	0.994
Yoga	0.739	0.738	0.735	0.738
Mean	0.643	0.640	0.636	0.638

The average classification accuracy achieved by the sampling approach that considers only examples near the cluster centers are presented in Table 5. In this table, we also compare the method which achieved the best mean accuracy in Table 4, the $k$ -Means algorithm with $3$ border examples from each cluster.

Table 5

General results of evaluated methods for each dataset considering the average accuracy for 5% to 95% of labeled data.

Dataset	Random	$k$ -Means	$k$ -	Hierar.	$k$ -	Density	GMM	Farth.	Betw.	Degree	Page
		(w/ borders)	Medoids		Means	Peak		First			Rank
50words	0.547	0.564	0.566	0.582	0.573	0.529	0.563	0.482	0.519	0.502	0.532
Adiac	0.493	0.487	0.514	0.507	0.517	0.404	0.513	0.427	0.442	0.413	0.474
ChlorineConc.	0.549	0.557	0.549	0.550	0.552	0.533	0.551	0.537	0.518	0.530	0.550
Cricket-X	0.462	0.465	0.472	0.499	0.498	0.486	0.469	0.437	0.430	0.429	0.455
Cricket-Y	0.524	0.511	0.547	0.542	0.557	0.542	0.532	0.476	0.514	0.488	0.518
Cricket-Z	0.491	0.494	0.506	0.508	0.517	0.504	0.498	0.453	0.472	0.460	0.488
FaceAll	0.620	0.625	0.639	0.660	0.660	0.447	0.639	0.557	0.589	0.566	0.608
FacesUCR	0.615	0.623	0.634	0.653	0.657	0.579	0.625	0.588	0.604	0.574	0.601
Fish	0.685	0.676	0.719	0.722	0.727	0.694	0.704	0.621	0.678	0.647	0.677
Haptics	0.361	0.376	0.366	0.341	0.355	0.368	0.361	0.310	0.362	0.351	0.364
MedicalImages	0.614	0.641	0.620	0.640	0.630	0.603	0.617	0.609	0.552	0.557	0.611
NonInvFetECG1	0.775	0.775	0.787	0.794	0.796	0.772	0.788	0.711	0.732	0.696	0.777
NonInvFetECG2	0.835	0.833	0.847	0.850	0.852	0.836	0.842	0.777	0.758	0.721	0.831
OSULeaf	0.454	0.471	0.460	0.477	0.472	0.432	0.468	0.429	0.433	0.403	0.442
StarLightCurves	0.835	0.823	0.827	0.834	0.834	0.8367	0.831	0.840	0.806	0.784	0.831
SwedishLeaf	0.691	0.680	0.701	0.717	0.722	0.653	0.716	0.551	0.628	0.559	0.659
Synt. Control	0.833	0.839	0.857	0.888	0.882	0.528	0.868	0.801	0.858	0.784	0.840
Two Patterns	0.779	0.799	0.798	0.827	0.821	0.772	0.796	0.765	0.772	0.742	0.770
uWaveGestLib-X	0.703	0.714	0.712	0.722	0.719	0.695	0.712	0.648	0.693	0.654	0.702
uWaveGestLib-Y	0.619	0.616	0.625	0.630	0.633	0.621	0.623	0.603	0.607	0.563	0.627
uWaveGestLib-Z	0.618	0.624	0.621	0.629	0.628	0.615	0.621	0.578	0.611	0.591	0.619
Wafer	0.989	0.993	0.991	0.994	0.992	0.984	0.992	0.993	0.961	0.965	0.991
WordsSynonyms	0.511	0.513	0.531	0.553	0.544	0.482	0.526	0.454	0.516	0.480	0.503
Yoga	0.753	0.739	0.767	0.773	0.776	0.731	0.771	0.747	0.734	0.697	0.740
Mean	0.639	0.643	0.652	0.662	0.663	0.610	0.651	0.599	0.616	0.589	0.633
Wins against
random sampling		15	23	22	22	8	22	2	3	0	7

Results in Table 5 show that for all 24 datasets, the best results are achieved in general by k-Means and Hierarchical clustering. Only in the StarLightCurves dataset, the best result was achieved by the Farthest First Traversal algorithm. We also present the number of times that each method outperformed random sampling, the baseline considered in this work. Although k-Means and Hierarchical clustering present better results, k-Medoids achieved the highest number of wins against random sampling. From the results presented in Table 5, we can also note that considering only examples near the cluster centers slightly surpass the accuracy obtained considering both examples near cluster centers and cluster borders (presented in Table 4).

We performed the Friedman test with the Nemenyi post-hoc with 95% as confidence level to statistically compare the results of different algorithms. In Fig. 5 we present the critical difference diagram for ranked accuracies that illustrates the test.

Figure 5.

Critical difference diagram considering the average results of each algorithm.

In this diagram, the algorithms are sorted according to their average ranking. The algorithms connected by a line do not present statistically significant differences among them [19]. Thus, we can note in the diagram from Fig. 5 that k-Means is ranked in first place. In contrast, random sampling, the most used method in the literature, is ranked only in the 5th among 10 evaluated methods. There is no statistically significant difference among k-Means, Hierarchical clustering, k-Medoids and Gaussian Mixture Models. However, k-Means and Hierarchical clustering are superior to random sampling with statistically significant difference.

In Fig. 5, we consider the average results achieved by the algorithms with 5% to 95% of labeled data. In Fig. 6 we show that this ranking can slightly change if we consider a different number of labeled examples. In the example, we consider 20% and 50% of labeled data.

Figure 6.

Critical difference diagrams considering the results achieved by the algorithms with 20% and 50% of training set labeled. (a) 20% of labeled data; (b) 50% of labeled data.

In the three presented cases, the methods that provided the worst performance are same: Degree centrality measure, Farthest First Traversal, Betweenness centrality measure, and Density Peak clustering algorithm. Thus, we do not recommend these methods to select initial data on fully unlabeled data. Based on our results, we can recommend k-Means and Hierarchical clustering rather than the random sampling. k-Means is computationally more efficient than Hierarchical clustering. However, k-Means results present some variability due to the non-deterministic nature of the algorithm.

To present a general view of results for all datasets in a paired comparison against random sampling, we show a graphical representation in which each dataset is represented by a point and the $x$ and $y$ coordinates are the accuracies obtained by two rival methods respectively. We focus in the comparison of random sampling against k-Means, Hierarchical clustering, and k-Medoids. Figure 7 presents the average of results of from 5% to 95% of labeled examples, as presented in Table 5.

Figure 7.

General results of k-Means, Hierarchical clustering, and k-Medoids compared to random sampling considering the average of results from 5% to 95% of labeled data. (a) Random $v s$ . $k$ -Means; (b) Random $v s$ . Hierarchical; (c) Random $v s$ . $k$ -Medoids.

We can note in the three illustrations of Fig. 7, that almost all points are situated above the main diagonal (except Haptics dataset in $k$ -Means and Hierarchical Agglomerative clustering), which means datasets where the rival method is better than the random sampling. Both k-Means and Hierarchical clustering present very similar results. In Fig. 7c, we can note that the results of k-Medoids are slightly worse than k-Means and Hierarchical clustering, but still better than random sampling.

Figure 8.

Similar results of k-Means and Hierarchical clustering methods. (a) FaceAll; (b) FacesUCR; (c) TwoPatterns.

Analyzing the results of methods based on clustering varying the amount of labeled data, we can observe that in many cases as the datasets FaceAll, FacesUCR and TwoPatterns, both algorithms k-Means and Hierarchical clustering presented very similar results as observed in Fig. 8. In these figures, the inferior x-axis represents a percentage of training data which is selected by the algorithm to be labeled, the superior x-axis represents the respective values in absolute terms and in y-axis presents the accuracies in a holdout evaluation using the test data. Obviously, with 100% of labeled training data all the algorithms converge to the same result which can be observed in the last column from Table 3.

Figure 9.

Datasets with inconclusive results in the evaluation on test set. (a) ChlorineConcentration; (b) Haptics.

Among all the 24 datasets, only in ChlorineConcentration and Haptics the random sampling achieved competitive results as presented in Fig. 9. However, we can note that for ChlorineConcentration, all methods present very similar results. For Haptics, due to some characteristic of the data distribution, Hierarchical clustering does not perform very well. It is also interesting to note that with only 25% (30 examples) of the labeled data, the algorithm k-Medoids achieves almost the same result considering 100% (or 155 examples) of labeled data.

Figure 10.

Datasets with worst results considering the Density Peak clustering and Farthest-First Traversal algorithms. (a) SwedishLeaf; (b) Synthetic Control.

For the Density Peak and Farthest-First algorithms, we highlighted the poor results on the SwedishLeaf and Synthetic Control in Fig. 10. Although these methods present poor results in many datasets, the behavior is more evident in these two.

The results achieved by the Farthest-First Traversal are justified given that the algorithm is susceptible to noise and outliers. In addition, even with the absence of noise or outliers, the algorithm is more likely to select border examples in the feature space. This characteristic becomes a problem when the dataset has many classes and/or the classes are not well separated in the feature space.

In relation to the Density Peak, we noted in our experiments that the cutoff distance $d_{c}$ has great influence in its behavior. Although the authors show in their experiments that the results are consistent by varying $d_{c}$ [45], we noted that setting this parameter in an intermediate value $d_{c}=0.5$ presents a considerable variation in our results. In many datasets, the results achieved by the Density Peak are the worst in comparison with the remaining methods while in other datasets the results are competitive with other methods. Therefore, in general, we believe that this algorithm can achieve competitive results whether the parameter $d_{c}$ is well tuned.

Considering only the results achieved by methods based on centrality measures and varying the amount of labeled data, we can note that the random sampling is very competitive. As previously discussed, considering 20% and 50% of the labeled data or the average of results, the measures Degree, and Betweenness are in the top 3 worst methods among the 10 methods evaluated in this paper. The PageRank centrality measure presents better results than the others, but its results are comparable to the random sampling. For example, in the datasets SwedishLeaf and Yoga, the random sampling presents a slightly better accuracy than the competitors methods, although the PageRank measure presents a similar performance in some cases, as presented in Fig. 11.

Figure 11.

Datasets in which random sampling slightly outperforms the results achieved by methods based on centrality measures. (a) SwedishLeaf; (b) Yoga.

The differences in the results are even less evident in datasets as ChlorineConcentration, Fish, Synthetic Control and uWaveGestureLibrary Y. As we can observe in Fig. 12, for ChlorineConcentration and uWaveGestureLibrary Y, the results of random sampling and PageRank are very similar, although the PageRank shows a slightly better performance at the beginning with less labeled data. However, for the Synthetic Control dataset, the Betweenness measure slightly outperforms the results achieved by random sampling and PageRank.

Figure 12.

Datasets where methods based on centrality measures slightly outperformed the random sampling. (a) ChlorineConcentration; (b) Synthetic Control; (c) uWaveGestureLibrary Y.

From the results achieved from graph-based methods, we can conclude that it is safer to avoid these methods to select an initial data to be labeled. These methods present similar results to the results achieved by random sampling with more effort and only rarely present slightly better results. We believe that the poor results of these methods are justified by the fact that when we map the tabular representation to a graph-based representation, we are changing the view of the data, and the use of centrality measures extracts the importance of instances in the graph topology, not in the tabular data. Thus, the most important instances in one view may be noisy in another view.

Figure 13.

Datasets where semi-supervised learning improves the quality of initial training set. (a) Cricket-X; (b) TwoPatterns; (c) Haptics; (d) StarLightCurves.

At this point we are in condition to answer two of the three research questions posed earlier:

•

Question 1: Are there simple and effective alternatives to the random sampling in order to select examples to build an initial training set?

•

Answer: Yes. Considering our experimental evaluation, we have that the simple methods based on clustering algorithms such as k-Means, Hierarchical and k-Medoids are able to provide an initial training data better than the random sampling in terms of predictive power. In general, these algorithms are easy to understand and their implementations are easily found online. Thus, the use of random sampling can be replaced by a method based on clustering that generally presents better results.

•

Question 2: Among the evaluated alternatives which one stands out?

•

Answer: We can note that the best results are achieved by the sampling based on clustering methods which select examples near the cluster centers. In particular, by k-Means and Hierarchical clustering. However, we recommend the use of Hierarchical clustering since it is a deterministic algorithm. In general, this method outperforms the random sampling. Regarding the sampling based on centrality measures, PageRank provides better results among the three evaluated measures. However, its results are not consistently better than the random sampling. The Farthest-First Traversal presents the worst results together with the Density Peak clustering algorithm and they are usually outperformed by random sampling.

4.2 Evaluation using semi-supervised learning

The use of active learning methods to build an initial training data is justified by the huge amount of data that can be easily collected in many real-world applications, but can not be fully labeled due to the high costs. In the evaluation previously discussed, we consider the performance of a classifier that selects only $b\\$ from the unlabeled data $D^{U}$ through the active learning methods to be labeled by an expert. However, all the remaining data $R$ from the unlabeled data $D^{U}$ are discarded. In this section, we evaluate if the labeled data together with unlabeled data can help to build a more accurate classifier based on the semi-supervised learning literature [10]. For labeling the data from $R$ using the subset of labeled instances $L1$ , we classify the instances using the semi-supervised algorithms GFHF, LLGC, and Self-Training. This evaluation is represented in Fig. 4b.

Figure 14.

Datasets where semi-supervised learning deteriorates the quality of initial training set. (a) Cricket-Y; (b) SwedishLeaf; (c) Synthetic Control; (d) uWaveGestureLibrary Y.

According to our experiments, the best algorithms to select examples to be labeled in the active learning process were k-Means and Hierarchical clustering. However, the Hierarchical clustering has the advantage of being deterministic. Thus, we use the labeled data selected by the Hierarchical clustering to evaluate if a higher amount of data in training set using unlabeled data can improve the results obtained by instance selection. We present a comparison of accuracies achieved by the classifier that uses only the selected examples and those that also use the unlabeled data in the training set after the inductive semi-supervised learning process.

Due to the parameter settings of each evaluated semi-supervised learning algorithm, the results present some variability. For this reason, we will present the mean accuracy and the standard deviation achieved by the algorithms varying the values of the parameters and the amount of labeled data.

Of a total of 24 datasets evaluated in this work, we note that the SSL improves the quality of classification in approximately 8 datasets. From these datasets, we present the results of 4 of them (Cricket-X, Haptics, StarLightCurves, and TwoPatterns) in Fig. 13. In Fig. 13b and Fig. 13d, we can observe that the Self-Training algorithm can improve slightly the quality of the training set by the use of unlabeled data to complement the labeled data. On other hand, Self-Training achieves poor or equal results compared to the use of only labeled data on the Cricket-X (Fig. 13a) and Haptics (Fig. 13c) datasets. In the Cricket-X dataset, the best results are achieved by the GFHF algorithm, and in Haptics dataset, the best results are achieved by the LLGC algorithm.

Curiously, we note in some datasets that the use SSL to label the remaining unlabeled data in the training set was responsible for deteriorating the quality of initial training set. Thus, in these cases, the best choice is the use of only labeled data by the oracle. To better illustrate this problem, we present the results of datasets Cricket-Y, SwedishLeaf, Synthetic Control and uWaveGestureLibrary-Y in Fig. 14. In all of these datasets, we can note that the Self-Training presents slightly worse results than the use of only labeled data. However, LLGC and GFHF present considerably worse results.

This behavior of Self-Training algorithm occurs when the initial most confident instances are wrongly classified in the first iterations. Thus, a poor classification model is generated in the first iteration and the classification errors are propagated through the next iterations [76].

Figure 15.

Illustration of laser sensor to capture information about insects and an example of audio signal collected. (a) Sensor; (b) Signal.

Graph-based algorithms tend to obtain higher classification performances than Self-Training in several application domains [48, 76, 72, 75]. However, in many datasets, they present poor results than supervised learning algorithms [48]. In our evaluation, i.e, considering sequential data, we also verified that they tend to obtain poor classification performances than the obtained by supervised learning.

At this point we are in condition to answer the last research question made earlier in this work:

•

Question 3: Can the unlabeled examples (not selected for the initial model) improve the classification performance of the initial training set?

•

Answer: Of a total of 24 datasets evaluated, we note that the SSL algorithms can slightly improve the results of only one-third of the cases. Among the three evaluated SSL algorithms (GFHF, LLGC, and Self-Training), we note that the Self-Training presented the best results. However, in many datasets the SSL algorithms were responsible for deteriorating the quality of the training set. Thus, besides presenting an additional cost in the building process of a labeled training set, we do not have evidence that the SSL algorithms can improve the quality of the training data.

4.3 A case study of laser sensor to classify insects

In our case study, we present a real application of insect species classification using laser sensors in a data stream environment [61]. There are two main motivations to classify insects species using the laser sensor: i) provide estimates of insects’ population that can aid traditional methods of control such as the spreading of insecticides and larvicides with less waste and ii) the laser can be used on intelligent traps that attract insects using allures and capture only specific target species as disease vectors and agricultural pests as proposed in [58].

Basically, the sensor uses a low-powered planar laser source that is pointed at an array of phototransistors as presented in Fig. 15a. When a flying insect crosses the laser, its wings partially occlude the light, causing small variations in the light captured by the phototransistors. These variations are recorded as an audio signal, as the example presented in Fig. 15b. In this example, we can note an audio segment with one-second length where most of the time we have background noise and a higher amplitude signal that lasts for tenths of a second that constitute an insect passage. From the audio data with this passage, it is possible to extract discriminant features for each species. In this work, we use the Mel-Frequency Cepstral Coefficients (MFCC) as recommended in our previous evaluation [59]. MFCCs are popular features in various application domains, particularly speech and speaker recognition [71] as well as musical instruments classification [63].

In controlled experiments in the laboratory, we can consider that we know a subset of species and we have an initial training data to evaluate discriminant features and classifiers. However, in the use of the sensor in field conditions, we do not have the knowledge of potential species at the specific region where the sensor will be placed. Even if it were possible to have knowledge of these species to collect data in the laboratory previously, the environmental conditions in the laboratory are possibly different from the field conditions. It is known that climate variations such as temperature [62], air pressure [9] and humidity [40] can influence the insects’ metabolisms and consequently the measured data. Thus, the previously collected data in the laboratory cannot reflect the current data observed in field conditions.

We propose in this paper a simple and practical solution to collect an initial training data and build an accurate classifier for the sensor of insects using a subset of examples. Our idea is to place the sensor in field conditions to collect data for a period of time and use active learning methods to select some of the data. An expert will analyze the selected signals and provide their respective labels.

Given the general results achieved in our experimental evaluation, in this case study, we only consider the three best methods based on clustering: k-Means, Hierarchical, and k-Medoids to select the data to be labeled. We also evaluated the popular method random sampling as a baseline method. We use the same dataset evaluated in [61], which contains two species of flies (Drosophila melanogaster and Musca domestica) and three species of mosquitoes (Culex quinquefasciatus, Culex tarsalis and Aedes aegypti) distributed in a total of 5,325 examples collected over a period of seven days. In our experiments, we considered a setup time of one day. In other words, the sensor collects data over one day and from this subset of unlabeled data, we select some of the data to present to an expert that will provide the respective labels. Then, we use this initial training set to evaluate the classification accuracy for the next six days. One day of data collection constitutes the first 1,083 examples of the dataset. Thus, the test is performed over the 4,242 remaining examples.

The accuracies achieved by the evaluated methods considering a different amount of labeled data are presented in Fig. 16.

Figure 16.

Accuracies of methods based on clustering to select initial data to classify insects.

We can note that the results presented in Fig. 16 are according to those presented in our experimental evaluation in the benchmark datasets. We noted that Hierarchical clustering achieved better or equal results to the k-Means. These results are considerably better than random sampling. Among the methods based on clustering, k-Medoids presented the worst results, but still better than random sampling.

It is also interesting to note that the Hierarchical clustering algorithm to sampling data presents an accuracy of 0.874 with 35% (or 379 examples) of labeled examples. This result is very competitive if we consider that with 100% (or 1083 examples) of training data labeled, we have a very similar accuracy of 0.877. Thus, we noted that the use of methods based on clustering as Hierarchical clustering to select data to be labeled allows to reduce drastically the cost to build an accurate classifier.

Figure 17.

Critical difference diagram for the results achieved by methods to select initial data to classify insect species.

Given the results presented in Fig. 16, we performed the Friedman test to compare the results statistically. The critical difference diagram is presented in Fig. 17. We can note that Hierarchical clustering is the first in the ranking without a significant difference to k-Means. We also observe no significant difference among the results of k-Means and k-Medoids, and the results of k-Medoids and random sampling. However, Hierarchical and k-Means are statistically better than random sampling.

Finally, we evaluated if the use of semi-supervised learning algorithms can improve the quality of the training set labeled by the oracle after the sampling based on Hierarchical clustering. Once again, we chose the data selected by Hierarchical clustering due to the best achieved results, as previously discussed. The results achieved by the algorithms GFHF, LLGC and Self-Training are presented in Fig. 18.

Figure 18.

Results of semi-supervised learning algorithms for insects classification.

Similarly to the previously presented results for benchmark data, we can note that the SSL algorithms are responsible for deteriorating the quality of training set labeled by the oracle.

5. Conclusions

Although active learning has been widely used in many problems, the majority of methods from literature consider the presence of an initially labeled dataset or a trained classification model. Thus, previously labeled data are useful to rank unlabeled data according to their importance. Then, the user must label just the top ranked instances and the training data will be enriched with those instances. However, in some real applications, there is the need to perform the active learning procedure at the beginning of the life-cycle of the problem on a fully unlabeled dataset and with the absence of a classification model.

In this fully unlabeled scenario, most works in literature randomly sample examples to be labeled by an oracle. Although it is a simple and popular approach, we question in this paper if this practice is the most effective for this problem. We believe that such practice is popular due to the lack of a more in-depth study which shows that other simple methods can be a better choice.

In this direction, we evaluated in this paper 9 different methods based on clustering algorithms and centrality measures from graphs to select the most informative examples of a fully unlabeled dataset to be labeled by an oracle. Based on our wide experimental evaluation performed on 24 time series datasets and one case study of insects’ classification, we found that some simple approaches present better accuracy than the random sampling. The best results are achieved by the sampling based on clustering methods, in particular, by k-Means and Hierarchical clustering. These methods are simple and readily available in many software packages. Thus, the random sampling method can be easily replaced by one of these methods in order to obtain better results. On the other hand, the Density Peak clustering algorithm and the three evaluated centrality measures from graphs (Degree, Betweenness, and PageRank) have performed worse than the simple random sampling.

After the selection of instances to be labeled by an oracle to build an initial training set, we also evaluated if the use semi-supervised learning algorithm can help to improve the quality of the training set by the addition of more examples. Unfortunately, we note that in addition to present an additional cost in the building process of a labeled training set, we do not have evidence that the SSL algorithms can consistently improve the quality of the training set.

Footnotes

Acknowledgments

The authors would like to thank São Paulo Research Foundation (FAPESP) for the support in the grants #2011/17698-5, #2011/12823-6, and #2014/08996-0 and National Counsel of Technological and Scientific Development (CNPq) for the support in the grants #446330/2014-0, and #303083/2013-1.

References

Amancio

D.R.

, Probing the topological properties of complex networks modeling short written texts, PloS one 10(2) (2015), e0118394.

Angluin

, Queries revisited, Theoretical Computer Science 313(2) (2004), 175–194.

Araujo

and Zhao

, Detecting and labeling representative nodes for network-based semi-supervised learning, in: Proceedings of the International Joint Conference on Neural Networks (IJCNN), 2013, pp. 1729–1736.

Bagnall

and Lines

, An experimental evaluation of nearest neighbour time series classification, Technical Report CMP-C14-01, Department of Computer Science, University East Anglia, Norwich, United Kingdom, 2014.

Baram

El-Yaniv

and Luz

, Online choice of active learning algorithms, Journal of Machine Learning Research 5 (2004), 255–291.

Batista

G.E.A.P.A.

Keogh

E.J.

Tataw

O.M.

and Souza

V.M.A.

, CID: an efficient complexity-invariant distance for time series, Data Mining and Knowledge Discovery 28(3) (2014), 634–669.

Breve

F.A.

Zhao

Quiles

M.G.

Pedrycz

and Liu

, Particle competition and cooperation in networks for semi-supervised learning, IEEE Transactions on Knowledge and Data Engineering 24(9) (2012), 1686–1698.

Brin

and Page

, The anatomy of a large-scale hypertextual web search engine, Computer Networks 30 (1998), 107–117.

Chadwick

L.E.

and Williams

C.M.

, The effects of atmospheric pressure and composition on the flight of drosophila, The Biological Bulletin 97(2) (1949), 115–137.

10.

Chapelle

Schölkopf

and Zien

, Semi-Supervised Learning, MIT Press, 2006.

11.

Chen

and Ng

, On the marriage of lp-norms and edit distance, in: Proceedings of the International Conference on Very large data bases (VLDB), 2004, pp. 792–803.

12.

Chen

Özsu

M.T.

and Oria

, Robust and fast similarity search for moving object trajectories, in: Proceedings of the International Conference on Management of data (SIGMOD), 2005, pp. 491–502.

13.

Chen

Keogh

Begum

Bagnall

Mueen

and Batista

, The ucr time series classification archive, July 2015. www.cs.ucr.edu/ẽamonn/time_series_data/.

14.

de Sousa

C.A.R.

Rezende

S.O.

and Batista

G.E.A.P.A.

, Influence of graph construction on semi-supervised learning, in: Proceedings of the European Conference Machine Learning and Knowledge Discovery in Databases (ECML/PKDD), 2013, pp. 160–175.

15.

de Sousa

C.A.R.

Souza

V.M.A.

and Batista

G.E.A.P.A.

, Time series transductive classification on imbalanced data sets: An experimental study, in: Proceedings of the International Conference on Pattern Recognition (ICPR), 2014, pp. 3780–3785.

16.

de Sousa

C.A.R.

Souza

V.M.A.

and Batista

G.E.A.P.A.

, An experimental analysis on time series transductive classification on graphs, in: Proceedings of the International Joint Conference on Neural Networks (IJCNN), 2015, pp. 1–8.

17.

Delalleau

Bengio

and Le-Roux

, Efficient non-parametric function induction in semi-supervised learning, in: Proceedings of the International Workshop on Artificial Intelligence and Statistics (AISTATS), 2005, pp. 96–103.

18.

Dempster

A.P.

Laird

N.M.

and Rubin

D.B.

, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society 39 (1977), 1–38.

19.

Demšar

, Statistical comparisons of classifiers over multiple data sets, The Journal of Machine Learning Research 7 (2006), 1–30.

20.

Ding

Trajcevski

Scheuermann

Wang

and Keogh

, Querying and mining of time series data: Experimental comparison of representations and distance measures, Proceedings of the VLDB Endowment 1(2) (2008), 1542–1552.

21.

Ester

Kriegel

Sander

and Xu

, A density-based algorithm for discovering clusters in large spatial databases with noise, in: Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD), 96 (1996), pp 226–231.

22.

Faloutsos

Ranganathan

and Manolopoulos

, Fast subsequence matching in time-series databases, in: Proceedings of the International Conference on Management of Data (SIGMOD), 1994, pp. 1–11.

23.

Fernández

García

del Jesus

M.J.

and Herrera

, A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets, Fuzzy Sets and Systems 159(18) (2008), 2378–2398.

24.

Frentzos

Gratsias

and Theodoridis

, Index-based most similar trajectory search, in Proceedings of the International Conference on Data Engineering (ICDE), 2007, pp. 816–825.

25.

Zhu

and Li

, A survey on instance selection for active learning, Knowledge and Information Systems 35(2) (2013), 249–283.

26.

Fukunaga

and Hostetler

, The estimation of the gradient of a density function, with applications in pattern recognition, IEEE Transactions on Information Theory 21(1) (1975), 32–40.

27.

Hoi

S.C.H.

Jin

Zhu

and Lyu

M.R.

, Batch mode active learning and its application to medical image classification, in: Proceedings of the International Conference on Machine Learning (ICML), 2006, pp. 417–424.

28.

Mac-Namee

and Delany

S.J.

, Off to a good start: Using clustering to select the initial training set in active learning, in: Proceedings of the International Florida Artificial Intelligence Research Society Conference (FLAIRS), 2010, pp. 26–31.

29.

W.M.

Xie

and Maybank

, Unsupervised active learning based on hierarchical graph-theoretic clustering, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 39(5) (2009), 1147–1161.

30.

Jain

A.K.

, Data clustering: 50 years beyond k-means, Pattern Recognition Letters 31(8) (2010), 651–666.

31.

Kang

Ryu

K.R.

and Kwon

, Using cluster-based sampling to select initial training set for active learning in text classification, in: Advances in Knowledge Discovery and Data Mining, Springer 2004, pp. 384–388.

32.

Keogh

and Ratanamahatana

C.A.

, Exact indexing of dynamic time warping, Knowledge and Information Systems 7(3) (2005), 358–386.

33.

Keogh

Wei

Lee

and Vlachos

, Lb_keogh supports exact indexing of shapes under rotation invariance with arbitrary representations and distance measures, in: Proceedings of the International Conference on Very Large Data Bases (VLDB), 2006, pp. 882–893.

34.

Kim

Smyth

and Luther

, Modeling waveform shapes with random effects segmental hidden markov models, in: Proceedings of the conference on Uncertainty in artificial intelligence (UAI), 2004, pp. 309–316.

35.

Klein

D.J.

, Centrality measure in graphs, Journal of Mathematical Chemistry 47(4) (2010), 1209–1223.

36.

Lewis

D.D.

and Gale

W.A.

, A sequential algorithm for training text classifiers, in: Proceedings of the International ACM Conference on Research and Development in Information Retrieval (SIGIR), 1994, pp. 3–12.

37.

Lughofer

, Hybrid active learning for reducing the annotation effort of operators in classification systems, Pattern Recognition 45(2) (2012), 884–896.

38.

Macskassy

S.A.

, Using graph-based metrics with empirical risk minimization to speed up active learning on networked data, in: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2009, pp. 597–606.

39.

Mandel

M.I.

Poliner

G.E.

and Ellis

D.P.W.

, Support vector machine active learning for music retrieval, Multimedia Systems 12(1) (2006), 3–13.

40.

Mellanby

, Humidity and insect metabolism, Nature 138 (1936), 124–125.

41.

Nanopoulos

Alcock

and Manolopoulos

, Feature-based classification of time-series data, International Journal of Computer Research 10(3) (2001), 49–61.

42.

Newman

, Networks: An Introduction, Oxford University Press, Inc., 2010.

43.

Nguyen

H.T.

and Smeulders

, Active learning using pre-clustering, in: Proceedings of the International Conference on Machine Learning (ICML), 2004, p. 79.

44.

Reynolds

, Gaussian mixture models, In Encyclopedia of Biometrics, Springer 2015, pp. 827–832.

45.

Rodriguez

and Laio

, Clustering by fast search and find of density peaks, Science 344(6191) (2014), 1492–1496.

46.

Rodríguez

J.J.

and Alonso

C.J.

, Interval and dynamic time warping-based decision trees, in: Proceedings of the ACM Symposium on Applied Computing (SAC), 2004, pp. 548–552.

47.

Rodríguez

J.J.

Alonso

C.J.

and Boström

, Learning first order logic time series classifiers: Rules and boosting, in: Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery (ECML/PKDD), 2000, pp. 299–308.

48.

Rossi

R.G.

Lopes

A.A.

and Rezende

S.O.

, Optimization and label propagation in bipartite heterogeneous networks to improve transductive classification of texts, Information Processing & Management 52(2) (2016), 217–257.

49.

Roy

and McCallum

, Toward optimal active learning through sampling estimation of error reduction, in: Proceedings of the International Conference on Machine Learning (ICML), 2001, pp. 441–448.

50.

Saito

P.T.M.

Suzuki

C.T.N.

Gomes

J.F.

de Rezende

P.J.

and Falcão

A.X.

, Robust active learning for the diagnosis of parasites, Pattern Recognition, 2015.

51.

Settles

, Active learning literature survey, University of Wisconsin, Madison, 2010, p. 65.

52.

Settles

and Craven

, An analysis of active learning strategies for sequence labeling tasks, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2008, pp. 1070–1079.

53.

Settles

Craven

and Friedland

, Active learning with real annotation costs, in: Proceedings of the NIPS Workshop on Cost-Sensitive Learning, 2008, pp. 1–10.

54.

Settles

Craven

and Ray

, Multiple-instance active learning, in: Advances in Neural Information Processing Sytems, 2008, pp. 1289–1296.

55.

Seung

H.S.

Opper

and Sompolinsky

, Query by committee, in: Proceedings of the Workshop on Computational Learning Theory, 1992, pp. 287–294.

56.

Shokoohi-Yekta

Wang

and Keogh

, On the Non-Trivial Generalization of Dynamic Time Warping to the Multi-Dimensional Case, in: Proceedings of the SIAM International Conference on Data Mining (SDM), 2015, pp. 289–297.

57.

Silva

D.F.

Souza

V.M.A.

and Batista

G.E.A.P.A.

, Time series classification using compression distance of recurrence plots, in: Proceedings of the International Conference on Data Mining (ICDM), 2013, pp. 687–696.

58.

Silva

D.F.

Souza

V.M.A.

Batista

G.E.A.P.A.

Keogh

and Ellis

D.P.W.

, Applying machine learning and audio analysis techniques to insect recognition in intelligent traps, in: Proceedings of the International Conference on Machine Learning and Applications (ICMLA), Vol. 1, 2013, pp. 99–104.

59.

Silva

D.F.

Souza

V.M.A.

Ellis

D.P.W.

Keogh

E.J.

and Batista

G.E.A.P.A.

, Exploring low cost laser sensors to identify flying insect species, Journal of Intelligent & Robotic Systems (2014), 1–18.

60.

Souza

V.M.A.

Rossi

R.G.

Rezende

S.O.

and Batista

G.E.A.P.A.

, Online supplementary material, http://sites.labic.icmc.usp.br/vsouza/SM_IDA.pdf, 2016.

61.

Souza

V.M.A.

Silva

D.F.

and Batista

G.E.A.P.A.

, Classification of data streams applied to insect recognition: Initial results, in: Proceedings of the Brazilian Conference on Intelligent Systems (BRACIS), 2013, pp. 76–81.

62.

Taylor

L.R.

, Analysis of the effect of temperature on insects in flight, The Journal of Animal Ecology (1963), 99–117.

63.

Terasawa

Slaney

and Berger

, The thirteen colors of timbre, in: Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2005, pp. 323–326.

64.

Tong

and Koller

, Support vector machine active learning with applications to text classification, Journal of Machine Learning Research 2 (2002), 45–66.

65.

Vlachos

Kollios

and Gunopulos

, Discovering similar multidimensional trajectories, in: Proceedings of the International Conference on Data Engineering (ICDE), 2002, pp. 673–684.

66.

Ward

J.H.

, Jr., Hierarchical grouping to optimize an objective function, Journal of the American Statistical Association 58(301) (1963), 236–244.

67.

and Chang

E.Y.

, Distance-function design and fusion for sequence data, in: Proceedings of the ACM international conference on Information and knowledge management (CIKM), 2004, pp. 324–333.

68.

Keogh

Shelton

Wei

and Ratanamahatana

C.A.

, Fast time series classification using numerosity reduction, in: Proceedings of the International Conference on Machine learning (ICML), 2006, pp. 1033–1040.

69.

Yan

Yang

and Hauptmann

, Automatically labeling video data using multi-class active learning, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2003, pp. 516–523.

70.

Yuan

Han

Guan

Lee

and Lee

, Initial training data selection for active learning, in: Proceedings of the International Conference on Ubiquitous Information Management and Communication (IMCOM), 2011, p. 5.

71.

Zhen

Liu

and Chi

, On the importance of components of the mfcc in speech and speaker recognition, Acta Scientiarum Naturalium 37(3) (2001), 371–378.

72.

Zhou

Bousquet

, Lal

Weston

and Schölkopf

, Learning with local and global consistency, Advances in Neural Information Processing Systems 16(16) (2004), 321–328.

73.

Zhu

Wang

Yao

and Tsou

B.K.

, Active learning with sampling by uncertainty and density for word sense disambiguation and text classification, in: Proceedings of the International Conference on Computational Linguistics (COLING), 2008, pp. 1137–1144.

74.

Zhu

, Semi-supervised learning literature survey, Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005.

75.

Zhu

Ghahramani

and Lafferty

, Semi-supervised learning using gaussian fields and harmonic functions, in: Proceedings of the International Conference on Machine Learning (ICML), Vol. 3, 2003, pp. 912–919.

76.

Zhu

and Goldberg

A.B.

, Introduction to Semi-Supervised Learning, Morgan and Claypool Publishers, 2009.

77.

Zhu

Zhang

Lin

and Shi

, Active learning from data streams, in: Proceedings of the International Conference on Data Mining (ICDM), 2007, pp. 757–762.

Unsupervised active learning techniques for labeling training sets: An experimental evaluation on sequential data

Abstract

Keywords

1. Introduction

2.1 Time series classification

2.3.1 Random sampling

2.3.2 Farthest-First Traversal

2.3.3 Sampling based on clustering

3.1 Datasets description

Table 3 Description of datasets evaluated

4.1 Evaluation using supervised learning

Table 4 Average classification accuracy achieved by a classifier that uses a training set sampled by clustering algorithms using 1 example near from cluster centers and 3 examples from the cluster borders

Footnotes

Acknowledgments

References

Table 3
Description of datasets evaluated

Table 4
Average classification accuracy achieved by a classifier that uses a training set sampled by clustering algorithms using 1 example near from cluster centers and 3 examples from the cluster borders