Abstract
The dilemma between stability and plasticity is crucial in machine learning, especially when non-stationary input distributions are considered. This issue can be addressed by continual learning in order to alleviate catastrophic forgetting. This strategy has been previously proposed for supervised and reinforcement learning models. However, little attention has been devoted to unsupervised learning. This work presents a dynamic learning rate framework for unsupervised neural networks that can handle non-stationary distributions. In order for the model to adapt to the input as it changes its characteristics, a varying learning rate that does not merely depend on the training step but on the reconstruction error has been proposed. In the experiments, different configurations for classical competitive neural networks, self-organizing maps and growing neural gas with either per-neuron or per-network dynamic learning rate have been tested. Experimental results on document clustering tasks demonstrate the suitability of the proposal for real-world problems.
Keywords
Introduction
Humans are able to adapt to the changing environments around them in an efficient and robust way through a sequence of experiences that allow them to incrementally learn what they need from these non-stationary environments and retain vital information for the future. This feature has not been emulated by machine learning models yet. They include a wide variety of methods that have been successfully applied to such diverse tasks as job scheduling [1], natural disaster prediction [2], material production [3, 4], structure inspection and defect detection [5, 6, 7, 8], biomedical tasks [9, 10, 11] and economic ones [12, 13, 14]. Furthermore, they excel in tasks that are dependent on image processing and analysis [15, 16, 17] due to the improvement of deep learning models. However, machine learning training is typically based on a data set that is assumed to maintain the same data distribution over time.
In recent times, continuous learning has emerged as one of the areas of study with the greatest potential for endowing learning systems with the ability to work with non-stationary data distributions [18, 19]. It is essential for the learning models to be sufficiently plastic to incorporate more current knowledge while avoiding forgetting information acquired at the beginning of their training, which in its most extreme case would lead to a situation known as catastrophic forgetting [20, 21], in which the models would continually discard information learned in the past and would only retain the latest data presented to them. However, it is not easy to find a compromise between plasticity and the stability necessary for the model to converge to a steady state that maintains the meaningful information extracted throughout the whole training [22]. Although multiple continuous learning models based on supervised and reinforcement learning have been proposed [23, 24, 25, 26], the development of unsupervised models that attempt to cluster input data whose distribution changes over time is still a challenging field.
Unsupervised learning methods do not need an input data set labeled by a human expert. They usually attempt to detect similar features or common patterns that allow them to group the input data into different clusters or categories [27]. Knowledge about the data similarity within each cluster can be used to find association rules between features [28]. Furthermore, each cluster information may be useful for representing data that belongs to that cluster in a reduced form [29]. For that purpose, only the most significant features are retained. Those that provide little or no information are removed. The present work is focused on a specific paradigm within unsupervised learning: competitive learning. This paradigm is represented by a family of classical neural networks where neurons compete to represent data samples. The early stages of the training process greatly influence the information learned by the neural network. A change in the distribution of data when the network is practically trained does not greatly affect its final configuration. In order to learn those new data properties, the neural networks would have to be retrained. The aim of this work is to propose a framework in which competitive networks could learn continuously. Within this family, we experiment with three network architectures: competitive neural network, self-organizing map, and growing neural gas. Each new model has been applied to the problem of online document clustering and its performance has been tested.
The remainder of the paper is structured as follows. Section 1.1 comprises works related to competitive neural networks and online document clustering. The competitive and self-organized continuous learning models are presented in Section 2, while Section 3 exposes our proposal to adapt them to continual learning, as well as the metrics used to evaluate the performance. Section 4 aims to present the results of the experiments. Subsequently, Section 5 is devoted to discussing the results. Finally, Section 6 presents some conclusions.
Related works
A competitive neural network is designed to perform vector quantization of the input space once it is trained [30]. Despite being relatively simple models, competitive networks have achieved a good performance in applications with high-dimensional data such as images or video sequences. Color image segmentation methods where competitive models are an essential component are presented in [31, 32]. Regarding the analysis of video sequences, [33] proposes the use of rival penalized competitive learning [34] to categorize the body postures of human beings appearing in the video frames, which allows the system to extract behavioral patterns. Finally, a probabilistic competitive neural model to compress multispectral data by estimating each cluster’s intrinsic dimension is presented in [33].
Self-organizing maps can be considered a refinement of competitive learning networks, organizing the neurons into a graph representing a low-dimensional grid. Specialized variants of self-organizing maps have been adapted for many different engineering tasks, such as foreground detection in video sequences [35], brain-computer interfaces [36], intrusion detection in sensor networks [37], monitoring of industrial processes [38], or object detection in controlled environments [39].
A growing neural gas is another kind of competitive learning model where neurons are organized into a graph whose topology and size are dynamic. The number of neurons and the connections among them may be increased or decreased as it is required to adapt the number of clusters accordingly to the distribution of the training data. This model and various variants have also been applied to a range of engineering tasks, such as temporal video segmentation [40], foreground detection in video sequences [41], automatic recognition of relevant landmarks in medical images [42], modeling of biodiversity data [43], 3D structure reconstruction [44], or vehicle classification [45].
The aim of this work is to provide a dynamic learning rate framework for unsupervised learning models in order to adapt them for continual learning. This way, these models can learn from non-stationary distributions in a continual way, as in many real-world problems, addressing the stability and plasticity dilemma. This framework is tested with an array of experiments to cluster CORA [46], a dataset of scientific articles. While this dataset has been used to test supervised continual learning for Graph Neural Networks using the dataset’s citation network [47, 48, 49], our approach is the first application to continual learning with unsupervised competitive learning using the vocabulary metadata in the dataset. Other works for online clustering of documents use probabilistic models [50], various online variants of K-means clustering [51, 52], or fuzzy clustering [53], but no competitive learning approaches such as the self-organizing map o the growing neural gas, whose update rules update neurons (clusters) not in isolation but in neighborhoods, therefore providing a mechanism to avoid individual clusters to get stuck on low local optima [54]. Besides online document clustering, there are also works developing online clustering of images or image patches [55, 56, 57, 58], online clustering of sound samples [59], or the more generic multi-armed bandit problem [60].
Preliminaries
The three unsupervised learning algorithms used in this paper are competitive learning, Kohonen’s Self-Organizing Map (SOM), and Growing Neural Gas (GNG). They are briefly reviewed in this Section to present the notation that will be used throughout the paper.
Competitive neural networks
In a competitive learning network composed of
Then, only the winning neuron
In that rule,
where
Typically, the time steps in the training phase are grouped into a number of epochs: in each epoch, all samples from the training dataset are presented (in random order) to the training network. After the training phase, the weights are fixed, and each new input vector is assigned to the cluster corresponding to the neuron with the closest weight vector. If the training was successful, the weights of the neurons are distributed so that they represent the cluster centroids of a Voronoi tesselation. Please note that other competitive learning networks such as SOM and GNG share most of the basic setup described here for competitive networks.
The Self-Organizing Map (SOM) is an unsupervised clustering technique designed to learn a low-dimensional representation of higher-dimensional data. This representation is frequently a 2D neuron grid (but the dimensionality and size of the grid may be changed), and it aims to preserve the topology of the input data. Each neuron represents a cluster of input samples. Formally, in a SOM with a 2D grid,
For each neuron
Since the SOM is a self-organizing neural network, not only is the winner updated as in the competitive neural network, but also the rest of the neurons are adjusted, for
where
According to these training dynamics, please note that the neurons are not necessarily placed in the data’s higher dimensional space as a smooth 2D grid. At the beginning of the training, the neurons’ positions are initialized by taking the mean of a random sample of data points; the resulting initial embedding is typically a random assemblage around the dataset’s centroid. Each neuron will influence (and be influenced by) the position of their topological neighbors, and for well-behaved datasets, this will translate into the 2D grid being embedded as a smooth sheet modeling the data’s distribution, but this will not always be the case, generally speaking.
The Growing Neural Gas (GNG) is another self-organizing neural network. The main difference with the SOM is that it learns a dynamic graph with a variable number of neurons and connections instead of a two-dimensional fixed topology. This set of connections will be noted
After incrementing the age of all connections emanating from
Then the weights of the nearest unit
where every neuron
The GNG can create or remove neurons and connections. Therefore, the connection between
where
Our goal is to adapt the models presented in Section 2 to make them able to learn continuously: instead of training ahead of time to learn a specific clustering and then performing inference, the neural network should be able to adapt and change the configuration of its weight vectors to represent a different input distribution if it is required, in order to be able to learn continuously. Accordingly, the training regime should be changed so that it can be run in parallel to inference. Therefore, we propose to make the learning rate a function based on the reconstruction error.
Regarding our methodology, first, we present the modifications performed to adapt the unsupervised learning algorithms shown in Section 2 to continual learning. Second, the multi-learning rate approach is introduced, as a further adaptation to continual learning. Finally, the clustering metrics used to measure the performance are given.
Adaptation to continual learning
To achieve the goal of adapting competitive models to train indefinitely, we propose a different scaling strategy during training: instead of using a learning rate
The core idea is that if the input distribution changes, the
This new way to update the learning rate is intended to be used from the beginning of the training, with no warm-up period using the standard way described in the previous section. Different monotonic function types are considered in the experiments:
Linear: Quadratic: Inverse: Constant:
While a constant learning rate is not adaptive, it is included for comparison with the other ones. In the context of the updating rule, it only makes sense for
Note that, as described above, the
Clustering performance measures
Next, we review the clustering performance measures that will be used in this work.
The first one is the Mean Squared Error (MSE), which measures the average of the squares of the errors, that is, the average squared difference between the estimated values (weight vectors) and the actual values (input samples), lower is better:
where
The Davies-Bouldin index [61] is a well-known measure which favors compact and well-separated clusters [62, 63]. It is given by (lower is better):
where
and
The original Dunn index [64] tries to identify sets of clusters that are compact, with a small variance between members of the cluster, and well separated, where the means of different clusters are sufficiently far apart, as compared to the within-cluster variance. This index has been improved in several ways to make it more robust and computationally efficient. The particular version that we will consider here is one of those advocated in [65] (higher is better):
The Calinski-Harabasz criterion [66] measures the similarity of a sample with respect to its own cluster as compared to the separation between clusters. The distances from the cluster’s centroid
Silhouette values are often used to assess the quality of a clustering [67, 68]. They are a measure of how similar an input sample is to its own cluster (cohesion) compared to other clusters (separation). Let
Configuration values used in the experiments for the competitive learning networks modified for online learning. The first one is the type of competitive learning network. All other ones are specific to our proposal for online learning in these networks. See Section 4.2 for details
Results for competitive learning network, window size 
Results for self-organizing map, 
Results for growing neural gas, 
Results for competitive learning network, 
Results for self-organizing map, 
Results for growing neural gas, 
In this section, the computational experiments that we have carried out are described, and their results are reported.
Dataset
The proposed model has been tested to compute clusterings of the CORA dataset [46]. This dataset has 2708 samples, each consisting of metadata about a scientific article on machine learning: its citation network, its sub-field, and the presence of words within the article from a standardized vocabulary of 1433 words, obtained by stemming the text corpus, removing stop words and filtering out words appearing less than ten times [46] with absolute frequency less than 10. From this metadata, only the vocabulary data is used to compute clusterings of the dataset, considering each sample (each article) as a Boolean vector of length 1433, each value signifying the presence or absence of each word in the vocabulary. While not directly used in the training processes (since competitive learning networks are unsupervised), the dataset also assigns one of seven possible sub-fields to each paper. The sub-fields are genetic algorithms, reinforcement learning, theory of machine learning, rule learning, case-based learning, probabilistic methods, and neural networks. Sub-fields are used to compute accuracy (see next section). The dataset also includes the paper’s citation network, but citation information is not used at all in this work.
Setup
All experiments were done using the Matlab programming environment in a computer with an Intel i7 processor; because of the nature of competitive learning networks, memory requirements are modest and mostly dependent on the size of the training dataset. An array of experiments with different configurations has been run to test the proposed model. In this subsection, all the tested parameters and training modes are described.
As described in Section 3, a variety of performance metrics are used to measure the goodness of the clusterings: accuracy, Calinski-Harabasz criterion, Silhouette coefficient, Dunn’s index, mean squared error, Davies-Bouldin index, and (for self-organizing maps and growing neural gas) topographic error. Accuracy is computed with respect to the sub-fields of the samples in each cluster’s receptive field. Please note that all models are unsupervised: the sub-field of each sample is not part of the training data, and the model does not directly predict the sub-field. Instead, each cluster’s sub-field is considered to be the mode (i.e., the sub-field with the highest frequency) among the samples in its receptive field, and the cluster accuracy is the ratio of the mode (the most frequent value) to the size of the receptive field. Then, clustering accuracy in each experiment is computed as the mean of the accuracy values for each cluster.
In each experiment, a network is trained to cluster the data, with networks of three types: simple competitive learning, self-organizing map, and growing neural gas. All networks have 25 neurons (the self-organizing map is configured as a 5
As described in Section 3, four different functions (linear, quadratic, inverse, constant) are used to determine the learning rate, all with one parameter,
For the first three functions, 40 values, logarithmically spaced between 0.001 and 5. For the constant function, 30 values, logarithmically spaced between 0.001 and 1.
For the first three functions, experiments are also performed for three different sizes for the time window used to compute
Two different regimes are tested for the learning rate:
A single learning rate for all neurons: the A different learning rate for each neuron: each neuron has its own time window, and consequently its own
Finally, two different training regimes are tested:
A classical (regular) training regime, with the network training on the whole dataset during four epochs, each epoch corresponding to. Clustering performance metrics are computed after training, considering the whole dataset to compute the metrics. A continual learning regime, with the dataset randomly divided into ten separate batches of approximately 208 samples each, training the network in sequence with each batch during four mini-epochs. After each of the four mini-epochs, clustering performance metrics are computed considering just the samples for the current batch (not the whole dataset), and the training for the next batch starts without resetting the neurons. This setup simulates an environment with continual learning, where the network is concurrently trained and used to predict the clustering of incoming samples.
All combinations of the previously described options and parameter values are tested (see Table 1 for a summary of these parameters) to make sure we carry out an unbiased search over the parameter space. In particular, three different types of competitive learning networks are tested, with three well-known unsupervised learning models, in order to cover a wide range of architectural concepts in the underlying competitive models: competitive networks as an example of a model with no topology and a fixed number of neurons, SOMs as an example of a model with a fixed topology and a fixed number of neurons, and GNGs as an example of a model with dynamic topology and a variable number of neurons. A total of 100 runs are performed for every possible configuration, resulting in a total of
Summarized performance results by learning rate configuration for regular training experiments (without batches). Each row summarizes results across all performance metrics for a specific configuration of the learning rate function and learning rate mode. Within each row, each column labeled from A. to T.E. represents results for each performance metric: how many times (for all possible window sizes and network architectures) that specific learning rate configuration beats other learning rate configurations for that performance metric. The last column (all) is the sum of the previous columns. Please see the text for a more detailed description
Summarized performance results by learning rate configuration for batch training experiments. Please compare with Table 2
Figures from 1 to 6 show the curves for the mean values of each performance metric as a function of the parameter
Note also that the figures do not represent all experiments, only those with window size
For experiments without batches (Figs 1–3), best
For experiments with batch training (Figs 4–6), in contrast, it is best
Tables 2 and 3 distill the information in the plots of the previously discussed figures, summarized to highlight the relative performance of each learning rate function (linear, quadratic, inverse) with different learning rate modes (either a single learning rate or multiple ones) with respect to using a constant learning rate. Please note that each metric’s mean value for each configuration (with a specific value for parameter
Learning rate mode (either single or multi). Learning rate function (either linear, quadratic, inverse, or constant, but note that constant is independent of learning rate mode). Window size ( Network architecture (either competitive, self-organizing map, or growing neural gas).
To contextualize the significance of these best mean values: in the plots in Figs 1–6, the best mean values correspond to the top of each curve’s range along the Y axis for the first four metrics, and the bottom of the range for the other metrics (but the plots are only for
For each performance metric, the mean and standard deviation values from 100 different runs of several clustering strategies. Three classical clustering techniques (K-means clustering, K-medoids clustering, and DBSCAN), three competitive learning techniques with linearly decreasing learning rate (competitive network, SOM, GNG), and one configuration of our online method (GNG with
To build Tables 2 and 3, we add up all the times a configuration with a specific combination of learning rate mode and function wins over configurations with other combinations. Please note that 100 experiments are run for each configuration, and we use the mean value across the 100 runs to measure the number of times each combination wins over others. Table 2 summarizes results for regular training mode (without batches), and Table 3 does the same for batch training mode. There are nine different combinations of window size and network architecture, so for each metric, there are nine possible “wins” to be distributed among the seven possible combinations of learning rate mode and function (rows in Tables 2 and 3). Consequently, each column in these tables adds up to 9, except in the case of the topographic error: it only applies to the self-organizing map and growing neural gas, so each column adds up to 6.
Two conclusions can be gleaned from these tables:
Varying the learning rate Using a different learning rate for each neuron seems to be better than using a single learning rate for all neurons, as measured by accuracy, Calinski-Harabasz criterion, Silhouette coefficient, Dunn’s index, mean squared error, and Davies-Bouldin index. For all these metrics, configurations with multiple learning rates win (having the best mean value) more times than configurations with a single learning rate. In the case of topographic error, there is a draw in the number of wins.
Representation of two specific experiments, one with a self-organizing map (up) and the other with growing neural gas (down). In both cases, the topology of the network is depicted, with each neuron showing the mode of the sample sub-field for all samples in each neuron’s receptive field. If the neural network has learned a good clustering of the dataset, adjacent neurons will tend to have similar modes. Sub-field acronyms are GA (Genetic Algorithms), ReL (Reinforcement Learning), T (Theory), RuL (Rule Learning), CB (Case-Based), PM (Probabilistic Methods), and NN (Neural Networks).
Additionally, Table 4 presents a comparison between the results of one of the configurations tested in the experiments and various other clustering techniques, always with 25 clusters, since the number of clusters is 25 for the experiments with competitive networks and SOM, and at most 25 with GNG: three classical clustering techniques (K-means, K-medoids, DBSCAN) and the three competitive learning networks tested in the experiments, but with classical learning rate decay. The learning rate decays linearly from an initial value of 0.4 for the first half of the training; then, it is set at a low value (0.01) for the second half. A total of 100 experiments are run for each clustering technique: the values in the table are the mean and standard deviation over these 100 runs for each metric. Please note that for the first four performance results, larger values are better, and vice versa for the last three. While many configurations of our method perform better than the other six techniques in some metric (in isolation), we compare them against just one configuration. To select this configuration, we first filter the ones that are better than the other techniques in accuracy and MSE (since we consider these to be the most relevant metrics). This results in a still sizeable number of configurations. From these, we select the configurations with a minimal sum of ranks, where the configuration’s rank for each metric is its ordinal when all competing values in the same value are sorted from better to worse, considering directly the configurations with globally minimal rank results in configurations that do not prioritize accuracy and MSE. From these, we select for the table the one with the best accuracy: GNG network, with history size
As a graphical example of the accuracy of the clustering for experiments, we also show in Fig. 7 specific examples of two experiments, one with a self-organizing map and the other with a growing neural gas. They are plotted to show how well the topology of the network maps to sample classes (defining the class of each sample as the sub-field of the underlying article). For each experiment, the resulting neural network is represented, showing for each neuron the mode of the sample class for the samples in the neuron’s receptive field. As can be seen, there are clear clusters of sample class modes, for example, with the growing neural gas.
To adapt competitive learning models to continual learning, we have defined the learning rate as a function of the error rate during training. In this way, if the model has converged for a particular probability distribution, but the incoming samples drift to another one, then the learning rate can change dynamically to address this situation.
Thus, adapting the learning rate to the amount of training error has been shown to produce better clusterings. Furthermore, in the context of continual learning, for many of the metrics used to measure clustering performance, Table 3 shows that using multiple learning rates (i.e., a different learning rate adapted to the particular history of training errors for each neuron) can produce better results than using a single learning rate for the whole model. Table 2 shows that multiple learning rates are also better in the context of regular training regimes, even if the effect seems less intense. However, these results should be put into context: as it can be seen from Figs 1–6, it is apparent that, for most metrics, the differences in average peak performance between different learning rates and different functions
Looking at Figs 1–6 for instances of configurations with multiple learning rates outperforming their counterparts with a single learning rate, this happens more frequently for the metrics Silhouette, Dunn’s index and Davies-Bouldin, but not for Calinski-Harabasz. All these four metrics measure intra-cluster compactness and inter-cluster separation, but in different ways. In particular, the first three compute the values finding maxima and minima of various terms (the fourth doesn’t), so their values depend more on samples being extreme in some respect. For example, Dunn’s index just measures the ratio between the smallest inter-cluster distance and the largest intra-cluster distances. This is evidence that using multiple learning rates can lead to clusterings with better-behaving clusters at the extreme range of various ways to measure intra- and inter-cluster distances.
It is also remarkable that the inverse function (i.e., defining the learning rate as an inverse function of training errors) can be better for some performance metrics, most notably the Calinski-Harabasz criterion when using multiple learning rates in Table 2, and more broadly for multiple metrics in Table 3. This seemingly counterintuitive result can be explained by outliers. In the presence of outliers in the input distribution, an inverse correspondence keeps the units (neurons) that represent the outliers relatively static because such units have large training errors, as the outliers are far away from the other samples. Consequently, these units do not migrate to the large central clusters of the distribution, thereby keeping the outliers well-represented. As demonstrated in the experiments, this mechanism can greatly enhance the overall performance of the studied clustering algorithms.
Conclusions
In this work, a novel framework for continual learning has been proposed. It is aimed at dealing with non-stationary inputs for unsupervised learning models. A dynamic learning rate is designed, which depends on the current reconstruction error. It can be applied either per neuron or per network. This method can be employed for any application of unsupervised neural networks. In particular, we have chosen a scientific article unsupervised clustering problem. For this real-life problem, our approach with multiple learning rates dependent on individual neuron error rates has demonstrated that it outperforms the classic constant learning rate scheme according to a wide range of unsupervised clustering performance measures. Moreover, the number of recent samples taken into account for computing the error rates seem not to be too relevant. In fact, the best configuration found in our experiments was a GNG network with a small history size
While this work has focused on applying our proposed approach to online competitive learning to document classification, there are other active research areas where our proposal can also work, such as online clustering of images or image patches [55, 56, 57, 58], online clustering of sound samples [59], or the more generic multi-armed bandit problem [60]. Regarding future work on the model itself, there are a number of lines of future work amenable to our results. Taking advantage of the relatively inexpensive computational cost of competitive learning, negative correlation learning in dynamic ensembles [70] can provide a performance boost by tuning a clustering model with multiple sub-networks whose results are combined using a negative correlation learning rule. Also, incoming samples can be mapped into another multidimensional space where they are easier to classify, with a suitable procedure such as Neural Dynamic Classification [71]. Another possibility is to keep our proposed approach to online competitive learning as an initial unsupervised step used to feed a supervised learning approach in order to model the resulting partitions of the input space in a more effective way using a Finite Element Machine [72] or to avoid catastrophic forgetting in the latter supervised step [73].
Footnotes
Acknowledgments
This work is partially supported by the Ministry of Science, Innovation and Universities of Spain under grant RTI2018-094645-B-I00, project name Automated detection with low-cost hardware of unusual activities in video sequences. It is also partially supported by the Autonomous Government of Andalusia (Spain) under project UMA18-FEDERJA-084, project name Detection of anomalous behavior agents by deep learning in low-cost video surveillance intelligent systems. It is also partially supported by the Autonomous Government of Andalusia (Spain) under project UMA20-FEDERJA-108, project name Detection, characterization and prognosis value of the non-obstructive coronary disease with deep learning. All of them include funds from the European Regional Development Fund (ERDF). It is also partially supported by the University of Malaga (Spain) under grants B1-2019_01, project name Anomaly detection on roads by moving cameras, and B1-2019_02, project name Self-Organizing Neural Systems for Non-Stationary Environments. The authors thankfully acknowledge the computer resources, technical expertise and assistance provided by the SCBI (Supercomputing and Bioinformatics) center of the University of Málaga. They also gratefully acknowledge the support of NVIDIA Corporation with the donation of two Titan X GPUs. The authors also thankfully acknowledge the grant of the Universidad de Málaga and the Instituto de Investigación Biomédica de Málaga – IBIMA.
