Abstract
We introduce a new inductive bias for learning in dynamic event-based human systems. This is intended to partially address the issue of deep learning in chaotic systems. Instead of fitting the data to polynomial expansions that are expressive enough to approximate the generative functions or of inducing a universal approximator to learn the patterns and inductive bias, we only assume that the relationship between the input features and output classes changes over time, and embed this assumption through a form of dynamic contrastive learning in pre-training, where pre-training labels contain information about the class labels and time periods. We do this by extending and integrating two separate forms of contrastive learning. We note that this approach is not equivalent to inserting an extra feature into the input data that contains time period, because the input data cannot contain the label. We illustrate the approach on a recently designed learning algorithm for event-based graph time-series classification, and demonstrate its value on real-world data.
Keywords
Introduction
The No Free Lunch (NFL) theorem essentially states that no one learner is superior to any other learner a priori over all tasks [45]. In practice, however, only a subset of tasks are probable for any given domain. For example, it was discovered that imperceptible perturbations to an image (i.e., a fooling image) could cause a state-of-the-art deep neural network to misclassify the image [42]. Similarly, “rubbish” examples [34] could be created from the ImageNet [12] database, in which the image contained no recognizable object, yet was classified with very high confidence by the AlexNet DNN [24]. While this did trigger valuable discussions on adversarial uses of artificial intelligence and the non-smooth decision boundaries of deep nets [42, 13], it did not call into question their broad use for image recognition because adversarial images are highly improbable. In the language of NFL, fooling and rubbish images belong to a set of tasks that is highly unlikely to occur. The analogy naturally extends to other learning algorithms and tasks. The algorithms are indeed subject to NFL, but it is highly improbable that they will actually encounter the fooling and rubbish tasks that would cause them to fail catastrophically.
However, this is not the case for human systems that exhibit high levels of competition, or natural selection. A task change is not a byproduct of simple data drift, nor a random walk, but often the result of agents changing their strategies in order to adapt to or in anticipation of new conditions. The shift is intentional, and often unpredictable. Hence, we need a learning system that assumes the current task will change. One standard approach has been regularization. Another common approach is to assume that the same task persists, but that the bias of the available learner is too strong to learn the very complicated task, and the solution consists of applying a different learner. The deep learning approach converts temporal data into an embedding that represents patterns that persist over time – invariant data structure. However, for chaotic systems, the longer the time period considered, the less persistant structure one will encounter – offsetting deep learning’s strength of leveraging large datasets. These approaches reduce inductive bias. Yet, the issue may not be a matter of reducing bias, but of incorporating properties of human evolutionary systems into the inductive bias. We propose one such approach here.
We assume that the relationship between the input features and output classes changes over time, and embed this assumption through contrastive learning (CL) in pre-training. The CL mechanism is designed to simulate data shift that occurs over state representations of human systems due to adaptation in an evolutionary environment, i.e., non-repeating states. Specifically, we devise a type of dynamic contrastive learning based on modifications of two current CL methods. First, we employ a pre-training label that is essentially a cartesian product of time specification and value of the original label. This extends the use of time-period as a pre-training label in time contrastive learning (TCL) [40] to also represent label class. Second, we extend supervised contrastive learning (SupCon) [23] by adjusting the dependencies in the SupCon method to modify or eliminate forward or backward dependencies. We argue that this novel approach clusters samples according to the pre-training labels in a way that regular learning may not. If it is the case that the input-output relationship changes over time, one would need to differentiate the data according to time. While the traditional training process differentiates according to class label, with the expectation that the changing input-output relationship will be approximated in the latent state, differentiating according to time first allows for a cleaner hyperplane to be approximated. The approach also creates an interesting latent space in that the embedding does not reproduce relationships among the data strictly on the basis of past states. Rather, the latent features represent how the variables may relate to one another based on how the more recent states differ from the states of the distant past. Interestingly, our approach, which essentially cuts the bi-directional tie in contrastive learning, is the element of deep learning that makes it so effective in natural language processing, for example, but ineffective for the kind of event-based evolutionary human systems, such as geo-political applications, considered here.
To summarize, we extend TCL and SupCon, then integrate the two, to produce dynamic contrastive learning (DyCon). The modifications, while simple in terms of formula and code, are non-trivial in the sense that one would need some sort of theoretic motivation to make them. They deal with selective masking in the denominator of the SupCon function, while most of the contrastive learning deals with selective masking in the numberator. Furthermore, contrasting a given class label from two different time periods is counter-intuitive unless one assumes that evolutionary pressures induce agents that mix their strategies.
We illustrate our approach on GT-CHES, an event-based graph time-series classification method for learning within human evolutionary systems [20]. GT-CHES uses a feature set that models the underlying structure of human networks and the dynamic nature of such structure. It has been shown to give promising results in the context of early violence detection [21]. We demonstrate how dynamic contrastive learning further enhances the predictive power of GT-CHES on real-world data. The paper is organized as follows. Section 2 discusses related work with time-series classification. Section 3 introduces our proposed dynamic contrastive learning approach. Section 4 details our experimental set up and reports on a number of experiments with several data sets across a number of approaches and measures of performance. Finally, Section 5 concludes the paper and discusses future work.
Related work
A survey of many traditional time-series classification algorithms, including whole-series comparisons, shapelets, dictionary-based methods, and model-based approaches, can be found in [2]. Some recent developments have focused on more advanced uses and/or compositions of these various mechanisms, including symbol Fourier approximation [39, 38] and random convolutional kernel transform [11], but no fundamentally different inductive biases.
Deep learning has long been applied to design state-of-the-art solutions to time-series applications. Early on, recurrent neural networks [17] were formulated to process time-series data serially. This approach is more likely to converge efficiently to a suitable solution than a traditional multi-layer perceptron. It maintains its power as an Euler approximation of time-dependent ordinary differential equations [22], but does not overcome the exploding/vanishing gradient problem [3], especially when dealing with chaotic time-series [31]. Convolution was later introduced as a locality induction bias. This narrowed the search space to finding relationships among local features [28, 48, 41]. More recently, the notion of attention [43] allowed for estimating more global relationships across a time-series – when considering the data point
Analytical approaches have also been used for time-series estimation. These generally start with strong inductive biases and are then extended to become more expressive. Exponential smoothing, for example, has only three parameters (error, trend, and seasonality) [16, 18], which may not be sufficient for chaotic systems. Gaussian processes that include a non-linear component have been suggested in this case [10]. Other more expressive functions have been used to fit time-series data, including Legendre polynomials [44], Fourier expansions [46, 9], and polynomial expansion coefficients [36]. While these avenues of research see progress towards modeling time-series of greater complexity, they still possess a bias for fitting data according to similarity measures, rather than imposing some kind of differentiation or contrast between samples from different time periods.
The notion of contrastive learning has been used for this purpose. Inspired by non-linear independent component analysis, time-contrastive learning [19] was formulated as a way to generate representations of time-series data based on time period. For a given time period, samples from within the time period are considered positive samples while samples from other time periods are considered the negative samples. The resulting contrastive learning embeds the samples into a space that discriminates according to time period. Time-contrastive networks [40] take a similar approach. Given multiple views (for example, different camera angles recording an event) of a sequence, constrative learning is applied to uncover what is different over distinct time-steps for the same view, while detecting attributes that do not change over each single viewpoint. Alternatively, one can capture information that is relevant across the entire time series, while discarding information that is specific to a time step, and thus considered to be noise. Contrastive predicting coding [35] filters global structure from local noise by estimating the similarity/difference between different steps in the time series. The dot-product operation in the constrastive learning loss, rather than cross-entropy, offers this particular functionality. Interestingly, it is used in this method to learn mutual information across all time points, while our method does quite the opposite in creating an embedding based on how the time steps diverge. Furthermore, none of the current approaches address the non-repeating state/distribution problem.
Dynamic contrastive learning
We use supervised contrastive learning (SupCon) [23] as the base contrastive learning architecture. Roughly speaking, SupCon maps an input
where
Here, we apply SupCon to event-based data and extend it by incorporating the actual labels
The interested reader can obtain the code from the corresponding author upon request.
Since it has proven effective in event-based applications, such as early violence detection [21], we use the Graph Transformation for Classification in Human Evolutionary Systems (GT-CHES) [20] to illustrate our approach. GT-CHES first converts event data into dynamic graphs. It then extracts network- and game-theoretic measures from each graph, including fairness [25], goodness [25], eigenvector centrality [5, 4], structural balance [15, 7], assortativity [33, 26], and trust propagation [14] (feature subset
We have typically used a bagging ensemble of 15 random forest trees [20]. To accommodate the DyCon loss function and pre-training, we use a fully-connected neural network (FCNN) with 4 layers here. The first two layers act as an encoder block while the last two layers are the classifier block. This distinction between layer blocks is necessary, because the encoder is pre-trained using the DyCon loss function. This process involves a) creating a pre-training label set
Data sets
Our event-based data comes from the Integrated Crises Early Warning System Project (ICEWS) [6], where events have been extracted automatically from English-language news reports, and consist of interactions between socio-political actors for various regions of the world at various times. Events can thus be represented by a quadruple
Prequential training
The time-series training and validation process differs from that of a standard machine learning process. For a standard problem, a method such as cross-validation simulates the permutations of samples acting as members of the training and testing sets. The assumption is that each sample is equally likely to be a training or test sample. For temporal problems, this assumption does not hold since it does not effectively simulate the process of training on data before a time period
Prequential training [32] in time-series validation simulates the real-world training and testing process. To make a prediction at
Experimental setup
Our data set spans January 2001 through March 2014 and is discretized by month. January 2012 will be the first test time period. For each month there is a vector of 37 features and a label for each of the 162 countries in the data set. The label is 1 if the event associated with the task occurred for the corresponding country, and 0 otherwise.
Recall that our assumption is that data from the recent past may differ from data from the distant past. That is, the task is changing. Given a time step
All experimental models have the same 4-layer fully-connected architecture. As stated above, the first 2 layers act as an encoder while the last 2 form the classifier block. We use the following models to assess the validity of our approach.
FCNN. The standard 4-layer architecture with no pre-training. This baseline allows us to test whether pre-training is helpful for event-based time-series. It represents the “vanilla” GT-CHES.
Time-contrastive learning (TCL) [19]. FCNN with first two layers pre-trained from assigning label 0 to all members of DyConI. This version of DyCon allows us to test whether combining time period and label information outperforms using them in separate ways (i.e., DyconII. This extended version allows us to test whether eliminating the bi-directional relationship is more effective.
Comparison of training for different architectures. Green nodes represent nodes being trained. Grey nodes are frozen. Red nodes are output nodes. The FCNN in Fig. 1a has no pre-training phase. For the live architecture in Fig. 1b, the first two layers are pre-trained. The entire network is then trained during the training phase. For the freeze architecture in Fig. 1c, the first two layers are pre-trained. These layers are frozen during the training phase. Only the last two layers are trained during that phase.
As we would also like to know whether pre-training the encoder results in an embedding that is superior to a random or overfit embedding. To test that hypothesis, we train all models (other than FCNN) in the following two ways:
freeze: The encoder’s weights update during pre-training, but are then frozen. During training, the encoder projects the inputs as it was trained by the contrastive learning mechanism, and the classifier makes predictions based on that embedding. live: The encoder’s weights update during both pre-training and training. During training, the errors backpropagate through the entire architecture.
Figure 1 illustrates which nodes are trained during training and pre-training. The graph transformation features from GT-CHES are fed into the different architectures. The green nodes are trained; the grey nodes frozen; and the red nodes are output nodes.
Finally, we score the methods using 1) area under the receiver operating characteristic curve (AUC), which favors models that identify negative (no event) cases; 2) area under the precision-recall curve (AUPR), which favors models that have high accuracy among their positive predictions; 3) Brier, which favors a bi-modal distribution of predicted probabilities; 4) accuracy; and 5)
As effective performance on one metric may not translate to effective performance on other metrics, and likewise, effective performance on a given label set may not translate to effective performance on other label sets, we use an ranking to allow global comparison. We compute individual rankings for metrics and tasks, as well as an overall ranking, as follows.
Metric Ranking: For each metric, we rank the methods by performance on each data set. This ranking is a percentile ranking. When ranking four methods, the top method will score 100%, the second best will score 75%, the third 50%, and the last 25%. We are interested in which methods perform well according to the individual metric rankings. For an overall ranking, the percent rankings are then summed by method and normalized. Task Ranking: We do a similar ranking by label data set. For each label data set, a ranking of each metric is established. The metric rankings are summed and normalized, providing an overall ranking for the dataset. Overall Ranking: There are two approaches to an overall ranking. First, we sum the metric overall rankings. Second, we sum the overall metric rankings. Note that due to the limits of ranking [1], it is possible for the rank orderings to be different for these two approaches.
Table 1 shows the ranking of methods by metrics, sorted by overall ranking.
Ranking of models by metrics sorted by overall ranking
Ranking of models by metrics sorted by overall ranking
Ranking of methods by data set sorted by overall ranking
The results clearly show that the DyCon methods with weight freezing significantly outperform all other approaches. DyConII-freeze obtains the highest rank overall, which suggests that its embedding captures relevant information to adjust to data shift. For both Dycon methods, freezing the encoder also outperforms the live encoder. This suggests that there is value in locking in the bias. If freeze is better than live, that is a further argument that our imposed bias is effective. This is because during training, freeze will only be training 2 layers while live will be training all 4. If freeze is better, then the frozen encoder is better than the non-frozen encoder in live. The null hypothesis is that the frozen encoder block outputs a random embedding or even overfit embedding. However, if this were to be the case, the classifier block would not be able to outperform the live encoder and classifier block in the live model, since the classifier block alone is less expressive than the full encoder-classifier block model, and would be working off a random or overfit embedding.
Table 2 shows the ranking of methods by data set, sorted by overall ranking.
Again, the results clearly show the value of the DyCon methods with DyConII-freeze obtaining the highest rank overall. There is one significant exception, however. While TCL-freeze ranked lowest on all of the other data sets, it actually ranks first on ILC, with the best DyCon method coming in only as third. As it happens, ILC positive cases never repeat over successive steps. Hence, a positive prediction would necessitate assuming that time steps differ. TCL-freeze is the only method that applies contrastive learning based only on period. Its sister method, TCL-live, which is allowed to train with live gradients, did not perform quite as well, but was the next best method on ILC.
We have introduced dynamic contrastive learning as a novel form of pre-training, where we embed the training data into a higher dimensional space during pre-training with a new set of labels in which the class label is at least partially determined by the time period in which it occurred. This embedding then becomes the input for training against the regular label set. It is noteworthy that the very element that makes deep learning so effective on applications such as natural language processing is the one our proposed approach cuts out of contrastive learning to make it effective on event-based evolutionary human systems, such as geo-political applications. A number of experiments over several data sets and across a variety of methods, using GT-CHES as a testbed, showed that dynamic contrastive learning ranks first over a number of relevant metrics, demonstrating its potential in improving the performance of other time-series algorithms.
While our results are promising, we hope that our work spurs further research into this new type of inductive bias. Two avenues of research worth pursuing as a direct result of this work are as follows. First, our results show that TCL-live and
