Abstract
Recurrent Neural Networks (RNNs) represent a natural paradigm for modeling sequential data like text written in natural language. In fact, RNNs and their variations have long been the architecture of choice in many applications, however in practice they require the use of labored architectures (such as gating mechanisms) and computationally heavy training processes. In this paper we address the question of whether it is possible to generate sentence embeddings via completely untrained recurrent dynamics, on top of which to apply a simple learning algorithm for text classification. This would allow to obtain extremely efficient models in terms of training time. Our work investigates the extent to which this approach can be used, by analyzing the results on different tasks. Finally, we show that, within certain limits, it is possible to build extremely efficient models for text classification that remain competitive in accuracy with reference models in the state-of-the-art.
Introduction
When dealing with natural language in the form of written text, a convenient way of representing the input that also somehow resembles how humans read is simply as a sequence of words or characters. In fact, in the context of neural networks the de-facto standard for Natural Language Processing (NLP) has long been that of Recurrent Neural Networks (RNNs) and their variations [3, 41]. After being properly trained, RNNs have the peculiar ability of transforming a whole input sequence into a final vector of a given, fixed size. In the context of NLP, where the input is usually a sequence of words, this final vector often encompasses the semantics of the input, enclosed in a fixed size representation that is called sentence embedding. The sentence embedding can then be conveniently used to feed non-recurrent machine learning models.
The typical methods for tuning the parameters of a RNN, such as backpropagation through time, suffer from the well-known problems of gradient vanishing and gradient explosion, which can make these kind of networks difficult to train in the presence of long input sequences [5]. Because of this limitation, several approaches have gained popularity for their ability to avoid or alleviate the problems associated with the propagation of the gradient during training. For example, Long Short-Term Memory (LSTM) [22] and Gated Recurrent Unit (GRU) [13] are two notable examples of architectural variations of RNNs, in which gating mechanisms are employed to make the network selectively remember and forget by regulating the flow of information through each time step, helping to alleviate the vanishing of the gradient.
Recently, the development of the Transformer model [47] gained traction for its characteristic non-recurrent architecture, which allows to process input sequences with just the use of self-attention mechanisms. In these kind of models, training is still performed by backpropagation. Increasingly often, transfer learning techniques are used to share knowledge between models by first training a task-independent language model over a large variety of text corpora (for example by employing an autoencoder or a classifier with a next-step prediction task) and then using these learned parameters as part of potentially many different specialized models.
The development of these techniques has been pushed forward mainly by the opportunity to obtain strong predictive performance, however less focus has been put over the aspect of computational performance. In fact, these techniques (and their tendency to require hundreds of millions of parameters) can lead to a significant cost in terms of training time [39]. What may only seem a matter of training speed, has actually several kinds repercussions such as economic availability, financial costs, and environmental impact. As an example, it has been estimated that training a Transformer model can produce about 87 kg of CO2 when considering commonly used hardware and cloud computing services, with a financial cost in US dollars between $289 and $981 [44]. It then becomes important to carefully rethink whether the current methodology is the only possible path for achieving state-of-the-art results, or instead radically different approaches can be used to obtain a higher level of efficiency.
There exist works that focus on performance by relying on non-neural architectures (for example tree kernels [52, 53]), but in this paper we investigate this research question by proposing an approach for text classification that is based on the widely used RNNs. Differently from the typical RNN approaches, however, we make use of the efficient training methodology that characterizes Echo State Networks (ESNs) [23, 24] from the framework of Reservoir Computing (RC) [35, 49]. In particular we propose the use of advances in the architectural setup of ESNs [18] to produce, by means of randomly initialized and untrained weights, a fixed-size embedding for the input text which can then be used for classification tasks. Since this kind of network is largely kept untrained, we are able to achieve a strikingly fast training process. We experimentally assess the feasibility and the performance of our approach by evaluating it on different text classification tasks and comparing the results against our own baseline and state-of-the-art models from the literature. We will also present and evaluate variations of the main model (for example by including an attention mechanism); moreover, we will investigate whether it is possible to use the concept of reservoir transfer to further improve the efficiency of our methodology by reusing parts of a model in the context of tasks that are different from the original one.
This paper is structured as follows. We briefly introduce the characteristics and advantages of the ESN model in Section 2, where we also address advances in the design of recurrent connections. In Section 3 we present the proposed models, which we then validate on the text classification datasets described in Section 4. Our experiments and methodology are reported in Section 5, while a discussion of the results is presented in Section 6. Finally, in Section 7 we draw the conclusions of this study.
Background
Since our work is framed in the context of RC, and in particular of ESNs, this section will introduce the reader to the main concepts of this framework.
ESNs are variants of RNNs, and as such they represent a useful and effective method for modeling sequences. Architecturally, ESNs and RNNs have many similarities, while the differences among these models are mostly limited to different training methodologies.
The typical and widespread method for training RNNs and their variants involves the use of gradient descent algorithms such as backpropagation through time. Unfortunately, these kinds of methods have well-known weaknesses such as the issues associated with the propagation of the gradient (as briefly discussed in Section 1) and the fact that they can be costly to run. In contrast, radically different approaches from the RC paradigm involve the stable initialization of the recurrent dynamics so that the training of the parameters in the recurrent part of the network can be avoided altogether. ESNs [23, 24] represent one instance of this approach. In ESNs, an untrained dynamical system with randomly initialized parameters, called reservoir, is used to compute the state of the network at each time step. Then, the output is typically extracted from the state of the reservoir by means of simple linear regression [35] (even though other kind of approaches can be used): ESNs are thus an efficient approach to modeling and training RNNs.
To set the stage for a mathematical description, let us consider a generic RNN model (this also applies to ESNs). For the sake of notation, in what follows let us denote with T the length of a generic input sequence. Similarly, whenever a specific sequence i is considered, let us denote its length with T
i
. Moreover, we will denote with symbols N
U
, N
R
and N
Y
respectively the size of the input layer, the number of hidden recurrent units (i.e. the size of the state vectors), and the number of output units in the RNN. Any generic RNN, given an input sequence composed of vectors
The trajectory of an ESN in the state space is ruled by a non-linear equation which depends on the input and on the previous most recent state. In the case of leaky-integrator neurons [25] and hyperbolic tangent activation functions, the state at time t is computed as per the following equation:
In contrast to what would typically happen in a RNN, in an ESN the values in the weight matrices
Intuitively, a properly initialized reservoir (i.e. one that exhibits the ESP) will push similar sequences towards close regions in the state space. Similarity is implicitly suffix-based, and this organization of the state space is at the core of the intrinsic discrimination capabilities of ESNs [20, 45].
After an input sequence has been fed into the ESN, the states produced by the reservoir can be used to compute the final output of the network. Given the typically high dimensionality of the non-linear reservoir, from Cover’s theorem it follows that the states of the ESN are with high probability linearly separable [14]. Hence, we can use a simple linear layer (“readout”) to perform the classification. In that case, the output
We can remain consistent with the definition of ESN but also try to further simplify the computation in order to reduce the usage of resources. Specifically, the high-level description of the evolution of the dynamical system can remain the one of Equation 1, but the structure of the reservoir-to-reservoir matrix
This particular configuration of the reservoir-to-reservoir matrix is called “multi-ring” from its characteristic pattern of connections between the recurrent units, as illustrated in Fig. 1. A multi-ring reservoir has several advantages, like the fact that large reservoirs will have a very efficient state transition cost. In fact, the matrix-vector multiplication

Representation of a multi-ring reservoir. On the left, the permutation matrix
We have considered two different possibilities for exploiting the properties of RC for text classification. In both cases, we propose the use of a bidirectional recurrent architecture [8, 42] to transform the input into a useful intermediate representation. In the following sections we present in detail the models that we designed, all of which adopt an ESN for the recurrent module but use different implementations for the readout. Specifically, the first model (Bi-ESN) uses a simpler readout component and is described in Section 3.1. The second model (Bi-ESN-Att), which includes a self-attention mechanism, is presented in Section 3.2. The basic versions of Bi-ESN and Bi-ESN-Att are designed to be applied to classification tasks that exhibit a single sentence as input. In Section 3.3, we show how to extend our models for handling tasks in which a classification must be performed about the interaction between two different input sentences.
Bi-ESN
We define Bi-ESN as a pure RC model for the classification of textual input. In this model, based on a leaky ESN and previously introduced in [18], the recurrent component is implemented by a bidirectional orthogonal (multi-ring) reservoir. The state vector produced by the reservoir is intended to serve as a fixed size embedding of the whole input text, or sentence, as illustrated in Fig. 2. In practice, this sentence embedding

Architecture of Bi-ESN. On the bottom, the input words (or tokens) are transformed to real-valued vectors via pretrained word embeddings, then they are fed through the bidirectional reservoir of the leaky ESN. A sentence embedding is extracted from the state of the reservoir by concatenating the final states for each direction, and then fed to the linear layer for classification. The only parts of the model that undergo training are represented by a dark, solid background: in this case, only the final linear layer.
In order to train the linear classifier it is necessary to collect the sentence embeddings for the data and then apply the ridge regression algorithm. If we denote with
The Bi-ESN from Section 3.1 is a model that completely fits the RC framework in the sense that we can distinguish an untrained recurrent reservoir and a linear readout. Here we describe Bi-ESN-Att, another RC model that however uses a more sophisticated readout which in principle should be able to capture more information from the states produced by the reservoir. In short, in this model (previously introduced in [18]) it is proposed to apply a self-attention mechanism to a bidirectional multi-ring ESN, thus combining techniques from both RC and gradient descent training. As shown in Fig. 3, Bi-ESN-Att is designed to make use of all the states produced by the reservoir, both in the forward and backward direction, and not just from the final ones as it happens for Bi-ESN. In fact, at each time step the states from the forward and backward directions are concatenated to produce a new sequence where each item is a vector of size 2N
R
. Since this sequence may be large in terms of memory, each item also goes through a linear layer with the purpose of reducing the vector dimensionality to N
D
. If

Architecture of Bi-ESN-Att. On the bottom, the input words (or tokens) are transformed to real-valued vectors via pretrained word embeddings, then they are fed through the bidirectional leaky ESN. For each time step, the corresponding forward and backward states are concatenated (shaded boxes) and then are fed to a linear layer that performs dimensionality reduction. Finally, the self-attention mechanism selects the most important states, which are merged together via a weighted sum and then fed to a linear classifier. The only parts of the model that undergo training are represented by a dark, solid background.
Once the sequence
As before, let T be the length of the input sequence. Let also
As it can be noticed from Equations 11 and 12, all parameters are independent from the length of the sequence and, in fact, are shared between the time steps. After the attention scores
Note that all free parameters can be trained end-to-end by gradient descent and backpropagation. Note also that here the gradient only flows through a short path of fixed length, as opposed to the arbitrary unfolded depths that can occur in standard RNNs. Thanks to this, this model does not incur in the issue of gradient vanishing.
In addition to the problem of classifying a single sentence, another typical kind of task in NLP is the one of textual entailment recognition, in which two sentences are compared to identify whether the meaning of the first one entails (or contradicts) the meaning of the second one. In the experiments we will take the opportunity to analyze our approach with respect to a textual entailment recognition task: this will allow us to demonstrate how our method and architecture can be adapted to different input formats, and at the same time to highlight the current limitations of our approach when dealing with instances of NLP problems for which the bias of our models may not be adequate (as it will be discussed in Section 6.3). For these reasons, in this section we show how each of the previously presented models can be extended to the case of textual entailment recognition tasks.
Bi-ESN. In the case of Bi-ESN, the extension is straightforward: a sentence embeddings is produced for each of the two inputs (premise and hypothesis) and they are then combined for classification. In particular, the premise and the hypothesis are processed independently by the same model to produce two sentence embeddings
Bi-ESN-Att. Also in the case of Bi-ESN-Att the extension is straightforward: as before, we use the same ESN to produce state vectors independently for the premise and the hypothesis. The only difference with the Bi-ESN-Att described in Section 3.2 is the self-attention layer, that must be adapted to handle the states from both the premise and the hypothesis. We follow the approach proposed in [34], that by using shared weights computes vectors
In this section we briefly describe the major characteristics of the datasets used in our experiments. Specifically, in Section 4.1 we describe the TREC dataset which supports the Question Classification task. In Section 4.2 we describe another text classification dataset (SMS Spam Collection) that deals with very short messages with highly unbalanced classes. Finally, in Section 4.3 we describe a natural language inference dataset that we use for validating our models on a task that involves a pair of input sentences instead of just one sentence.
TREC
A commonly used benchmark for evaluating Natural Language Processing systems is the TREC dataset for Question Classification 2 [33]. The TREC dataset deals with the task of classifying a number of input sentences, written in English, into one of 6 classes that indicate their broad topic (i.e. whether they ask about a person, a location, a number, a human being, a description or an entity). While the dataset also contains more detailed fine-grained classes, here we only focus on the 6 commonly used coarse-grained classes.
To support our model validation methodology, the dataset has been split in three folds: training, validation and test. The test fold is directly provided by the authors of the dataset [33] and contains 500 labeled questions. The other fold provided by the authors of the dataset, composed of 5452 labeled questions, was partially used for training and partially for validation. In fact, we have split this fold by the commonly used “80/20 rule”, where 80% of the instances (chosen at random) are used for training and the other 20% for validation. This yields a training set of 4362 questions and a validation set of 1090 questions, with similar class distributions between the two sets (we did not perform an explicit stratification).
We have performed tokenization of the input questions, so that we could assign a word embedding to each token. In particular, we represented each token by a pretrained FastText embedding vector for the English language, with 300 dimensions [21]. Whenever a word that does not have a corresponding embedding in FastText is encountered, we use a random vector of the same shape instead. This vector is different for each missing word. While the NLP community is going towards context-sensitive word embeddings (see for example BERT [17]), in the current setting we chose FastText for its relative efficiency.
SMS Spam Collection
The SMS Spam Collection (v.1) dataset 3 [2] is a relatively small corpus composed by 5574 SMS messages written in English, each of which is annotated as either spam or not-spam (ham). Because of the nature of the corpus, messages are very short, with a maximum length of 910 characters for this dataset.
Following the same methodology as [2], the dataset has been split into two parts, where the last 70% of the messages (3900 examples) was used as a test set. The other 30% of the messages has been further divided into a training set composed by the first 1000 messages, while the remaining data (674 messages) has been used as a small validation set for model selection. In order to preserve the same training and test data as [2], we did not perform any preliminary shuffling of the order of the examples. However, the ratio of the two classes of examples is roughly the same across all three dataset folds. In particular, the ratio is heavily unbalanced since the whole dataset is composed by 4827 negative examples (not-spam) and just 747 positive examples (spam). As in Section 4.1, each message is tokenized and represented by a pretrained FastText embedding vector for English, with 300 dimensions.
Taking into consideration the high unbalance of the binary labels in this particular dataset, when evaluating the performance of our models we will compute, in addition to the accuracy, also the F1 and Matthews Correlation Coefficient (MCC) scores. The former is in the range from 0 (worst) to 1 (best), while the latter goes from -1 (worst) to 1 (best). The MCC score is particularly significative in the case of highly unbalanced binary labels.
Stanford Natural Language Inference
The Stanford Natural Language Inference corpus 4 (SNLI) [10] is a large annotated dataset containing pairs of sentences written in English. It is used for the task of natural language inference, or textual entailment recognition, in which it must be understood whether the meaning of the first sentence (the premise) entails the meaning of the second one (the hypothesis). In the case of SNLI, for each pair of sentences there are three possible answers: entailment, contradiction, or neutral.
From a total of 570k pairs of sentences, the dataset is split in three folds whose sizes are provided by the authors of the corpus. In particular, after the split we get a training set of 549,367 examples, a validation set of 9842 examples and a test set of 9824 examples. Following the methodology in [10], these numbers do not include the examples whose label is controversial.
The included representation as binary parse trees is used to extract a tokenization for the sentences. Then, just like in Sections 4.1 and 4.2, to each token is applied a pretrained FastText embedding vector for English with 300 dimensions. The input to our models will then be in the form of a pair of sequences of word embeddings: one for the premise and one for the hypothesis.
Experiments
All our experiments were performed on a single NVIDIA Tesla V100 with 16 GB of memory. The source code for the models and their training procedure is developed in PyTorch [36], which provides convenient automatic differentiation (for the models that make use of it). All source code and relative instructions will be made freely available online 5 in order to allow third parties to easily reproduce our experiments. In addition to the RC-based models that we have described in Section 3, we also implemented a standard bidirectional GRU (Bi-GRU) that we use for comparison purposes on the analysis of accuracy and efficiency.
Training method. Most of the models that we use for our experimental comparison are trained by mini-batched gradient descent using the Adam algorithm [29] and cross entropy as the loss function. On the other hand, the simple linear readout that characterizes RC-based models allowed us to train Bi-ESN by closed-form solution with ridge regression. Thanks to the fast training process made possible by ridge regression, we were also able to cheaply compute an ensemble out of 10 instances of Bi-ESN, all identical except for their random initialization. The output of the ensemble model is computed by simply averaging the output scores of the predictions of the 10 individual instances, then taking as final prediction the class corresponding to the highest averaged score.
In summary, the four models that are going to be trained on each of the three tasks of Section 4 are: Bi-ESN: a pure RC model offering extremely fast training, Bi-ESN, ensemble: a combination of ESNs for improving predictive performance, Bi-ESN-Att: for trading predictive performance with computational efficiency, and Bi-GRU: a common baseline for recurrent neural networks.
Validation method. Initially, all models are trained on the training fold of the dataset and hyperparameter tuning is performed by evaluating the accuracy on the validation fold. After the best hyperparametrization has been found, the models are retrained on a new fold composed by the union of the examples in the training and validation folds in order to get a final estimate of the performance. To even out the effects of the random initialization of the parameters, all measurements of the test performance have been performed by repeating the training process 10 times, with different random initializations each time, and averaging the results. The standard deviations have also been computed in order to confirm the robustness of the results with respect to different randomly initialized weights.
Details on model selection. Regarding both Bi-ESN and Bi-ESN-Att, we selected the number of recurrent units N
R
within the interval [500, 10000]. For Bi-ESN-Att, the additional hyperparameter N
D
has been selected in {128, 256, 512}. The values ω and ρ for the initialization of respectively the input-to-reservoir matrix
Results
In Figure 6 we have reported the performance achieved by our models on the three tasks defined in Section 4. We have also compared these results against other models in the literature. When taking into consideration the accuracy on the TREC and SMS datasets that we have reported in Figures 6a and 6b, we can observe how all our three proposed models are performing. It can be seen how these models, that are all based on a RC approach and that are thus exploiting a completely untrained recurrent dynamics, are able to compete very well against the majority of the other models in the literature. In fact, on the SMS dataset they even slightly surpass the state-of-the-art. On the TREC dataset, even if the models are very competitive in general, two peaks can be spotted in correspondence to CNN rnd and U T +CNNw2v: we will discuss why we believe these peaks to not be much significant in Section 6.2, along with a critical analysis about the performance on the SNLI task in Section 6.3.

Extension of the Bi-ESN model in Fig. 2 to the case of a textual entailment recognition task. The two input sentences are processed in parallel by the same bidirectional ESN to produce the two sentence embeddings

Extension of the Bi-ESN-Att model in Fig. 3 to the case of a textual entailment recognition task. The two input sentences are processed in parallel by the same bidirectional ESN, then each resulting state is fed to a layer that performs dimensionality reduction. A self-attention mechanism inspired by [34] extracts the important states independently from the premise and the hypothesis sides, then computes multiplicative interactions between them. Finally, the resulting vector is fed to a linear classifier. The only parts of the model that undergo training are represented by a dark, solid background.

Comparison of the predictive performance reached by our models against the state of the art. On the y axis, the accuracy achieved by our models is reported in black, while gray bars are used for the models in the literature.
What is remarkable is that the competitive results of our models come with an extremely low training cost of Bi-ESN, as it can be observed in the measurements reported in Tables 1, 2, and 4. In fact, Bi-ESN turns out to be more than 60 times faster than Bi-GRU in terms of training time on both TREC and SMS (requiring just 6 seconds on the former dataset, and less than a second on the latter), without any significant decrease in predictive accuracy despite all the untrained weights. Even the ensemble model, which requires to separately train 10 differently initialized classifiers, is still highly competitive against the Bi-GRU in terms of training time (and for engineering purposes is could trivially be improved even more by parallelization between the different instances).
Predictive accuracy and training time (with standard deviations) on the TREC dataset after model selection
Predictive accuracy and training time (with standard deviations) on the SNLI dataset after model selection
Predictive accuracy and training time (with standard deviations) on the SNLI dataset, with hyperparameters from the QC models
Attention mechanism. We have decided to include the Bi-ESN-Att model into our analysis in order to understand whether the fading memory property of the RC models was too strong to be able to retain the important information in the input from the start to the end of each sequence. Our results show that adding an attention mechanism on top of the ESN sometimes led to a gain in predictive performance with respect to Bi-ESN (see the results in Table 1 and Table 3), but even when it was present this gain was rather limited. In the case of the Question Classification task, one could say that this may be due to the relative simplicity of the TREC dataset, which exhibits short sentences with a relatively simple structure. In fact, many sentences start with “Who is”, “How many”, “Where is”, “When did”, and so on. The bidirectional architecture (and in particular the backward direction), then, seems sufficient to capture these important features in the data. This intuition is illustrated in Figure 7, where are displayed the attention scores assigned by the attention mechanism to the words of a sample of correctly classified sentences. However, the same line of reasoning cannot be applied neither to the SNLI dataset, nor to the SMS dataset. These results then seem to indicate that a properly initialized ESN is able to retain all the important information in the input sentences (wherever they are located) so that they can be exploited by the classifier. In other words, in our case the classifier is able to rely only on the final point in the state space trajectory because no extra useful information is encoded in the intermediate states of that same trajectory. 6

Visualization of the intensity of the attention scores
The defining characteristic of a RC-based approach like our own is that the reservoir of the model is not subject to training. This means that the fitness of a reservoir to a specific task is determined mainly by its hyperparameters, as opposed to the individual values of its weights. It is then interesting to investigate whether a reservoir optimized on a given task can be successfully transferred to different tasks of the same class (in our case text classification) without the need to perform another hyperparameter search. To this end, we retrained our models (including the Bi-GRU) on the SMS and SNLI tasks by using the same hyperparameters as the corresponding models that were previously trained on the TREC dataset. Note that this allows for a very lightweight transfer since the hyperparameters are the only values that are transferred: in the destination model, the actual values within the reservoir matrices are randomly regenerated from scratch on the basis of those hyperparameters.
From Table 2 and Table 3 it can be compared the effect of the introduction of reservoir transfer on the SNLI task. Remarkably, no loss in predictive accuracy has occurred and, in fact, in the second case the accuracy of the RC models is even slightly superior and gets closer to the GRU. Similar results are found with the SMS task, as it can be observed by comparing Table 4 and Table 5. It then appears that, when using the proposed methodology, the model selection phase can be drastically simplified by selecting a good reservoir on a given task and then reusing it multiple times for different purposes.
Performance on the SMS Spam Collection dataset after model selection
Performance on the SMS Spam Collection dataset after model selection
Performance on the SMS Spam Collection dataset, with hyperparameters from the QC models
In Figure 6 are reported the performances of our models among those of several different models in the literature in similar experimental conditions. Regarding the third party models in the plots, it is worth pointing out that other works exist in which a slightly higher accuracy is reached. For example, on the TREC dataset some authors apply SVM [16] or KDA [15], and reach an accuracy of respectively 95.0% and 94.3%. Similarly, on the SMS dataset there exist several works [4, 40] in which accuracies up to 99.44% are obtained. However, we intentionally include in our plots only those models for which the reported results are able to provide a fair comparison of the generalization capability when compared to our own. In particular, we only include those works in which the model selection methodology is comparable to our own in terms of an analogous separation between model selection and model assessment, on validation and test set respectively.
On the other hand, the experimental methodology used in the Question Classification task for CNN rnd [12] (reported in Figure 6a) is compatible with our own. However, we highlight the fact that CNN rnd should have an architecture identical to the one previously introduced in [28], but the authors do not provide an explanation for the extremely high increase in accuracy with respect to the original paper. Finally, we note that the two models U T and U T +CNNw2v make use of network weights that are pre-trained on several additional large text corpora, whereas in our case the training is limited to the data within the dataset itself.
Critical analysis on SNLI
For what concerns the results on the SNLI task (Figure 6c) it can be observed how a significantly higher level of accuracy is reached by several competing models. While the predictive performance of the RC-based models is considerably higher than chance (which would be about 33%), this highlights the limitations of our approach in its current form. In particular, the difficulties introduced by the Markovian bias that characterizes ESNs [20] seem to be amplified when two different sentences are compared, as the semantic information associated to the final states of the reservoir will be skewed towards the final parts of the sentences. The problem is alleviated by the use of leaky-integrator neurons and a bidirectional architecture, but to ensure a higher predictive performance it may be necessary to employ very large reservoirs that increase the memory capacity of the network [19]. Unfortunately, a large reservoir determines an increase in the number of parameters in the classifier, which in turn makes it more costly to train the readout. In summary, for the RC-based models that we are proposing there exists a trade-off between predictive performance and efficiency: we have intentionally shown the existence of tasks in Natural Language Processing where it may still be convenient to use a fully trained model like the GRU. However, we have also demonstrated that there exist a number of tasks (shown in our experiments through Question Classification and SMS Spam detection instances) that can instead benefit from the extremely high training efficiency that characterizes our approach.
Conclusion
In the field of Natural Language Processing it is increasingly easy to find sophisticated machine learning architectures that require lots of time and computational resources to be properly trained. These models can often reach surprising predictive performance even on complex tasks, however they can be overkill in other situations. For this reason, we decided to investigate whether a highly efficient neural network model could compete against the typical approaches. For our analysis we followed the RC framework by proposing the novel use of a bidirectional multi-ring ESN, possibly associated with a self-attention mechanism [18].
To determine the applicability of our approach, we have selected three datasets and we have compared the predictive performance of our models against the state-of-the-art neural networks in the literature.
First, the experimental comparison with the existing literature allowed us to pin-point and highlight the limitations of our approach by means of the results on the challenging SNLI task.
Then, we took the chance to evaluate the contribution of an attention mechanism on top of our RC-based models. In this regard, we have discovered that the final point in the trajectory of the state dynamics seems sufficient for capturing the information that is present in all the individual states of the trajectory, provided that the ESN is properly initialized. In fact, we have only found marginal improvements by adding the attention mechanism.
We have introduced the concept of reservoir transfer, which allows to reuse reservoirs without performing additional hyperparameter searches. In this regard, we have experimentally shown how a reservoir with hyperparameters that have been selected on the Question Classification task can be transferred to the SMS and SNLI tasks without any loss in predictive performance with respect to the use of task-specific reservoirs.
Most importantly, we have demonstrated the extremely high computational efficiency of our approach with respect to a common and simple fully-trained model like a bidirectional GRU. On the Question Classification and SMS tasks, we have shown how our RC-based models are able to remain competitive against the state-of-the-art in terms of accuracy, while requiring drastically low training times. In fact, our Bi-ESN has exhibited training times more than 60 times faster than those required for a bidirectional GRU. When associated with the technique of reservoir transfer, our methodology allows an even higher computational efficiency starting from the hyperparameter search up to the final trained model.
There are a number of ideas that have been left out from this study and that it would be worth investigating to get a deeper understanding on further implications of this work. First, the space of the sentence embeddings could be analyzed in order to discover whether it displays linguistic regularities. This would be especially useful to shed light on how to improve the performance on tasks such as SNLI, which heavily depend on the interaction between different sentence embeddings. For example, it may be the case that additional learned transformations on the two individual sentence embeddings are needed before they are joined for classification. Artificial tasks such as Masked Language Modeling [17] can also be quite informative of the performance of the RC approaches and worth exploring. Moreover, several techniques could be employed to improve the richness of information contained within the embeddings. Apart from more complex initialization strategies that keep into consideration the grammatical structure of the sentence, more advanced kinds of pre-trained lexical embeddings could be used without any alteration to the proposed models. Even further, a more advanced tokenization strategy such as SentencePiece [31] may allow additional benefits to the predictive performance.
In conclusion, our results show that Reservoir Computing methods are able to offer a compelling value with respect to the typical and costly full backpropagation training of recurrent neural networks, and thus they should be taken into consideration when facing resources constraints. From our analysis, focused on text classification tasks, it emerges that these methods can be adopted in this context with potentially drastic advantages in terms of computational cost. With the considerable amounts of data that increasingly characterize Natural Language Processing problems, our approach represents a first step towards making a more efficient use of the available resources.
