Abstract
In this paper, we show that a contextual neural network with artificial neurons performing a conditional aggregation of signals can be trained by the generalized backpropagation algorithm. To allow this algorithm to be used for training contextual neural networks, we derive appropriate generalized delta rules. Our approach is constructed on the basis of introduced generalized representation of the aggregation function in an ordered groups space and division of its attention function into binary scan-path and contribution functions. The advantage of the proposed representation is that it clarifies the description of the aggregation process by using Stark’s scan-path theory and allows us to achieve results independent from the actual form of the attention functions used during aggregation. As such, the proposed solution is valid for the whole presented family of conditional aggregation functions and is a considerable extension of the previously reported results. In particular, the obtained results are valid for the introduced exemplary attention functions which illustrate performed calculations. Moreover, the presented solution can be further extended by considering real valued, non-binary contribution functions inside ordered aggregation functions. Especially promising are its possible applications in large deep neural networks and energy-limited systems.
Keywords
Introduction
Latest developments in the field of artificial neural networks prove that methods it can provide bring very valuable solutions to real life problems both in science and industry [18, 45]. In the same time numerous research reports present growing amount of improvements that can additionally enhance usability of artificial neural networks [6, 11]. One of the explored directions of those works are contextual neural networks, which for each input vector independently limit the amount of input data processed to compute their outputs [33, 37]. Models of this kind can have several useful properties in comparison to non-contextual solutions, such as selective attention, increased accuracy of outputs generation for input vectors not used during training as well as limitation of time and energy used for data processing [52]. Regardless of the results already achieved, still the main problems related with contextual neural networks are definition of the practical model of such structures and formulation of theoretically justified, effective algorithm for building and training contextual models.
Within this paper we deal with the problem of formulation of a general model of a contextual neuron. The importance of this lies in the fact that properties of such neurons allow for bottom-up creation of contextual neural networks. With such an approach one should be able to extend contextual properties of known neural network architectures (e.g. feedforward, recurrent or convolutional) by using contextual neurons in place of non-contextual ones. Thus, the model of the neuron we search for should also let induce a generalized method of training contextual neural networks that contain such kind of neurons.
Early attempts to create models of contextual neural networks emerged from observations of selective attention mechanisms in humans and other primates [1, 54]. It became evident that biological neural networks can generate and use a set of rules that for every possible situation define parts of perception space that should be observed and processed depending on the context with highest, medium or lowest priority [15, 42].
The authors of the first contextual neural models assumed that to realize dynamic changes of the region of interest, neural networks should be governed by a specialized external element that uses a predefined algorithm of selection of inputs to activate [2, 7] or various artificial mechanisms for dynamic network architecture changes [14, 48], which in fact also were realizations of some constant embedded strategies of context processing. Such solutions allowed producing specialized tools that were useful in tasks such as face recognition, objects tracking or even pointing interesting objects for observation by cosmic probes [10, 28].
But the mentioned models were not universal solutions and unfounded in the light of neurobiological observations and earlier experimental proofs for a distributed nature of context processing [8, 38]. This gave a strong impulse to further searching for basic contextual mechanisms on the level of single synapses and neurons [35, 53]. In the effect researchers concentrated on exploring new artificial neurons models – especially those, with input signals aggregation methods other than simple weighted sum of input values known from the perceptron neuron model. This is due to the fact that an aggregation function is the only part of the neuron transmit function that has full access to all information that is available through a neuron’s inputs – and after the aggregation process most of the information about the sources of particular signals and context related dependencies between them is lost.
For the above reasons, a number of neuron models were proposed along with definitions of two basic types of aggregation functions that allow us to express the dependencies between inputs of such neurons. The first family of those aggregation functions incorporate so-called higher-order neuron models, which use variations of the polynomial aggregation function that include all or only chosen terms of the right-hand side of Equation (1), where N is the number of neuron inputs, and
But with the increasing number of neuron’s inputs, this leads to exponential explosion of the number of higher order terms. Thus to restrict aggregation function complexity various limited versions of higher order aggregation functions were analyzed, such as the Sigma-Pi [4], the Clusteron [3], Compensatory [36] and Cubic [22] functions. Their common feature is capturing contextual dependencies between inputs by using multiplication and defining clusters of related inputs by given selection of higher order terms and values of dedicated parameters.
The approach to limit the number of parameters of the contextual models induced exploration of the second family of neurons which use nonlinear aggregation functions. Examples of such functions are: Product Unit [40]:
Spratling-Hayes function [39] in which nonlinearity as well as dependencies between neuron’s input connections are induced by the minimum function, generalized-mean neuron model [43] and trigonometric aggregation [29]. Those solutions have only few additional parameters and are flexible what makes them applicable in many problems [44], but still there is no explicit rule describing how to properly choose value of their parameters for a given problem.
On this foundation emerged the Sigma-if model [34], being a representative of the third type of contextual neurons using conditional multi-step aggregation functions which implement dependencies between inputs without higher-order multiplicative terms by realization of Stark’s scan-path theory of attention [12]. The Sigma-if aggregation function can be presented in the form:
The above solution is a direct generalization of the perceptron aggregation and introduces only a few additional parameters whose values are easy to select with simple rules [32]. And what is most interesting, it was also shown that a feedforward neural network of neurons using the Sigma-if aggregation can be trained with the generalized backpropagation (GBP) algorithm which uses the self-consistency paradigm known from physics to solve nonlinear context relations between inputs of the neurons [16, 31]. With this procedure multistep aggregation allows us to create classifiers that not only can have better generalization properties than analogous multilayer perceptron networks (MLP), but also use contextual relations between inputs for a dynamical reduction of the set of input values read-in for processing and building distributed selective attention abilities [32].
While the properties of neural networks with the Sigma-if function are interesting, it is only one of many different possible methods of multistep conditional aggregation of signals. Thus, it is important to find a general description of this family of aggregation functions and check if the GBP algorithm can be used to train neural networks based on them. If this would be the case, it could lead to a definition of a practical, general model of conditional contextual neural networks and formulation of a theoretically justified, effective algorithm for training such models.
In this study, we propose a set of new, selected methods of conditional multistep aggregation of input signals within contextual neural networks. Next we formulate generalized representation of the family of those functions based on Stark’s scan-path theory. This description is later used to show that the GBP algorithm can be used not only for the Sigma-if based neural networks but for the whole family of conditional contextual neural networks.
The rest of this paper is organized as follows. In Section 2, selected new conditional aggregation functions are proposed as well as a formula describing the whole family of such functions. Then in Section 3 generalized representation of conditional aggregation functions in the ordered groups space is introduced to simplify further analyses of the GBP algorithm. This generalized representation is next used in Section 4 to derive GBP formula for weights update for feedforward neural networks using conditional aggregation functions. Finally in Section 5, results of experiments are presented showing that the GBP algorithm can be successfully used to construct contextual neural networks with the proposed conditional aggregation functions that solve selected UCI machine learning benchmark classification problems.
It was suggested in the previous section that conditional contextual neural networks can be instantiated with a wide set of aggregation functions that represent different methods of contextual input data processing. Thus for the neuron with transfer function given by the classical equation
The k* value is the lowest value of the data accumulation step for which the neuron has enough data to form the proper output value for a given input vector x, what is defined by the selected aggregation condition, and the partial activation of the neuron accumulated from the group of number g is determined as a weighted sum of input signals gated by the appropriate Kronecker’s delta:
The formulation of Equation (6) is different from Equation (3) because the latter relates to a special case of the Sigma-if neuron in which indexes of inputs groups are equal to indexes of consecutive aggregation steps and the summation can be done over the aggregation steps index k. In the general formula this must be changed to aggregation over the groups index g with help of an additional function Γ which describes which inputs groups should be used to form neuron activation in a given step of the aggregation:
Whereas Γ is defined for any input vector
Looking at the graphical representation of the Sigma-if attention function, one can easily construct many other attention functions for creation of contextual neural networks that potentially will have properties different than those observed for the Sigma-if neural network. Within this paper, we propose four new attention functions:
CFA (Constant Field of Attention) in which the inputs space is analyzed through moving a window of a predefined size – Fig. 2a, OCFA (Overlapping Constant Field of Attention) as in CFA but input space regions overlap in consecutive aggregation steps – Fig. 1b, PRDFA (Predefined Random Dynamic Field of Attention) in which the order of inputs groups aggregation is set pseudo randomly on the neuron initialization and is not changed during training – Fig. 2b, RDFA (Random Dynamic Field of Attention) in which the order of inputs groups aggregation is set pseudo randomly at the beginning of each aggregation.
Formal definitions of the proposed attention functions are as follows:
The proposed functions were selected to cover various types of input data aggregation methods. They include both random and non-random aggregation as well as represent usage of constant and dynamic field of attention. There are at least two reasons for this selection. Firstly, it is interesting how using such Γ functions in Equation (6) will change the influence of aggregation on the neural networks properties in comparison to the MLP and Sigma-if models. But the most important is that chosen Γ functions represent solutions which were not covered by the GBP method in [32] and should be taken into consideration if one wants to check if and how the GBP algorithm can be used for training conditional contextual neural networks with the whole family of possible aggregation methods.
The general conditional aggregation formula given by Equation (6) is a simple way of describing conditional aggregation schemes of both a structured and unstructured characteristic (e.g. random). It is also useful as it allows us to easily create new aggregation methods by defining various attention functions. Unfortunately, such a form of the aggregation function for many attention functions would be hard to use within formal analyses of the GBP algorithm as it would generate irreducible, problematic expressions in partial derivatives. Thus, to calculate the delta rule for a feedforward neural network of neurons with a general conditional aggregation function, its alternative representation is needed that will be usable for theoretical analyses. To ensure generality of the following proposition of such representation, it will be illustrated with the most complicated case of the random Γ RPDFA attention function.
One of the possible solutions of the above problem is making an observation that for any attention function Γ its groups indexes can be reordered in such a way that the resulting ordered attention function Γ′ would describe input groups ordered according to the sequence of their aggregation. The nature and example of such transformation for the Γ RPDFA attention function is presented in Fig. 3.
Thus in general the conditional aggregation within ordered space of the g’ index has the form:
This change can be viewed only as a formal modification as the Equation (15) keeps the structure of Equation (6) and does not change the sequence of data aggregation. But in practice such reordering of groups indexes allows for further division of ordered attention function into two sub functions:
Even if for the purposes of this paper we assume that both scan-path S and contribution Z are simple binary functions, they represent two distinct natural components of the attention function. The first one shows which groups are used to form a neuron’s output, and the latter concentrates the information about the group’s importance in the given steps of the aggregation process − which in general does not have to be binary.
Using further Equation (15), one can also notice that for the aggregation step k* there exist two boundary values g* and common for scan-path, contribution and ordered attention functions, such that:
With the help of those values it is easy to see that aggregation value
It is worth underlining that the obtained general formula for the conditional aggregation function in the ordered groups space is common for all attention functions. In particular, one can easily see that the same result will be achieved for other attention functions proposed in the previous section. Thus, it is the representation of the searched conditional aggregation function which can be used to define a wide family of contextual neurons with different attention functions.
But what is equally important, such a form of the general conditional aggregation function hides a potentially complex nature of an attention function and is almost identical to the aggregation function of the Sigma-if neuron written in the space of aggregation steps index k (see Equation 3). The latter fact becomes obvious after analyzing relation of the attention, ordered attention and scan-path functions of the Sigma-if neuron – in this special, simple case all those functions are mutually identical and contribution function is equal one. Thus in particular the Sigma-if aggregation steps index k is equal to its ordered attention index g′. In such a special case, Equation (21) can be used without unnecessary complications to perform calculation of the generalized delta rule for the general conditional aggregation function. Still, it requires confirmation if the same can be done in a general case – for all attention functions.
It has been shown in the previous section that the general formula of a conditional aggregation function given by Equation (21) can be used to build contextual neurons for a wide set of attention functions given by Equation (16). We can use such neurons within a feedforward artificial neural network to form a contextual neural network. In turn, obtained contextual neural networks can be trained with the GBP algorithm due to similarity of Equations (21) to (3). Still, to be able to do that we need to find the form of the generalized delta rule in the ordered groups space for Equation (21).
It would be comfortable to show the derivation of the GBP generalized delta rule for conditional contextual neural network in comparison with the backpropagation algorithm’s generalized delta rule for MLP network. But due to common knowledge about backpropagation algorithm we will only refer to its existing, easily available descriptions [13, 32]. To simplify further parts of the derivation, we will also use set of symbols given below and common for the mentioned works.
Let’s consider the general case of a multilayer feedforward neural network with a full network of connections between neurons in adjacent layers, and a non-decreasing and differentiable activation function in individual neurons. To establish the symbols we assume that every μ-th learning pattern is a pair containing the input vector x zμ and the corresponding output vector y zμ . Simultaneously, consecutive layers of the network are numbered with index m and values from 1 to M, where layer m consists of n m neurons. Consequently, the weight of the connection between the j-th neuron in m-th layer and the i-th neuron of the previous layer is written as (in case of the double lower indices, the left subscript is the number of the neuron within the layer of the number indicated in the superscript, while the right subscript is the number of the neuron input). Using this convention and until otherwise noted weights of connections between neurons in layers m and m + 1 are denoted as , where l is the index of a neuron in a layer m + 1. Similarly, the values of the aggregation function φ and activation function F (φ) for j-th neuron of the m-th layer are denoted as and , respectively, while for the i-th neuron of the input layer, which by definition realizes an identity transfer function, is equal . The maximal and minimal boundaries of an ordered attention function for the j-th neuron of the m-th layer are denoted, respectively, as and .
Using the above notation and by assuming that all neurons of the network hidden layers are using the conditional contextual aggregation given by Equation (21), we get the output values of the neurons in the form
From the above, we can directly write that
Recalling now Equation (22), we can start calculating the missing right-hand side part of Equation (26):
The iterator q is introduced in Equation (27) instead of j to denote the q-th neuron in the m-th layer because within this formula the value of j needs to be constant as an index of . Hence, after expanding the sum over g’ and performing the differentiation of the right-hand side, the above equation takes the form:
Then, by factoring out the common weight term, we can write:
However, the sum of Kronecker deltas appearing on the right-hand side of Equation (29) may take only two values: one when the j-th input of the l-th neuron belongs to one of the groups active during signals aggregation for the vector μ, and zero if it is not. In the first case, the component of the grouping vector assigned to the j-th input connection is within the boundary values and , and in the second one it is outside this range. This allows us to conclude that:
Lastly, by applying to Equation (26) the derivative calculated in this way, one can determine the formula for the output error of the j-th neuron in the m-th hidden layer of conditional contextual neural network (based on Equation (24)):
Expression (31) differs from the corresponding formula for the MLP network only by a presence of the two Heaviside functions. Thus, as in the case of the Sigma-if network, due to this change, when not all inputs of the contextual neuron are involved in determining its output value, the related error is propagated only by the connections that were previously used. This behavior is fully consistent with the idea of the backpropagation algorithm. A neuron’s input connections inactive during the aggregation of the signals, even despite non-zero weights, do not make any contribution to the activation of a neuron, and, as a result, do not influence the neuron’s output error values.
Finally, to determine the general rule of modification of weights in the contextual network with conditional aggregation functions, the following derivative requires consideration:
However, it is easy to notice the similarity between Equations (33) and (27). Thus by analogy, without unnecessary transformations we get:
As a result, the generalized delta rule specifying the change of the weight value of the i-th input of the j-th neuron in the m-th contextual network layer takes the form:
Lastly, after taking into account the relevant formulas for errors of different elements of the contextual network with conditional aggregation function, generalized delta rule for the output layer of its neurons is given by:
Equations (35–37) could be further simplified by changing appropriate products of Heaviside functions to dedicated rectangular functions, but it does not change the result and we leave it to the interested reader. Heaviside functions appearing in Equation (35) can be viewed as a mechanism that counteracts unnecessary modifications of the network structure in those parts which are not used for determining the output values of individual neurons for a given training vector. Thus, both in the hidden and output layer weights of connections that were inactive during the process of input signals accumulation are not modified.
To verify in practice the results of our theoretical analysis, we have used generalized weight modification formulas obtained in the previous section to check the effects of applying the GBP algorithm to train feedforward contextual neural networks with proposed aggregation methods CFA, OCFA, RDFA and PRDFA. In each case, a fully connected neural network with a single hidden layer of ten contextual neurons and bipolar coding of inputs was trained to solve four selected benchmark problems given by Wine, Votes, Heart and Sonar datasets from the UCI ML repository. The used neural network architecture, basic parameters values and considered problems were selected so that we could to compare new results obtained for contextual neurons built with proposed aggregation methods with the previous results for Sigma-if and MLP neural networks presented in [32]. The basic properties of the selected datasets are given in Table 1.
For the above reasons during the training no bias weights were used, nor did we use any momentum, and the training constant η was 0.05. In the given neural network all contextual neurons have had the same values of aggregation threshold φ*, number of groups K and input window size. Values of those parameters were chosen for each experiment from the following sets: φ* from the set {0.5, 0.6, 0.7} and the number of groups K from the set {2, 3, 5, 7, 9, 11, 13}. The maximal number of training epochs was 500. Additionally, after initial experiments we have limited considered attention window sizes to the values from the set {1, 3, 5, 7, 9, 11}.
Each combination of the above parameters values was used to perform 5 times 4-fold cross-validation and for each trained neural network the level of generalization and average activity of internal network connections were measured. Then average results for the best networks obtained in each cross-validation were compared witch analogous properties of Sigma-if and MLP networks.
It can be seen in Table 2 that for almost all considered benchmark problems neural networks with aggregation methods introduced in Section 2 achieve generalization levels similar to the results for Sigma-if neural network and better than for MLP. The only exception is the Sonar problem for which CFA, OCFA and PRDFA aggregation methods generate results better than noted for Sigma-if and MLP.
The obtained results indicate that weights modification formulas obtained for the proposed groups space can be effectively used to train neural networks with contextual neurons. It can be also concluded that various conditional aggregation functions for given problems can lead to different results, also better than in the basic case of the Sigma-if neuron. This is due to the fact that different attention functions set different limits on neurons’ attention fields and on their abilities of perception of contextual relationships within the data. Thus, some attention functions can lead to better results for a given problem than others. On the other hand, independently from the aggregation method actually used within the neurons, the GBP algorithm was able to successfully generate usable contextual neural networks with considerably high average level of generalization. This can suggest that each non-trivial attention function can be used to construct neurons useful in contextual neural networks.
Contrary to the above, the selection of an aggregation function can have a huge impact on the average activity of internal network connections. Within Table 3 one can see that the levels of average activity of internal network connections for the cases of proposed aggregation functions are considerably decreased in comparison to analogous results for Sigma-if and MLP networks. For the Sonar problem the CFA aggregation method decreased the average internal connections activity about two times and its RDFA analogue almost three times in comparison to the results for Sigma-if model. Similar considerable decrease of connections activity was noted for OCFA method for all considered problems. One can also observe that average decrease of connections activity for PRDFA and RFDA aggregation functions is lower than for OCFA and CFA methods. This indicates that one should carefully select a neuron’s aggregation method if the level of activity of internal networks connections is important for the given solution. It can be also expected that non-random attention functions can lead to a higher reduction of the level connections activity than their pseudo-random versions.
On the basis of the above, we can conclude that analysis of experimental results support presented theoretical considerations. The GBP algorithm is able to do successful training of contextual neural networks with various forms of the attention function. Also, the proposed conditional aggregation methods are useful as allow us to achieve a higher average decrease of network internal connections activity than MLP and Sigma-if neural networks while obtaining similar (in some cases better) classification results. Such neuron models can be especially useful within large systems that use convolutional neural networks and deep learning because this can lead to considerable reduction of energy used by such systems both during training and further usage. Still, there is a need for further experiments on the various forms of conditional aggregation functions and on their applicability in different neural networks.
Conclusions
In this paper, we have shown that a contextual neural network with neurons performing a conditional signals aggregation can be trained by the GBP algorithm with the use of derived generalized delta rules. The result is independent from the actual form of the attention functions used during the aggregation. Thus it is valid for the whole presented family of conditional aggregation functions and constitutes a considerable extension of the previous result for the Sigma-if neural network [32]. In order to achieve this, we have introduced a generalized representation of the aggregation function in an ordered groups space and division of its attention function to binary scan-path and contribution functions. Presented calculations have been illustrated with examples of four new conditional aggregation functions and with results of experiments in which GBP has been successfully used to train contextual neural networks based on those functions.
Obtained results open a wide field of research on the properties of new conditional aggregation functions which can be easily defined by the use of the introduced idea of an ordered attention function. Further work can include also considerations of contextual neural networks with conditional aggregation functions in which the contribution function is not binary and analyzing them against a broader set of benchmark and real life problems.
Highly promising is a potential application of presented contextual neurons within recent deep neural networks because this can lead to a considerable reduction of the energy cost of their usage both during training and further utilization – also when neural networks are run on specialized and GPU hardware. This especially includes large convolutional deep learning solutions for image [54], audio [51] and video [6] processing currently used by leading IT companies as well as energy-limited systems that could benefit from application of deep neural networks (e.g. mobile androids [50], intelligent sensors, autonomous vehicles [6, 49] and flying drones [11]). Moreover, according to the literature of the subject and to the best of our knowledge such low-level contextual neurons are currently not used in the mentioned types of systems.
Footnotes
Acknowledgments
We thank four anonymous reviewers for their valuable comments that improved the quality of this paper.
