Notes on the generalized backpropagation algorithm for contextual neural networks with conditional aggregation functions

Abstract

In this paper, we show that a contextual neural network with artificial neurons performing a conditional aggregation of signals can be trained by the generalized backpropagation algorithm. To allow this algorithm to be used for training contextual neural networks, we derive appropriate generalized delta rules. Our approach is constructed on the basis of introduced generalized representation of the aggregation function in an ordered groups space and division of its attention function into binary scan-path and contribution functions. The advantage of the proposed representation is that it clarifies the description of the aggregation process by using Stark’s scan-path theory and allows us to achieve results independent from the actual form of the attention functions used during aggregation. As such, the proposed solution is valid for the whole presented family of conditional aggregation functions and is a considerable extension of the previously reported results. In particular, the obtained results are valid for the introduced exemplary attention functions which illustrate performed calculations. Moreover, the presented solution can be further extended by considering real valued, non-binary contribution functions inside ordered aggregation functions. Especially promising are its possible applications in large deep neural networks and energy-limited systems.

Keywords

Contextual neural networks conditional aggregation classification dynamic inputs selection selective attention

1 Introduction

Latest developments in the field of artificial neural networks prove that methods it can provide bring very valuable solutions to real life problems both in science and industry [18, 45]. In the same time numerous research reports present growing amount of improvements that can additionally enhance usability of artificial neural networks [6, 11]. One of the explored directions of those works are contextual neural networks, which for each input vector independently limit the amount of input data processed to compute their outputs [33, 37]. Models of this kind can have several useful properties in comparison to non-contextual solutions, such as selective attention, increased accuracy of outputs generation for input vectors not used during training as well as limitation of time and energy used for data processing [52]. Regardless of the results already achieved, still the main problems related with contextual neural networks are definition of the practical model of such structures and formulation of theoretically justified, effective algorithm for building and training contextual models.

Within this paper we deal with the problem of formulation of a general model of a contextual neuron. The importance of this lies in the fact that properties of such neurons allow for bottom-up creation of contextual neural networks. With such an approach one should be able to extend contextual properties of known neural network architectures (e.g. feedforward, recurrent or convolutional) by using contextual neurons in place of non-contextual ones. Thus, the model of the neuron we search for should also let induce a generalized method of training contextual neural networks that contain such kind of neurons.

Early attempts to create models of contextual neural networks emerged from observations of selective attention mechanisms in humans and other primates [1 , 54]. It became evident that biological neural networks can generate and use a set of rules that for every possible situation define parts of perception space that should be observed and processed depending on the context with highest, medium or lowest priority [15 , 42].

The authors of the first contextual neural models assumed that to realize dynamic changes of the region of interest, neural networks should be governed by a specialized external element that uses a predefined algorithm of selection of inputs to activate [2 , 7] or various artificial mechanisms for dynamic network architecture changes [14 , 48], which in fact also were realizations of some constant embedded strategies of context processing. Such solutions allowed producing specialized tools that were useful in tasks such as face recognition, objects tracking or even pointing interesting objects for observation by cosmic probes [10 , 28].

But the mentioned models were not universal solutions and unfounded in the light of neurobiological observations and earlier experimental proofs for a distributed nature of context processing [8 , 38]. This gave a strong impulse to further searching for basic contextual mechanisms on the level of single synapses and neurons [35 , 53]. In the effect researchers concentrated on exploring new artificial neurons models – especially those, with input signals aggregation methods other than simple weighted sum of input values known from the perceptron neuron model. This is due to the fact that an aggregation function is the only part of the neuron transmit function that has full access to all information that is available through a neuron’s inputs – and after the aggregation process most of the information about the sources of particular signals and context related dependencies between them is lost.

For the above reasons, a number of neuron models were proposed along with definitions of two basic types of aggregation functions that allow us to express the dependencies between inputs of such neurons. The first family of those aggregation functions incorporate so-called higher-order neuron models, which use variations of the polynomial aggregation function that include all or only chosen terms of the right-hand side of Equation (1), where N is the number of neuron inputs, and x and w are the input vector and term weight matrix (of dimension N and N (N + 1), respectively). In comparison to the perceptron neuron, such an extension of the aggregation function form not only increases information capacity of resulting models [47], but also their ability of learning of geometrically invariant properties of input vectors [9, 46].

$\begin{matrix} φ_{Poly} (x, w) = w_{0} + \sum_{p = 1}^{N} w_{p}^{1} x_{p} \\ + \sum_{p, p^{'} = 1}^{N} w_{{pp}^{'}}^{2} x_{p} x_{p^{'}} + \dots \\ + \sum_{p, p^{'}, \dots, p^{N} = 1}^{N} w_{{pp}^{'} \dots p^{s}}^{N} x_{p} x_{p^{'}} \dots x_{p^{N}}, \end{matrix}$ (1)

But with the increasing number of neuron’s inputs, this leads to exponential explosion of the number of higher order terms. Thus to restrict aggregation function complexity various limited versions of higher order aggregation functions were analyzed, such as the Sigma-Pi [4], the Clusteron [3], Compensatory [36] and Cubic [22] functions. Their common feature is capturing contextual dependencies between inputs by using multiplication and defining clusters of related inputs by given selection of higher order terms and values of dedicated parameters.

The approach to limit the number of parameters of the contextual models induced exploration of the second family of neurons which use nonlinear aggregation functions. Examples of such functions are: Product Unit [40]: $φ_{PU} (x, w) = w_{0} + \prod_{p = 1}^{N} (w_{p})^{x_{p}},$ (2)

Spratling-Hayes function [39] in which nonlinearity as well as dependencies between neuron’s input connections are induced by the minimum function, generalized-mean neuron model [43] and trigonometric aggregation [29]. Those solutions have only few additional parameters and are flexible what makes them applicable in many problems [44], but still there is no explicit rule describing how to properly choose value of their parameters for a given problem.

On this foundation emerged the Sigma-if model [34], being a representative of the third type of contextual neurons using conditional multi-step aggregation functions which implement dependencies between inputs without higher-order multiplicative terms by realization of Stark’s scan-path theory of attention [12]. The Sigma-if aggregation function can be presented in the form: $φ_{Σ - if} (w, x) = \sum_{k = 1}^{k^{*}} \sum_{i = 1}^{N} w_{i} x_{i} δ (k, θ_{i} (w)),$ (3) where inputs grouping vector θ declares in which aggregation step given input should be used, δ is the Kronecker’s delta and k* is the minimal number of aggregation step k for which neuron has enough data to form proper output value for given input vector x, what is defined by aggregation condition: $Q_{Sigma - if} (k) = ((\sum_{i = 0}^{k} Δ φ_{i}) \geq φ^{*}) \lor (k = K)$ (4) dependent from the number of defined groups K and constant activation threshold value φ^*.

The above solution is a direct generalization of the perceptron aggregation and introduces only a few additional parameters whose values are easy to select with simple rules [32]. And what is most interesting, it was also shown that a feedforward neural network of neurons using the Sigma-if aggregation can be trained with the generalized backpropagation (GBP) algorithm which uses the self-consistency paradigm known from physics to solve nonlinear context relations between inputs of the neurons [16 , 31]. With this procedure multistep aggregation allows us to create classifiers that not only can have better generalization properties than analogous multilayer perceptron networks (MLP), but also use contextual relations between inputs for a dynamical reduction of the set of input values read-in for processing and building distributed selective attention abilities [32].

While the properties of neural networks with the Sigma-if function are interesting, it is only one of many different possible methods of multistep conditional aggregation of signals. Thus, it is important to find a general description of this family of aggregation functions and check if the GBP algorithm can be used to train neural networks based on them. If this would be the case, it could lead to a definition of a practical, general model of conditional contextual neural networks and formulation of a theoretically justified, effective algorithm for training such models.

In this study, we propose a set of new, selected methods of conditional multistep aggregation of input signals within contextual neural networks. Next we formulate generalized representation of the family of those functions based on Stark’s scan-path theory. This description is later used to show that the GBP algorithm can be used not only for the Sigma-if based neural networks but for the whole family of conditional contextual neural networks.

The rest of this paper is organized as follows. In Section 2, selected new conditional aggregation functions are proposed as well as a formula describing the whole family of such functions. Then in Section 3 generalized representation of conditional aggregation functions in the ordered groups space is introduced to simplify further analyses of the GBP algorithm. This generalized representation is next used in Section 4 to derive GBP formula for weights update for feedforward neural networks using conditional aggregation functions. Finally in Section 5, results of experiments are presented showing that the GBP algorithm can be successfully used to construct contextual neural networks with the proposed conditional aggregation functions that solve selected UCI machine learning benchmark classification problems.

2 Conditional aggregation functions for contextual neurons

It was suggested in the previous section that conditional contextual neural networks can be instantiated with a wide set of aggregation functions that represent different methods of contextual input data processing. Thus for the neuron with transfer function given by the classical equation $u (x, w) = F (φ (w, x)),$ (5) where F is the neuron activation function (e.g. sigmoid), and x and w are N dimensional input values and weight vectors respectively, the general formula for the function of conditional aggregation of data from maximum of K defined groups of inputs can be written as $φ (w, x) = \sum_{g = 1}^{K} Γ (k^{*}, g) \cdot Δ φ (g, w, x) .$ (6)

The k^* value is the lowest value of the data accumulation step for which the neuron has enough data to form the proper output value for a given input vector x, what is defined by the selected aggregation condition, and the partial activation of the neuron accumulated from the group of number g is determined as a weighted sum of input signals gated by the appropriate Kronecker’s delta: $Δ φ (g, w, x) = \sum_{i = 1}^{N} w_{i} x_{i} δ (g, θ_{i} (w)),$ (7) where θ is N dimensional inputs grouping vector defined by selected grouping function. For clarity and without losing generality within this text groups definition and grouping vector calculation is done as in the case of Sigma-if neuron, and aggregation condition is in the form analogous to Equation (4):

$\begin{matrix} Q (k) & = & ((\sum_{g = 0}^{K} Γ (k, g) \cdot Δ φ (g)) \geq φ^{*}) \\ \lor (g = K) . \end{matrix}$ (8)

The formulation of Equation (6) is different from Equation (3) because the latter relates to a special case of the Sigma-if neuron in which indexes of inputs groups are equal to indexes of consecutive aggregation steps and the summation can be done over the aggregation steps index k. In the general formula this must be changed to aggregation over the groups index g with help of an additional function Γ which describes which inputs groups should be used to form neuron activation in a given step of the aggregation: $Γ : N^{2} \to {0, 1}$ (9)

Whereas Γ is defined for any input vector x , it depicts the aggregation process for such a case when the aggregation condition is false for all steps k < K. Within this paper, Γ is called an attention function, what can be understood as an element which describes at which inputs the neuron’s attention in a given aggregation step is directed. In the case of Sigma-if aggregation its attention function takes the form $Γ_{Sigma - if} (k, g) = H (k - g)$ (10) and can be graphically presented as in Fig. 1a.

Looking at the graphical representation of the Sigma-if attention function, one can easily construct many other attention functions for creation of contextual neural networks that potentially will have properties different than those observed for the Sigma-if neural network. Within this paper, we propose four new attention functions:

CFA (Constant Field of Attention) in which the inputs space is analyzed through moving a window of a predefined size – Fig. 2a,

OCFA (Overlapping Constant Field of Attention) as in CFA but input space regions overlap in consecutive aggregation steps – Fig. 1b,

PRDFA (Predefined Random Dynamic Field of Attention) in which the order of inputs groups aggregation is set pseudo randomly on the neuron initialization and is not changed during training – Fig. 2b,

RDFA (Random Dynamic Field of Attention) in which the order of inputs groups aggregation is set pseudo randomly at the beginning of each aggregation.

Formal definitions of the proposed attention functions are as follows: $Γ_{CFA} = H (2 k - g) H (g - 2 k - 1 + wnd)$ (11) $Γ_{OCFA} = H (k - g) H (g - k - 1 + wnd)$ (12) $Γ_{PRDFA} = {\begin{matrix} 1 : & \begin{matrix} g \in r_{g} : r_{g} = {i : i = {rnd}_{0} (K)} \\ \land r_{g} ⊄ ⋃_{j = 1}^{k - 1} r_{j} \land | r_{g} | = wnd} \end{matrix} \\ 0 : & g \notin r_{g} \end{matrix}$ (13) $Γ_{RDFA} = {\begin{matrix} 1 : & \begin{matrix} g \in r_{g} : r_{g} = {i : i = rnd (K)} \\ \land r_{g} ⊄ ⋃_{j = 1}^{k - 1} r_{j} \land | r_{g} | = wnd} \end{matrix} \\ 0 : & g \notin r_{g} \end{matrix}$ (14) where rnd() and rnd₀() are functions returning random numbers from the set {1 … K}, while the second of those random functions for given neuron returns always the same sequences of numbers as during its initialization. The constant width of the attention window wnd as well as aggregation step number k and group index g are omitted on the left side of equations for clarity.

The proposed functions were selected to cover various types of input data aggregation methods. They include both random and non-random aggregation as well as represent usage of constant and dynamic field of attention. There are at least two reasons for this selection. Firstly, it is interesting how using such Γ functions in Equation (6) will change the influence of aggregation on the neural networks properties in comparison to the MLP and Sigma-if models. But the most important is that chosen Γ functions represent solutions which were not covered by the GBP method in [32] and should be taken into consideration if one wants to check if and how the GBP algorithm can be used for training conditional contextual neural networks with the whole family of possible aggregation methods.

3 Generalized representation of conditional aggregation functions in the ordered groups space

The general conditional aggregation formula given by Equation (6) is a simple way of describing conditional aggregation schemes of both a structured and unstructured characteristic (e.g. random). It is also useful as it allows us to easily create new aggregation methods by defining various attention functions. Unfortunately, such a form of the aggregation function for many attention functions would be hard to use within formal analyses of the GBP algorithm as it would generate irreducible, problematic expressions in partial derivatives. Thus, to calculate the delta rule for a feedforward neural network of neurons with a general conditional aggregation function, its alternative representation is needed that will be usable for theoretical analyses. To ensure generality of the following proposition of such representation, it will be illustrated with the most complicated case of the random Γ_RPDFA attention function.

One of the possible solutions of the above problem is making an observation that for any attention function Γ its groups indexes can be reordered in such a way that the resulting ordered attention function Γ′ would describe input groups ordered according to the sequence of their aggregation. The nature and example of such transformation for the Γ_RPDFA attention function is presented in Fig. 3.

Thus in general the conditional aggregation within ordered space of the g’ index has the form: $φ (w, x) = \sum_{g^{'} = 1}^{K} Γ^{'} (k^{*}, g^{'}) \cdot Δ φ (g^{'}, w, x) .$ (15)

This change can be viewed only as a formal modification as the Equation (15) keeps the structure of Equation (6) and does not change the sequence of data aggregation. But in practice such reordering of groups indexes allows for further division of ordered attention function into two sub functions: $Γ^{'} (k, g^{'}) = S (k, g^{'}) \cdot Z (k, g^{'}),$ (16) such that $S, Z : N^{2} \to {0, 1},$ (17) where the scan-path function S (k, g′) represents evolution of the contextual neuron scan path during consecutive steps of the aggregation process and the contribution function Z (k, g′) depicts how much the data analyzed in the previous aggregation steps influences the value of partial activation of the current step of aggregation. Visualizations of those two functions for the $Γ_{RPDFA}^{'}$ from Fig. 3b are presented in Fig. 4.

Even if for the purposes of this paper we assume that both scan-path S and contribution Z are simple binary functions, they represent two distinct natural components of the attention function. The first one shows which groups are used to form a neuron’s output, and the latter concentrates the information about the group’s importance in the given steps of the aggregation process − which in general does not have to be binary.

Using further Equation (15), one can also notice that for the aggregation step k^* there exist two boundary values g^* and $g_{0}^{*}$ common for scan-path, contribution and ordered attention functions, such that: ${\begin{matrix} g^{*} = Max (g^{'} : Γ^{'} (k^{*}, g^{'}) = 1) \\ g_{0}^{*} = Min (g^{'} : Γ^{'} (k^{*}, g^{'}) = 1) \end{matrix}$ (18)

With the help of those values it is easy to see that aggregation value $φ = \sum_{g^{'} = 1}^{K} S (k^{*}, g^{'}) \cdot Z (k^{*}, g^{'}) \cdot Δ φ (g^{'})$ (19) is equal $φ (x, w) = \sum_{g^{'} = 1}^{g^{*}} Z (k^{*}, g^{'}) \cdot Δ φ (g^{'}, x, w)$ (20) because the scan-path function S is equal to zero above g^* and one below this value and can be removed. By analogy, for the binary contribution function the aggregation can be finally presented as $φ (x, w) = \sum_{g^{'} = g_{0}^{*}}^{g^{*}} Δ φ (g^{'}, x, w) .$ (21)

It is worth underlining that the obtained general formula for the conditional aggregation function in the ordered groups space is common for all attention functions. In particular, one can easily see that the same result will be achieved for other attention functions proposed in the previous section. Thus, it is the representation of the searched conditional aggregation function which can be used to define a wide family of contextual neurons with different attention functions.

But what is equally important, such a form of the general conditional aggregation function hides a potentially complex nature of an attention function and is almost identical to the aggregation function of the Sigma-if neuron written in the space of aggregation steps index k (see Equation 3). The latter fact becomes obvious after analyzing relation of the attention, ordered attention and scan-path functions of the Sigma-if neuron – in this special, simple case all those functions are mutually identical and contribution function is equal one. Thus in particular the Sigma-if aggregation steps index k is equal to its ordered attention index g′. In such a special case, Equation (21) can be used without unnecessary complications to perform calculation of the generalized delta rule for the general conditional aggregation function. Still, it requires confirmation if the same can be done in a general case – for all attention functions.

4 GBP delta rule for contextual neural networks with conditional aggregation functions

It has been shown in the previous section that the general formula of a conditional aggregation function given by Equation (21) can be used to build contextual neurons for a wide set of attention functions given by Equation (16). We can use such neurons within a feedforward artificial neural network to form a contextual neural network. In turn, obtained contextual neural networks can be trained with the GBP algorithm due to similarity of Equations (21) to (3). Still, to be able to do that we need to find the form of the generalized delta rule in the ordered groups space for Equation (21).

It would be comfortable to show the derivation of the GBP generalized delta rule for conditional contextual neural network in comparison with the backpropagation algorithm’s generalized delta rule for MLP network. But due to common knowledge about backpropagation algorithm we will only refer to its existing, easily available descriptions [13 , 32]. To simplify further parts of the derivation, we will also use set of symbols given below and common for the mentioned works.

Let’s consider the general case of a multilayer feedforward neural network with a full network of connections between neurons in adjacent layers, and a non-decreasing and differentiable activation function in individual neurons. To establish the symbols we assume that every μ-th learning pattern is a pair containing the input vector x^zμ and the corresponding output vector y^zμ. Simultaneously, consecutive layers of the network are numbered with index m and values from 1 to M, where layer m consists of n_m neurons. Consequently, the weight of the connection between the j-th neuron in m-th layer and the i-th neuron of the previous layer is written as $w_{j, i}^{m}$ (in case of the double lower indices, the left subscript is the number of the neuron within the layer of the number indicated in the superscript, while the right subscript is the number of the neuron input). Using this convention and until otherwise noted weights of connections between neurons in layers m and m + 1 are denoted as $w_{l, j}^{m + 1}$ , where l is the index of a neuron in a layer m + 1. Similarly, the values of the aggregation function φ and activation function F (φ) for j-th neuron of the m-th layer are denoted as $φ_{j}^{m μ}$ and $u_{j}^{m μ}$ , respectively, while for the i-th neuron of the input layer, which by definition realizes an identity transfer function, $u_{i}^{1 μ}$ is equal $x_{i}^{z μ}$ . The maximal and minimal boundaries of an ordered attention function for the j-th neuron of the m-th layer are denoted, respectively, as $g_{j}^{* m μ}$ and $g_{0 j}^{* m μ}$ .

Using the above notation and by assuming that all neurons of the network hidden layers are using the conditional contextual aggregation given by Equation (21), we get the output values of the neurons in the form $u_{j}^{m μ} = F (\sum_{g^{'} = g_{0}^{*}}^{g *} \sum_{i = 1}^{n_{m}} w_{j, i}^{m} u_{i}^{(m - 1) μ} δ (g^{'}, θ_{j, i}^{m})),$ (22) the network output error for a given μ-th training vector written formally as $ξ_{μ} = \frac{1}{2} \sum_{j = 1}^{n_{M}} {(y_{j}^{z μ} - u_{j}^{M μ})}^{2}$ (23) and the error created in the j-th neuron of the m-th layer: $δ_{j}^{m μ} = - \frac{\partial ξ_{μ}}{\partial φ_{j}^{m μ}} = - \frac{\partial ξ_{μ}}{\partial u_{j}^{m μ}} F^{'} (φ_{j}^{m μ}) .$ (24)

From the above, we can directly write that $\frac{\partial ξ_{μ}}{\partial u_{j}^{M μ}} = - (y_{j}^{z μ} - u_{j}^{M μ}) .$ (25) and that $\frac{\partial ξ_{μ}}{\partial u_{j}^{m μ}} = \sum_{l = 1}^{n_{m + 1}} (\frac{\partial ξ_{μ}}{\partial φ_{l}^{(m + 1) μ}} \frac{\partial φ_{l}^{(m + 1) μ}}{\partial u_{j}^{m μ}}) .$ (26)

Recalling now Equation (22), we can start calculating the missing right-hand side part of Equation (26):

$\begin{matrix} \frac{\partial φ_{l}^{(m + 1) μ}}{\partial u_{j}^{m μ}} & = & \frac{\partial}{\partial u_{j}^{m μ}} \sum_{g^{'} = g_{0}^{* (m + 1) μ}}^{g^{* (m + 1) μ}} \sum_{q = 1}^{n_{m}} \\ (w_{l, q}^{m + 1} \cdot u_{q}^{m μ} δ (g^{'}, θ_{l, q}^{m + 1})) \end{matrix}$ (27)

The iterator q is introduced in Equation (27) instead of j to denote the q-th neuron in the m-th layer because within this formula the value of j needs to be constant as an index of $u_{j}^{m μ}$ . Hence, after expanding the sum over g’ and performing the differentiation of the right-hand side, the above equation takes the form:

$\begin{matrix} \frac{\partial}{\partial u_{j}^{m μ}} (\sum_{q = 1}^{n_{m}} w_{l, q}^{m + 1} u_{q}^{m μ} δ (g_{0 l}^{* (m + 1) μ}, θ_{l, q}^{m + 1}) + \dots \\ + \dots + \sum_{q = 1}^{n_{m}} w_{l, q}^{m + 1} u_{q}^{m μ} δ (g_{l}^{* (m + 1) μ}, θ_{l, q}^{m + 1})) \\ = w_{l, j}^{m + 1} δ (g_{0 l}^{* (m + 1) μ}, θ_{l, j}^{m + 1}) + \dots \\ + \dots + w_{l, j}^{m + 1} δ (g_{l}^{* (m + 1) μ}, θ_{l, j}^{m + 1}) . \end{matrix}$ (28)

Then, by factoring out the common weight term, we can write: $\frac{\partial φ_{l}^{(m + 1) μ}}{\partial u_{j}^{m μ}} = w_{l, j}^{m + 1} \sum_{g^{'} = g_{0 l}^{* (m + 1) μ}}^{g_{l}^{* (m + 1) μ}} δ (g^{'}, θ_{l, j}^{m + 1}) .$ (29)

However, the sum of Kronecker deltas appearing on the right-hand side of Equation (29) may take only two values: one when the j-th input of the l-th neuron belongs to one of the groups active during signals aggregation for the vector μ, and zero if it is not. In the first case, the component of the $θ_{l, j}^{m + 1}$ grouping vector assigned to the j-th input connection is within the boundary values $g_{0 l}^{* (m + 1) μ}$ and $g_{l}^{* (m + 1) μ}$ , and in the second one it is outside this range. This allows us to conclude that:

$\begin{matrix} \frac{\partial φ_{l}^{(m + 1) μ}}{\partial u_{j}^{m μ}} = w_{l, j}^{m + 1} H (g_{l}^{* (m + 1) μ} - θ_{l, j}^{m + 1}) \\ \cdot H (θ_{l, j}^{m + 1} - g_{0 l}^{* (m + 1) μ}) \end{matrix}$ (30)

Lastly, by applying to Equation (26) the derivative calculated in this way, one can determine the formula for the output error of the j-th neuron in the m-th hidden layer of conditional contextual neural network (based on Equation (24)):

$\begin{matrix} δ_{j}^{m μ} & = & F^{'} (φ_{j}^{m μ}) \sum_{l = 1}^{n_{m + 1}} δ_{j}^{(m + 1) μ} w_{l, j}^{(m + 1) μ} \\ \cdot H (g_{l}^{* (m + 1) μ} - θ_{l, j}^{m + 1}) \\ \cdot H (θ_{l, j}^{m + 1} - g_{0 l}^{* (m + 1) μ}) . \end{matrix}$ (31)

Expression (31) differs from the corresponding formula for the MLP network only by a presence of the two Heaviside functions. Thus, as in the case of the Sigma-if network, due to this change, when not all inputs of the contextual neuron are involved in determining its output value, the related error is propagated only by the connections that were previously used. This behavior is fully consistent with the idea of the backpropagation algorithm. A neuron’s input connections inactive during the aggregation of the signals, even despite non-zero weights, do not make any contribution to the activation of a neuron, and, as a result, do not influence the neuron’s output error values.

Finally, to determine the general rule of modification of weights in the contextual network with conditional aggregation functions, the following derivative requires consideration: $Δ w_{ji}^{m} = - η \frac{\partial ξ_{μ}}{\partial w_{j, i}^{m}} = - η \frac{\partial ξ_{μ}}{\partial φ_{j}^{m μ}} \frac{\partial φ_{j}^{m μ}}{\partial w_{j, i}^{m}},$ (32) where η is the learning factor and the right-hand partial derivative over the connection weights between layers m and m - 1 in the groups space is given by the following formula:

$\begin{matrix} \frac{\partial φ_{j}^{m μ}}{\partial w_{j, i}^{m}} & = & \frac{\partial}{\partial w_{j, i}^{m}} \sum_{g^{'} = g_{0 j}^{* m μ}}^{g_{j}^{* m μ}} \sum_{i = 1}^{n_{m}} (w_{j, i}^{m} \\ \cdot u_{i}^{(m - 1) μ} δ (g^{'}, θ_{j, i}^{m})) . \end{matrix}$ (33)

However, it is easy to notice the similarity between Equations (33) and (27). Thus by analogy, without unnecessary transformations we get:

$\begin{matrix} \frac{\partial φ_{j}^{m μ}}{\partial w_{j, i}^{m}} & = & u_{i}^{(m - 1) μ} \cdot H (g_{j}^{* m μ} - θ_{j, i}^{m}) \\ \cdot H (θ_{j, i}^{m} - g_{0 j}^{* m μ}) . \end{matrix}$ (34)

As a result, the generalized delta rule specifying the change of the weight value of the i-th input of the j-th neuron in the m-th contextual network layer takes the form:

$\begin{matrix} Δ w_{j, i}^{m} & = & η δ_{j}^{m μ} u_{i}^{(m - 1) μ} \cdot H (g_{j}^{* m μ} - θ_{j, i}^{m}) \\ \cdot H (θ_{j, i}^{m} - g_{0 j}^{* m μ}) . \end{matrix}$ (35)

Lastly, after taking into account the relevant formulas for errors of different elements of the contextual network with conditional aggregation function, generalized delta rule for the output layer of its neurons is given by:

$\begin{matrix} Δ w_{j, i}^{M} = η u_{i}^{(M - 1) μ} \cdot H (g_{j}^{* M μ} - θ_{j, i}^{M}) \\ \cdot H (θ_{j, i}^{M} - g_{0 j}^{* M μ}) F^{'} (φ_{j}^{M μ}) (y_{j}^{z μ} - y_{j}^{μ}), \end{matrix}$ (36) while its counterpart for the hidden layers is:

$\begin{matrix} Δ w_{j, i}^{m} & = & η u_{i}^{(m - 1) μ} \cdot H (g_{j}^{* m μ} - θ_{j, i}^{m}) \\ \cdot H (θ_{j, i}^{m} - g_{0 j}^{* m μ}) F^{'} (φ_{j}^{m μ}) \\ \cdot \sum_{l = 1}^{n_{m + 1}} (δ_{l}^{(m + 1) μ} w_{l, j}^{m + 1} H (g_{l}^{* (m + 1) μ} - θ_{l, j}^{m + 1}) \\ \cdot H (θ_{l, j}^{m + 1} - g_{0 l}^{* (m + 1) μ})) . \end{matrix}$ (37)

Equations (35–37) could be further simplified by changing appropriate products of Heaviside functions to dedicated rectangular functions, but it does not change the result and we leave it to the interested reader. Heaviside functions appearing in Equation (35) can be viewed as a mechanism that counteracts unnecessary modifications of the network structure in those parts which are not used for determining the output values of individual neurons for a given training vector. Thus, both in the hidden and output layer weights of connections that were inactive during the process of input signals accumulation are not modified.

5 Experiments

To verify in practice the results of our theoretical analysis, we have used generalized weight modification formulas obtained in the previous section to check the effects of applying the GBP algorithm to train feedforward contextual neural networks with proposed aggregation methods CFA, OCFA, RDFA and PRDFA. In each case, a fully connected neural network with a single hidden layer of ten contextual neurons and bipolar coding of inputs was trained to solve four selected benchmark problems given by Wine, Votes, Heart and Sonar datasets from the UCI ML repository. The used neural network architecture, basic parameters values and considered problems were selected so that we could to compare new results obtained for contextual neurons built with proposed aggregation methods with the previous results for Sigma-if and MLP neural networks presented in [32]. The basic properties of the selected datasets are given in Table 1.

For the above reasons during the training no bias weights were used, nor did we use any momentum, and the training constant η was 0.05. In the given neural network all contextual neurons have had the same values of aggregation threshold φ*, number of groups K and input window size. Values of those parameters were chosen for each experiment from the following sets: φ* from the set {0.5, 0.6, 0.7} and the number of groups K from the set {2, 3, 5, 7, 9, 11, 13}. The maximal number of training epochs was 500. Additionally, after initial experiments we have limited considered attention window sizes to the values from the set {1, 3, 5, 7, 9, 11}.

Each combination of the above parameters values was used to perform 5 times 4-fold cross-validation and for each trained neural network the level of generalization and average activity of internal network connections were measured. Then average results for the best networks obtained in each cross-validation were compared witch analogous properties of Sigma-if and MLP networks.

It can be seen in Table 2 that for almost all considered benchmark problems neural networks with aggregation methods introduced in Section 2 achieve generalization levels similar to the results for Sigma-if neural network and better than for MLP. The only exception is the Sonar problem for which CFA, OCFA and PRDFA aggregation methods generate results better than noted for Sigma-if and MLP.

The obtained results indicate that weights modification formulas obtained for the proposed groups space can be effectively used to train neural networks with contextual neurons. It can be also concluded that various conditional aggregation functions for given problems can lead to different results, also better than in the basic case of the Sigma-if neuron. This is due to the fact that different attention functions set different limits on neurons’ attention fields and on their abilities of perception of contextual relationships within the data. Thus, some attention functions can lead to better results for a given problem than others. On the other hand, independently from the aggregation method actually used within the neurons, the GBP algorithm was able to successfully generate usable contextual neural networks with considerably high average level of generalization. This can suggest that each non-trivial attention function can be used to construct neurons useful in contextual neural networks.

Contrary to the above, the selection of an aggregation function can have a huge impact on the average activity of internal network connections. Within Table 3 one can see that the levels of average activity of internal network connections for the cases of proposed aggregation functions are considerably decreased in comparison to analogous results for Sigma-if and MLP networks. For the Sonar problem the CFA aggregation method decreased the average internal connections activity about two times and its RDFA analogue almost three times in comparison to the results for Sigma-if model. Similar considerable decrease of connections activity was noted for OCFA method for all considered problems. One can also observe that average decrease of connections activity for PRDFA and RFDA aggregation functions is lower than for OCFA and CFA methods. This indicates that one should carefully select a neuron’s aggregation method if the level of activity of internal networks connections is important for the given solution. It can be also expected that non-random attention functions can lead to a higher reduction of the level connections activity than their pseudo-random versions.

On the basis of the above, we can conclude that analysis of experimental results support presented theoretical considerations. The GBP algorithm is able to do successful training of contextual neural networks with various forms of the attention function. Also, the proposed conditional aggregation methods are useful as allow us to achieve a higher average decrease of network internal connections activity than MLP and Sigma-if neural networks while obtaining similar (in some cases better) classification results. Such neuron models can be especially useful within large systems that use convolutional neural networks and deep learning because this can lead to considerable reduction of energy used by such systems both during training and further usage. Still, there is a need for further experiments on the various forms of conditional aggregation functions and on their applicability in different neural networks.

6 Conclusions

In this paper, we have shown that a contextual neural network with neurons performing a conditional signals aggregation can be trained by the GBP algorithm with the use of derived generalized delta rules. The result is independent from the actual form of the attention functions used during the aggregation. Thus it is valid for the whole presented family of conditional aggregation functions and constitutes a considerable extension of the previous result for the Sigma-if neural network [32]. In order to achieve this, we have introduced a generalized representation of the aggregation function in an ordered groups space and division of its attention function to binary scan-path and contribution functions. Presented calculations have been illustrated with examples of four new conditional aggregation functions and with results of experiments in which GBP has been successfully used to train contextual neural networks based on those functions.

Obtained results open a wide field of research on the properties of new conditional aggregation functions which can be easily defined by the use of the introduced idea of an ordered attention function. Further work can include also considerations of contextual neural networks with conditional aggregation functions in which the contribution function is not binary and analyzing them against a broader set of benchmark and real life problems.

Highly promising is a potential application of presented contextual neurons within recent deep neural networks because this can lead to a considerable reduction of the energy cost of their usage both during training and further utilization – also when neural networks are run on specialized and GPU hardware. This especially includes large convolutional deep learning solutions for image [54], audio [51] and video [6] processing currently used by leading IT companies as well as energy-limited systems that could benefit from application of deep neural networks (e.g. mobile androids [50], intelligent sensors, autonomous vehicles [6, 49] and flying drones [11]). Moreover, according to the literature of the subject and to the best of our knowledge such low-level contextual neurons are currently not used in the mentioned types of systems.

Footnotes

Acknowledgments

We thank four anonymous reviewers for their valuable comments that improved the quality of this paper.

References

Treisman

A.M.

, Contextual cues in selective listening, Quarterly Journal of Experimental Psychology12 (1960), 242–248.

Olshausen

, Anderson

and Van Essen

, A Neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information, The Journal of Neuroscience13 (1993), 4700–4719.

Mel

B.W.

, The clusteron: Toward a simple abstraction for a complex neuron, Advances in Neural Information Processing Systems4 (1992), 35–42.

Mel

B.W.

, The sigma-pi column: A model of associative learning in cerebral cortex. Technical Report CNS Memo 6, Computation and Neural Systems Program, California Institute of Technology (1990).

Anderson

and Van Essen

, Shifter Circuits: A computational strategy for dynamic aspects of visual processing, Proceedings of National Academy of Sciences, USA84, 1987, pp. 6297–6301.

Chen

, Seff

, Kornhauser

and Xiao

, DeepDriving: Learning affordance for direct perception in autonomous driving, Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), IEEE, 2015, pp. 2722–2730.

Koch

and Ullman

, Shifts in selective visual attention: Towards the underlying neural circuitry, Human Neurobiology4 (1985), 219–227.

Privitera

and Stark

L.W.

, Algorithms for defining visual region-of-interest: Comparison with Eye Fixations, IEEE Transactions on Pattern Analysis and Machine Intelligence22(9) (2000), 970–982.

Giles

C.L.

and Maxwell

, Learning, invariance, and generalization in high order neural networks, Applied Optics26(23) (1994), 4972–4978.

10.

Privitera

C.M.

, Azzariti

and Stark

L.W.

, Locating regions-of-interest for the Mars Rover expedition, Journal of Remote Sensing21(17) (2000), 3327–3347.

11.

Maturana

and Scherer

, 3D Convolutional Neural Networks for landing zone detection from LiDAR, Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2015, pp. 3471–3478.

12.

Noton

and Stark

, Scanpaths in saccadic eye movements while viewing and recognizing patterns, Vision Research11 (1971), 929–942.

13.

Rumelhart

, Hinton

, McClelland

, A general framework for parallel distributed processing, in: Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations, Rumelhart

and McClelland

, eds, The MIT Press, 1986, pp. 45–76.

14.

Niebur

, Koch

and Rosin

, An oscillation-based model for the neural basis of attention, Vision Research33 (1993), 2789–2802.

15.

Spelke

, Hirst

and Neisser

, Skills of divided attention, Cognition4 (1976), 215–230.

16.

Cherry

E.C.

, Some experiments on the recognition of speech, with one and with two ears, Journal of the Acoustical Society of America25(5) (1953), 975–979.

17.

Ferguene

and Toumi

F.F.

, Dynamic external force feedback loop control of a robot manipulator using a neural compensatorapplication to the trajectory following in an unknown environment, International Journal of Applied Mathematics and Computer Science19 (2009), 113–126.

18.

Tapiador

F.J.

, Kidd

, Hsu

K.L.

and Marzano

, Neural networks in satellite rainfall estimation, Meteorological Applications11(1) (2004), 83–91.

19.

Hager

and Toyama

, Incremental focus of attention for robust visual tracking, International Journal of Computer Vision35(1) (1999), 45–63.

20.

Houghton

and Tipper

S.P.

, Inhibitory mechanisms of neural and cognitive control: Applications to selective attention and Sequential Action, Brain and Cognition30 (1996), 20–43.

21.

Kucera

, Francis

W.M.

, Computational analysis of present-day American English, Brown University Press, 1967.

22.

Bukovsky

, Zeng-Guang Hou

and Bila

M.M.

, Gupta, Foundation of notation and classification of nonconventional static and dynamic neural units, Proceedings of 6th IEEE International Conference on Cognitive Informatics, 2007, pp. 401–407.

23.

Rybak

I.A.

, Gusakova

V.I.

, Golovan

A.V.

, Podladchikova

L.N.

and Shevtsova

N.A.

, A model of attention-guided visual perception and recognition, Vision Research38 (1998), 2387–2400.

24.

Korbicz

, Obuchowicz

, Uciński

, Unidirectional networks, in: Artificial Neural Networks: Foundations and Applications, Bolc

, ed., Akademicka Oficyna WydawniczaLJ, Poland, 1994, pp. 35–58.

25.

Neely

J.H.

, Semantic priming and retrieval from lexical memory: Roles of inhibitionless spreading activation and limited capacity attention, Journal of Experimental Psychology: General106 (1977), 226–254.

26.

Tsotsos

J.K.

, Culhane

, Cutzu

, From foundational principles to a hierarchical selection circuit for attention, in: Visual Attention and Cortical Circuits, Braun

, Koch

, Davis

, eds, MITress, Cambridge, MA, USA, 2001, pp. 285–306.

27.

Tsotsos

J.K.

, Culhane

, Wai

, Lai

, Davis

and Nuflo

, Modeling visual attention via selective tuning, Artificial Intelligence78(1-2) (1995), 507–547.

28.

Yamada

and Cottrell

G.W.

, A model of scan paths applied to face recognition, Procedings of 17th Ann Cognitive Science Conference, 1995, pp. 55–60.

29.

Zhang

, Simoff

J.S.

and Zhang

J.C.

, Trigonometric polynomial higher order neural network group models and weighted kernel models for financial data simulation and prediction, Artificial Higher Order Neural Networks for Economics and Business, IGI Global, 2009, pp. 484–503.

30.

Fonseca

L.R.C.

, Jimenez

J.L.

, Leburton

J.P.

and Martin

R.M.

, Self-consistent calculation of the electronic structure and electron-electron interaction in self-assembled InAs-GaAs quantum dot structures, Physical Review B57 (1998), 4017–4026.

31.

Clauss

, Bayerl

and Neumann

, Evaluation of regions-of-interest based attention algorithms using a probabilistic measure, Proceedings of the 5th, Workshop Dynamic Perception, IOS Press, 2004, pp. 227–232.

32.

Huk

, Backpropagation generalized delta rule for the selective attention Sigma-if artificial neural network, International Journal of Applied Mathematics and Computer Science22(2) (2012), 449–459.

33.

Huk

, Learning distributed selective attention strategies with the Sigma-if neural network, in: Advances in Computer Science and IT, Akbar

and Hussain

, eds., InTech, Vukovar, 2009, pp. 209–232.

34.

Huk

, Manifestation of selective attention in Sigma-if neural network, Proceedings of the International Multiconference on Computer Science and Information Technology IMCSIT/AAIA’07, 2nd International Symposium Advances in Artificial Intelligence and Applications, 2007, pp. 225–236.

35.

Renninger

, Sequential information maximization can explain eye movements in an object learning task, Journal of Vision4(8) (2004), 744–744a.

36.

Sinha

, Gupta

M.M.

and Nikiforuk

P.N.

, A compensatory wavelet neuron model, Proceedings of IFSA World Congress and 20th NAFIPS International Conference, vol. 3, 2001, pp. 1372–1377.

37.

Szczepanik

and Jóźwiak

, Biometric security systems for mobile devices based on fingerprint recognition algorithm, Proceedings of the Second International Conference on Mobile Services, Resources, and Users, 2012, pp. 62–67.

38.

Eckstein

M.P.

, Beutter

B.R.

and Stone

L.S.

, Quantifying the performance limits of human saccadic targeting in visual search, Perception30 (2001), 1389–1401.

39.

Spratling

M.W.

and Hayes

, Learning synaptic clusters for nonlinear dendritic processing, Neural Processing Letters11(1) (2000), 17–27.

40.

Durbin

and Rumelhart

D.E.

, Product units: A computationally powerful and biologically plausible extension to backpropagation networks, Neural Computation1(1) (1990), 133–142.

41.

VanRullen

and Kochm

, Visual selective behavior can be triggered by a feedforward process, Journal of Cognitive Neuroscience15(2) (2003), 209–217.

42.

Shiffrin

R.M.

and Schneider

, Controlled and automatic human information processing: Perceptual learning, automatic attending and a general theory, Psychological Review84 (1997), 127–190.

43.

Yadav

R.N.

, Kumar

, Kalra

P.K.

and John

, Multilayer neural networks using generalized-mean neuron model, Proceedings of IEEE International Symposium on Communications and Information Technology ISCIT 20041 (2004), 93–97.

44.

Yadav

R.N.

, Kalra

P.K.

and John

, Neural network learning with generalized-mean based neuron model, Soft Computing – A Fusion of Foundations, Methodologies and Applications10(3) (2006), 257–263.

45.

Albarqouni

, Baur

, Achilles

, Belagiannis

, Demirci

and Navab

, AggNet: Deep learning from crowds for mitosis detection in breast cancer histology images, IEEE Transactions on Medical Imaging35(5) (2016), 1313–1321.

46.

Perantonis

S.J.

and Lisboa

P.J.

, Translation, rotation, and scale invariant pattern recognition by high-order neural networks and moment classifiers, IEEE Trans Neural Networks3 (1992), 241–251.

47.

Venkatesh

S.S.

and Baldi

, Programmed interactions in higher order neural networks: Maximal capacity, Journal of Complexity7 (1991), 316–337.

48.

Pelc

, A formal model of an artificial neural network used to store and recognize the semantics of some sentences of natural language, Proceedings of the Fifth International Conference on Neural Information Processing ICONIP’98, IOA Press, 1998, pp. 21–23 .

49.

Hoang

and Jo

, Path planning for autonomous vehicle based on heuristic searching using online images, Vietnam Journal of Computer Science2(2) (2015), 109–120.

50.

Chen

, Qu

, Zhou

, Weng

, Wang

and Fu

, Door recognition and deep learning algorithm for visual based robot navigation, Proceedings of the 2014 IEEE International Conference on Robotics and Biomimetics (ROBIO), IEEE, 2014, pp. 1793–1798.

51.

Xiao-Lei

and DeLiang

, Boosting contextual information for deep neural network based voice activity detection, IEEE/ACM Transactions on Audio, Speech, and Language Processing24(2) (2016), 252–264.

52.

Zuo

, Shuai

, Wang

, Liu

, Wang

and Chen

, Learning contextual dependence with convolutional hierarchical recurrent neural networks, IEEE Transactions on Image Processing25(7) (2016), 2983–2996.

53.

Lee

, An information-theoretic framework for understanding saccadic behaviors, Advances in Neural Processing Systems12 (2000), 834–840.

54.

Tsal

, Movements of attention across the visual field, Journal of Experimental Psychology: Human Perception and Performance9(4) (1983), 523–530.