Wavelets as activation functions in Neural Networks

Abstract

Traditionally, a few activation functions have been considered in neural networks, including bounded functions such as threshold, sigmoidal and hyperbolic-tangent, as well as unbounded ReLU, GELU, and Soft-plus, among other functions for deep learning, but the search for new activation functions still being an open research area. In this paper, wavelets are reconsidered as activation functions in neural networks and the performance of Gaussian family wavelets (first, second and third derivatives) are studied together with other functions available in Keras-Tensorflow. Experimental results show how the combination of these activation functions can improve the performance and supports the idea of extending the list of activation functions to wavelets which can be available in high performance platforms.

Keywords

deep learning neural network activation functions wavelets Keras-Tensorflow

1 Introduction

Seminal works on artificial neural networks were inspired by human brain models composed by inputs, synaptic connections (weights), activation functions and outputs. In a supervised approach, a first main goal is to calculate the weights such that the training error, obtained by comparing the desired output with the corresponding approximated outputs, decreases when the neural system is excited with the corresponding inputs. Moreover, a second and important goal is to get a minimal testing error. Early stopping prevents overfitting by detecting the minimal point in the test error curve. From this point of view, it imposes a subtle different approach with respect to the mathematical approximation problem, as such, where we can get the lowest training error but a weak generalization. In machine learning, is not only required to approximate but also to generalize in the best way [21].

Neural networks, as connectionist approach, use weights to store and distribute information, which provides fault tolerance that can be assumed as small variations in the weight values or noise in the inputs and, despite this, to obtain correct outputs within an acceptable range of values. Under this connectionist approach, weights seem more relevant to store and process information than the activation function which is, apparently, sufficient to define an activation threshold. In other words, there is no complex information stored or processed within the activation function since it assumes a decision task primarily. In this sense, some bounded and non-decreasing activation functions such as threshold or sigmoidal produce, essentially, the same results. When introducing a gradient based algorithm to train neural networks it is decisive to have analytical expressions and differentiable activation functions. In this sense, functions with singularities, such as threshold, do not fulfill this condition and this is a reason to consider other functions including sigmoidal (logistic), arctangent and hyperbolic tangent.

McCulloch and Pitt’s model for a neuron considers a threshold function [23], and the Perceptron’s Convergence Theorem guarantees convergence for any separable set of samples [2]. For example, for two binary inputs, there are 16 different boolean functions, and a single perceptron is unable to classify the XOR 0-1-1-0 and XNOR 1-0-0-1 output patterns since they represent non-linearly separable problems. However, a model that uses two perceptrons followed by a single output neuron has success for these cases [24]. An interpretation for the pattern “0-1-1-0” could be that a neuron is able to classify the pattern as “0-1-1-1” but as the third output is 1, the second neuron is activated to disable the former and get the right output “0-1-1-0” through the third neuron. Thus, the connections and collaboration between neurons provide data processing.

Cybenko’s Universal Approximation Theorem [8] states a mathematical support for claiming that a neural network with a single hidden layer can approximate functions with finite energy using monotone non-decreasing and bounded activation functions. This theorem presents a sufficient, but not necessary, condition to obtain an approximated solution. Furthermore, it does not directly show how many neurons are required neither how to obtain the weight values, and therefore, more than one hidden layer can be used. Typically, one or two fully connected hidden layers are available in software implementations, and the problem relies on discovering, experimentally or with heuristics, how many neurons should be used for each hidden layer. Optimization techniques based on gradient are often applied to calculate the optimal weight values [10, 12].

In deep learning, more than two layers are included, although for different purposes such as: data preprocessing, dimension reduction, feature extraction, and classification. Data preprocessing layers aim to replace the “classical” preprocessing step before the classification task, for example the Fourier transform to calculate the frequency components, and instead to use raw data and let the filtering-layer produce distinctive features. A filtering task means to convolve the input data with filters (kernels) so, this kind of layers is named convolutional [26]. Some activation functions used with convolutional layers are ReLU [30] and Leaky-ReLU [3]. Although these piecewise-linear functions are not bounded and have a singular point at zero, a justification to use them is that it is possible to derive them as piecewise which is fast and simple, since they are piecewise-linear, but a stronger reason is that they do not vanish the gradient in backpropagation algorithms. Note that these functions do not fit the unbounded condition of the Universal Approximation Theorem, and however they have shown good results in deep learning [30]. Some other functions have been proposed for deep learning such as ELU [6], GELU [7], Mish [17], and Swish, among others that involves exponential components. Particularly, Swish was proposed experimentally after testing a lot of combinations of linear and exponential components, and its performance can be higher than the original functions [22].

In this paper, motivated and supported by the wavelet theory and some previous works [4 , 18], some wavelets have been included in Keras-Tensorflow [22, 27] and now they are available as activation functions. Wavelets are bases of vector spaces with short oscillations and fast decay. The multiresolution analysis states how to decompose functions via linear superpositions of translated and dilated versions of a mother wavelet [11]. Wavelets allow to analyze functions with transient elements with less coefficients (weights) than using other functions such as sine and cosine, as Fourier analysis proposes. A related approach is the Short-Time Fourier transform that aims to analyze local sections by defining an observation window, where a Gaussian window is typically preferred since it has a good time and frequency localization, according to the Heisenberg principle [25].

Certainly, the use of wavelets as activation functions is not a novel concept since there are previous works on the matter. For example, in [29] a wavelet network was presented as an alternative to feedforward neural networks, where the basic idea is to replace neurons by “wavelons". In this approach, wavelet networks are presented as a generalization of radial basis functions with one hidden layer that use wavelets as activation functions. Additionally, genetic algorithms have been applied in [20] and particle swarm optimization with gradient descent algorithms in [31] for wavelet network training. Concerning wavelet convolutional neural networks, in [13] the improvement was to replace sigmoid activations with $\cos (1.75 x) \exp^{- \frac{x^{2}}{2}}$ wavelets in the convolution layer. Although backpropagation-type algorithms have been also proposed, neural networks with wavelet activation functions have not been widely used as the rest of non-decreasing functions previously mentioned. In fact, wavelets are not part of the “standard” libraries for deep learning. Why? May be because wavelets have fast decay, which promotes the gradient vanishing. As previously pointed out, this work reconsiders wavelets as activation functions and promotes their use in deep learning layers given their properties such as feature extraction, sparsity and defining different classification regions. So, we present experiments to study their performance when they are combined with other functions available in Keras and Tensorflow.

The rest of the article is organized as follows: Section 2 presents a study with two-input boolean functions with single neurons that considers wavelets as activation functions and also some properties are discussed. In Section 3, some experimental results are described with Keras and Tensorflow for some cases of study on images. Finally, in Section 4 conclusions and future works are presented according to the experimental results.

2 Wavelets as activation functions

In this section a single neuron model is studied together with the wavelet reconstruction formula as inspiration to propose the wavelet neuron model. It aims to show how wavelet activation functions may provide additional processing to synaptic connections (weights) given their fast decay, bounded and oscillatory behavior that allows to deal with transient signals, in contrast to non-decreasing and unbounded functions such as ReLU and Heaviside [5].

Although this kind of oscillatory and fast decay behavior may not be fully supported by a bio-inspiration, from an approximation point of view, the reconstruction formula for the continuous wavelet transform provides a formal and powerful mathematical tool. In this sense, given a function f(x) with finite energy, it can be written as [11]: $f (x) = \frac{1}{C_{ψ}} \int \int W_{a, b} ψ (\frac{x - b}{a}) db \frac{da}{a^{2}}$ (1) where W_a,b are the wavelet coefficients, the admissibility condition is satisfied with $\frac{1}{C_{ψ}} = 2 π \int \frac{| \hat{ψ} (ξ) |^{2}}{ξ} d ξ < \infty$ (2) and $ψ (\frac{x - b}{a})$ is the mother wavelet with translation b and scaling factor a, both real numbers, a > 0. Equations (1) and (2) means that it is possible to approximate a function f (x) by linear combinations of wavelet functions.

Specifically, for a single neuron the input X can be multiplied with W, to get an activation potential v, that is translated and dilated to evaluate the wavelet and, in this way, to get the neuron output. Also, equations (1) and (2) can be used to propose neural networks where wavelets $ψ (\frac{x - b}{a})$ are activation functions and the weights Wa,b can be optimized to minimize not only the approximation error, but also to maximize the generalization, as described previously.

At this point, it is explored how wavelets can approximate boolean functions. For this purpose, consider two binary inputs (0,1). There are 16 output patterns 0000, 0001, ..., 1111 and some of them have well-known names, for example 0111 corresponds to the OR operation, 0001 is known as the AND operation, 0110 as XOR, and 1001 as XNOR.

For comparison purposes, the individual behavior of a single neuron is explored with the ReLU and threshold activation functions and then with triangular and Haar wavelets [15]. Since there are two inputs x₁, x₂, plus the unitary bias input x₀, the input vector is X = (x₀, x₁, x₂) and the synaptic weight vector is W = (w₀, w₁, w₂) where w₀ is the bias weight. The activation potential is: $v = w_{0} + w_{1} x_{1} + w_{2} x_{2}$ (3) Wavelets are used with scaling “a” and dilation “b” parameters, so they are considered part of our neuron model, as is shown in Figure 1.

Fig. 1

Neuron with weights, translation, and dilation parameters.

In this case, the neuron output is obtained by evaluating $ψ (\frac{x - b}{a})$ and also, in a similar way, it applies to the activation functions already mentioned. Then, it follows to find synaptic weight values W = (w₀, w₁, w₂), as well as dilation "a” and translation “b” parameter values to solve approximations for the 16 gates with their desired outputs D. For this purpose, a genetic algorithm was used to optimize these free parameters that minimizes the RMS between D and the neuron output.

The rectified linear unit ReLU function is defined as: $ReLU (x) = \max (0, x) .$ (4)Table 1, column 2 shows the approximation error (RMS) for the 16 binary functions using ReLU as activation function. Note that there are six rows that does not successfully solve all cases. Values greater than 0.25 are written in bold case since they are considered failed outputs.

Table 1

Parameters for 16 boolean functions optimized by a genetic algorithm using a neuron with ReLU as activation function

Boolean function	Relu RMS	Threshold RMS	Triangular RMS	Triangular2 RMS	Haar RMS
0000	0	0	0	0	0
0001	0.0003	0	0.00084	0.0002	0
0010	0.0009	0	0.00080	0.0002	0
0011	0	0	0.000009	0.0003	0
0100	0.0008	0	0.00032	0.0008	0
0101	0.0008	0	0.00431	0	0
0110	0.5	0.5	0.00084	0.0007	0
0111	0.25	0	0.25001	0.0001	0
1000	0.0001	0	0.00076	0.0009	0
1001	0.5	0.5	0.00059	0	0
1010	0.0009	0	0.00039	0.0009	0
1011	0.25	0	0.25002	0.001	0
1100	0.0009	0	0.00067	0.0003	0
1101	0.25	0	0.25029	0.001	0
1110	0.25	0	0.25004	0.0009	0
1111	0.001	0	0.00052	0	0

In the case of the threshold function H(x) as activation function, Table 1 at column 3 shows its results. It is possible to appreciate two cases (in bold) that are not fulfilled: 0110 and 1001. It is not surprising that this result is consistent with perceptron’s behavior using the threshold function when dealing with the XOR and XNOR cases, because they are non-linearly separable.

Given a function with support [0, 2] defined as: $φ (x) = {\begin{matrix} x, & if x \geq 0 and x \leq 1 \\ 2 - x, & if x > 1 and x < 2 \\ 0, & elsewhere \end{matrix}$ (5) a triangular wavelet function [11] can be defined as: $ψ (x) = - 0.5 φ (2 x) + φ (2 x - 1) - 0.5 φ (2 x - 2)$ (6) This wavelet is a first order spline [16]. Since ReLU defines a linear behavior, a triangular wavelet can be expressed as a linear combination of ReLU with translations and dilations and then it can be used as activation function. Again, a genetic algorithm can be applied to optimize the free parameters. Table 1 at column 4 shows the RMS for the 16 binary functions using the triangular wavelet as activation function. Note that, in this case, three rows (RMS in bold) do not successfully solve the desired outputs. Also, note that the triangular wavelet defined in equation (2) is not able to solve all 16 cases, but a modified version can do it. The modified version (triangular2) is defined as: $ϕ (x) = {\begin{matrix} - 0.5 x - 0.5, i f - 1 \leq x a n d x < 0 \\ 4 x - 0.5, i f x \geq 0 a n d x < 0.5 \\ - 4 x + 3.5, i f x \geq 0.5 a n d x < 1 \\ 0.5 x - 1, i f x \geq 1 a n d x < 2 \\ 0, elsewhere \end{matrix}$ (7) and Table 1, column 5 shows the results for this modified triangular wavelet, where in all cases the RMS is lower than 10^-3.

The Haar wavelet, defined as: $Haar (x) = {\begin{matrix} 1, & ifx > 0 and x < 0.5 \\ - 1, & ifx \geq 0.5 and x < 1 \\ 0, & elsewhere \end{matrix}$ (8) can be written in terms of unit step functions with dilation and translations, as follows: $Haar (x) = H (x) - 2 H (x - 0.5) + H (x - 1) .$ (9) So, we can conceive the Haar wavelet as a linear combination of translated and dilated threshold functions. Table 1, column 6 shows the RMS values for the 16 binary functions using the Haar wavelet activation function. Each row solves successfully the 4 outputs given the (0,0), (0,1), (1,0), and (1,1) input combinations, and in all cases, the approximation error is zero.

By now, it has been illustrated how triangular and Haar wavelets can define different classification regions to deal successfully with boolean functions. Also, it was presented a relationship of these wavelets with threshold and ReLU functions.

3 Wavelets for neural networks implemented in Keras-Tensorflow

In the previous section, wavelets with piecewise sections were used, and a genetic algorithm was used to optimize the parameters of a single neuron with interesting results. This section involves Keras [14] and Tensorflow [27] using ReLU, GELU, Swish and Mish functions, as well as some wavelets for data classification. In fact, three wavelets were implemented in Keras-Tensorflow as activation functions. Specifically, “Gaussian-family wavelets” have been included which can be expressed as the n-th derivative of the Gaussian function $G (x) = e^{- \frac{x^{2}}{2}}$ [28], in this way: $D_{g}^{(n)} (x) = \frac{d^{(n)}}{{dx}^{(n)}} G (x) = \frac{d^{(n)}}{{dx}^{(n)}} e^{- \frac{x^{2}}{2}} .$ (10) All these Gaussian-family wavelets have short support and for n > 1 they have up to n-th vanishing moments, that means: $\int x^{p} D_{g}^{(n)} (x) dx = 0, for p = 0, \dots, n - 1 .$ (11) This property becomes interesting in machine learning since it could provide sparsity of coefficients to build efficient neural network architectures.

This paper only considers the cases n = 1, 2, 3, that corresponds to the first, second and third derivative of the Gaussian function. But the same approach can be used to include more wavelets for n > 3. Their mathematical expressions are shown in Eq. (10), Eq. (11) and Eq. (13) respectively, so: $D_{g}^{(1)} (x) = - x e^{- \frac{x^{2}}{2}}$ (12) The second derivative $D_{g}^{(2)}$ of the Gaussian function is known as “Mexican Hat", hence: $D_{g}^{(2)} (x) = (1 - x^{2}) e^{- \frac{x^{2}}{2}}$ (13) and the third derivative $D_{g}^{(3)}$ of the Gaussian function is: $D_{g}^{(3)} (x) = (- 3 x + x^{3}) e^{- \frac{x^{2}}{2}}$ (14) In the next section, four experiments are described with these three wavelets applied together with ReLU, GeLU, Swish, and Mish in Keras-Tensorflow.

4 Experiments and results

Three experiments are described and some results are discussed.

A first experiment aims to compare how the performance of a basic deep learning architecture is modified when Gaussian family wavelets of equations (12), (13) and (14) are included as activation functions. A second experiment appeals to crossfolding validation to get more evidence of the performance with a different dataset and considering more layers than those of Experiment 1. In fact, Experiment 2 deals with 5 layers architectures where the wavelets are placed in different layers. The performance of these architectures are compared with other that uses Swish, Gelu, Mish and ReLu functions exclusively or in combination. The last experiment compares arquitectures with 10 layers, both in “pure” versions or “mixed” versions of activation functions including wavelets. It led to use a nomenclature to name these combinations in a compact way with no subscripts or superscripts.

4.1 Experiment 1

The case of study is the MNIST dataset, with a training set of 60,000 examples, and a test set of 10,000 examples [1]. The optimizer method was ADAM, with 25 iterations, and Conv2D, MaxPooling2D, and Flatten layers. After the Conv2D, a Mish, Swish, Gelu, Relu, $D_{g}^{(1)}$ , $D_{g}^{(2)}$ (MexHat) or $D_{g}^{(3)}$ was applied as activation function. The last layer uses a SoftMax activation function. The loss function was sparse _ categorical _ cross - entropy with default learning rate.

Results of Experiment 1.Tables 5 show numerical results that summarize the loss and accuracy metrics for training and testing. The names D1, D2 and D3 refer to networks with wavelets $D_{g}^{(1)}$ , $D_{g}^{(2)}$ and $D_{g}^{(3)}$ , respectively. In Figures 3, these data are plotted for training loss and accuracy, respectively. Figures 5 illustrate the corresponding testing loss and accuracy.

Table 2
Training: Loss values for several activation functions

Iteration D1 D2 D3 Mish Swish GELU ReLU

1 0.2402 0.2496 0.2371 0.2599 0.2658 0.2679 0.2755

2 0.086 0.0849 0.0829 0.0902 0.0951 0.0957 0.1041

3 0.0556 0.0514 0.0521 0.0619 0.0611 0.0644 0.0677

4 0.0382 0.0345 0.0367 0.0436 0.0446 0.0467 0.0483

5 0.0258 0.0242 0.0274 0.0343 0.0343 0.0352 0.0388

6 0.0194 0.0171 0.0199 0.0267 0.0265 0.0291 0.0331

7 0.0151 0.014 0.0152 0.0207 0.022 0.0246 0.0272

8 0.0115 0.0098 0.0129 0.0179 0.0176 0.0189 0.0217

9 0.0092 0.0099 0.0103 0.0158 0.0161 0.0185 0.0187

10 0.0075 0.0093 0.0104 0.0127 0.014 0.0147 0.0168

11 0.0072 0.0072 0.0107 0.0115 0.0125 0.0124 0.0166

12 0.0081 0.0073 0.0085 0.0121 0.0102 0.0116 0.0132

13 0.0073 0.0075 0.0069 0.0097 0.0091 0.0115 0.0123

14 0.004 0.0063 0.0092 0.0092 0.0091 0.0115 0.0114

15 0.0035 0.0054 0.0081 0.0085 0.0091 0.009 0.0092

16 0.0034 0.0044 0.0067 0.0074 0.0075 0.0102 0.0102

17 0.0067 0.0043 0.0058 0.0075 0.0071 0.0073 0.0103

18 0.0098 0.0053 0.006 0.0075 0.0059 0.0077 0.0101

19 0.0046 0.0046 0.0078 0.0077 0.0077 0.0064 0.0084

20 0.0023 0.0035 0.0053 0.0058 0.006 0.0072 0.0083

21 0.0031 0.0028 0.0041 0.0058 0.0052 0.0076 0.0063

22 0.0036 0.0042 0.004 0.0048 0.0055 0.0076 0.0089

23 0.0038 0.0065 0.0054 0.0069 0.0057 0.0068 0.0063

24 0.0035 0.004 0.0062 0.0062 0.0061 0.0061 0.0074

25 0.0046 0.0033 0.0058 0.0064 0.0065 0.0055 0.0083

Iteration	D1	D2	D3	Mish	Swish	GELU	ReLU
1	0.2402	0.2496	0.2371	0.2599	0.2658	0.2679	0.2755
2	0.086	0.0849	0.0829	0.0902	0.0951	0.0957	0.1041
3	0.0556	0.0514	0.0521	0.0619	0.0611	0.0644	0.0677
4	0.0382	0.0345	0.0367	0.0436	0.0446	0.0467	0.0483
5	0.0258	0.0242	0.0274	0.0343	0.0343	0.0352	0.0388
6	0.0194	0.0171	0.0199	0.0267	0.0265	0.0291	0.0331
7	0.0151	0.014	0.0152	0.0207	0.022	0.0246	0.0272
8	0.0115	0.0098	0.0129	0.0179	0.0176	0.0189	0.0217
9	0.0092	0.0099	0.0103	0.0158	0.0161	0.0185	0.0187
10	0.0075	0.0093	0.0104	0.0127	0.014	0.0147	0.0168
11	0.0072	0.0072	0.0107	0.0115	0.0125	0.0124	0.0166
12	0.0081	0.0073	0.0085	0.0121	0.0102	0.0116	0.0132
13	0.0073	0.0075	0.0069	0.0097	0.0091	0.0115	0.0123
14	0.004	0.0063	0.0092	0.0092	0.0091	0.0115	0.0114
15	0.0035	0.0054	0.0081	0.0085	0.0091	0.009	0.0092
16	0.0034	0.0044	0.0067	0.0074	0.0075	0.0102	0.0102
17	0.0067	0.0043	0.0058	0.0075	0.0071	0.0073	0.0103
18	0.0098	0.0053	0.006	0.0075	0.0059	0.0077	0.0101
19	0.0046	0.0046	0.0078	0.0077	0.0077	0.0064	0.0084
20	0.0023	0.0035	0.0053	0.0058	0.006	0.0072	0.0083
21	0.0031	0.0028	0.0041	0.0058	0.0052	0.0076	0.0063
22	0.0036	0.0042	0.004	0.0048	0.0055	0.0076	0.0089
23	0.0038	0.0065	0.0054	0.0069	0.0057	0.0068	0.0063
24	0.0035	0.004	0.0062	0.0062	0.0061	0.0061	0.0074
25	0.0046	0.0033	0.0058	0.0064	0.0065	0.0055	0.0083

Table 3

Training accuracy for several activation functions

Iteration	D1	D2	D3	Mish	Swish	GELU	ReLU
1	0.9341	0.9305	0.9285	0.9234	0.9222	0.9221	0.9196
2	0.9768	0.9772	0.9752	0.9735	0.9714	0.9717	0.9688
3	0.9841	0.9866	0.9842	0.9818	0.9812	0.9805	0.9794
4	0.9904	0.9914	0.9887	0.9868	0.9865	0.9851	0.9859
5	0.9937	0.9939	0.9919	0.9897	0.9888	0.9893	0.988
6	0.9953	0.9962	0.9942	0.9916	0.9915	0.9911	0.9894
7	0.9965	0.9966	0.9956	0.9935	0.9931	0.9918	0.9912
8	0.9974	0.998	0.9962	0.9947	0.9945	0.9939	0.9928
9	0.9979	0.9978	0.9972	0.9948	0.9948	0.994	0.9938
10	0.9987	0.9977	0.9968	0.9961	0.9952	0.9949	0.9945
11	0.9987	0.9984	0.9967	0.9965	0.9959	0.9959	0.9941
12	0.9981	0.9984	0.9977	0.996	0.9967	0.9962	0.9954
13	0.9982	0.9981	0.9979	0.9968	0.997	0.9958	0.9958
14	0.9993	0.9984	0.9971	0.997	0.9968	0.9957	0.9961
15	0.9994	0.9986	0.9976	0.9973	0.9968	0.9969	0.9969
16	0.9994	0.999	0.9979	0.9977	0.9974	0.9965	0.9964
17	0.9981	0.999	0.9983	0.9977	0.9977	0.9977	0.9965
18	0.9972	0.9987	0.9982	0.9976	0.998	0.9974	0.9966
19	0.999	0.9986	0.9973	0.9976	0.9972	0.998	0.9972
20	0.9996	0.9992	0.9983	0.9981	0.998	0.9972	0.9971
21	0.9993	0.9995	0.9987	0.9983	0.9981	0.9976	0.9979
22	0.9992	0.9989	0.9989	0.9985	0.9981	0.9975	0.9969
23	0.999	0.9982	0.9982	0.9977	0.9981	0.9978	0.9978
24	0.9992	0.9989	0.998	0.998	0.9978	0.9978	0.9977
25	0.9988	0.9991	0.9983	0.9979	0.9978	0.9982	0.9971

Table 4

Testing loss values for several activation functions

Iteration	D1	D2	D3	Mish	Swish	GELU	ReLU
1	0.1029	0.0967	0.0959	0.1008	0.0975	0.1063	0.1099
2	0.0679	0.0711	0.068	0.065	0.0693	0.0694	0.0669
3	0.0576	0.0555	0.0548	0.0504	0.0607	0.0558	0.0594
4	0.0534	0.0521	0.0548	0.0475	0.0493	0.0494	0.0533
5	0.0505	0.0457	0.0508	0.0597	0.0538	0.0478	0.0491
6	0.0453	0.041	0.0548	0.053	0.0457	0.0466	0.0496
7	0.0485	0.0435	0.0495	0.045	0.0508	0.0468	0.048
8	0.0466	0.0443	0.0529	0.0454	0.0492	0.048	0.0541
9	0.0479	0.0455	0.0471	0.0489	0.0535	0.0512	0.0453
10	0.0496	0.0481	0.0561	0.0517	0.047	0.0494	0.0497
11	0.0514	0.0467	0.0537	0.0498	0.0585	0.051	0.0485
12	0.0542	0.0493	0.051	0.0603	0.0543	0.0556	0.0455
13	0.0537	0.0449	0.0543	0.0489	0.0547	0.0607	0.0495
14	0.0481	0.0483	0.0645	0.0495	0.0628	0.0567	0.0554
15	0.0467	0.0606	0.0597	0.05	0.0533	0.0574	0.0487
16	0.0521	0.0445	0.0591	0.0509	0.0522	0.0534	0.0565
17	0.064	0.0516	0.0541	0.0499	0.0588	0.0542	0.0604
18	0.0496	0.0515	0.0686	0.0513	0.0564	0.0514	0.0536
19	0.0563	0.051	0.0675	0.0599	0.0542	0.0566	0.0557
20	0.0513	0.0481	0.0585	0.0523	0.0565	0.0544	0.0536
21	0.0572	0.0484	0.0579	0.0619	0.0629	0.0616	0.0587
22	0.0552	0.056	0.0645	0.063	0.0601	0.0621	0.06
23	0.0575	0.052	0.0854	0.0636	0.0692	0.0626	0.0604
24	0.0568	0.0501	0.0716	0.0578	0.072	0.0615	0.0657
25	0.0596	0.0522	0.0693	0.0589	0.0688	0.0624	0.0654

Table 5

Testing accuracy values for several activation functions

Iteration	D1	D2	D3	Mish	Swish	GELU	ReLU
1	0.9707	0.9719	0.9695	0.9684	0.9704	0.9678	0.9672
2	0.979	0.979	0.9782	0.9791	0.9797	0.9781	0.979
3	0.9821	0.9836	0.982	0.9832	0.9815	0.9799	0.9809
4	0.9831	0.9843	0.9809	0.9838	0.983	0.9836	0.9831
5	0.9848	0.9865	0.9833	0.9787	0.9838	0.9846	0.9833
6	0.9853	0.9868	0.9828	0.982	0.9857	0.985	0.984
7	0.9845	0.9866	0.9848	0.9853	0.9844	0.9863	0.9849
8	0.9837	0.9857	0.9835	0.9854	0.9858	0.9855	0.9848
9	0.9849	0.986	0.9845	0.9861	0.9854	0.9854	0.9862
10	0.9848	0.9852	0.9827	0.9859	0.9864	0.986	0.9855
11	0.9832	0.9856	0.9847	0.9865	0.9839	0.9857	0.9868
12	0.9837	0.9852	0.9851	0.9846	0.9856	0.9848	0.9867
13	0.985	0.9861	0.9843	0.986	0.9866	0.9859	0.9861
14	0.9851	0.985	0.9819	0.9858	0.9842	0.9857	0.9858
15	0.9859	0.9833	0.9834	0.9867	0.9858	0.9857	0.9873
16	0.9842	0.987	0.9836	0.9865	0.9866	0.9864	0.9855
17	0.9809	0.9843	0.9843	0.9865	0.9863	0.9869	0.9853
18	0.9856	0.9854	0.9836	0.9866	0.9869	0.9867	0.9857
19	0.9835	0.9853	0.9833	0.9856	0.9871	0.9864	0.9867
20	0.9848	0.9862	0.9829	0.9883	0.9864	0.9872	0.9871
21	0.9848	0.9866	0.9843	0.9848	0.9869	0.9855	0.9866
22	0.9848	0.9842	0.9843	0.9865	0.9856	0.9861	0.9849
23	0.9833	0.9859	0.9793	0.9855	0.9857	0.9859	0.987
24	0.9845	0.986	0.9813	0.987	0.9847	0.9873	0.985
25	0.9826	0.985	0.9833	0.9868	0.985	0.9863	0.9864

Fig. 2

Plot training loss values for several activation functions including wavelets.

Fig. 3

Plot training accuracy values for several activation functions including wavelets.

Fig. 4

Plot testing loss values for several activation functions including wavelets.

Fig. 5

Plot testing accuracy values for several activation functions including wavelets.

Table 2 has the numerical values for training-loss metrics. Note that from the first iteration, wavelet functions have a better performance, and so they have a competitive behavior.

Also, concerning to training step, Table 3 has the numerical values for the accuracy metrics, where three wavelets achieve higher values than the functions of the last four columns. See for example the second and last rows.

Figure 3 plots the training accuracies, and it is possible to appreciate how $D 2 \equiv D_{g}^{(2)}$ is above others, and numerical data in Table 2 shows a similar behavior for $D 1 \equiv D_{g}^{(1)}$ and $D 3 \equiv D_{g}^{(3)}$ , at least in the first 10 iterations.

Table 4 has the numerical values for testing loss metrics. The lowest value 0.041 (at row 6) is first reached by the $D_{g}^{(2)}$ wavelet.

Figure 4 shows the testing loss values, and although there is no full dominance of $D_{g}^{(2)}$ (orange) and $D_{g}^{(3)}$ (gray) they are competitive. In this plot we appreciate that $D_{g}^{(2)}$ reaches the lowest value in iteration 6, where the overfitting seems to start.

Table 5 has the numerical values for testing accuracy metrics. In the first 6 iterations, the maximum 0.9868 is reached by $D_{g}^{(2)}$ . The highest value is reached by Mish after 20 iterations. For comparison purposes, each maximum by column is highlighted in bold case.

Figure 5 shows the testing accuracy, and although the highest values 0.9883 is achieved by Mish at iteration 20, $D_{g}^{(2)}$ is above from iterations 3 to 7. Moreover, Mish has a low value at iteration 5. In this experiment, $D_{g}^{(3)}$ does not show a good performance, since its accuracy oscillates after iteration 6.

Figure 2 shows the training loss, and we can appreciate that $D_{g}^{(2)}$ (orange), followed by $D_{g}^{(3)}$ (grey), and $D_{g}^{(1)}$ (blue) reaches lower values than the others. Also, concerning to training step, Table 3 has the numerical values for training accuracy metrics, where networks with three wavelets achieve higher values than the functions of the last four columns. See for example the second and last rows.

Figure 3 plots the training accuracies, and it is possible to appreciate how $D_{g}^{(2)}$ is above others, and numerical data in Table 3 shows a similar behavior for $D_{g}^{(1)}$ and $D_{g}^{(3)}$ , at least in the first 10 iterations.

Table 4 has the numerical values for testing loss metrics. The lowest value (row 6) is first reached by the $D_{g}^{(2)}$ wavelet.

Figure 4 shows the testing loss values, and although there is no full dominance of $D_{g}^{(2)}$ (orange) and $D_{g}^{(1)}$ (gray) they are competitive. In this plot we appreciate that $D_{g}^{(2)}$ reaches the lowest value in iteration 6, where the overfitting seems to start.

Table 5 has the numerical values for testing accuracy metrics. In the first 6 iterations, the maximum is reached by $D_{g}^{(2)}$ . The highest value is reached by Mish after 20 iterations. For comparison purposes, each maximum by column is highlighted in bold case.

Figure 5 shows the testing accuracy, and although the highest values are achieved by Mish at iteration 20, $D_{g}^{(2)}$ is above from iterations 3 to 7. Moreover, Mish has a low value at iteration 5. In this experiment, $D_{g}^{(3)}$ does not show a good performance, since its accuracy oscillates after iteration 6.

4.2 Experiment 2

It uses the CIFAR dataset with 60000 samples, 10 iterations and 10-crossfolding. Crossfolding was applied to get a broader perspective on the performance of the network architectures that use 5 layers: three Conv2D-MaxPooling2D layers, two Dense Layers with Swish, Gelu, Mish, or ReLU activation functions, followed by a single Dense layer with Softmax function. These architectures can be identified in abbreviated form as “5Swish: 5S", “5Gelu: 5G", "5Mish: 5M", “5Relu: 5R". When including wavelets, modified version were obtained by replacing the first activation function with $D_{g}^{(1)}$ , $D_{g}^{(2)}$ , or $D_{g}^{(3)}$ complemented with ReLU in the next layers. So, they can be identified in abbreviated form by "D14R", “D24R", and “D34R” respectively. Also, it was explored the option of four ReLU and $D_{g}^{(1)}$ , $D_{g}^{(2)}$ , or $D_{g}^{(3)}$ identified as “4RD1", 4RD2", and “4RD3” respectively.

Results of Experiment 2. The results are presented in Table 6 and Figure 6 as boxplots for the accuracy of the 10-folds. The first 4 columns of Table 6 considers only Swish, Gelu, Mish or Relu functions separately. Note that column 4 of the ReLU case has a lower performance than Swish, Gelu and Mish in spite that provides a maximum of 72.58, but it also has the lowest value (69.72). The last 6 columns of Table 6 involve wavelets and ReLU functions, and the ReLU performance is enhanced, as is the case of column 5 that uses $D_{g}^{(3)}$ before 4ReLU (1D34R), with a minimum value of 70.22. Also, Table 6 columns 6 with $D_{g}^{(1)}$ wavelet (1D14R) improves a little the ReLU performance.

Table 6
Accuracy for 10-crossfolding with CIFAR dataset and 5 layers

Neural Network 5S 5G 5M 5R 1D34R 1D14R 4R1D1 4R1D2 4R1D3

Time (sec) 697 912 1028 464 950 847 639 710 695

Fold

1 70.78 71.93 72.15 70.92 71.42 71.97 71.07 71.68 71.55

2 71.36 70.70 72.27 72.58 73.38 72.73 72.85 72 72.37

3 72.25 72.30 72.32 70.48 72.3 71.38 72.38 71.48 72.4

4 72.53 70.53 73.1 71.57 71.82 73.8 71.93 69.98 73.22

5 71.6 71.66 71.52 71.75 71.63 69.87 72.57 71.67 72.05

6 71.79 71.33 71.98 71.2 72.7 71.45 72.03 70.97 73.27

7 70.95 72.30 71.48 71.1 72.52 71.47 71.58 72.5 72.72

8 71.64 70.68 72.27 69.72 70.22 71 72.58 71.25 72.3

9 71.29 70.68 72.03 71.47 71.45 71.45 72.43 70.88 69.8

10 72.21 71.70 72.33 71.47 70.9 71.2 72.4 70.28 70.7

Fig. 6

Boxplots for 10-crossfolding for several arquitectures with 5 layers including wavelet activation functions.

The most consistent case for all the 10 folds is achieved with 4 ReLU followed by $D_{g}^{(1)}$ (4R1D1) as is shown in Table 6 column 8 with a minimum of 71.07 and maximum of 72.85. The minimal difference between column 4 (5R) is with respect to column 9 (4R1D2) that uses four ReLU and a D⁽²⁾ (Mexican Hat) wavelet.

4.3 Experiment 3

Experiment 3 considers 10 layers with 11 combinations of activation functions and an output layer with SoftMax. The naming nomenclature for each combination is an integer followed by the kind of layer: “S” is Swish, “M” is Mish, “G” is Gelu and Dn is the n-th Gaussian derivative, so for example 1D19R means a layer with first Gaussian derivative followed by 9 layers with ReLU as activation function and an output layer with SoftMax (not written). Hence, these architectures are: 1) 10S, 2) 10G, 3) 10M, 4) 10R, 5) 1D19R, 6) 1D39R, 7) 8R1D31R, 8) 9R1D3, 9) 8R1D11R, 10) 9R1D1 and 11) 4R1D35R.

Results of Experiment 3. They are shown in Table 7 and Figure 7. The first column enumerates the 10 folds. Initially, experiment 3 compares “pure” architectures: Mish, Gelu and Swish (10M, 10G and 10S) with Relu at the first columns of Table 7. See that the network 10R reaches a better average performance, although it has a lower value of 98.61 with respect to 98.67 of Mish highlighted in bold in columns 3 and 4. Then, when ReLU-layers are combined with wavelets $D_{g}^{(1)}$ or $D_{g}^{(3)}$ the performance can be enhanced, as is the case of 9R1D3 (considered "the third best place” network) that preserves the same lowest value 98.61, reaches a better average performance and achieves a higher value 99.3 instead of 99.17 of 10R.

Table 7
Accuracy for 10-crossfolding with CIFAR dataset and 10 layers

Neural Network 10S 10G 10M 10R 1D19R 1D39R 8R1D31R 9R1D3 8R1D11R 9R1D1 4R1D35R

Time (sec) (1117) (1287) (1192) (966) (1132) (1087) (694) (971) (850) (992) (868)

Fold 1 2 3 4 5 6 7 8 9 10 11

1 98.65 98.77 98.65 98.94 98.95 98.94 98.99 98.61 99.05 99.07 99.01

2 98.94 98.62 98.67 99 98.97 99.03 99.09 99.24 98.99 99.01 99.01

3 98.81 99.01 98.85 98.76 99.09 98.9 99.21 99.27 99.09 99.24 98.87

4 98.97 98.97 99.01 99.04 99.04 99.16 99.29 99.09 99.15 98.96 99.27

5 98.58 98.6 98.74 98.67 98.47 98.81 98.46 98.81 98.9 98.66 98.47

6 98.77 98.9 99.15 98.96 98.25 98.84 98.85 99.3 99.01 99.03 98.85

7 99.01 99 98.98 99.17 99.05 98.57 99 99.27 98.97 99.4 99.14

8 99.04 99.18 99.05 99.1 99.15 99.3 99.24 99.06 99.21 99.31 99.14

9 98.67 98.69 98.74 98.61 98.81 99.09 98.51 98.66 99.00 98.97 99.22

10 98.92 98.97 98.88 98.97 98.97 98.87 99.29 99.01 99.10 98.8 99

Fig. 7

Boxplots for 10-crossfolding for several arquitectures with 10 layers including wavelet activation functions.

We remark that results for D2 were not included in Table 7 and Figure 7 since its performance was relatively too low, with an average around 55%.

1D19R has a similar behavior to 10R, but it presents a minimum of 98.25 in Table 7 column 6.

1D39R reaches a higher value of 99.3 greater than 99.17 of 10R, but it has a minimum of 98.57 that is lower than 98.61 of 10R.

8R1D31R also reaches a high value equals to 99.29 and a competitive average performance, even though it has a worst case of 98.46.

A peculiar behaviour is shown by 8R1D11R ("the second best place") that preserves a consistent performance for all the 10 folds, in fact its lowest value is similar to the average performance of 10R, 10G, 10M and 10R at the first 4 columns of Table 7, and it reaches a maximum of 99.21 greater than 99.04 of Swish, 99.18 of Gelu, 99.15 of Mish and 99.17 Relu cases.

Another interesting architecture, considered as “the best” in this experiment is 9R1D1 with a minimum of 98.66, a maximum of 99.4 and a very competitive average performance. Except for 8R1D11R, it outperforms the rest of the networks and it is shown in Figure 7.

Concerning the execution time, Table 7 shows the average time in seconds required for each of the 10 folds and is shown just below each architecture name. Note that the best architecures: 9R1D3, 8R1D11R and 9R1D1 required 971 seconds, 850 seconds and 992 seconds, that are similar to 966 seconds of the case of Relu (10R) and they are lower than the time for 10S, 10G and 10M, i.e., 1117, 1287 and 1192 seconds respectively. In other words, in experiment 3, the inclusion of wavelets increases the performance and keeps a competitive execution time.

5 Conclusions

The search for new activation functions of neural networks still being an open research area. According to the experimental results, some contributions of this paper can be highlighted related to wavelet as activation functions.

Firstly, in this paper, a model neuron with translation and dilation parameters has been introduced, and several activation functions have been studied. Both types of functions: non-decreasing functions such as ReLU, unit step, and GELU were considered, as well as wavelets Haar, triangular, and Gaussian-derivatives over all the 16 two input boolean functions. An important observation is about how a single wavelet-neuron can deal with all combinations via an evolutionary optimization. It supports the idea that wavelets can model different separation regions than “basic functions". Here the term “basic functions” means that is possible to get some wavelets conceived as “composed functions", such as Haar in terms of Heaviside and triangular wavelet in terms of ReLu functions. It motivates to reflect on the importance of considering wavelets as activation functions even though they do not follow the same behavior as other widely used as activation functions.

Secondly, experimental results support the idea that activation functions may assume some part of the data processing, when using in deep learning layers, which complement the connectionist approach where synaptic weights store distributed information and provide fault tolerance. Whereas other research approaches propose activation functions supported by simplicity, derivability, easy and fast computation and non-vanishing gradients or guided by computer experiments for searching new activation functions, our approach relies on the wavelet theory.

Thirdly, three Gaussian-derivative wavelets were implemented in Keras-Tensorflow platforms and the results on an image database were executed using gradient optimization methods. Certainly, all wavelet functions do not consistently improve the performance, but some of them improves the performance when combining with ReLU and maintains high competitiveness, which promises to take up this research area for future works.

Since wavelet functions were implemented for Keras-Tensorflow, that are widely used frameworks, the applications are multiple and include all areas where deep learning is already applied. In particular, according to wavelet theory, the contributions are expected to be relevant when dealing with data having non-stationary behavior: financial time series, edge detection in image processing, speech recognition with localized and non-periodic noise, as well as anomaly detection, among other applications.

Finally, is made known that the source code to include wavelet activation functions in Keras-Tensorflow is available [19].

References

Activation Functions Explained –GELU, SELU, ELU, ReLU and more. https://mlfromscratch.com/activationfunctions-explained (2019), accessed 24 April 2021.

Novikoff

A.B.J.

, On convergence proofs on perceptrons, Proceedings of the Symposium on the Mathematical Theory of Automata12 (1962), 615–622..

Maas

A.L.

, Hannun

A.Y.

and Ng,

A.Y.

, Rectifier nonlinearities improve neural network acoustic models, in: Proceedings of the in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013.

Romero

, Herrer

and Landassuri

V.M.

, Aproximación defunciones con EPWavenets, Research in Computin Science93(5) (2015), 95–109.

Chui

, An introduction to wavelets, Academic Press, 1992.

Clevert

D.A.

, Unterthiner

and Hochreiter

, Fast and accurate deep network learning by exponential linear units (ELUs), in: International Conference on Learning Representations, 2016.

Hendrycks

and Gimpel,

, Gaussian error linear units (gelus),arXiv, 2020.

Cybenko

, Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals and Systems2(4) (1989), 303–314.

Romero

, Aproximación de funciones con redes neuronales yalgoritmos evolutivos, Universidad Autónomadel Estado de México, México, 2016. http://ri.uaemex.mx/handle/20.500.11799/62920

10.

Shunichi

, Backpropagation and stochastic gradient descent method, Neurocomputing5(4-5) (1993), 185––196.

11.

Daubechies

, Ten Lectures on Wavelets, Society for Industrial and Applied Mathematics, 1992.

12.

Goodfellow

, Bengio

and Courville

, Deep Learning, MIT Press, 2016.

13.

Liu

J.W.

, Zuo

F.L.

, Guo

Y.X.

, Li

T.Y.

and Chen

J.M.

, , Research onimproved wavelet. Research on improved wavelet convolutional waveletneural networks, Applied Intelligence51 (2021), 4106–4126.

14.

Keras, https://keras.io (2015), accessed 24 April 2021.

15.

Soman

K.P.

, Ramachandran

K.I.

and Resmi

N.G.

, PHI Learning, 2010.

16.

Schumaker

, Spline functions: Basic theory, Cambridge, University Press, 2007.

17.

Diganta

, Mish: A self regularized non-monotonic activation function, arXiv, 2020.

18.

Herrera

and González

, Neuronas artificiales con waveletsparamétricos, Research in Computin Science147(5) (2018), 333–342.

19.

Herrera,

, Repository: Keras + Tensorflow + Wavelets. http://ia.azc.uam.mx, 2021.

20.

Cristea

, Tuduce

and Cristea

, Time Series Prediction WithWavelet Neural Networks., in: 5th Seminar on Neural Network Applications in Electrical Engineering, Belgrade, Yugoslavia, 2000.

21.

Mello

R.F.

and Ponti,

M.A.

, Machine Learning: A Practical Approach on the Statistical Learning Theory, Springer, Cham, 2018.

22.

Prajit

, Barre

and Le

, Quoc, Searching for activation functions, arXiv, 2017.

23.

Chakraverty

, Sahoo

D.M.

and Mahato

N.R.

, McCulloch–Pitts Neural Network Model, in: Concepts of Soft Computing, Springer, Singapore, 2019.

24.

Haykin,

, Neural Networks: A Comprehensive Foundation, Prentice Hall PTR, 2nd edition, 1998.

25.

Mallat

, A Wavelet Tour of Signal Processing: The sparse way, Academic Press, 3rd edition, 2009.

26.

Ravichandiran

, Hands-On Deep Learning Algorithms with Python: Master Deep Learning algorithms with math by implementing them from scratch, Packt Publishing, 2019.

27.

TensorFlow: Large-scale machine learning on heterogeneous systems. http://www.tensorflow.org (2015), accessed 24 April 2021.

28.

Jieqing

and Ping

, Marr-type wavelets of high vanishingmoments, Applied Mathematical Letters20(11) (1115), 1115–1121.

29.

Zhang

and Benveniste

, Wavelet networks, IEEE Transactionson Neural Networks3(6) (1992), 889–898.

30.

Glorot

, Bordes

and Bengio

, Deep sparse rectifier neural networks, in: Proceedings of the Journal of Machine Learning Research, 2011.

31.

Chen

, Yang

and Dong

, Time-Series Prediction Using a LocalLinear Wavelet Neural Wavelet, Neurocomputing69 (2006), 449–465.

Neural Network	5S	5G	5M	5R	1D34R	1D14R	4R1D1	4R1D2	4R1D3
Time (sec)	697	912	1028	464	950	847	639	710	695
Fold
1	70.78	71.93	72.15	70.92	71.42	71.97	71.07	71.68	71.55
2	71.36	70.70	72.27	72.58	73.38	72.73	72.85	72	72.37
3	72.25	72.30	72.32	70.48	72.3	71.38	72.38	71.48	72.4
4	72.53	70.53	73.1	71.57	71.82	73.8	71.93	69.98	73.22
5	71.6	71.66	71.52	71.75	71.63	69.87	72.57	71.67	72.05
6	71.79	71.33	71.98	71.2	72.7	71.45	72.03	70.97	73.27
7	70.95	72.30	71.48	71.1	72.52	71.47	71.58	72.5	72.72
8	71.64	70.68	72.27	69.72	70.22	71	72.58	71.25	72.3
9	71.29	70.68	72.03	71.47	71.45	71.45	72.43	70.88	69.8
10	72.21	71.70	72.33	71.47	70.9	71.2	72.4	70.28	70.7

Neural Network	10S	10G	10M	10R	1D19R	1D39R	8R1D31R	9R1D3	8R1D11R	9R1D1	4R1D35R
Time (sec)	(1117)	(1287)	(1192)	(966)	(1132)	(1087)	(694)	(971)	(850)	(992)	(868)
Fold	1	2	3	4	5	6	7	8	9	10	11
1	98.65	98.77	98.65	98.94	98.95	98.94	98.99	98.61	99.05	99.07	99.01
2	98.94	98.62	98.67	99	98.97	99.03	99.09	99.24	98.99	99.01	99.01
3	98.81	99.01	98.85	98.76	99.09	98.9	99.21	99.27	99.09	99.24	98.87
4	98.97	98.97	99.01	99.04	99.04	99.16	99.29	99.09	99.15	98.96	99.27
5	98.58	98.6	98.74	98.67	98.47	98.81	98.46	98.81	98.9	98.66	98.47
6	98.77	98.9	99.15	98.96	98.25	98.84	98.85	99.3	99.01	99.03	98.85
7	99.01	99	98.98	99.17	99.05	98.57	99	99.27	98.97	99.4	99.14
8	99.04	99.18	99.05	99.1	99.15	99.3	99.24	99.06	99.21	99.31	99.14
9	98.67	98.69	98.74	98.61	98.81	99.09	98.51	98.66	99.00	98.97	99.22
10	98.92	98.97	98.88	98.97	98.97	98.87	99.29	99.01	99.10	98.8	99