Abstract
The need to detect and classify objects correctly is a constant challenge, being able to recognize them at different scales and scenarios, sometimes cropped or badly lit is not an easy task. Convolutional neural networks (CNN) have become a widely applied technique since they are completely trainable and suitable to extract features. However, the growing number of convolutional neural networks applications constantly pushes their accuracy improvement. Initially, those improvements involved the use of large datasets, augmentation techniques, and complex algorithms. These methods may have a high computational cost. Nevertheless, feature extraction is known to be the heart of the problem. As a result, other approaches combine different technologies to extract better features to improve the accuracy without the need of more powerful hardware resources. In this paper, we propose a hybrid pooling method that incorporates multiresolution analysis within the CNN layers to reduce the feature map size without losing details. To prevent relevant information from losing during the downsampling process an existing pooling method is combined with wavelet transform technique, keeping those details "alive" and enriching other stages of the CNN. Achieving better quality characteristics improves CNN accuracy. To validate this study, ten pooling methods, including the proposed model, are tested using four benchmark datasets. The results are compared with four of the evaluated methods, which are also considered as the state-of-the-art.
Introduction
There is constant research exploring new technologies to improve the Convolutional Neural Network (CNN) performance. Since deep learning was first proposed there have been plenty of recommendations to increase CNN accuracy. They go from enlarging training datasets and reducing batches [3], to pretrained networks, skip connections [27], deeper architectures [13], and complex connections [12]. CNN can obtain huge amounts of features, this is the main reason they are being used in many applications. The training stage is where all key patterns are learned. This capability is constructed updating CNN parameters values, and the network accuracy improves by extracting more relevant patterns during training.
Nevertheless, increasing the effectiveness of CNNs architectures is not only about learning patterns, but extracting high quality features and moreover keeping the right ones. Therefore, feature extraction is still an open field to research. To address this challenge some approaches have focused on different feature extraction techniques to identify key patterns. For example, [4] and [18] utilized spectral analysis to analyze heart sounds, the first one uses wavelet transform to produce a scalograms to trains a CNN, while the second one uses Denoising Autoencoders to extract features from the spectrograms to do the same. In [30], Principal Component Analysis is used to extract a feature vector and reduce its dimensionality, then the images are classified in a Multi-scale CNN. Statistical methods are also common, the work in [11], focused on detecting objects with camouflage. First, the image is divided into pieces of equal size named subblocks, each subblock is decomposed two levels employing Discrete Wavelet Transform and Daubechies wavelet Db2. Then the coefficients obtained are used to extract features statistically. These features feeds the final stage of the algorithm to detect and locate the objects. In general, the main idea is to do more with less.
Nevertheless, the feature extraction process not only takes place in the convolution layers but also in the pooling layers. According to [33] and [34], the pooling process enlarges receptive field and decreases computational cost. Contrarily, it also causes valuable information lost, specifically high-frequency patterns. Even more, according to [36] and [37], conventional CNN lost most of the spectral information. And they can be considered as a deficient form of Multi-Resolution Analysis (MRA). This problem can be overcome by incorporating a well-structured MRA like the Wavelet Transform (WT). The WT is particularly useful preserving details by capturing frequency and location features, according to [22] and [23]. CNN feature extraction process can be enriched directly from the inside to get more useful set of patterns, allowing less training iterations to reach its optimal state.
During the last decade, there have been many studies incorporating MRA to their models to get those high-quality features. They have been used along with CNN to increase image resolution, restore images, remove image artifacts, or even extract features to feed them into a CNN. Even more, some of those frameworks have successfully blended wavelets techniques within the CNN layers to get useful patterns. They have tried to use them in every single element of CNN. For example, WT has been used as size reduction and pooling [5, 34] and [41], residual networks and skip connections [17, 36] and [37], activation function in [26], or even in conventional neural networks [2] and [29], just to mention a few.
In this work, initially inspired by [41], we propose to combine conventional pooling methods with wavelet transform within the CNN layers to reduce the feature map size without losing details. The contributions of this work can be summarized as follows. Existing pooling methods are combined with wavelet transform to improve CNN accuracy. The lifting scheme is implemented as part of the CNN layers to apply the wavelet transform. Pooling combinations not previously considered are evaluated. Only the approximation coefficients are needed in the proposed model. State-of-the-art accuracy classifying four benchmark datasets is achieved.
The organization of this paper is as follows. Section 2 presents some related work using wavelet transform with CNN models. Section 3 briefly describes the lifting scheme and the most common pooling methods. Section 4 presents the proposed model. Section 5 reports the characteristics of the experiments and their results. Finally, section 6 presents the conclusions.
Related work
CNNs are becoming more capable for image analysis application, due to the increasing number of layers and complex connections. However, adding more layers to increases performance produce more intricate models and higher computational cost. Eventually, deeper networks may start losing accuracy. As mentioned earlier, these are attempts to improve the feature extraction process. On the other hand, MRA keeps all the information contained and distributed into its coefficients through the different resolution levels. Each one contains specific features; higher resolutions show more details and lower resolutions shows only the strongest features.
In recent years, studies applying MRA to the CNN has become more common. Some of these studies are presented in Table 1, where the first two columns indicate the publication year and the application. The next column describe the relationship between the presented models and CNNs, which is also the objective of each research. The last three columns describe the kind of wavelet solution implemented. For example, the fourth column shows that Discrete Wavelet Transform (DWT), Continuous Wavelet Transform (CWT), Lifting Scheme Wavelet Transform (LSWT), Package Wavelet Transform (PWT), and Fast Wavelet Transform (FWT) are the only wavelet transforms utilized in the analyzed studies. At the same time, the last three columns completes each model information by adding the wavelet family selected, the number of wavelet coefficients applied and the levels of decomposition. The wavelets used are not only families like Daubechies or Coiflet but also adaptive, which means that these wavelets are achieved through optimization algorithms. The 2D wavelet transforms produce four coefficients and all analyzed models require the approximation coefficients. However, not all studies utilize the details coefficients. In [20], the detail coefficients are used partially. In the last column indicates the number of levels of decomposition applied in each model. These studies introduce new frameworks with interesting results, like [26], which replaces sigmoid function as the activation of CNN with a scale function of the wavelet transform to classify handwritten digits. Other models use adaptive lifting schemes that learn the wavelet configuration by incorporating a small CNN as part of its internal blocks. Then this model is embedded in a larger CNN to detect scenes and classify textures [31] or compress images [20]. Other approaches also detect scenes, classify handwriting digits [39], categorize faces by gender [1], or identify emotions [40]. But instead of blending both techniques in one framework, these models use wavelet coefficients as inputs for multiple CNNs. Solutions like [19] and [42], do the same as the previous two models. But additionally, they use the CNN output to feed the inverse wavelet transform to produce super-resolution images. Replacing pooling layers is the most common uses of wavelets within CNN like in [5, 34], and [41]. It is used for most of the previous applications, as well as for image restoration, lung tumors detection, among many more. One dimensional signal as in [4] and [35], uses WT to produce scalograms, which are 2D signals, to be used as input image for CNN. Transforming time-frequency data into 2D spatial information.
Wavelets Transforms used to improve CNNs
Wavelets Transforms used to improve CNNs
Note: This table presents some of the most recent research as examples and does not intent to be a survey study.
The aim of the study is to extract more features by combining MRA with the existing pooling methods to improve CNN accuracy. By taking advantage of the lifting scheme, which is very suitable for the CNN layers format, MRA is integrated as a block of layers. This block is combined with one of the existing pooling methods to complete the proposed model. It is important to note that, no hybrid proposal using WT considers existing pooling methods, they just replace them.
The lifting scheme
The lifting scheme, which is MRA, is used to design wavelets and apply the DWT. It is considered the second-generation wavelet transform. This technique was developed by [43]. The lifting scheme change the finite filters with basic convolution operation, shortening the required arithmetic operations by half as explained in [38] and [9]. The DWT applies low-pass and high-pass filters to the signal sequentially. But the lifting scheme, splits the signal in two halves, called odd and even samples, and then applies some simple operations across the divided signal, as shown in Figure 1. The generalized lifting scheme was developed by [25]. Based on the original lifting scheme but generalized to overcome the scheme structure restriction. The original lifting scheme has four steps: The signal is split into odd and the even samples. This means, both groups are downsampled by half (↓2), as in Eq.(1) and Eq. (2).
The prediction step is an operator P () applied to the even samples and subtracted from for the odd samples to get small variations known as details, as in Eq. (3). In this case, the Haar wavelet has been used and the predictor operator requires to multiply by -1 all even values.
The update step is also an operator U (), it adjusts the section of low frequency by adding some of the removed energy to obtain average values, as in Eq. (4). For this case, the updater operator multiplies the details by The last step normalizes the low and high frequency subbands multiplying them by

Lifting scheme decompose a signal into a high and low frequency subbands. Where [↓2] indicates that the signal is downsampled by half, P = -1,
In order to reconstruct the original signal, the process of the wavelet transform could be inverted using the same steps.
When the lifting scheme needs to be applied to an image, a 2D version is used. It simply repeats the same block two more times, one for the low and one for high frequency coefficients. The first block splits the signal in one direction (horizontal), while the other two blocks divide the signal in the other direction (vertical). The following equations repeat the lifting scheme steps for the vertical blocks [21]. The approximation and detail coefficients are split into odd and the even samples, as shown in Eqs.(5)-(8).
where d
e
are the even and d
o
are odd samples of the detail coefficients, while c
e
are the even and c
o
are odd samples of the approximation coefficients, respectively. The prediction step applies the Eq. (9) and Eq. (10) for the vertical blocks.
where HH and LH are the high-high frequency, and the low-high frequency coefficients accordingly. The Eq. (11) and Eq. (12) are for the vertical update step.
where HL and LL are the high-low frequency, and the low-low frequency coefficients, respectively. As before, the last step normalizes the low and high frequency subbands multiplying them by
At the end of the decomposition process, four coefficients are obtained, they are also known as approximations (LL), verticals (HL), horizontals (LH) and diagonals (HH), as shown in Figure 2.

Lifting scheme for 2D decomposes an image into four subbands. The subbands names are approximations (LL), horizontals (LH), verticals (HL), and diagonals (HH).
The pooling layer is mainly used for subsampling. This method summarizes a region into one single value, reducing the input’s dimensions. The most widely known pooling methods are max pooling [28] and average pooling [14]. Max pooling takes the highest value of a region R ij , while the region slides at a stride of S × S (in this case S = 2) through the entire image to reduce its feature map. The equation for max pooling is described in Eq. (13).
where I is the input image, (i, j) are the dimensions of the pooling region, F max is the obtained output, (x, y) are its dimensions, and (a, b) are indexes that point out to all the elements of the region.
Similarly to max pooling, average pooling reduces the feature map by calculating the average value of each region. The general equation for average pooling is shown in Eq. (14).
where F avg (x, y) is the output, and |R ij | is the magnitude or number of elements in the pooling region.
There are two probabilistic pooling methods, mixed pooling [15], and stochastic pooling [32]. Mixed pooling selects max and average pooling randomly over training. The general equation for mixed pooling is presented in Eq. (15).
where λ value is 0 or 1 randomly.
On the other hand, stochastic pooling randomly selects values of the region depending on its probability, which is calculated based on its magnitude.
The Eq. (16) normalizes the value of each element within the pooling region to obtain their probabilities p (i, j).
where R (i, j) is also the pooling region Ri,j.
The general form for stochastic pooling is F stoch (x, y), shown in Eq. (17)
where the function P () selects a sample from the multinomial distribution created with the probabilities obtained by the Eq. (16).
In CNN feature loss in pooling layers is more relevant than in convolution layers. During the first convolutions layers the colors and edges are extracted, the following convolutions, groups and mix the information to construct shapes and patterns. Conversely, the pooling process may eliminate some of that information while downsampling. It just selects one value and ignores the rest. Depending on the pooling method that value can be the maximum, the average, or a randomly selected one. For example, the average pooling loses relevant information by acting like a low-pass filter, generalizing features when obtaining its mean value. In contrast, max pooling focus on the magnitude by keeping only the highest value. Although, it loses some information, the output keeps frequency details depending entirely on the input signal shape. This behavior allows the CNN to improve accuracy, which is the reason to be selected for this study over other methods.
The lifting scheme is also included in the model because it is very suitable within CNN architecture, as in [20] and [31]. It is also a MRA technique that preserves frequency and space information when reducing dimensionality without any loss. Even if only the approximation coefficients are used, the lifting scheme can extract relevant features from images to improve CNN performance. According to [39], this subband has most of the image energy and structure information, it is the most alike to the original image. The model proposed is a pooling method that incorporates both max pooling technique and 2D lifting scheme in parallel (see Figure 3). Since the model is embeded in within the CNN all its blocks are also network layers. The model is constructed by incorporating two lifting scheme architectures identical to the one presented in Figure 1, including the downsampling process (↓2) and the parameter values P, U, and N1. However, only the approximations outputs are connected to the next blocks, which are both leakyReLU layers [10]. These layers are activation functions that prevent negative values from being lost, without giving them to much weight. The general form of activation function LeakyReLU F lReLU (x) is defined in Eq. (18).

The proposed model is a pooling method that incorporates both max pooling technique and 2D lifting scheme in parallel. The lifting scheme used is for the Haar case, but it only extracts the LL coefficients.
where
At the same time, the max pooling block performs downsampling with a 2 × 2 pooling region, stride of 2 × 2 without padding, as defined previously in Eq. (13).
As the last step, the output A mn produced by the lifting scheme (passed through a leaky ReLU layer) and the output B ij generated by the max pooling are combined using a concatenation layer [24]. Both arrays are defined in Eqs.(20).
where A is a matrix that contains real numbers, m and n are indexes that indicate the array dimensions.
where B is a matrix that contains real numbers, i and j are indexes that indicate the array dimensions.
The concatenation layer is frequently use in CNNs to bond two or more blocks of information into one. In this case, the layer links together both feature maps to enriching the pooling process despite of the downsampling effectuated to both outputs. The array produced C
xyz
is defined in Eqs.(22).
All experiments used Matlab R2020b & Deep Learning Toolbox, and the CNN utilized was the MatConvNet, which is also reported in [41]. Only one change was made to the CNN, the local response normalization (LRN) layer is used instead of the batch normalization (BN) layer for practicality of the applications. Four benchmark datasets are used for training and testing, as shown in Figure 4. All are trained using stochastic gradient decent, initial learning rate of 0.001, minibatch of 64, and 20 max epochs. All tests are run on a 64-bit operating system (windows 10). The core was an AMD RYZEN 7 4800H series with Radeon Graphics @ 2.90 GHZ, with 16.0 GB RAM. And the GPU utilized was an NVIDIA GEFORCE GTX 1050. This study used the Haar wavelet as basis because it simplifies the lifting scheme implementation.

Four benchmark datasets are used in this study. a) MNIST, b) CIFAR-10, c) SVHN, and d) KDEF.
The MNIST dataset [44] contains a large number of color images of handwritten single digits divided into 10 classes, and their size is 28x28 pixels. The training set is made of 60,000 images and the testing set of 10,000 images.
The network structure used for this dataset is made of the following layers (see Figure 5): four convolutions, three normalizations, one ReLU, one softmax, one classification, and two pooling. The highlighted blocks are the pooling layers, this indicates where the different methods are evaluated within the CNN.

Architecture of the CNN used to test the MNIST dataset. The highlighted blocks indicate the pooling layers.
The KDEF dataset [16] contains emotional faces of 70 people, 35 females and 35 males (between 20 and 30 years old). The set consists of seven basic emotions (afraid, angry, disgusted, happy, neutral, sad, and surprised) and five poses (full left and right profiles, half left and right profiles, and straight). The color images are resized to 128x128 pixels for memory and time constrains as in [41].
This dataset uses the largest network structure of this study since the images size is also the largest of the four datasets, and the network requires more layers to reduce the dimensions of the input. However, this network is configured for 7 classes, while the others are configured for 10. The CNN layers architecture is formed by (see Figure 6): five convolutions, four normalizations, four ReLU, two dropout, one softmax, one classification, and four pooling. Two dropouts layers are being used to prevent the network from overfitting.

Architecture of the CNN used to test the KDEF dataset. The highlighted blocks indicate the pooling layers.
The CIFAR-10 dataset [6] consists of 10 classes of objects (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck), and the size of the color images is 32x32. The training set has 50,000 images and testing set 10,000 images. Correspondingly, the last dataset is made of street view house numbers (SVHN) [45] categorized in 10 classes, one for each digit. The training set is formed of 73,257 digits and the testing test of 26,032 additionally it contains 531,131 extra images. 30,000 of those extra images were used for validation as in [41]. The color images size is 32x32 pixels.
The architecture of the network used for CIFAR-10 and SVHN datasets is similar to the previous one but due to the size and complexity increment the CNN is made of the following layers (see Figure 7): five convolutions, three normalizations, four ReLU, two dropouts, one softmax, one classification, and three pooling. The only difference between dataset CNNs is the number of filters used. CIFAR-10 uses 32 in the first three convolution blocks and 64 in the fourth convolution block, while SVHN doubles them. In this case, SVHN dataset uses images with more complex patterns and colors than CIFAR-10, and needs more filters to capture those features.

Structure of the CNN to test the CIFAR-10 & SVHN datasets. The asterisk in the convolution blocks is replaced by 32 & 64 for the CIFAR-10 case, and 64 & 128 are used for the SVHN CNN. The highlighted blocks indicate the pooling layers.
The pooling methods used in this study (see Table 2) are four singles: Average (Avg), Maximum (Max.), Mixed (Mix.) and Stochastic (Stoch.), which are the most referred methods for CNNs. The next four methods are hybrids since they are formed by concatenating the LL output of the lifting scheme (Lift.) with the four single methods. Finally, the last method concatenates the four lifting scheme coefficients for the next layer.
Accuracy Results for the Pooling Methods
Accuracy Results for the Pooling Methods
*Proposed Model.
Every CNN architecture has pooling layers indicated in Figures7. Each pooling method has been implemented in those layers, utilizing one single method for the entire network. Every method has been tested for all four datasets considering a 2 × 2 region, 2 × 2 stride, and no padding.
The test results are concentrated in Table 2. The four datasets are located one per row and the ten pooling methods are shown horizontally to facilitate the comparison among them. The proposed model is indicated by an asterisk.
Even though, the hybrid models have good performance, the average pooling and the max pooling alone exceed some of them. On the other hand, the proposed model performs better or equal than all the other tested models, except for the SVHN dataset case. However, the proposed model is very close to the model with the highest accuracy.
It would be expected that the model Lift.2, which uses all the lifting scheme coefficients, reached the highest performance given that it contains all details. It is clear that the detail coefficients are obtained by a highpass filter, average pooling acts like a lowpass filter, mix pooling and stochastic pooling selects any values. Nevertheless, max pooling is not like any of them and its performance depends directly on the signal shape. Even more, max pooling is well-known for preventing CNNs from overfitting, max pooling behaves more like sliding filter. When combining this behavior with a fixed lowpass filter it not only obtains different frequency features but also prevents the net from memorizing values achieving higher results.
The hybrid pooling model was able to overcome traditional methods in CNNs and also produced better results than using WT alone. The proposed model outperforms all others in the MNIST, CIFAR-10 and KDEF datasets, and performs high quality results in the SVHN dataset. The max pooling method still depends on datasets and CNN architecture. However, hybrid approaches can enrich the accuracy of existing methods.
Finally, implementing wavelet based on the lifting scheme as a ready to use block, may trigger the development of more models that effectively exploit the functionalities of the WT and CNN to create new solutions.
Future works include the use of different wavelets from the Daubechies and Coiflet family, which may produce relevant results. Elaborate a pooling method that dynamically select the wavelet coefficients to take advantage of all the coefficients using all available details to produce better results, and analyze the frequency response of max pooling by modifying the shape of the input signal.
Footnotes
Acknowledgments
The first author gratefully acknowledges the financial support from the Mexican National Council for Science and Technology (CONACYT) and the Universidad de las Americas Puebla (UDLAP), Mexico. The authors would like to thank the anonymous reviewers for their helpful comments.
