Abstract
Hyperparameter optimization is a crucial step in the implementation of any machine learning model. This optimization process includes regularly modifying the hyperparameter values of the model in order to minimize the testing error. A deep neural learning model hyperparameter optimization process includes optimizing both the model parameters and architecture. Optimizing a model’s parameters involves deciding the values of parameters, such as learning rate and batch size. Optimizing architectural hyperparameters includes deciding the shape of the deep neural learning model, i.e., the number of layers of individual types and the number of neurons in a certain layer. The state-of-the-art hyperparameter optimization methods don’t optimize the position of the hyperparameter within the model architecture. In this work, we study the effect of changing a hyperparameter within the deep learning model architecture. Thus, we propose an
Keywords
Introduction
Deep neural networks are utilized successfully in different application fields such as image classification [18], healthcare [32], and self-driving cars [21]. The achieved performance of these techniques are superior to all other machine learning techniques [9, 10]. For instance, in [17], the authors proposed a rectifier neural network for image classification that can surpass the reported human-level performance.
Building a successful deep neural learning model is a complex task. Deep neural networks consist of complex architectures. It consists of a number of stacked layers to automate the process of feature extraction from the input data. Deep neural networks have major network architectures, i.e., convolutional neural networks (CNNs), recurrent neural networks, and recursive neural networks. The performance of a deep learning model depends on a set of hyperparameters values, e.g. the number of layers, batch size, weight initialization, and Dropout rate [19, 27]. The correct choice of deep learning model hyperparameters values is a challenging task. For a given learning model, hyperparameter optimization task includes searching for the optimal set of hyperparameter values that minimizes the generalization error E.
In this paper, we address the problem of hyperparameters optimization within one deep learning architecture, CNN, as CNN is considered one of the most complex deep neural network architectures. Tuning a CNN model is a big burden in deep learning model construction, as it may have an abundance of hyperparameters and millions of parameters.
For any deep learning model, the hyperparameter optimization task can be classified into two classes, namely, value-based (i.e. learning rate and momentum) and architectural hyperparameters (i.e. depth of the network and choice of connectivity per layer). Besides, the hyperparameter optimization can be classified according to the search approach. The most widely used search approaches are grid search, random search, evolutionary algorithms, reinforcement learning, sequential model-based global optimization (SMBO), i.e. Bayesian optimization and tree-structured Parzen estimator (TPE). From a practical perspective, there are an abundance of hyperparameter optimization options, e.g. hyperopt [4] and Optunity [8].
Much research has been conducted for optimizing the hyperparameters values. In [26], the authors utilized chaining derivatives backward through the entire training procedure to optimize hyperparameters values such as step-size and momentum schedules and weight initialization distributions. In [3], the hyperparameters are optimized using a random search. Random search can achieve the hyperparameter optimization task within a small fraction of the computation time in comparison to the brute-force (grid search) method. Bayesian optimization is used to select promising hyperparameters based on the results of the previous values, as proposed in [14].
Architectural hyperparameters have received several efforts as well. In [23], the authors provided a survey of deep neural networks architecture search. The authors focused on CNN architecture. In [36], the authors used an evolutionary method, genetic programming, to decide on a deep learning model structure that maximizes the validation accuracy. Recently, reinforcement learning has been adapted to optimize the neural network architectures, as proposed in [2, 31].
The architectural hyperparameters optimization includes the number of layers, the number of neurons per layer, and the connectivity’s type of layer. The position of a hyperparameter can affect the performance of the CNN model. For instance, deciding to place a batch normalization (BN) layer position after or before the activation layer depends on the problem at hand. In [18], the authors built a model by adopting a BN layer right after each convolution layer and before the activation layer. In contrast, several practical tests have been conducted that show that a BN layer placed after an activation layer has better performance [11]. For instance, research was conducted to combine brain magnetic resonance imaging (MRI) scans with methylation data from The Cancer Genome Atlas to predict the methylation state of the regulatory regions in cancer patients [16]. To achieve this goal, the authors proposed a CNN model in which a BN layer is used after the activation layer.
To the best of the authors’ knowledge, the state-of-the-art methods don’t optimize the position of hyperparameters. This research gap motivates the current work to answer questions such as "What is the best position of a BN layer? Is it better to place a BN layer after or before the activation layer?"
In this work, we propose ArchPosOpt; the proposed method extends a new dimension of the existing hyperparameter optimization techniques, which is the position of the hyperparameter within the model architecture. In this context, the process of hyperparameter position optimization refers to finding places of BN and dropout layers within the model architecture so that the model error rate is minimized. We propose extending the grid search cross validation (GridsearchCV) function of package scikit-learn [30], the random search and TPE of hyperopt library [4].
To evaluate this contribution, we proposed comparing the results of the two scenarios. The first scenario is to search for hyperparameter values only using the three tools, grid search, random search, and TPE. The model architecture of the first scenario follows the recommended position of BN and Dropout layers, as suggested in the original papers proposing these two concepts. On the other hand, the second scenario includes searching for the optimal positions of BN and Dropout layers using the proposed method, which extended the three tools for that purpose. The models of the two scenarios were trained from scratch on two standard datasets, CIFAR-10 [24] and Dogs vs. Cats [1]. The former and latter datasets represent binary and multi-class image classification problems, respectively.
The contributions of this work are as follows: To the best of the authors’ knowledge, the proposed method ArchPosOpt is the first tool to search for the optimal positions of the hyperparameters within the deep neural network architecture. We extended three well-known hyperparameter optimization tools, GridSearchCV, randomSearch, and TPE, to search for the optimal position of the hyperparameters. We conducted a set of experiments to evaluate the performance of the proposed method. In addition, we tested the effect of seven different hyperparameters separately on the deep neural network.
This paper is structured as follows. In Section 2, we review CNNs and discuss the related work of hyperparameter optimization. In Section 3, we elaborate on the proposed method. In Section 4, we expose a set of experiments to evaluate the proposed method and the effect of tuning seven different hyperparameters on a CNN model. Finally, the paper is concluded in Section 5.
Background and related work
Background
CNNs are one of the most interesting deep learning architectures. Their architecture is biologically inspired by the animal visual cortex organization [20]. CNN has shown pleasant performance on two-dimensional data such as images and videos. Thus, it achieves the state-of-the-art results required for many problems, e.g. image classification. The beginning of CNN is related to LeNet [25], which was introduced by Yann LeCun in 1998. At that time, LeNet was used in tasks such as character recognition.
CNN has three operations; Convolution followed by Non-linearity, Pooling or Subsampling and Classification (Fully connected), where they run on three main layers, convolutional, pooling and fully connected layers, respectively. CNN has two stages of training, feed-forward and backward.
In the first stage, the input is represented in each layer by weights and biases to get an output. The loss cost between the expected and the actual values is computed. In the backward stage, chain rules are used to compute the gradients of each parameter and update all parameters, and then another forward stage begins with the updated parameters. After a sufficient number of iterations, the network learning is achieved, where the loss cost is the minimum.
Convolutional layers are used mainly as features extractors from the given input vector. It is powerful because it makes use of the spatial relationship among vector components as it learns important features of inputs by using a small square of given data and reducing the number of parameters in the same feature map. After extracting all local features, the location relationship between features can be determined. After every convolution operation, a non-linear operation is used. The most common function for non-linear operations is rectified linear unit (ReLU). ReLU output is computed by the following equation, output = max (0, input).
Pooling layer plays an important role in solving the curse of dimensionality of the feature map. Although a lot of information is lost, a pooling layer keeps the important information. Pooling layer is also incompatible with translation because its computation depends on neighboring pixels. Maximum (max) and average pooling are the two used methods. In the case of the max-pooling, for example (2PLH2 window), the largest value is selected from the window. It has proven that max-pooling is better than average pooling based on this detailed theoretical analysis [7]. In addition, max-pooling has a faster convergence and generalization [33].
A fully connected (FC) layer is the final layer of any CNN. The term “Fully Connected” means that all the nodes in two consecutive layers are fully connected. Convolutional and pooling layers represents the input data in high-level features. These high-level features are used as inputs to the FC layer to classify the input data into a predefined set of classes, based on the used training dataset.
Related work
In the literature, a number of attempts have been conducted to search for the hyperparameter optimization that minimize the model error. These efforts can be classified based on the hyperparameter types in two classes, value-based (e.g. learning rate and momentum) and architectural-based (e.g. depth of the network and choice of connectivity per layer) parameters. Another classification of these efforts can be established based on the search method of the optimization task. The most widely used search methods are grid search, random search, Bayesian optimization, and sub-optimal methods, e.g., evolutionary algorithms, reinforcement learning, random forests, and TPE.
Value-based hyperparameter optimization methods focus on one or more hyperparameter value and tune it. The value-based hyperparameters include learning rate, batch size, activation functions, pooling variance, dropout rates, etc.
For instance, in [29], research was conducted to explore one hyperparameter of CNNs, the activation function. The authors performed a set of experiments and tested several activation functions to figure out which activation function is most suitable for each experiment/case. Similarly, the authors compared the variants of the most used activation function ReLU in [37]. Other research studied the relation between two hyperparameters; batch size and learning rate [34]. The authors recommended to increase the batch size instead of decreasing the learning rate. Finally, the most common hyperparameters are considered in [28], such as the pooling variants selection (stochastic, max, average, mixed), classifier design (convolutional, FC), learning rate, and batch size.
Architectural-based hyperparameter optimization methods try to find the architecture that achieves the best performance. The architectural-based hyperparameter optimization task includes optimizing the number of the layers, number of units per layer, number of filters per layer, etc. The decisions regarding the selection of such architectural options used to be reached manually through significant human expertise; this approach is inefficient especially for large and complex deep neural networks. Thus, the area of automating CNN architecture search becomes critical [23].
The grid search (brute force) method is widely used for hyperparameter optimization, especially for models with limited search space. The Python scikit-learn package includes an implementation of the grid search method, GridSearchCV function [30]. GridSearchCV takes a list of hyperparameter values and trains the model. Then, it returns the best hyperparameters values after searching all possible options. For example, to optimize the Dropout rate, GridSearchCV takes a list of Dropout rates, e.g., 0.2, 0.5, 0.7, 0.9, trains different CNN models with every possible value, and then returns the model with the value that achieves the best accuracy. The method guarantees to find the optimal solution within the specified search space, but its main disadvantage is the intensive computational time, as it tries all possible hyperparameter configurations. One possible solution for the problem is to parallelize the GridSearchCV function.
An alternative to the grid search method is the random search technique; it is a function of Python scikit-learn package [30] and hyperopt package [4] as well. The main advantage of random search that it is trivially parallel; its implementation is very simple, and enables the model designer to set the number of combinations to be considered. This number should be feasible to the model designer in terms of the computational power and time.
In [13], the authors utilized Bayesian optimization to find the optimal deep neural network architecture for a given dataset by considering a set of previous datasets and their corresponding machine learning models. Other research combined Bayesian optimization with a bandit strategy to optimize the number of layers and units per layer [12].
Bayesian optimization-based methods assume a prior distribution of the loss function. Then, these methods tune the hyperparameters by updating this prior distribution based on the previous iterations/ observations. Bayesian optimization converges slowly at the beginning, and then the convergence rate consistently increases. The main problem for Bayesian optimization is that it is a sequential strategy and can’t be parallelized. Thus, for problems with huge computational time per iteration, it is not the best choice.
In [36], the authors utilized Cartesian genetic programming to tune the CNN structure and connectivity. The Cartesian genetic programming representation of the architecture trained the network to classify the images of a CIFAR-10 dataset. Then, the error in the validation set is assigned in accordance with the fitness of the architecture, which optimizes the CNN architectures. Another evolutionary algorithm is used to achieve the same goal in [38]. The authors proposed using particle swarm optimization (PSO) for tuning the hyperparameters values of a CNN model for image classification of CIFAR-10 and CIFAR-100 datasets. The mechanism of evolutionary algorithm-based hyperparameter optimization is sequential; thus, these methods can’t be parallelized.
In [5], the authors proposed utilizing the TPE for the purpose of hyperparameter optimization task. The proposed method outperformed both manual and random search on tuning the hyperparameter values and architectural options. TPE can be exploited more when we compute the models with different sets of hyperparameters in parallel instead of serially. Parallelizing TPE makes search less efficient, though the running time is faster [6]. The core idea of parallelizing TPE is to use stochasticity of draws to help move from one iteration to the next iteration.
From a practical perspective, there are an abundance of hyperparameter optimization methods. Hyperopt is a Python library implementing two solvers of hyperparameter optimization, random search and TPE [4]. Another Python library is Optunity; this library utilizes several solvers, i.e., grid search, random search, PSO, and TPE [8], where PSO is the default solver of this library.
Based on this discussion, existing methods of hyperparameter optimization field are insufficient insofar as deciding the position of the hyperparameter within the model. In the following section, we discuss the proposed work to bridge this research gap.
ArchPosOpt
: the proposed method
The ultimate goal of the ArchPosOpt method is to search for the optimal positions of the hyperparameters within the model architecture. These optimal positions should minimize the error rate of the model. The idea of the ArchPosOpt can be understood through a flowchart, as illustrated in Fig. 1. Besides, ArchPosOpt should search for the optimal hyperparameters values, like the other existing hyperparameter optimization method.

ArchPosOpt flowchart
The flowchart of ArchPosOpt is depicted in Fig. 1. First, the ArchPosOpt should receive the model architecture with the possible values of each hyperparameter and the possible positions of the hyperparameters as a string, model architecture. Thus, the input model architecture should contain the layer type followed by the possible options of this layer; these options are surrounded by square brackets. For instance, the string "Conv[32,64]" refers to a convolutional layer with two possible options; 32 units or 64 units. The layers should be separated by a delimiter within the input string. For instance, if the model architecture includes a convolutional layer and then an activation layer with ReLU function, the input model architecture should be "Conv[32,64]+Act[ReLU]". There are many layer types, e.g Conv, Pooling, FC, Pos, etc.
For the second input, the ArchPosOpt should receive the number of iterations. This input defines the number of options selected from the search space, N. Thus, the ArchPosOpt will generate N different variations of the model architecture. In the case of GridSearch, N equals all possible model variations. The last input is the search method; ArchPosOpt supports three methods, as indicated in Fig. 1.
In the flowchart, the ArchPosOpt first step is to generate the search space. Then, the search method selects a value for each hyperparameter; step 2. The ArchPosOpt utilizes a new hyperparameter called "Pos", and this hyperparameter can be used to select the optimal position of a hyperparameter, e.g. Dropout or BN layers. Then, ArchPosOpt generates a model using the hyperparameter values with the help of Algorithm 3. Then, the generated model is trained and its accuracy and configurations are saved, steps 4 and 5, respectively. Finally, once the N iterations are complete, the model variation with the highest accuracy is reported in terms of model layers, hyperparameters’ values, and hyperparameters’ positions.
The example of Fig. 1-ex eases understanding the steps of ArchPosOpt’s flowchart. Fig. 1-ex shows an input example of ArchPosOpt. In this example, the delimiter is the ’+’ symbol; this delimiter separates the layers within the input string.
In Fig. 1-ex, the model has a convolutional layer with two possible configurations, 32 and 64 units. This is the first hyperparameter value and it has two options. Then, the convolutional layer is followed by an activation layer with two possible options, ReLU and PReLU. This is the second hyperparameter value. After the activation layer, ArchPosOpt should decide whether it is better to add a BN layer, a dropout layer with a rate equal to 0.25, or a dropout layer with a rate equal to 0.5. This is the first hyperparameter position in the model. Then, the input indicates that the model should have another convolutional layer with two possible configurations; 64 and 128 units. The model has a pooling layer followed by another hyperparameter position. This hyperparameter position may be a BN layer, a dropout layer with a rate equal to 0.5 or 0.75, or no layer at all. The option "–" means this position has no layer; it is treated as though it doesn’t exist in the model. Finally, the model should include an FC layer as the last layer. Besides, the number of iterations equals five. Thus, only five model variations will be checked. The selected search method of this example is TPE, the third input.
In Fig. 1-ex step 1, the search space is generated. The input model has three hyperparameter values and two hyperparameter positions. There are 96 possible model variations; this is the result of multiplying all numbers of options of each hyperparameter value and position, 2 PLH 2 PLH 3 PLH 2 PLH 4 = 96. In step 1, c_param1 and c_param2 refer to the options of the first and second convolutional layers, respectively. Similarly, h_param1 and h_param2 refer to the option of the first and second hyperparameter positions, respectively. Besides, a_param refers to the activation layer option. Step 2 shows an example of selected values from the search space. Finally, step 3 generates a model based on step 2 values. Steps 5 to 7 are not included in Fig. 1-ex, as they are very clear in the flowchart, Fig. 1.
The pseudo-code of ArchPosOpt’s third step is listed in Algorithm 3. The algorithm receives a string as an input, strArchit. This string represents the model in terms of layers, hyperparameter values, and hyperparameter positions. The string consists of consecutive tokens separated by delimiters. Each token represents a layer/position with its parameters. The algorithm reads the input string token by token, lines 1 to 15. Once a token is read, a corresponding layer is added to the model, line 3 to 13. Besides, Algorithm 3 extracts the parameter values for each layer from the input string. These parameters are denoted in the algorithm based on the type of the layer, e.g, a_param and c_param for activation layer and convolutional layer parameters, respectively.
In this section, we discuss the used datasets and the setup of the proposed experiments. Then, we discuss the CNN model architecture that was utilized in all the experiments. Besides, we evaluate the results of the proposed method ArchPosOpt. Additionally, we expose the result of each hyperparameter, and discuss the effect of tuning this hyperparameter on the proposed CNN model. The last subsection frames a list of the recommendations for selecting the hyperparameter values and positions.
Datasets and setup
All experiments are conducted using the Python programming language; the Keras library with TensorFlow backend. The experiments are performed on a computer with two 2.3 GHz Intel 8-core processors, Tesla K80 GPU, and P100 NVIDIA GPU. The utilized OS is 64-bit Linux.
The experiments include two datasets of image classification tasks. The first dataset is the standard CIFAR-10 [24] for the categorical (multi-class) classification problem, and the second is Dogs vs. Cats [1] for the binary classification problem. CIFAR-10 is a standard colored images dataset for multi-class image classification problems. It consists of 60,000 images divided into 10 classes; every class consists of 6,000 images. The dimensions of every image are 32PLH32. The dataset is divided into training and the test sets of size 50,000 and 10,000 images, respectively. CIFAR-10 is saved in five different batches for training and one batch for testing; each batch has 10,000 images. The Dogs vs. Cats dataset contains 25,000 color images with a fixed size 72PLH72 in two classes. The training set and the test consists of 20,000 and 5,000 images, respectively.
Model Architecture
In all the experiments, we used a standard CNN architecture, which is described in Fig. 3. We call this model the

ArchPosOpt example inputs and model generation

The utilized model architecture of the experiments
Experimental setup
The experiments of this section are performed on the P100 GPU. This section evaluates the decision of selecting the place of Dropout and BN layers. There are two possible scenarios for making this decision. First, the CNN model designer follows the positions suggested by the original papers. Second, the CNN model designer uses the trial and error approach to find the best positions of these two layers. Thus, the second scenario can be automated by using the proposed method ArchPosOpt. In other words, we consider finding best positions of these two layers as a hyperparameter, the main idea of our work.
We followed the first scenario to evaluate the performance of the grid search, random search, and TPE methods. This is achieved by placing each of the BN and Dropout layers within the plain model presented in Section 4.2. More specifically, we placed a Dropout layer after the FC layer of the plain model, presented in Section 4.2, as suggested by the paper that proposed the Dropout layer [35]. Then, we placed a BN layer before the activation layer and after each convolutional layer, as suggested by the paper that proposed the concept of BN layer [22]. Thus, these three methods (grid search, random search and TPE) should search for the hyperparameter values only.
The search space of the first scenario includes two possible values of dropout rates (0.5 and 0.7), two possible learning rate (LR) methods (Adam and SGD), and two possible weight initialization methods (random normal and glorot normal). Thus, the search space includes 8 possible combinations.
We followed the second scenario to evaluate the performance of the proposed method ArchPosOpt. To evaluate the performance of the proposed method ArchPosOpt, we used the same plain model, presented in Section 4.2, and then considered finding positions of BN and Dropout layers within the CNN model architecture so that the model error rate would be minimized; an extra hyperparameter to be optimized.
The search space of the second scenario includes all the hyperparameters of the first scenario plus three possible positions. These positions of the binary classification problem are 1) applying Dropout then BN layers after the FC layer and applying a BN layer before the activation layer and after each convolutional layer (first scenario configurations), 2) applying a Dropout layer after the FC layer only, and 3) applying a Dropout then a BN layer after the FC layer only. These positions of the multi-class classification problem are 1) applying a Dropout layer then a BN layer after the FC layer and applying a BN layer before the activation layer and after each convolutional layer (first scenario configurations), 2) applying a Dropout layer after the FC layer only, and 3) applying a Dropout after the FC layer and after each pooling layer.
The second scenario includes four hyperparameters, two options of the dropout rates (0.5 and 0.7), two possible LR methods (Adam and SGD), two possible weight initialization methods (random normal and glorot normal), and three options of the hyperparameter positions. Thus, the second scenario search space has 24 possible combinations.
Results
For the first scenario, Table 2 lists the search times in hours, and the accuracy, for the best CNN model found by different methods. The reported search time is the elapsed time to finish the evaluation of n different models (in hours). The value of n is represented between brackets for the random and TPE methods. For grid search, n equals to the search space size. The first row reports the results of the first scenario model using the grid search method. The second to fourth rows report the results of the first scenario using the random search method with 2, 4, and 6 evaluations, respectively. Similarly, the last three rows report the results of the first scenario using the TPE method with 2, 4, and 6 evaluations, respectively. The best accuracy of Table 2 was achieved by the grid search method.
A list of tested hyperparameters with the possible values.
A list of tested hyperparameters with the possible values.
Performance of the best model of first scenario
The results of ArchPosOpt are reported in Table 3; the reported search times are in hours. The best reported accuracy of the second scenario is found using the grid search method.
Performance of the best model of second scenario
From Tables 2 and 3, the grid search found the CNN model with best accuracy rate. This come at the cost of the search time; the grid search method has the largest searching time. Of note, the running times of grid search and random search methods can be parallelized. Thus, running the experiments on two GPUs can cut the searching time for these two methods to almost half.
Finally, Table 4 lists the best model results of the plain model, including no BN or Dropout layers, the best result of optimizing the hyperparameters values, and optimizing the hyperparameters values and positions, respectively. Apparently, ArchPosOpt achieved the highest accuracy. These results emphasize the worthiness of considering the positions of the Dropout and BN layers as an architectural hyperparameter that led to better accuracy in the test case.
Performance of the best models
This section includes the evaluation of seven different hyperparameters. The configuration of the seven experiments are listed in Table 1; the experiments of this section are performed on the Tesla K80 GPU. The reported times are in seconds.
Batch Size (BS)
Fig. 4 depicts the validation accuracy against the number of epochs for a different batch size; the validation accuracy rates of CIFAR-10 and Dogs vs. Cats datasets in the top and bottom sub-figures, respectively. First, we consider the average loss of a mini-batch as an approximation of the expected loss over the data distribution. In these experiments, we fixed all the hyperparameter values except the BS value. As shown in Fig. 4, the lowest validation accuracy rates are linked to the larger BS value. This is linked to the intuition that the number of iterations or updates increases as the batch size decreases.

Batch size results: validation accuracy vs. number of epochs, top: CIFAR-10 and bottom: Dogs vs. Cats
The number of updates can be calculated as follows: updates = numberOfEpochs
In other words, checking the direction several times is better even if the direction is less precise. The best batch size values, that achieve the best performance in our tests, are 64 for the Dogs vs. Cats dataset and 256 for the CIFAR-10 dataset. Thus, when training using larger batch sizes, the learning rate should be adjusted as discussed in [15, 34].
The epoch running time is the time used when the entire dataset is passed forward and backward through the CNN model. Thus, the epoch running time of the larger batch sizes is less than the small batch sizes. For instance, in CIFAR-10, when the BS value equals 16, the epoch time was 63 seconds. In contrast, when the BS equals 1,024, the epoch running time was 24 seconds. The overall convergence time of the larger batch sizes is bigger than smaller ones. This is because increasing the batch size makes SGD oscillate around the minimum. Thus, using an appropriate batch size between 16 and 1,024 is better for performance and achieving convergence faster.
Table 5 lists the results of both the traditional and state-of-the-art weight initialization methods. The experiments started with setting weights to zeros, ones, or constant values. When the results for all three values are the same, the model can’t learn. Thus, it is not recommended to start a CNN model with any of these three weight initialization methods. This is because using the same initial values for all weights makes the network symmetric. In other words, each neuron will compute the same output, and the same gradient will be computed in the back-propagation process.
Weight initialization methods results (best results in bold).
Weight initialization methods results (best results in bold).
Notice: all experiments conducted using the same batch size 128.
As noticed, the network started learning in all other initialization methods and achieved comparable results. Thus, we examined the convergence time and accuracy rates to compare these methods. Weights that were drawn from normal or uniform distributions with means equal to zero and a fixed standard deviation (i.e. random normal, random uniform, truncated normal) have less accuracy rates and longer convergence times.
When the weights drawn from a truncated normal distribution with means equal to zero and standard deviations computed using the number of inputs, outputs or average of units in the weight tensor or uniform distribution with a limit based on the number of units, it results in better performance. Best accuracy rates (bold numbers) are achieved using weights drawn from a truncated normal distribution, but it necessitates longer training times. In contrast, weights drawn from uniform distributions achieve accuracy rates less than the ones drawn from normal distributions, but the former converges faster.
All the experiments of the Dropout rates are conducted using a standard Dropout layer with probabilities zero, 0.2, 0.5, or 0.7. First, a model without Dropout is used; it achieved a good classification result in both datasets, but the validation accuracy didn’t increase more than 85.67%. This is because the model overfitted the data; the training accuracy of the model, without adding any dropout layer, was 100%. FC layers ensure that the nodes of two consecutive layers are fully connected; based on that, it has most of the trainable parameters in the network. Thus, using a Dropout layer after FC layers seems to be an approach to stop the overfitting problem. In the case of binary classification, the best validation accuracy rates are achieved when we apply the Dropout layers after the FC layers, as shown in Table 6.
Using Dropout layer in different positions and values (best in bold).
Using Dropout layer in different positions and values (best in bold).
Notice: all experiments conducted using the same batch size 128.
Increasing or decreasing the value of the dropout rate can’t be linked to the training time. From Table 6, we can notice that for some models, increasing the dropout rate decreases the training time (rows 5 to 7). This relation is not holding for the other row of the table. This observation holds for the validation accuracy as well. This is due to the model randomly dropping out neurons, causing the architecture of the model to be changed in every iteration.
For multi-class image classification, CIFAR-10 dataset, applying a Dropout layer after FC and pooling layers achieved the best validation accuracy rate, as shown in Table 6. By conducting the aforementioned effect, the validation accuracy was 81.65% better than the model with Dropout layers by 2%. On the other hand, applying this effect resulted in a model with a three times slower training period in comparison to the model without any Dropout. In the case of utilizing the Dropout method after convolutional, pooling, and fully connected layers at the same time, this approach has a negative impact in the model performance.
The original paper of BN [22] recommended to add a BN layer before the non-linearity function layers. Thus, the experiments started by adding a BN layer before and after the activation function layers. In the binary classification problem, the model overall performance was less than the plain model and it overfitted the training data - the training accuracy was nearly 100%. Thus, adding a BN layer after the activation functions does not have the positive effect of regularization as adding a Dropout layer has, as listed in Table 7.
BN methods validation results (best in bold).
BN methods validation results (best in bold).
Notice: all experiments conducted using the same batch size 128.
In contrast, a BN layer increases the model validation accuracy in the multi-class classification problem. In both of the two classes and multi-class image classification, adding a BN layer negatively affected the convergence time. Thus, using BN only is not sufficient to solve the overfitting problem.
For the two-class classification, adding a Dropout layer and then a BN layer after the FC layers seems to be a promising approach to address the overfitting problem. The results showed that this combination achieved a better overall performance than the plain model or the plain model with BN layers only. On the other hand, adding Dropout and BN layer after the FC layers converges slower than the plain model by three times.
In addition, BN and Dropout with a rate equal to 0.5 are used after every pooling layer in the model. In the case of the binary classification, this combination had a negative impact on the model’s overall performance, as the model overfitted the training dataset. In the other case of the multi-class classification problem, it had a positive impact on the model’s overall performance, especially when they have the following order: pooling, BN, and Dropout.
Adding a BN layer resulted in better accuracy rates for both Dogs vs. Cats and CIFAR-10 datasets in comparison to the plain model by 3%. But, this effect slowed down the convergence rates by about three times. Dropout and BN are used after pooling and FC layers interchangeably. This combination decreased model performance in binary classification problems and increased the convergence time. On the other hand, it has a positive impact on the overall performance of the categorical classification problem.
The recent learning rate (LR) optimization methods are tested to ensure impact on training deep learning models. The experiments started using the default values for learning rate, β1, β2, and the exponentially weighted average decay factor for most learning rate optimization methods. The results of these experiments are exposed in Table 8 and Fig. 5.

LR results: validation accuracy vs. number of epochs, top: Dogs vs. Cats and bottom: CIFAR-10
LR optimization methods results (best in bold).
For the SGD method, a small LR value equal to 0.001, is used to test the impact of learning rate in model behavior. Using a small LR decreased the overall performance and increased the convergence time. When a larger learning rate is used, 10 times larger, the model overall performance is increased by a considerable margin of 4% and the convergence time is three times faster, in the binary classification problem.
In the case of the multi-class classification problem, the model accuracy increased by a small margin, but it has a positive impact on convergence time; it decreased the convergence time by about four times. RMSprop achieved the best overall performance compared to other models, but it had the longest convergence time, in binary classification. In the multi-class classification problem, Adamax achieved the highest validation accuracy rate.
In Table 9, the sigmoid function achieved the best results in the binary classification problem, in comparison to all other output activation functions. The sigmoid function obtains more reasonable validation accuracy in the multi-class classification problem as well. The softmax functions that are mostly used for multi-class classification problems failed to learn in the binary problem, but, as predicted, it achieved the best validation accuracy in the multi-class classification problem. Both tanh and ReLU functions didn’t learn in either binary or categorical problems. Worth noting, the tanh function started learning in binary classification until it reached accuracy near 77.5%, then suddenly the accuracy dropped to zero. Besides, the ReLU function started learning slowly in the categorical problem but it did not reach reasonable accuracy.
Output layers activation functions results (best in bold).
Output layers activation functions results (best in bold).
Notice: all experiments conducted using the same batch size 128.
To evaluate the performance of hidden activation functions, we tested the proposed model using a set of different activation functions of hidden layers. This set included sigmoid, tanh, ReLU, leakyRelu, Parametric ReLU (PReLU), ThresholdedRelu, exponential linear unit (ELU), and scaled ELU (SELU). The activation function of these methods are listed in Table 10.
Activation functions
Activation functions
The experiments started using the traditional sigmoid function, as noticed in Table 11 the sigmoid function failed to learn in both Dogs vs. Cats and CIFAR-10 datasets. In addition, the deprecated tanh function was tested, but unlike the sigmoid function, tanh obtained more reasonable results in both datasets. The most used hidden layers activation function, ReLU, was tested. ReLU achieved the best overall performance in both binary and multi-class classification problems.
Hidden layers activation functions results (best in bold)
Notice: all experiments conducted using the same batch size 128.
The modified version of ReLU, LeakyReLU, which gives some flexibility to the values less than zero by multiplying them by 0.01, achieved an overall performance less than ReLU and its convergence time equal to ReLU in the binary problem. In the case of the categorical problem, LeakyReLU converged four time faster than ReLU.
PReLU defines variable α to be a learnable parameter. It achieves an overall performance close to that of the ReLU function and takes ReLU’s halftime to converge in the binary problem, but it failed to learn in the categorical problem. ThresholdedReLU failed to learn in the binary problem, but it achieved an accuracy less than ReLU by 1% and takes 75% of ReLU’s time to converge in the categorical problem. ELU achieved validation accuracy rates less than ReLU in both classification problems and takes nearly the same time to converge. Finally, SELU failed to learn in both the binary and categorical datasets.
In the conducted experiments, most hyperparameters are tested to draw conclusions of how they affect the performance of CNN models for the image classification problem. Table 12 lists a set of recommendations to be considered to start the learning process properly. In addition, a visual recommendation of the hyperparameter positions in terms of their locations within a CNN model architecture is depicted in Fig. 6. ArchPosOpt optimizes the positions of the Dropout and BN layers. Thus, in Fig. 6 there are three possible hyperparameter positions after each layer of the CNN model; layers with blue color.

Architectural recommendations of the hype-parameter positions
Recommendations for both classifications problems
In Fig. 6, the layer with character ‘x’ means this layer should be dropped from the CNN model. According to the conducting experiments, there is one recommendation of the hyperparameter positions for each of the two problems, binary and multi-class image classification, one recommendation per line. For instance, the recommended model hyperparameter positions of the binary image classification are two-fold. First, it is recommended to place an activation layer after each convolutional layer; second, the FC layer(s) should be followed by an activation layer and then a Dropout layer. There is no BN included in the model with the best accuracy for the binary class classification. The recommended model using ArchPosOpt yielded a less complex model, with less number of layers, and more accurate in comparison to following the recommendation of the original paper proposed BN layer, to place a BN layer after each convolutional layer.
Similarly, the architectural recommendation for the multi-class image classification is to place a Dropout layer after each pooling layer and after the FC layer of the CNN model. Comparing the recommended CNN model of the multi-class classification using ArchPosOpt with the CNN model following the recommended positions as suggested in the original papers proposing BN and Dropout concepts, the former has a higher accuracy by about 2% than the latter.
This observation emphasizes that utilizing the proposed tool ArchPosOpt may result in better results in comparison to placing BN and Dropout layers in the standard positions within the CNN model.
Hyperparameter optimization is a crucial task within the process of designing a deep neural network model. This optimization process includes setting values to hyperparameters and choosing from a set of architectural options. In this vein, we extend the process of the architectural options to include the position of the hyperparameters, which affects model performance. The proposed method ArchPosOpt extended an existing three hyperparameter optimization tools, grid search, random search, and TPE, to consider searching for the optimal position of the hyperparameters. The proposed method was evaluated through a set of experiments of image classification for two datasets; binary classification and multi-class classification. The extended version of the three aforementioned tools, utilizing ArchPosOpt approach, found models achieving higher accuracy in comparison to the original tools. In addition, seven different hyperparameters are studied through a comprehensive set of experiments to find the best practice of using them for the problem of image classification using a CNN-based model. Finally, we provided a list of recommendations for choosing the optimal hyperparameter values and positions for these two datasets. While the proposed idea is used to optimize the architecture of the CNN models, it can be generalized to other deep learning architectures.
Footnotes
Acknowledgements
The research was partially funded by the Natural National Science Foundation of China (Grant Nos. 61772182, 61472126, 61602170, 61750110531, 61672215, U1613209).
