Optimizing deep neural networks hyperparameter positions and values

Abstract

Hyperparameter optimization is a crucial step in the implementation of any machine learning model. This optimization process includes regularly modifying the hyperparameter values of the model in order to minimize the testing error. A deep neural learning model hyperparameter optimization process includes optimizing both the model parameters and architecture. Optimizing a model’s parameters involves deciding the values of parameters, such as learning rate and batch size. Optimizing architectural hyperparameters includes deciding the shape of the deep neural learning model, i.e., the number of layers of individual types and the number of neurons in a certain layer. The state-of-the-art hyperparameter optimization methods don’t optimize the position of the hyperparameter within the model architecture. In this work, we study the effect of changing a hyperparameter within the deep learning model architecture. Thus, we propose an architectural position optimization (ArchPosOpt) method for model architectural hyperparameter optimization. ArchPosOpt extends three different hyperparameter optimization techniques, namely grid search, random search, and Tree-structured Parzen Estimator (TPE), to include a new dimension of hyperparameter optimization problem – the hyperparameter position. We show through a set of experiments that the position of the hyperparameters does matter for model performance as well as the hyperparameter values.

Keywords

Deep neural networks hyperparameter optimization CNN architectural optimization hyperparameter position

1 Introduction

Deep neural networks are utilized successfully in different application fields such as image classification [18], healthcare [32], and self-driving cars [21]. The achieved performance of these techniques are superior to all other machine learning techniques [9, 10]. For instance, in [17], the authors proposed a rectifier neural network for image classification that can surpass the reported human-level performance.

Building a successful deep neural learning model is a complex task. Deep neural networks consist of complex architectures. It consists of a number of stacked layers to automate the process of feature extraction from the input data. Deep neural networks have major network architectures, i.e., convolutional neural networks (CNNs), recurrent neural networks, and recursive neural networks. The performance of a deep learning model depends on a set of hyperparameters values, e.g. the number of layers, batch size, weight initialization, and Dropout rate [19, 27]. The correct choice of deep learning model hyperparameters values is a challenging task. For a given learning model, hyperparameter optimization task includes searching for the optimal set of hyperparameter values that minimizes the generalization error E.

In this paper, we address the problem of hyperparameters optimization within one deep learning architecture, CNN, as CNN is considered one of the most complex deep neural network architectures. Tuning a CNN model is a big burden in deep learning model construction, as it may have an abundance of hyperparameters and millions of parameters.

For any deep learning model, the hyperparameter optimization task can be classified into two classes, namely, value-based (i.e. learning rate and momentum) and architectural hyperparameters (i.e. depth of the network and choice of connectivity per layer). Besides, the hyperparameter optimization can be classified according to the search approach. The most widely used search approaches are grid search, random search, evolutionary algorithms, reinforcement learning, sequential model-based global optimization (SMBO), i.e. Bayesian optimization and tree-structured Parzen estimator (TPE). From a practical perspective, there are an abundance of hyperparameter optimization options, e.g. hyperopt [4] and Optunity [8].

Much research has been conducted for optimizing the hyperparameters values. In [26], the authors utilized chaining derivatives backward through the entire training procedure to optimize hyperparameters values such as step-size and momentum schedules and weight initialization distributions. In [3], the hyperparameters are optimized using a random search. Random search can achieve the hyperparameter optimization task within a small fraction of the computation time in comparison to the brute-force (grid search) method. Bayesian optimization is used to select promising hyperparameters based on the results of the previous values, as proposed in [14].

Architectural hyperparameters have received several efforts as well. In [23], the authors provided a survey of deep neural networks architecture search. The authors focused on CNN architecture. In [36], the authors used an evolutionary method, genetic programming, to decide on a deep learning model structure that maximizes the validation accuracy. Recently, reinforcement learning has been adapted to optimize the neural network architectures, as proposed in [2, 31].

The architectural hyperparameters optimization includes the number of layers, the number of neurons per layer, and the connectivity’s type of layer. The position of a hyperparameter can affect the performance of the CNN model. For instance, deciding to place a batch normalization (BN) layer position after or before the activation layer depends on the problem at hand. In [18], the authors built a model by adopting a BN layer right after each convolution layer and before the activation layer. In contrast, several practical tests have been conducted that show that a BN layer placed after an activation layer has better performance [11]. For instance, research was conducted to combine brain magnetic resonance imaging (MRI) scans with methylation data from The Cancer Genome Atlas to predict the methylation state of the regulatory regions in cancer patients [16]. To achieve this goal, the authors proposed a CNN model in which a BN layer is used after the activation layer.

To the best of the authors’ knowledge, the state-of-the-art methods don’t optimize the position of hyperparameters. This research gap motivates the current work to answer questions such as "What is the best position of a BN layer? Is it better to place a BN layer after or before the activation layer?"

In this work, we propose ArchPosOpt; the proposed method extends a new dimension of the existing hyperparameter optimization techniques, which is the position of the hyperparameter within the model architecture. In this context, the process of hyperparameter position optimization refers to finding places of BN and dropout layers within the model architecture so that the model error rate is minimized. We propose extending the grid search cross validation (GridsearchCV) function of package scikit-learn [30], the random search and TPE of hyperopt library [4].

To evaluate this contribution, we proposed comparing the results of the two scenarios. The first scenario is to search for hyperparameter values only using the three tools, grid search, random search, and TPE. The model architecture of the first scenario follows the recommended position of BN and Dropout layers, as suggested in the original papers proposing these two concepts. On the other hand, the second scenario includes searching for the optimal positions of BN and Dropout layers using the proposed method, which extended the three tools for that purpose. The models of the two scenarios were trained from scratch on two standard datasets, CIFAR-10 [24] and Dogs vs. Cats [1]. The former and latter datasets represent binary and multi-class image classification problems, respectively.

The contributions of this work are as follows:

To the best of the authors’ knowledge, the proposed method ArchPosOpt is the first tool to search for the optimal positions of the hyperparameters within the deep neural network architecture.

We extended three well-known hyperparameter optimization tools, GridSearchCV, randomSearch, and TPE, to search for the optimal position of the hyperparameters.

We conducted a set of experiments to evaluate the performance of the proposed method. In addition, we tested the effect of seven different hyperparameters separately on the deep neural network.

This paper is structured as follows. In Section 2, we review CNNs and discuss the related work of hyperparameter optimization. In Section 3, we elaborate on the proposed method. In Section 4, we expose a set of experiments to evaluate the proposed method and the effect of tuning seven different hyperparameters on a CNN model. Finally, the paper is concluded in Section 5.

2 Background and related work

2.1 Background

CNNs are one of the most interesting deep learning architectures. Their architecture is biologically inspired by the animal visual cortex organization [20]. CNN has shown pleasant performance on two-dimensional data such as images and videos. Thus, it achieves the state-of-the-art results required for many problems, e.g. image classification. The beginning of CNN is related to LeNet [25], which was introduced by Yann LeCun in 1998. At that time, LeNet was used in tasks such as character recognition.

CNN has three operations; Convolution followed by Non-linearity, Pooling or Subsampling and Classification (Fully connected), where they run on three main layers, convolutional, pooling and fully connected layers, respectively. CNN has two stages of training, feed-forward and backward.

In the first stage, the input is represented in each layer by weights and biases to get an output. The loss cost between the expected and the actual values is computed. In the backward stage, chain rules are used to compute the gradients of each parameter and update all parameters, and then another forward stage begins with the updated parameters. After a sufficient number of iterations, the network learning is achieved, where the loss cost is the minimum.

Convolutional layers are used mainly as features extractors from the given input vector. It is powerful because it makes use of the spatial relationship among vector components as it learns important features of inputs by using a small square of given data and reducing the number of parameters in the same feature map. After extracting all local features, the location relationship between features can be determined. After every convolution operation, a non-linear operation is used. The most common function for non-linear operations is rectified linear unit (ReLU). ReLU output is computed by the following equation, output = max (0, input).

Pooling layer plays an important role in solving the curse of dimensionality of the feature map. Although a lot of information is lost, a pooling layer keeps the important information. Pooling layer is also incompatible with translation because its computation depends on neighboring pixels. Maximum (max) and average pooling are the two used methods. In the case of the max-pooling, for example (2PLH2 window), the largest value is selected from the window. It has proven that max-pooling is better than average pooling based on this detailed theoretical analysis [7]. In addition, max-pooling has a faster convergence and generalization [33].

A fully connected (FC) layer is the final layer of any CNN. The term “Fully Connected” means that all the nodes in two consecutive layers are fully connected. Convolutional and pooling layers represents the input data in high-level features. These high-level features are used as inputs to the FC layer to classify the input data into a predefined set of classes, based on the used training dataset.

2.2 Related work

In the literature, a number of attempts have been conducted to search for the hyperparameter optimization that minimize the model error. These efforts can be classified based on the hyperparameter types in two classes, value-based (e.g. learning rate and momentum) and architectural-based (e.g. depth of the network and choice of connectivity per layer) parameters. Another classification of these efforts can be established based on the search method of the optimization task. The most widely used search methods are grid search, random search, Bayesian optimization, and sub-optimal methods, e.g., evolutionary algorithms, reinforcement learning, random forests, and TPE.

Value-based hyperparameter optimization methods focus on one or more hyperparameter value and tune it. The value-based hyperparameters include learning rate, batch size, activation functions, pooling variance, dropout rates, etc.

For instance, in [29], research was conducted to explore one hyperparameter of CNNs, the activation function. The authors performed a set of experiments and tested several activation functions to figure out which activation function is most suitable for each experiment/case. Similarly, the authors compared the variants of the most used activation function ReLU in [37]. Other research studied the relation between two hyperparameters; batch size and learning rate [34]. The authors recommended to increase the batch size instead of decreasing the learning rate. Finally, the most common hyperparameters are considered in [28], such as the pooling variants selection (stochastic, max, average, mixed), classifier design (convolutional, FC), learning rate, and batch size.

Architectural-based hyperparameter optimization methods try to find the architecture that achieves the best performance. The architectural-based hyperparameter optimization task includes optimizing the number of the layers, number of units per layer, number of filters per layer, etc. The decisions regarding the selection of such architectural options used to be reached manually through significant human expertise; this approach is inefficient especially for large and complex deep neural networks. Thus, the area of automating CNN architecture search becomes critical [23].

The grid search (brute force) method is widely used for hyperparameter optimization, especially for models with limited search space. The Python scikit-learn package includes an implementation of the grid search method, GridSearchCV function [30]. GridSearchCV takes a list of hyperparameter values and trains the model. Then, it returns the best hyperparameters values after searching all possible options. For example, to optimize the Dropout rate, GridSearchCV takes a list of Dropout rates, e.g., 0.2, 0.5, 0.7, 0.9, trains different CNN models with every possible value, and then returns the model with the value that achieves the best accuracy. The method guarantees to find the optimal solution within the specified search space, but its main disadvantage is the intensive computational time, as it tries all possible hyperparameter configurations. One possible solution for the problem is to parallelize the GridSearchCV function.

An alternative to the grid search method is the random search technique; it is a function of Python scikit-learn package [30] and hyperopt package [4] as well. The main advantage of random search that it is trivially parallel; its implementation is very simple, and enables the model designer to set the number of combinations to be considered. This number should be feasible to the model designer in terms of the computational power and time.

In [13], the authors utilized Bayesian optimization to find the optimal deep neural network architecture for a given dataset by considering a set of previous datasets and their corresponding machine learning models. Other research combined Bayesian optimization with a bandit strategy to optimize the number of layers and units per layer [12].

Bayesian optimization-based methods assume a prior distribution of the loss function. Then, these methods tune the hyperparameters by updating this prior distribution based on the previous iterations/ observations. Bayesian optimization converges slowly at the beginning, and then the convergence rate consistently increases. The main problem for Bayesian optimization is that it is a sequential strategy and can’t be parallelized. Thus, for problems with huge computational time per iteration, it is not the best choice.

In [36], the authors utilized Cartesian genetic programming to tune the CNN structure and connectivity. The Cartesian genetic programming representation of the architecture trained the network to classify the images of a CIFAR-10 dataset. Then, the error in the validation set is assigned in accordance with the fitness of the architecture, which optimizes the CNN architectures. Another evolutionary algorithm is used to achieve the same goal in [38]. The authors proposed using particle swarm optimization (PSO) for tuning the hyperparameters values of a CNN model for image classification of CIFAR-10 and CIFAR-100 datasets. The mechanism of evolutionary algorithm-based hyperparameter optimization is sequential; thus, these methods can’t be parallelized.

In [5], the authors proposed utilizing the TPE for the purpose of hyperparameter optimization task. The proposed method outperformed both manual and random search on tuning the hyperparameter values and architectural options. TPE can be exploited more when we compute the models with different sets of hyperparameters in parallel instead of serially. Parallelizing TPE makes search less efficient, though the running time is faster [6]. The core idea of parallelizing TPE is to use stochasticity of draws to help move from one iteration to the next iteration.

From a practical perspective, there are an abundance of hyperparameter optimization methods. Hyperopt is a Python library implementing two solvers of hyperparameter optimization, random search and TPE [4]. Another Python library is Optunity; this library utilizes several solvers, i.e., grid search, random search, PSO, and TPE [8], where PSO is the default solver of this library.

Based on this discussion, existing methods of hyperparameter optimization field are insufficient insofar as deciding the position of the hyperparameter within the model. In the following section, we discuss the proposed work to bridge this research gap.

3 ArchPosOpt : the proposed method

The ultimate goal of the ArchPosOpt method is to search for the optimal positions of the hyperparameters within the model architecture. These optimal positions should minimize the error rate of the model. The idea of the ArchPosOpt can be understood through a flowchart, as illustrated in Fig. 1. Besides, ArchPosOpt should search for the optimal hyperparameters values, like the other existing hyperparameter optimization method.

Fig.1

ArchPosOpt flowchart

The flowchart of ArchPosOpt is depicted in Fig. 1. First, the ArchPosOpt should receive the model architecture with the possible values of each hyperparameter and the possible positions of the hyperparameters as a string, model architecture. Thus, the input model architecture should contain the layer type followed by the possible options of this layer; these options are surrounded by square brackets. For instance, the string "Conv[32,64]" refers to a convolutional layer with two possible options; 32 units or 64 units. The layers should be separated by a delimiter within the input string. For instance, if the model architecture includes a convolutional layer and then an activation layer with ReLU function, the input model architecture should be "Conv[32,64]+Act[ReLU]". There are many layer types, e.g Conv, Pooling, FC, Pos, etc.

For the second input, the ArchPosOpt should receive the number of iterations. This input defines the number of options selected from the search space, N. Thus, the ArchPosOpt will generate N different variations of the model architecture. In the case of GridSearch, N equals all possible model variations. The last input is the search method; ArchPosOpt supports three methods, as indicated in Fig. 1.

In the flowchart, the ArchPosOpt first step is to generate the search space. Then, the search method selects a value for each hyperparameter; step 2. The ArchPosOpt utilizes a new hyperparameter called "Pos", and this hyperparameter can be used to select the optimal position of a hyperparameter, e.g. Dropout or BN layers. Then, ArchPosOpt generates a model using the hyperparameter values with the help of Algorithm 3. Then, the generated model is trained and its accuracy and configurations are saved, steps 4 and 5, respectively. Finally, once the N iterations are complete, the model variation with the highest accuracy is reported in terms of model layers, hyperparameters’ values, and hyperparameters’ positions.

The example of Fig. 1-ex eases understanding the steps of ArchPosOpt’s flowchart. Fig. 1-ex shows an input example of ArchPosOpt. In this example, the delimiter is the ’+’ symbol; this delimiter separates the layers within the input string.

In Fig. 1-ex, the model has a convolutional layer with two possible configurations, 32 and 64 units. This is the first hyperparameter value and it has two options. Then, the convolutional layer is followed by an activation layer with two possible options, ReLU and PReLU. This is the second hyperparameter value. After the activation layer, ArchPosOpt should decide whether it is better to add a BN layer, a dropout layer with a rate equal to 0.25, or a dropout layer with a rate equal to 0.5. This is the first hyperparameter position in the model. Then, the input indicates that the model should have another convolutional layer with two possible configurations; 64 and 128 units. The model has a pooling layer followed by another hyperparameter position. This hyperparameter position may be a BN layer, a dropout layer with a rate equal to 0.5 or 0.75, or no layer at all. The option "–" means this position has no layer; it is treated as though it doesn’t exist in the model. Finally, the model should include an FC layer as the last layer. Besides, the number of iterations equals five. Thus, only five model variations will be checked. The selected search method of this example is TPE, the third input.

In Fig. 1-ex step 1, the search space is generated. The input model has three hyperparameter values and two hyperparameter positions. There are 96 possible model variations; this is the result of multiplying all numbers of options of each hyperparameter value and position, 2 PLH 2 PLH 3 PLH 2 PLH 4 = 96. In step 1, c_param1 and c_param2 refer to the options of the first and second convolutional layers, respectively. Similarly, h_param1 and h_param2 refer to the option of the first and second hyperparameter positions, respectively. Besides, a_param refers to the activation layer option. Step 2 shows an example of selected values from the search space. Finally, step 3 generates a model based on step 2 values. Steps 5 to 7 are not included in Fig. 1-ex, as they are very clear in the flowchart, Fig. 1.

The pseudo-code of ArchPosOpt’s third step is listed in Algorithm 3. The algorithm receives a string as an input, strArchit. This string represents the model in terms of layers, hyperparameter values, and hyperparameter positions. The string consists of consecutive tokens separated by delimiters. Each token represents a layer/position with its parameters. The algorithm reads the input string token by token, lines 1 to 15. Once a token is read, a corresponding layer is added to the model, line 3 to 13. Besides, Algorithm 3 extracts the parameter values for each layer from the input string. These parameters are denoted in the algorithm based on the type of the layer, e.g, a_param and c_param for activation layer and convolutional layer parameters, respectively.

Algorithm 1 Pseudo-code to generate a model

Input: strArchit: a string of tokens describes

the model architecture in terms of layers, hyperparameters’

values and positions.

Output a model

1: token ← strArchit.NextToken()

2: while token ≠ NULL do

3: if token equals "Conv" then

4: add a Conv[c_param] layer to the model

5: else if token equals "Act" then

6: add an Act[a_param] layer to the model

7: else if token equals "Pool" then

8: add a Pooling[pool_param] layer to the model

9: else if token equals "FC" then

10: add an FC[fc_param] layer to the model

11: else if token equals "Pos"

12: add a layer[h_param] to the model

13: else if

14: token ← strArchit.NextToken()

15: end while

16: returnmodel.

4 Experimental results and discussion

In this section, we discuss the used datasets and the setup of the proposed experiments. Then, we discuss the CNN model architecture that was utilized in all the experiments. Besides, we evaluate the results of the proposed method ArchPosOpt. Additionally, we expose the result of each hyperparameter, and discuss the effect of tuning this hyperparameter on the proposed CNN model. The last subsection frames a list of the recommendations for selecting the hyperparameter values and positions.

4.1 Datasets and setup

All experiments are conducted using the Python programming language; the Keras library with TensorFlow backend. The experiments are performed on a computer with two 2.3 GHz Intel 8-core processors, Tesla K80 GPU, and P100 NVIDIA GPU. The utilized OS is 64-bit Linux.

The experiments include two datasets of image classification tasks. The first dataset is the standard CIFAR-10 [24] for the categorical (multi-class) classification problem, and the second is Dogs vs. Cats [1] for the binary classification problem. CIFAR-10 is a standard colored images dataset for multi-class image classification problems. It consists of 60,000 images divided into 10 classes; every class consists of 6,000 images. The dimensions of every image are 32PLH32. The dataset is divided into training and the test sets of size 50,000 and 10,000 images, respectively. CIFAR-10 is saved in five different batches for training and one batch for testing; each batch has 10,000 images. The Dogs vs. Cats dataset contains 25,000 color images with a fixed size 72PLH72 in two classes. The training set and the test consists of 20,000 and 5,000 images, respectively.

4.2 Model Architecture

In all the experiments, we used a standard CNN architecture, which is described in Fig. 3. We call this model the plain model , as it has no Dropout or BN layer. The input sizes are 32PLH32 for the CIFAR-10 dataset and 72PLH72 for the Dogs vs. Cats dataset. The number of units per convolutional layers one to four is 64, 128, 256, and 512 units, respectively. The filters size is 3PLH3, padding the same values. A max-pooling layer with pool size 2PLH2 is used after every convolutional layer. The number of nodes in the FC1 and FC2 layers are 512 and 1,024, respectively. Finally, the output layer produces the class of the input image. For all the experiments, we used the early stop technique with a value of 30. Thus, the process of training a model stopped when the accuracy didn’t improve for 30 successive epochs.

Fig.2

ArchPosOpt example inputs and model generation

Fig.3

The utilized model architecture of the experiments

4.3 ArchPosOpt performance evaluation

4.3.1 Experimental setup

The experiments of this section are performed on the P100 GPU. This section evaluates the decision of selecting the place of Dropout and BN layers. There are two possible scenarios for making this decision. First, the CNN model designer follows the positions suggested by the original papers. Second, the CNN model designer uses the trial and error approach to find the best positions of these two layers. Thus, the second scenario can be automated by using the proposed method ArchPosOpt. In other words, we consider finding best positions of these two layers as a hyperparameter, the main idea of our work.

We followed the first scenario to evaluate the performance of the grid search, random search, and TPE methods. This is achieved by placing each of the BN and Dropout layers within the plain model presented in Section 4.2. More specifically, we placed a Dropout layer after the FC layer of the plain model, presented in Section 4.2, as suggested by the paper that proposed the Dropout layer [35]. Then, we placed a BN layer before the activation layer and after each convolutional layer, as suggested by the paper that proposed the concept of BN layer [22]. Thus, these three methods (grid search, random search and TPE) should search for the hyperparameter values only.

The search space of the first scenario includes two possible values of dropout rates (0.5 and 0.7), two possible learning rate (LR) methods (Adam and SGD), and two possible weight initialization methods (random normal and glorot normal). Thus, the search space includes 8 possible combinations.

We followed the second scenario to evaluate the performance of the proposed method ArchPosOpt. To evaluate the performance of the proposed method ArchPosOpt, we used the same plain model, presented in Section 4.2, and then considered finding positions of BN and Dropout layers within the CNN model architecture so that the model error rate would be minimized; an extra hyperparameter to be optimized.

The search space of the second scenario includes all the hyperparameters of the first scenario plus three possible positions. These positions of the binary classification problem are 1) applying Dropout then BN layers after the FC layer and applying a BN layer before the activation layer and after each convolutional layer (first scenario configurations), 2) applying a Dropout layer after the FC layer only, and 3) applying a Dropout then a BN layer after the FC layer only. These positions of the multi-class classification problem are 1) applying a Dropout layer then a BN layer after the FC layer and applying a BN layer before the activation layer and after each convolutional layer (first scenario configurations), 2) applying a Dropout layer after the FC layer only, and 3) applying a Dropout after the FC layer and after each pooling layer.

The second scenario includes four hyperparameters, two options of the dropout rates (0.5 and 0.7), two possible LR methods (Adam and SGD), two possible weight initialization methods (random normal and glorot normal), and three options of the hyperparameter positions. Thus, the second scenario search space has 24 possible combinations.

4.3.2 Results

For the first scenario, Table 2 lists the search times in hours, and the accuracy, for the best CNN model found by different methods. The reported search time is the elapsed time to finish the evaluation of n different models (in hours). The value of n is represented between brackets for the random and TPE methods. For grid search, n equals to the search space size. The first row reports the results of the first scenario model using the grid search method. The second to fourth rows report the results of the first scenario using the random search method with 2, 4, and 6 evaluations, respectively. Similarly, the last three rows report the results of the first scenario using the TPE method with 2, 4, and 6 evaluations, respectively. The best accuracy of Table 2 was achieved by the grid search method.

Table 1
A list of tested hyperparameters with the possible values.

Hyper-parameter Variants

Batch Size 16,32,64,128,256,512,1024

Weight initialization Zeros, ones, constant, random, normal, random uniform, truncated normal, variance scaling, orthogonal, Glorot normal/uniform, He normal/uniform, Lecun normal

Dropout(dropout rate) [Option 1: Drop(.2) after every FC], [Option 2: Drop(.5) after every FC], [Option 3: Drop(.7) after every FC], [Option 4: Drop(.2) after every Conv], [Option 5: Drop(.5) after every Conv], [Option 6: Drop(.7) after every Conv], [Option 8:Drop(.2) after every Pooling], [Option 9:Drop(.5) after every Pooling], [Option 10: Drop(.7) after every Pooling], [Option 11: Drop(.5) after every (Pooling+FC)], [Option 12: Drop(.2) after every (Conv+Pooling+FC)]

BN [Option 1: a BN layer after FC+Act], [Option 2:BN between the FC and Act. layers], [Option 3: FC+Act. then a BN+Drop(0.5) layers], [Option 4: FC+Act. layers then Drop(0.5)+BN layers], [Option 5:FC then BN layer, Act. layer then Drop(0.5)], [Option 6: Drop(0.5)+BN after each Pool layer], [Option 7: BN+Drop(0.5) after each Pool layer], [Option 8: Drop(0.5)+BN after FC layer], [Option 9: BN+Drop(0.5) after FC layer], [Option 10: Drop(0.5)+BN after each Pooling layer and after each FC layer], [Option 11: BN+Drop(0.5) after each Pooling layer and after each FC layer]

Learning Rate Optimization Methods Adadelta, Adagrad, Adam, Adamax, Nadam, RMSprop, and Stochastic Gradient Descent (SGD).

Hidden Activation Functions ELU, SELU, ReLU, PReLU, LeakyReLU, ThresholdedReLU

Output Activation Functions Sigmoid, Softmax, Tanh

Hyper-parameter	Variants
Batch Size	16,32,64,128,256,512,1024
Weight initialization	Zeros, ones, constant, random, normal, random uniform, truncated normal, variance scaling, orthogonal, Glorot normal/uniform, He normal/uniform, Lecun normal
Dropout(dropout rate)	[Option 1: Drop(.2) after every FC], [Option 2: Drop(.5) after every FC], [Option 3: Drop(.7) after every FC], [Option 4: Drop(.2) after every Conv], [Option 5: Drop(.5) after every Conv], [Option 6: Drop(.7) after every Conv], [Option 8:Drop(.2) after every Pooling], [Option 9:Drop(.5) after every Pooling], [Option 10: Drop(.7) after every Pooling], [Option 11: Drop(.5) after every (Pooling+FC)], [Option 12: Drop(.2) after every (Conv+Pooling+FC)]
BN	[Option 1: a BN layer after FC+Act], [Option 2:BN between the FC and Act. layers], [Option 3: FC+Act. then a BN+Drop(0.5) layers], [Option 4: FC+Act. layers then Drop(0.5)+BN layers], [Option 5:FC then BN layer, Act. layer then Drop(0.5)], [Option 6: Drop(0.5)+BN after each Pool layer], [Option 7: BN+Drop(0.5) after each Pool layer], [Option 8: Drop(0.5)+BN after FC layer], [Option 9: BN+Drop(0.5) after FC layer], [Option 10: Drop(0.5)+BN after each Pooling layer and after each FC layer], [Option 11: BN+Drop(0.5) after each Pooling layer and after each FC layer]
Learning Rate Optimization Methods	Adadelta, Adagrad, Adam, Adamax, Nadam, RMSprop, and Stochastic Gradient Descent (SGD).
Hidden Activation Functions	ELU, SELU, ReLU, PReLU, LeakyReLU, ThresholdedReLU
Output Activation Functions	Sigmoid, Softmax, Tanh

Table 2

Performance of the best model of first scenario

Method	Dogs vs. Cats		CIFAR-10
	Search time (hr)	Acc.	Search time (hr)	Acc.
Grid search	10.2	88.6%	6.2	79.9%
Random search (2)	2.6	87.2%	1.6	75.6%
Random search (4)	5.2	88.3%	3.1	76.7%
Random search (6)	7.6	88.3%	4.6	78.6%
TPE (2)	2.7	86.1%	1.7	18.0%
TPE (4)	5.1	86.1%	3.2	76.5%
TPE (6)	7.6	88.4%	4.5	78.9%

The results of ArchPosOpt are reported in Table 3; the reported search times are in hours. The best reported accuracy of the second scenario is found using the grid search method.

Table 3

Performance of the best model of second scenario

Method	Dogs vs. Cats		CIFAR-10
	Search time(hr)	Acc.	Search time(hr)	Acc.
Grid search	30.2	88.9%	18.6	81.7%
Random search (10)	12.52	87.2%	7.7	79.7%
Random search (15)	18.95	87.9%	11.7	70.5%
Random search (20)	25.3	88.3%	15.5	81.2%
TPE (10)	12.72	87.1%	7.7	78.9%
TPE (15)	18.75	88.2%	11.6	80.2%
TPE (20)	25.2	88.3%	15.4	81.5%

From Tables 2 and 3, the grid search found the CNN model with best accuracy rate. This come at the cost of the search time; the grid search method has the largest searching time. Of note, the running times of grid search and random search methods can be parallelized. Thus, running the experiments on two GPUs can cut the searching time for these two methods to almost half.

Finally, Table 4 lists the best model results of the plain model, including no BN or Dropout layers, the best result of optimizing the hyperparameters values, and optimizing the hyperparameters values and positions, respectively. Apparently, ArchPosOpt achieved the highest accuracy. These results emphasize the worthiness of considering the positions of the Dropout and BN layers as an architectural hyperparameter that led to better accuracy in the test case.

Table 4

Performance of the best models

Method	Dogs vs. Cats		CIFAR-10
	Acc.	Epochs	Acc.	Epochs
Plain model	85.7%	39	74.1%	33
scenario 1	88.6%	120	79.9%	95
scenario 2	88.9%	60	81.7%	105

4.4 Hyperparameters tuning

This section includes the evaluation of seven different hyperparameters. The configuration of the seven experiments are listed in Table 1; the experiments of this section are performed on the Tesla K80 GPU. The reported times are in seconds.

4.4.1 Batch Size (BS)

Fig. 4 depicts the validation accuracy against the number of epochs for a different batch size; the validation accuracy rates of CIFAR-10 and Dogs vs. Cats datasets in the top and bottom sub-figures, respectively. First, we consider the average loss of a mini-batch as an approximation of the expected loss over the data distribution. In these experiments, we fixed all the hyperparameter values except the BS value. As shown in Fig. 4, the lowest validation accuracy rates are linked to the larger BS value. This is linked to the intuition that the number of iterations or updates increases as the batch size decreases.

Fig.4

Batch size results: validation accuracy vs. number of epochs, top: CIFAR-10 and bottom: Dogs vs. Cats

The number of updates can be calculated as follows: updates = numberOfEpochs $\times \frac{N}{batchSize}$ , where N represents the training data-points. For instance, when the BS value equals 64, then after every 64 data points the Stochastic Gradient Descent (SGD) computes the gradient. Then, SGD checks its searching direction and updates the weights. This process should be done 312 times for the Dogs vs. Cats dataset. But, if the BS value equals 1,024, then SGD will update the weights only 19 times for the same dataset.

In other words, checking the direction several times is better even if the direction is less precise. The best batch size values, that achieve the best performance in our tests, are 64 for the Dogs vs. Cats dataset and 256 for the CIFAR-10 dataset. Thus, when training using larger batch sizes, the learning rate should be adjusted as discussed in [15, 34].

The epoch running time is the time used when the entire dataset is passed forward and backward through the CNN model. Thus, the epoch running time of the larger batch sizes is less than the small batch sizes. For instance, in CIFAR-10, when the BS value equals 16, the epoch time was 63 seconds. In contrast, when the BS equals 1,024, the epoch running time was 24 seconds. The overall convergence time of the larger batch sizes is bigger than smaller ones. This is because increasing the batch size makes SGD oscillate around the minimum. Thus, using an appropriate batch size between 16 and 1,024 is better for performance and achieving convergence faster.

4.4.2 Weight initialization

Table 5 lists the results of both the traditional and state-of-the-art weight initialization methods. The experiments started with setting weights to zeros, ones, or constant values. When the results for all three values are the same, the model can’t learn. Thus, it is not recommended to start a CNN model with any of these three weight initialization methods. This is because using the same initial values for all weights makes the network symmetric. In other words, each neuron will compute the same output, and the same gradient will be computed in the back-propagation process.

Table 5
Weight initialization methods results (best results in bold).

Binary Categorical

Method Acc. Loss Epochs Time Acc. Loss Eopchs Time

zeros 50.0% 0.69 14 1129 10.0% 0 2.30 120 312

ones 50.0% 7.97 11 867 10.0% 1.19 11 275

constant 50.0% 0.69 11 864 10.0% 2.30 12 313

Random Normal 85.51% 0.80 56 4888 76.46% 1.05 36 1012

Random Uniform 86.69% 0.66 46 3677 77.42% 1.43 63 1730

Truncated Normal 86.43% 0.75 60 4738 77.18% 1.34 47 1314

Variance Scaling 87.31% 0.78 39 3560 77.81% 1.26 36 1006

Orthogonal 87.83% 0.80 47 4230 77.31% 1.15 30 925

Glorot normal 88.69% 0.80 56 5713 77.68% 1.28 39 1178

Glorot uniform 87.67% 0.79 50 4730 77.55% 1.33 48 1360

He normal 86.93% 0.79 41 3679 77.94% 1.33 39 1172

He uniform 86.47% 0.78 47 4375 77.13% 1.36 39 1174

	Binary	Categorical
zeros	50.0%	0.69	14	1129	10.0% 0	2.30	120	312
ones	50.0%	7.97	11	867	10.0%	1.19	11	275
constant	50.0%	0.69	11	864	10.0%	2.30	12	313
Random Normal	85.51%	0.80	56	4888	76.46%	1.05	36	1012
Random Uniform	86.69%	0.66	46	3677	77.42%	1.43	63	1730
Truncated Normal	86.43%	0.75	60	4738	77.18%	1.34	47	1314
Variance Scaling	87.31%	0.78	39	3560	77.81%	1.26	36	1006
Orthogonal	87.83%	0.80	47	4230	77.31%	1.15	30	925
Glorot normal	88.69%	0.80	56	5713	77.68%	1.28	39	1178
Glorot uniform	87.67%	0.79	50	4730	77.55%	1.33	48	1360
He normal	86.93%	0.79	41	3679	77.94%	1.33	39	1172
He uniform	86.47%	0.78	47	4375	77.13%	1.36	39	1174

Notice: all experiments conducted using the same batch size 128.

As noticed, the network started learning in all other initialization methods and achieved comparable results. Thus, we examined the convergence time and accuracy rates to compare these methods. Weights that were drawn from normal or uniform distributions with means equal to zero and a fixed standard deviation (i.e. random normal, random uniform, truncated normal) have less accuracy rates and longer convergence times.

When the weights drawn from a truncated normal distribution with means equal to zero and standard deviations computed using the number of inputs, outputs or average of units in the weight tensor or uniform distribution with a limit based on the number of units, it results in better performance. Best accuracy rates (bold numbers) are achieved using weights drawn from a truncated normal distribution, but it necessitates longer training times. In contrast, weights drawn from uniform distributions achieve accuracy rates less than the ones drawn from normal distributions, but the former converges faster.

4.4.3 Dropout

All the experiments of the Dropout rates are conducted using a standard Dropout layer with probabilities zero, 0.2, 0.5, or 0.7. First, a model without Dropout is used; it achieved a good classification result in both datasets, but the validation accuracy didn’t increase more than 85.67%. This is because the model overfitted the data; the training accuracy of the model, without adding any dropout layer, was 100%. FC layers ensure that the nodes of two consecutive layers are fully connected; based on that, it has most of the trainable parameters in the network. Thus, using a Dropout layer after FC layers seems to be an approach to stop the overfitting problem. In the case of binary classification, the best validation accuracy rates are achieved when we apply the Dropout layers after the FC layers, as shown in Table 6.

Table 6
Using Dropout layer in different positions and values (best in bold).

Binary Categorical

Method Acc. % Loss Epochs Time Acc. % Loss Eopchs Time

No Dropout in the model 85.67 0.94 39 3618.97 74.1 1.88 33 949.82

Drop(.2) after every FC 87.09 0.90 47 4298.30 76.59 1.53 42 1138.74

Drop(.5) after every FC 88.27 0.75 72 6231.73 77.74 1.25 37 1012.42

Drop(.7) after every FC 88.87 0.74 60 5500.48 77.52 1.09 36 995.75

Drop(.2) after every Conv 78.25 0.80 38 3384.98 63.57 1.42 35 957.36

Drop(.5) after every Conv 52.38 0.72 22 2007.50 10.27 2.97 12 312.63

Drop(.7) after every Conv 49.99 5.71 11 928.92 10.0 2.30 11 273.81

Drop(.2) after every Pool 81.01 0.90 33 3218.36 74.04 1.41 24 668.77

Drop(.5) after every Pool 84.61 0.39 43 3597.12 78.12 0.75 43 1155.94

Drop(.7) after every Pool 67.97 0.58 26 2273.00 10.0 2.30 11 276.22

Drop(.5) after every Pool, Drop(.5) after every FC 87.67 0.47 159 13450.18 81.65 0.57 105 2869.16

Drop(.2) after every Conv, Drop(.5) after every Pool, Drop(.5) after every FC 55.54 0.73 30 2749.32 44.1 1.68 81 2556.78

	Binary	Categorical
No Dropout in the model	85.67	0.94	39	3618.97	74.1	1.88	33	949.82
Drop(.2) after every FC	87.09	0.90	47	4298.30	76.59	1.53	42	1138.74
Drop(.5) after every FC	88.27	0.75	72	6231.73	77.74	1.25	37	1012.42
Drop(.7) after every FC	88.87	0.74	60	5500.48	77.52	1.09	36	995.75
Drop(.2) after every Conv	78.25	0.80	38	3384.98	63.57	1.42	35	957.36
Drop(.5) after every Conv	52.38	0.72	22	2007.50	10.27	2.97	12	312.63
Drop(.7) after every Conv	49.99	5.71	11	928.92	10.0	2.30	11	273.81
Drop(.2) after every Pool	81.01	0.90	33	3218.36	74.04	1.41	24	668.77
Drop(.5) after every Pool	84.61	0.39	43	3597.12	78.12	0.75	43	1155.94
Drop(.7) after every Pool	67.97	0.58	26	2273.00	10.0	2.30	11	276.22
Drop(.5) after every Pool, Drop(.5) after every FC	87.67	0.47	159	13450.18	81.65	0.57	105	2869.16
Drop(.2) after every Conv, Drop(.5) after every Pool, Drop(.5) after every FC	55.54	0.73	30	2749.32	44.1	1.68	81	2556.78

Notice: all experiments conducted using the same batch size 128.

Increasing or decreasing the value of the dropout rate can’t be linked to the training time. From Table 6, we can notice that for some models, increasing the dropout rate decreases the training time (rows 5 to 7). This relation is not holding for the other row of the table. This observation holds for the validation accuracy as well. This is due to the model randomly dropping out neurons, causing the architecture of the model to be changed in every iteration.

For multi-class image classification, CIFAR-10 dataset, applying a Dropout layer after FC and pooling layers achieved the best validation accuracy rate, as shown in Table 6. By conducting the aforementioned effect, the validation accuracy was 81.65% better than the model with Dropout layers by 2%. On the other hand, applying this effect resulted in a model with a three times slower training period in comparison to the model without any Dropout. In the case of utilizing the Dropout method after convolutional, pooling, and fully connected layers at the same time, this approach has a negative impact in the model performance.

4.4.4 Batch normalization

The original paper of BN [22] recommended to add a BN layer before the non-linearity function layers. Thus, the experiments started by adding a BN layer before and after the activation function layers. In the binary classification problem, the model overall performance was less than the plain model and it overfitted the training data - the training accuracy was nearly 100%. Thus, adding a BN layer after the activation functions does not have the positive effect of regularization as adding a Dropout layer has, as listed in Table 7.

Table 7
BN methods validation results (best in bold).

Binary Categorical

Method BN layer location Acc. Loss Epoch Time Acc. Loss Epoch Time

Conv/FC+ BN+Act After Conv, FC and before Act 83.01% 0.75 49 4552 78.45% 1.03 65 1962

Conv/FC+Act+ BN After Conv, FC and Act 84.91% 0.67 60 5444 78.89% 1.08 74 2159

Conv+Act+ BN+ FC + Act + BN + Drop After Conv, FC, and Act and before Dropout 87.75% 0.81 84 6480 78.39% 1.54 116 3244

Conv + Act + BN + FC + Act + Drop + BN After Conv, FC, Act and Dropout 88.29% 0.67 68 5434 78.16% 1.42 99 2788

Conv+ BN + Act + FC+ BN + Act+ Dropout(0.5) After Conv, FC, and before Act and Dropout 88.62% 0.37 120 9674 79.94% 0.82 95 2762

Pool+ Dropout(0.5) +BN After Pooling, Dropout 78.41% 1.37 151 12172 76.81% 1.36 138 3689

Pooling + BN + Drop(0.5) After Pool and before Dropout 78.29% 1.45 78 6368 78.88% 1.45 155 4181

FC + Drop(0.5) + BN after FC and Dropout 88.73% 0.64 103 8198 77.60 1.30 67 1885

FC + BN + Drop(0.5) After FC and before Dropout 81.43% 1.54 107 8323 77.44% 1.70 120 3229

Pooling + Drop(0.5) + BN + FC + Drop(0.5) + BN After Pooling, Dropout and after FC, Dropout 79.59% 0.49 67 5526 75.6% 0.76 65 1910

Pooling + BN + Drop(0.5) + FC + BN + Drop(0.5) After Pooling before Dropout and after Fc before Dropout 72.7% 1.92 111 9041 78.8% 0.74 185 5483

	Binary	Categorical
Conv/FC+ BN+Act	After Conv, FC and before Act	83.01%	0.75	49	4552	78.45%	1.03	65	1962
Conv/FC+Act+ BN	After Conv, FC and Act	84.91%	0.67	60	5444	78.89%	1.08	74	2159
Conv+Act+ BN+ FC + Act + BN + Drop	After Conv, FC, and Act and before Dropout	87.75%	0.81	84	6480	78.39%	1.54	116	3244
Conv + Act + BN + FC + Act + Drop + BN	After Conv, FC, Act and Dropout	88.29%	0.67	68	5434	78.16%	1.42	99	2788
Conv+ BN + Act + FC+ BN + Act+ Dropout(0.5)	After Conv, FC, and before Act and Dropout	88.62%	0.37	120	9674	79.94%	0.82	95	2762
Pool+ Dropout(0.5) +BN	After Pooling, Dropout	78.41%	1.37	151	12172	76.81%	1.36	138	3689
Pooling + BN + Drop(0.5)	After Pool and before Dropout	78.29%	1.45	78	6368	78.88%	1.45	155	4181
FC + Drop(0.5) + BN	after FC and Dropout	88.73%	0.64	103	8198	77.60	1.30	67	1885
FC + BN + Drop(0.5)	After FC and before Dropout	81.43%	1.54	107	8323	77.44%	1.70	120	3229
Pooling + Drop(0.5) + BN + FC + Drop(0.5) + BN	After Pooling, Dropout and after FC, Dropout	79.59%	0.49	67	5526	75.6%	0.76	65	1910
Pooling + BN + Drop(0.5) + FC + BN + Drop(0.5)	After Pooling before Dropout and after Fc before Dropout	72.7%	1.92	111	9041	78.8%	0.74	185	5483

Notice: all experiments conducted using the same batch size 128.

In contrast, a BN layer increases the model validation accuracy in the multi-class classification problem. In both of the two classes and multi-class image classification, adding a BN layer negatively affected the convergence time. Thus, using BN only is not sufficient to solve the overfitting problem.

For the two-class classification, adding a Dropout layer and then a BN layer after the FC layers seems to be a promising approach to address the overfitting problem. The results showed that this combination achieved a better overall performance than the plain model or the plain model with BN layers only. On the other hand, adding Dropout and BN layer after the FC layers converges slower than the plain model by three times.

In addition, BN and Dropout with a rate equal to 0.5 are used after every pooling layer in the model. In the case of the binary classification, this combination had a negative impact on the model’s overall performance, as the model overfitted the training dataset. In the other case of the multi-class classification problem, it had a positive impact on the model’s overall performance, especially when they have the following order: pooling, BN, and Dropout.

Adding a BN layer resulted in better accuracy rates for both Dogs vs. Cats and CIFAR-10 datasets in comparison to the plain model by 3%. But, this effect slowed down the convergence rates by about three times. Dropout and BN are used after pooling and FC layers interchangeably. This combination decreased model performance in binary classification problems and increased the convergence time. On the other hand, it has a positive impact on the overall performance of the categorical classification problem.

4.4.5 Learning rate

The recent learning rate (LR) optimization methods are tested to ensure impact on training deep learning models. The experiments started using the default values for learning rate, β1, β2, and the exponentially weighted average decay factor for most learning rate optimization methods. The results of these experiments are exposed in Table 8 and Fig. 5.

Fig.5

LR results: validation accuracy vs. number of epochs, top: Dogs vs. Cats and bottom: CIFAR-10

Table 8

LR optimization methods results (best in bold).

	Binary				Categorical
Method	Config.	Acc.	Loss	Epoch	Time	Acc.	Loss	Epoch	Time
SGD	lr value =.001	83.91%	0.49	98	7664	77.34%	1.23	137	3836
	momentum = 0.9
	lr value = 0.01	87.89%	0.59	31	2456	77.91%	1.17	32	929
	momentum= 0.9
RMSprop	lr value = 0.001	88.41%	0.65	58	4832	10.0%	1.19	17	470
	rho = 0.9
Adagrad	lr value = 0.01	50.0%	7.97	11	908	76.49%	1.13	22	635
Adadelta	lr value = 1.0	87.63%	0.56	30	3138	77.32%	1.42	28	860
	rho=0.95
Adam	lr value = 0.001	86.99%	0.75	32	3083	76.3%	1.16	19	578
	β1= 0.9
	β2= 0.999
Adamax	lr value = 0.002	87.37%	0.64	21	1935	77.78%	1.29	22	669
	β1= 0.9
	β2= 0.999
Nadam	lr value = 0.002	50.0%	0.69	11	891	71.90%	1.21	29	894
	β1= 0.9
	β2= 0.999

For the SGD method, a small LR value equal to 0.001, is used to test the impact of learning rate in model behavior. Using a small LR decreased the overall performance and increased the convergence time. When a larger learning rate is used, 10 times larger, the model overall performance is increased by a considerable margin of 4% and the convergence time is three times faster, in the binary classification problem.

In the case of the multi-class classification problem, the model accuracy increased by a small margin, but it has a positive impact on convergence time; it decreased the convergence time by about four times. RMSprop achieved the best overall performance compared to other models, but it had the longest convergence time, in binary classification. In the multi-class classification problem, Adamax achieved the highest validation accuracy rate.

4.4.6 Output activation functions

In Table 9, the sigmoid function achieved the best results in the binary classification problem, in comparison to all other output activation functions. The sigmoid function obtains more reasonable validation accuracy in the multi-class classification problem as well. The softmax functions that are mostly used for multi-class classification problems failed to learn in the binary problem, but, as predicted, it achieved the best validation accuracy in the multi-class classification problem. Both tanh and ReLU functions didn’t learn in either binary or categorical problems. Worth noting, the tanh function started learning in binary classification until it reached accuracy near 77.5%, then suddenly the accuracy dropped to zero. Besides, the ReLU function started learning slowly in the categorical problem but it did not reach reasonable accuracy.

Table 9
Output layers activation functions results (best in bold).

Binary Categorical

Function Acc. Loss Epochs Time Acc. Loss Epochs Time

Sigmoid 88.42% 0.90 125 10259 77.75% 1.77 150 3846

SoftMax 50.0% 7.97 31 2308 78.28% 1.57 66 1747

Tanh 0.0% 8.06 44 3586 10.0% 1.19 31 764

ReLU 50.0% 0.90 31 10259 14.78% 2.23 65 1580

	Binary	Categorical
Sigmoid	88.42%	0.90	125	10259	77.75%	1.77	150	3846
SoftMax	50.0%	7.97	31	2308	78.28%	1.57	66	1747
Tanh	0.0%	8.06	44	3586	10.0%	1.19	31	764
ReLU	50.0%	0.90	31	10259	14.78%	2.23	65	1580

Notice: all experiments conducted using the same batch size 128.

4.4.7 Hidden activation functions

To evaluate the performance of hidden activation functions, we tested the proposed model using a set of different activation functions of hidden layers. This set included sigmoid, tanh, ReLU, leakyRelu, Parametric ReLU (PReLU), ThresholdedRelu, exponential linear unit (ELU), and scaled ELU (SELU). The activation function of these methods are listed in Table 10.

Table 10
Activation functions

Function Equation

sigmoid $y = \frac{1}{1 + e^{- x}}$

softmax $y = \frac{e^{x_{i}}}{\sum e^{x_{i}}}$

tanh $y = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}$

ReLU $y = {\begin{matrix} x & x > 0 \\ 0 & else \end{matrix}$

LReLU $y = {\begin{matrix} x & x \geq 0 \\ 0.01 x & x < 0 \end{matrix}$

PReLU $y = {\begin{matrix} x & x \geq 0 \\ α x & x < 0 \end{matrix}$

ThresholdedReLU $y = {\begin{matrix} x & x > theta \\ 0 & otherwise \end{matrix}$

ELU $y = {\begin{matrix} x & x > 0 \\ α (\exp (x) - 1) & x \leq 0 \end{matrix}$

SELU $y = λ {\begin{matrix} x & x > 0 \\ α (\exp (x) - 1) & x \leq 0 \end{matrix}$

Function	Equation
sigmoid	$y = \frac{1}{1 + e^{- x}}$
softmax	$y = \frac{e^{x_{i}}}{\sum e^{x_{i}}}$
tanh	$y = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}$
ReLU	$y = {\begin{matrix} x & x > 0 \\ 0 & else \end{matrix}$
LReLU	$y = {\begin{matrix} x & x \geq 0 \\ 0.01 x & x < 0 \end{matrix}$
PReLU	$y = {\begin{matrix} x & x \geq 0 \\ α x & x < 0 \end{matrix}$
ThresholdedReLU	$y = {\begin{matrix} x & x > theta \\ 0 & otherwise \end{matrix}$
ELU	$y = {\begin{matrix} x & x > 0 \\ α (\exp (x) - 1) & x \leq 0 \end{matrix}$
SELU	$y = λ {\begin{matrix} x & x > 0 \\ α (\exp (x) - 1) & x \leq 0 \end{matrix}$

The experiments started using the traditional sigmoid function, as noticed in Table 11 the sigmoid function failed to learn in both Dogs vs. Cats and CIFAR-10 datasets. In addition, the deprecated tanh function was tested, but unlike the sigmoid function, tanh obtained more reasonable results in both datasets. The most used hidden layers activation function, ReLU, was tested. ReLU achieved the best overall performance in both binary and multi-class classification problems.

Table 11

Hidden layers activation functions results (best in bold)

	Binary				Categorical
Function	Acc.	Loss	Epochs	Time	Acc.	Loss	Epochs	Time
Sigmoid	50.0%	0.69	31	2213	10.0%	2.30	54	1268
Tanh	80.89%	1.26	62	4653	71.51%	1.95	41	1007
ReLU	88.47%	0.93	114	9929	78.58%	1.75	165	4198
LeakyReLU	86.55%	1.05	114	9948	76.59%	1.48	43	1206
PReLU	87.71%	0.77	54	5149	10.0%	2.30	38	945
ThresholdedReLU	50.00%	0.69	32	2537	77.57%	1.69	101	2830
ELU	82.17%	1.26	107	9192	73.58%	2.21	84	2231
SELU	50.00%	7.97	31	2577	10.0%	2.30	46	1281

Notice: all experiments conducted using the same batch size 128.

The modified version of ReLU, LeakyReLU, which gives some flexibility to the values less than zero by multiplying them by 0.01, achieved an overall performance less than ReLU and its convergence time equal to ReLU in the binary problem. In the case of the categorical problem, LeakyReLU converged four time faster than ReLU.

PReLU defines variable α to be a learnable parameter. It achieves an overall performance close to that of the ReLU function and takes ReLU’s halftime to converge in the binary problem, but it failed to learn in the categorical problem. ThresholdedReLU failed to learn in the binary problem, but it achieved an accuracy less than ReLU by 1% and takes 75% of ReLU’s time to converge in the categorical problem. ELU achieved validation accuracy rates less than ReLU in both classification problems and takes nearly the same time to converge. Finally, SELU failed to learn in both the binary and categorical datasets.

4.5 Recommendations

In the conducted experiments, most hyperparameters are tested to draw conclusions of how they affect the performance of CNN models for the image classification problem. Table 12 lists a set of recommendations to be considered to start the learning process properly. In addition, a visual recommendation of the hyperparameter positions in terms of their locations within a CNN model architecture is depicted in Fig. 6. ArchPosOpt optimizes the positions of the Dropout and BN layers. Thus, in Fig. 6 there are three possible hyperparameter positions after each layer of the CNN model; layers with blue color.

Fig.6

Architectural recommendations of the hype-parameter positions

Table 12

Recommendations for both classifications problems

Parameter	Recommendations
Batch Size	It is better to use a BS equals to 64 and 256 for Dogs vs. Cats and CIFAR-10, respectively.
Weight Initialization	It is not recommended to initialize weights with zeros, ones, or constants. As it is preferred to draw weights from a truncated normal distribution.For Dogs vs. Cats dataset, it is recommended use glorot-normal or lecun-normal, but it takes much 1.5 times longer than their counterpart uniform.For CIFAR-10 dataset, it is recommended use Lecun or He normal to initialize the weights.
Dropout	Using Dropout increases the performance and also increases the convergence time. It is not recommended to use Dropout after convolutional layers.For Dogs vs. Cats dataset, it is recommended to use a Dropout layer after FC layers with dropout rates equal to 0.5 or 0.7.For CIFAR-10 dataset, we recommended utilizing Dropout after pooling layers with probability 0.5. Also, we recommend combining both Dropout after pooling and FC layers
BN	For both datasets, it is recommended to use BN before activation functions and Dropout after fully connected layers. Also, we recommend using BN after Dropout in the FC layers
Learning Rate Opt. Methods	For Dogs vs. Cats dataset, it is recommended to use RMSprop and AdaMax for CIFAR-10 dataset as optimization methods.
Output Activation Functions	For Dogs vs. Cats dataset, it is recommended to use sigmoid function in the output layer. and SoftMax function for CIFAR-10 dataset.
Hidden Activation Functions	It is not recommended to use Sigmoid or tanh as they saturate. We recommend using ReLU or any kind of its variants. It is recommended to use PReLU and ThresholdedReLU for Dogs vs. Cats (binary) dataset and CIFAR-10 (categorical) dataset, respectively.

In Fig. 6, the layer with character ‘x’ means this layer should be dropped from the CNN model. According to the conducting experiments, there is one recommendation of the hyperparameter positions for each of the two problems, binary and multi-class image classification, one recommendation per line. For instance, the recommended model hyperparameter positions of the binary image classification are two-fold. First, it is recommended to place an activation layer after each convolutional layer; second, the FC layer(s) should be followed by an activation layer and then a Dropout layer. There is no BN included in the model with the best accuracy for the binary class classification. The recommended model using ArchPosOpt yielded a less complex model, with less number of layers, and more accurate in comparison to following the recommendation of the original paper proposed BN layer, to place a BN layer after each convolutional layer.

Similarly, the architectural recommendation for the multi-class image classification is to place a Dropout layer after each pooling layer and after the FC layer of the CNN model. Comparing the recommended CNN model of the multi-class classification using ArchPosOpt with the CNN model following the recommended positions as suggested in the original papers proposing BN and Dropout concepts, the former has a higher accuracy by about 2% than the latter.

This observation emphasizes that utilizing the proposed tool ArchPosOpt may result in better results in comparison to placing BN and Dropout layers in the standard positions within the CNN model.

5 Conclusions

Hyperparameter optimization is a crucial task within the process of designing a deep neural network model. This optimization process includes setting values to hyperparameters and choosing from a set of architectural options. In this vein, we extend the process of the architectural options to include the position of the hyperparameters, which affects model performance. The proposed method ArchPosOpt extended an existing three hyperparameter optimization tools, grid search, random search, and TPE, to consider searching for the optimal position of the hyperparameters. The proposed method was evaluated through a set of experiments of image classification for two datasets; binary classification and multi-class classification. The extended version of the three aforementioned tools, utilizing ArchPosOpt approach, found models achieving higher accuracy in comparison to the original tools. In addition, seven different hyperparameters are studied through a comprehensive set of experiments to find the best practice of using them for the problem of image classification using a CNN-based model. Finally, we provided a list of recommendations for choosing the optimal hyperparameter values and positions for these two datasets. While the proposed idea is used to optimize the architecture of the CNN models, it can be generalized to other deep learning architectures.

Footnotes

Acknowledgements

The research was partially funded by the Natural National Science Foundation of China (Grant Nos. 61772182, 61472126, 61602170, 61750110531, 61672215, U1613209).

References

Dogs vs. Cats. https://www.kaggle.com/c/dogs-vs-cats.

Baker

, Gupta

, Naik

and Raskar

, Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.

Bergstra

and Bengio

, Random search for hyperparameter optimization, Journal of Machine Learning Research 13(Feb) (2012), 281–305.

Bergstra

, Yamins

and Cox

D.D.

, Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms, In Proceedings of the 12th Python in Science Conference, Citeseer, 2013, pp. 13–20.

Bergstra

, Yamins

and Cox

D.D.

, Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures, 2013.

Bergstra

J.S.

, Bardenet

, Bengio

and Kégl

, Algorithms for hyper-parameter optimization, In Advances in Neural Information Processing Systems, 2011, pp. 2546–2554.

Boureau

Y.-L.

, Ponce

and Cun

Y.L.

, A theoretical analysis of feature pooling in visual recognition, In Proceedings of the 27th International Conference on Machine Learning (ICML-10) (2010), pp. 111–118.

Claesen

, Simm

, Popovic

and Moor

B.D.

, Hyperparameter tuning in python using optunity, In Proceedings of the InternationalWorkshop on Technical Computing for Machine Learning and Mathematical Engineering, volume 1, 2014, p. 3.

Duan

, Li

and Li

, An ensemble cnn2elm for age estimation, IEEE Transactions on Information Forensics and Security 13(3) (2018), 758–772.

10.

Duan

, Li

, Yang

and Li

, A hybrid deep learning cnn–elm for age and gender classification, Neurocomputing 275 (2018), 448–461.

11.

Ducha-Aiki. ducha-aiki/caffenet-benchmark.

12.

Falkner

, Klein

and Hutter

, Bohb: Robust and efficient hyperparameter optimization at scale. ArXiv preprint arXiv:1807.01774, 2018.

13.

Feurer

, Klein

, Eggensperger

, Springenberg

, Blum

and Hutter

, Efficient and robust automated machine learning, Advances in Neural Information Processing Systems, 2015, pp. 2962–2970.

14.

Feurer

, Springenberg

J.T.

and Hutter

, Initializing bayesian hyperparameter optimization via metalearning, In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.

15.

Goyal

, Dollár

, Girshick

, Noordhuis

, Wesolowski

, Kyrola

, Tulloch

, Jia

and He

, Accurate, large minibatch sgd: Training imagenet in 1 hour, arXiv preprint arXiv:1706.02677, 2017.

16.

Han

and Kamdar

M.R.

, Mri to mgmt: Predicting methylation status in glioblastoma patients using convolutional recurrent neural networks, In Pac Symp Biocomput, World Scientific, volume 23, 2018, pp. 331–42.

17.

, Zhang

, Ren

and Sun

, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, In Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1026–1034.

18.

, Zhang

, Ren

and Sun

, Deep residual learning for image recognition, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

19.

Henderson

, Islam

, Bachman

, Pineau

, Precup

and Meger

, Deep reinforcement learning that matters, In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

20.

Hubel

D.H.

and Wiesel

T.N.

, Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex, The Journal of Physiology 160(1) (1962), 106–154.

21.

Huval

, Wang

, Tandon

, Kiske

, Song

, Pazhayampallil

, Andriluka

, Rajpurkar

, Migimatsu

and Cheng-Yue

, et al., An empirical evaluation of deep learning on highway driving. arXiv preprint arXiv:1504.01716, 2015.

22.

Ioffe

and Szegedy

, Batch normalization: Accelerating deep network training by reducing internal covariate shift, In International Conference on Machine Learning, 2015, pp. 448–456.

23.

Jaafra

, Laurent

J.L.

, Deruyver

and Naceur

M.S.

, A review of meta-reinforcement learning for deep neural networks architecture search. arXiv preprint arXiv:1812.07995, 2018.

24.

Krizhevsky

and Hinton

, Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

25.

Cun

Y.L.

, Bottou

, Bengio

and Haffner

, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86(11) (1998), 2278–2324.

26.

Maclaurin

, Duvenaud

and Adams

, Gradient-based hyperparameter optimization through reversible learning, In International Conference on Machine Learning, 2015, pp. 2113–2122.

27.

Melis

, Dyer

and Blunsom

, On the state of the art of evaluation in neural language models, arXiv preprint arXiv:1707.05589, 2017.

28.

Mishkin

, Sergievskiy

and Matas

, Systematic evaluation of convolution neural network advances on the imagenet, Computer Vision and Image Understanding 161 (2017), 11–19.

29.

Nwankpa

, Ijomah

, Gachagan

and Marshall

, Activation functions: Comparison of trends in practice and research for deep learning, arXiv preprint arXiv:1811.03378, 2018.

30.

Pedregosa

, Varoquaux

, Gramfort

, Michel

, Thirion

, Grisel

, Blondel

, Prettenhofer

, Weiss

and Dubourg

, et al., Scikit-learn: Machine learning in python, Journal of Machine Learning Research 12(Oct) (2011), 2825–2830.

31.

Pham

, Guan

M.Y.

, Zoph

, Le

Q.V.

and Dean

, Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.

32.

Ravì

, Wong

, Deligianni

, Berthelot

, Andreu-Perez

, Lo

and Yang

G.-Z.

, Deep learning for health informatics, IEEE Journal of Biomedical and Health Informatics 21(1) (2017), 4–21.

33.

Scherer

, Müller

and Behnke

, Evaluation of pooling operations in convolutional architectures for object recognition, In Artificial Neural Networks–ICANN 2010, Springer, 2010, pp. 92–101.

34.

Smith

S.L.

, Kindermans

P.-J.

, Ying

and Le

Q.V.

, Don’t decay the learning rate, increase the batch size, arXiv preprint arXiv:1711.00489, 2017.

35.

Srivastava

, Hinton

, Krizhevsky

, Sutskever

and Salakhutdinov

, Dropout: A simple way to prevent neural networks from overfitting, The Journal of Machine Learning Research 15(1) (2014), 1929–1958.

36.

Suganuma

, Shirakawa

and Nagao

, A genetic programming approach to designing convolutional neural network architectures, In Proceedings of the Genetic and Evolutionary Computation Conference, ACM, 2017, pp. 497–504.

37.

, Wang

, Chen

and Li

, Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015.

38.

Yamasaki

, Honma

and Aizawa

, Efficient optimization of convolutional neural networks using particle swarm optimization, In 2017 IEEE Third International Conference on Multimedia Big Data (BigMM), 2017, pp. 70–73. IEEE.