EvoAAA: An evolutionary methodology for automated neural autoencoder architecture search

Abstract

Machine learning models work better when curated features are provided to them. Feature engineering methods have been usually used as a preprocessing step to obtain or build a proper feature set. In late years, autoencoders (a specific type of symmetrical neural network) have been widely used to perform representation learning, proving their competitiveness against classical feature engineering algorithms. The main obstacle in the use of autoencoders is finding a good architecture, a process that most experts confront manually. An automated autoencoder symmetrical architecture search procedure, based on evolutionary methods, is proposed in this paper. The methodology is tested against nine heterogeneous data sets. The obtained results show the ability of this approach to find better architectures, able to concentrate most of the useful information in a minimized encoding, in a reduced time.

Keywords

Representation learning autoencoder evolutionary methods

1. Introduction

Intelligent appliances based on machine learning systems [1] can be found in many everyday tasks. They are in charge of filtering spam email [2], detecting fraudulent transactions [3], recommending new products to buyers [4] and many other jobs with disparate complexity degrees [5, 6, 7, 8, 9, 10, 11]. To do that, ML methods need to extract knowledge from raw data. The usefulness of that knowledge mostly depends on the quality of data features. The goal is to produce descriptive and/or predictive models, depending on the task at hand.

Choosing a curated set of attributes, or building a new one from the original features, tends to produce better results [12] than using raw variables, although performance of ML methods could be also affected by learning metrics [13]. Hence the interest in feature engineering techniques in late years, including well-known preprocessing procedures [14] such as feature selection [15, 16] and feature extraction [17]. Representation learning (REPL) [18, 19] is a term tightly linked to perform feature engineering relying on deep learning techniques [20, 21].

Deep learning algorithms have a plethora of applications [22, 23]. Autoencoders [24, 25, 26] are a modern general-purpose DL-based family of tools for facing REPL. An AE is an unsupervised symmetric neural network [27] aimed to build an encoding that maximizes the reconstruction of data patterns. The obtained representation can be applied to many different tasks [28], including visualization, anomaly detection, hashing and noise removing. However, finding the proper AE architecture for each data set and function is not a trivial process. Usually, it is a challenge that experts have to deal with.

In this study EvoAAA, an automated methodology for designing AE architectures maximizing their reconstruction power when used with a specific data set, is proposed. The job is confronted as a hard optimization problem [29], unfeasible to solve by means of exhaustive search. Our hypothesis is that evolutionary methods [30] would be able to find near-optimal AE architectures in a reasonable time. It is a first approach to perform NAS specifically for AEs.

To verify the competitiveness of EvoAAA a thorough experimentation, including nine data sets and five search methods, three of them based on evolutionary optimization, is conducted. The architectures found through this approach are way better than those retrieved by exhaustive search.

Table 1
Summary of terminology and acronyms

Term/acronym	Description
AE	Autoencoder
ANN	Artificial Neural Network
Chromosome	Codification which represents the set of parameters of an individual in the population
CNN	Convolutional Neural Network
Cross-over	Operator to produce a new chromosome from two or more individuals (parents) by mixing their genes
DE	Differential Evolution
DL	Deep Learning
Elitism	Technique which preserves the best individuals among generations in an EM
EM	Evolutionary Method
ES	Evolution Strategy
EvoAAA	Evolutionary Methods for Automated Autoencoder Architecture search
Fitness	Quality measure linked to each individual
GA	Genetic Algorithm
Gene	Each value in a chromosome
Generation	Each one of the iterations in an evolutionary method
Individual	A potential solution in the search space denoted by a chromosome and the corresponding fitness value
LDA	Linear Discriminant Analysis
LLE	Locally Linear Embedding
ML	Machine Learning
MLP	Multi-Layer Perceptron
MSE	Mean Squared Error
Mutation	Operator to produce a new chromosome altering genes in an individual
NAS	Network Architecture Search
PCA	Principal Component Analysis
Population	Group of individuals in a generation
REPL	Representation Learning
SGD	Stochastic Gradient Descent

1.1 Problem formulation

That AEs are effective tools for REPL is a known fact [24, 25, 26], having proved their superiority against classical feature engineering methods such as PCA, LDA, ISOMAP and LLE [31]. They are also a tool closely related to nonstandard learning [32] problems. An AE is a symmetrical ANN [27], so it shares many of the characteristics of any ANN.

Training an ANN implies adjusting the weights that connect their neurons, using the backpropagation method [33] and any variation of the gradient descent algorithm such as SGD [34, 35]. The ANN architecture, i.e. number of layers, amount of units per layer, activation functions, etc., is set in advance, prior to the training process.

An inadequate ANN architecture could produce bad output results, regardless of the weights learned through the training process. If the ANN is too simple, adjusting too few parameters, it will be not able to learn. On the contrary, too complex ANNs suffer from overfitting [36] due to existing enough parameters to memorize the whole training data. These same problems also affect other kinds of ANNs [13], including AEs. Achieving a balanced architecture, the point where the ANN extracts enough information from seen data to generalize well while processing future never seen patterns, is still an unsolved problem. As a result, dozens of papers which contribute the design of ANNs to tackle specific problems [37, 38] are published every year.

The interest is in choosing a proper AE architecture to process a specific data set, so that the AE is able to learn an optimum representation of the data. Until now the design of AEs has been in charge of human experts. It is not an easy task to automate, since it is a hard combinatorial problem [29]. In fact, finding a good architecture can be seen as an optimization problem. This is a field where EMs [30] have shown their efficacy in the past.

Our proposal is a formulation to code any AE architecture so that it can be evolved by means of evolutionary approaches. The goal is to reduce the reconstruction error as much as possible. This way, the AE encoding will concentrate the maximum general-purpose information, rather than an encoding aimed to improve class separability or any other specific goal. Regarding evolutionary techniques, they will be used as the tool to optimize a set of parameters. The classical terminology in this field, summarized in Table 1 along with the acronyms appearing in the text, will be used.

1.2 Literature review

Feature engineering is a manual or automated task aimed to obtain a set of features better than the original one. Feature selection [14] consists in choosing a subset of attributes while maintaining most useful information in the data. It can be manually performed by an expert in the field, but mostly is faced with automated methods based on feature correlation [15] and mutual information [16]. By contrast, feature extraction methods transform the original data features to produce a new, usually reduced, set of attributes. Popular algorithms to do this are PCA [39] and LDA [40], whose mathematical foundations are relatively easy to understand.

More advanced studies work with the hypothesis that the distribution of variables in the original data lies along a lower-dimensional space, usually known as manifold. A manifold space works with the parameters that produce the data points in the original high-dimensional space. Finding this embedded space is the task of manifold learning [41] algorithms. Unlike PCA or LDA, manifold methods apply non-linear transformations, so they fall into the non-linear dimensionality reduction [42] category.

Autoencoders, as detailed in [27], are ANNs having a symmetric architecture, as shown in Fig. 1. The input and output layers have as many units as features there are in the data. Inner layers usually have fewer units, so that a more compact representation of the information hold in the data is produced. The goal is to reconstruct the input patterns into the output as faithfully as possible.

Figure 1.

Classic architecture for an AE. Black nodes denote a 2-variable encoding layer. Dark gray nodes are intermediate hidden layers. Light gray ones are the input (left) and output layers.

Although AEs are mainly used to perform feature fusion [27], searching the manifold in which the parameters to rebuild the data are found, they have many other practical applications [28, 43]. A properly configured AE can project data of any dimensionality into 2 or 3 dimensions so that patterns can be graphically visualized [44]. AEs can be used to detect anomalies [45, 46], training them to faithfully reconstruct normal patterns. When anomalous data enters the AE, it produces a high output error denoting that these patterns do not follow the known distribution. Another interesting application of AEs, specifically of denoising AEs, is data noise removing. This kind of AE has been used to successfully denoise images [47] and speech [48]. Usually the loss function of the AE has to be adapted to the specific task to be faced, so that the obtained encoding promotes separability, topology preservation or any other desired characteristic. When only maximum performance is pursued while reconstructing the input patterns, the AE will produce a general purpose encoding in its inner layer.

AEs can be configured with a variable amount of inner layers, each of them having different lengths. The proper architecture will mostly depend on the complexity of the patterns to be reconstructed and the restrictions imposed by the encoding layer. These restrictions prevent the AE from simply copying the input onto the output [27], for instance by reducing the number of neurons in the encoding layer, forcing the output of most neurons to be zero (sparse representation), etc.

Finding the best parameters to tune a machine learning model is an uphill battle. Performing a grid search through an internal validation process is an usual approach. However, it is useful only for limited sets of parameters taking known ranges of values. Disparate search algorithms, metaheuristics [49] and optimization approaches [50], aimed to perform combinatorial optimization, could be applied. These go from relatively simple search methods [51, 52, 53, 54] such as A* or IDA to more advanced and complex approaches such as memetic algorithms [55], including classic means as simulated annealing [56] or tabu search [57]. Evolutionary methods [58, 59] have also been used to face optimization problems [60, 61, 62, 63, 64] for a long time. Many of these algorithms are based on the behavior of certain populations, such as ant colonies [65] and particle swarms [66]. In the following, we focus specifically on approaches based on natural evolution [67].

In evolutionary algorithms [68] the search space is defined by the chromosome. It is made up of several genes, each of them representing a specific trait or dimension. The values taken by the genes in a chromosome are limited, and each combination is a potential solution (a point in the search space). These are also known as individuals. Usually a set of these are randomly generated, producing an initial population. Each individual is assigned a fitness value that represents the goodness of the solution. From this point, a number of iterations are run consisting in the same steps. Firstly, certain individuals of the population are selected for reproduction based on their fitness. Then, a set of operators are applied to selected individuals in order to create new ones. Common operators are crossing, that mixes genes of two individuals to produce a new chromosome, and mutation, which randomly changes the value of one or more genes. Lastly, the whole population is evaluated by computing the new fitness values and the population is updated, usually replacing old individuals with new ones, although the best overall solutions can be kept (elitism) whether they are new or not. Three popular evolutionary methods are genetic algorithms [69], differential evolution [70] and evolution strategy [71]. The first follows the methodology just described of selection, crossing, mutation and evaluation. Although GAs can be seen as an old technique, they are still in wide use [50, 62, 72] due to their simplicity and good performance. The second method produces new individuals from differences between existing ones, while the third one focuses on evolving only a few individuals using the mutation operator as only tool.

Evolutionary methods are also suitable for combinatorial optimization, and they have been used for instance for support vector machines [73] and more recently for deep learning networks [74]. Even though EMs have been already used to optimize ANNs, many of the proposals have been focused on learning the weights linked to each connection. This is known as conventional neuroevolution [75, 76, 77, 78] and its main foundation is to use an EM instead of the traditional gradient descent algorithm to optimize weights. Therefore, a fixed network architecture is the base for all the population. The only difference among individuals is the set of weight matrices connecting each layer, values optimized by the EM. Neuroevolutive algorithms able to also evolve the network topology appeared later [79], being aimed many of them to optimize MLPs, although proposals to adjust other kinds of ANNs also exist. There are different approaches to construct the ANN architecture, being one of the most popular subnetworks composition [80]. A recent survey in that matter can be found in [81].

Figure 2.

Components to be optimized during AE construction and the methods used for it. The numbers inside circles indicate the usual optimization order: firstly an architecture is set, then a set of hyperparameters is chosen, lastly the weights are adjusted.

More recently, the fusion of different techniques to fully automate the process of choosing a proper model structure and hyperparameters, adjusting the weights, etc., has given birth to a new field known as AutoML [82, 83, 84]. AutoML tools can be based on EMs, but also on Bayesian techniques and other optimization algorithms. Existing AutoML tools are mostly aimed to aid in the design of MLPs and CNNs [85, 86], maybe the two most popular kinds of ANNs. Essentially, these tools have a limited set of cells or blocks that they can combine to define the ANN topology. Depending on the task at hand a specific strategy is followed to choose these blocks. For instance, the AutoKeras tool [87] defines tasks that allow the automated design of ANNs for image, text and structured data classification. The ANN architecture is made up of predefined blocks such as ImageBlock, TextBlock or StructuredDataBlock, with some adjustable parameters. As a consequence, these AutoML tools have a coarse granularity and only can be used to build some specific types of ANNs, mainly supervised learners.

By contrast, in the present proposal EMs are used to evolve the architecture of AEs, a kind of ANN with some specific aspects such as its symmetric layout or its unsupervised learning approach. The AE layout is generated with a fine granularity, as described below, instead of using predefined blocks, and aiming to produce an ANN that generates a manifold of the data rather than a prediction model such as AutoKeras does. The weights are learned through the usual back-propagation algorithm. Therefore, the proposal is not a neuroevolution method since weights will still be learn through the usual gradient descent algorithm. The EvoAAA procedure proposed here could be a piece in an AutoML chain.

The rest of this paper is structured as follows: in Section 2 the different aspects to optimize while building an AE are outlined and the proposed architecture search methodology is presented. Section 3 describes the experimental framework, provides the analysis of performance results and discuss them. Lastly, conclusions are drawn in Section 4.

2. Autoencoder architecture search with EvoAAA

This section presents the proposal to face the problem formulated in Subsection 1.1, outlining the aspects that have to be taken into account while building an AE and detailing the methodology to follow.

2.1 Components to be optimized during AE construction

The process to build a new AE, adjusted to produce a good representation from input patterns, implies several steps. Each one is linked to the optimization of a certain component. A summary of the procedure, components and methods is shown in Fig. 2. The components are tuned in the following order:

1.
Architecture: The structure of the AE has to be set, deciding the amount of layers it will be made of, the number of units per layer, which activation functions will be used in each unit, etc. For years, this has been manually done by experts. The conventional trial and error procedure is also a frequent approach, readjusting the AE architecture after the three building steps have been completed and the AE performance evaluated. During this step specific restrictions, as the symmetrical structure of the network, have to be taken into account.
2.
Hyper-parameters: Having decided on the design of the AE, the following step would be choosing the proper values for several hyperparameters. These are not part of the network structure nor are they part of the weights configuration. They are in charge of controlling the tuning of weights during the training process. The most common hyperparameters are the learning rate, the batch size and the number of epochs the network is trained. Although the values could be manually picked, usually a grid search algorithm is used. Once more, trial and error using part of the training patterns as validation, allows to find the best configuration.
3.
Weights: Once the architecture of the network and its hyperparameters have been set, the last action would be adjusting the weights that connect every unit in each layer with all the units in the following one. Unlike what happens for finding a proper network structure and good hyperparameters, there is a solid mathematical background related to how these weights should be tuned. Derivatives allow to know the contribution of each connection to the global committed error, so that small adjustments can be made in the correct direction. This is the foundation of the well-known gradient descent method.

Figure 3.
Before an AE can be exploited each architecture has to be tested with multiple hyperparameters combinations, and the weights of everyone of these configurations have to be adjusted.

Figure 4.
Chromosome genes, name and interval of values they can get.

The three previous steps are iterative in a nested way, as shown in Fig. 3. Usually several different architectures will be tested. For each architecture, the first step (noted as 1 inside a circle) would be getting different sets of hyperparameters to be tried. Adjusting the weights of each configuration, a procedure that is iterative by itself, would be the following step. The third stage, after an architecture has been fixed, a set of hyperparameters is chosen and the weights are adjusted, is evaluating the model. The output of this action would lead us to step 4 or step 6, depending on whether a certain criterion is satisfied or not. In the first case additional sets of hyperparameters, generally obtained by grid search, would be used. If all potential combinations have been tried, a step backward (noted as 5) would be altering the AE architecture. Step 6 marks the end of the process, having an AE ready to be exploited in the system.
2.2 The EvoAAA proposal

As previously stated, there are well-known procedures for steps 2 and 3 (see Fig. 2) of the optimization process: grid search for finding the hyperparameters (step 2) and gradient descent for weight adjusting (step 3). On the contrary, finding a proper AE architecture is still an open problem.

The experience acquired while working with a certain data set could not be applicable when data change occurs. The amount of combinations is potentially infinite, so automating this process by means of exhaustive search is unfeasible. Most experts follow some heuristics to choose the number of layers and units, depending on the quantity of variables they have to deal with, while aspects as activation functions are statically assigned.

Seeking a way of automating the AEs design process, a suitable architecture could be found through a simple search if the number of combinations is restricted beforehand. It would be similar to the grid search procedure followed for hyperparameters. This approach would allow to choose the best structure among a bag of predesigned ones. However, it does not offer any guarantee over the performance of these predesigned models. Other straightforward search heuristics could be adopted, such as adding/deleting layers/units from a base structure as long as the performance of the AE improves.

The EvoAAA approach proposes to solve the task through EMs, since they have amply demonstrated their ability to solve disparate optimization problems. Specifically, it aims to use population based algorithms to evolve a set of AE architectures over time. To do so, a way to code all possible AE configurations is introduced. It will be the chromosome representation that the EMs will work with.

Table 2
Purpose of each gene and description of their values

Name	Purpose	Values
Type	Sets the type of AE to be used	1) Basic, 2) Denoising, 3) Contractive, 4) Robust, 5) Sparse, 6) Variational
Layers	Number of additional layers in coder/decoder	0) Only a encoding layer, 1–3) Additional layers in both encoder and decoder
Units	Sets the number of units per layer, with $f$ being the amount of features in the dataset	The first integer (gen 3) configures the number of units in the outer layer, while the last one (gen 6) sets the encoding length.
Activation	Activation function to use in each layer, both for the coder and decoder	1) linear, 2) sigmoid, 3) tanh, 4) relu, 5) selu, 6) elu, 7) softplus, 8) softsign
Out act.	Activation function for the output layer	1) linear, 2) relu, 3) elu, 4) softplus
Loss	Loss function to evaluate during fitting	1) Mean squared error, 2) Mean absolute error, 3) Mean absolute percentage error, 4) Binary crossentropy, 5) Cosine proximity

2.2.1 Chromosome representation

Evolutionary algorithms usually work with binary or real-valued genes. A set of genes builds a chromosome or individual of the population. In EvoAAA each chromosome will code the complete architecture of an AE. However, an integer gene representation is used rather than binary or real-valued genes.

The chromosome will be made up of 15 genes, as shown in Fig. 4. The number of each gene is shown above, their names inside and just below the range of values that can be assigned to them. The purpose of each gene, as well as the meaning of its values, are portrayed in Table 2. The main characteristics of this AE encoding are the following:

•
Different types of representations can be learned with AEs depending on the imposed restrictions. Some of these restrictions are linked to the type of AE [27], designed to induce sparsity, learn from noisy samples, etc. The first gen in the chromosome allows to choose among six different AE types (see details in Table 2).
•
The AEs will have an encoding layer and up to six additional hidden layers that have to be taken in pairs: 2, 4 or 6. The value of the second gen, between 0 and 3, indicates the number of pairs of hidden layers. Therefore, the simplest AE would have only 3 layers, the input one, the encoding one and output one, while the most complex would be made up of 9 layers in total.
•
Genes 3 to 6 state the number of units to have in the hidden layers. The last value is associated to the innermost layer, so it sets the encoding length. The other three values are linked to each layer pair, from outer to inner. The number of features in the data, noted as $f$ , will limit the amount of units in any layer. An additional restriction is that inner layers cannot be larger than outer ones.
•
With the exception of the input layer, which is limited to transferring values to the next one, all other layers in the AE use an activation function. Although it is common for all units of an AE to have the same activation function, there is nothing that restricts the use of different functions. The proposed encoding uses 7 genes (7 to 13) to choose the activation functions to be applied in each inner layer, allowing eight different options (see Table 2). The output layer is treated independently with gene 14, since only activation functions producing positive output are allowed.
•
The last gene in the chromosome can take values from 1 to 5, stating the loss function to be internally evaluated while training the AE. The meaning of these values is provided in Table 2.

Through the restrictions imposed in genes 3 to 6, the resulting AEs would be always undercomplete [27]. This means that the learned representation will be more compact than the original one, with a vector containing less values. In addition, the loss function evaluated while adjusting the weights will exclusively focus on reconstruction error. AEs can be trained to improve class separability, reduce the data complexity and other goals. In this case the main objective is to have a reconstruction of data patterns as good as possible. This will force the inner layer to concentrate as much useful, general purpose information as possible, so that the AE can be later used in any kind of task.
2.2.2 Autoencoder complexity penalization

When designing an AE for learning new representations the main interest will be in the raw performance. In an AE the performance is usually measured as the error produced while reconstructing data patterns from the learned encoding. Nonetheless, the time needed to obtain the encoding is also a factor to consider. This is the motivation to include a penalization factor in EvoAAA: $\alpha$ . It will be used in the fitness function of the evolutionary method in order to assess the goodness of each AE configuration.

The fitness function, that will decide the quality of the solutions, will be computed as shown in Eq. (1), where trainloss is the reconstruction loss produced by the AE with training data, Layers is the number of additional hidden layers (gen 2), Units encoding is the size of the encoding layer (gen 6), and $\alpha$ is a coefficient setting the level of penalization applied according to the complexity of the AE.

$\displaystyle\textit{fitness }=\textit{train loss}+\alpha(\textit{Layers}% \times\textit{Units encoding})$ (1)

Therefore, AEs having a similar reconstruction performance but simpler architecture (see Subsection 3.5.9 for additional details) will be preferred over the more complex ones. The bias to obtain AEs with less layers and a shorter encoding is modulated with the $\alpha$ value. This is the only parameter needed by EvoAAA.

2.2.3 Search space

The reason for using a complex evolutionary algorithm to find a proper AE architecture is that the search space is huge. So, a strategy is needed to pick a good solution without having to explore much of this space. But, how large is the search space assuming the chromosome previously described?

Excluding genes 3–6, whose values would vary depending on the number of features in the dataset, there are more than a billion combinations see Eq. (2.2.3).

$\displaystyle\textit{Type}\times\textit{Layers}\times\textit{Act}^{7}\times% \textit{Act out}\times\textit{Loss}=$

(2) $\displaystyle\quad 6\times 4\times 8^{7}\times 4\times 5=1∼{}006∼{}632∼{}960$

Working with very small datasets, those having a few dozens of attributes only, the amount of combinations will grow to several billions. We would face trillions of solutions or even more for high-dimensional datasets. Evaluating all those solutions to find the best one is currently unfeasible. As a result, looking for an optimal AE architecture would not be always possible by brute force. However, we could find good enough solutions through optimization mechanisms based on evolutionary strategies.
3. Algorithmic framework of EvoAAA

The EvoAAA methodology is a general evolutionary proposal to search good AE architectures. It can be instantiated using any population based evolutionary algorithm. In this study three specific instances of EvoAAA are tested, one with a genetic algorithm as underlying search method, another one relying on differential evolution and the third one using evolution strategy. The three of them will be compared against each other and also versus two baseline search methods, a simple exhaustive search and a random search. Nine different data sets will be used in the conducted experimentation.

3.1 Evolutionary search algorithms

We propose instantiating EvoAAA three times using three different evolutionary algorithms. All of them will use the former chromosome representation. These three approaches are:

•
Genetic algorithm (GA). A classical genetic algorithm [69], in which a population of individuals evolves through a crossover operator, to give rise to new ones, and to which a mutation operator is applied with a certain probability.
•
Evolution strategy (ES). An aggressive solution-seeking algorithm [71], working with a few individuals who give rise to new ones exclusively through mutation.
•
Differential evolution (DE). A population-based optimization algorithm [70, 88] in which new individuals (agents) are produced from the differences between two random agents with respect to another taken as reference.

Table 3
Main parameters of the evolutionary algorithms

Parameter GA ES DE

Population size 50 4 150

Iterations 100 500 30

Prob. mutation 1/15 1/15 NA

Prob. cross-over 1.0 NA 0.5

Elitism (individuals) 5 NA NA

Termination cost 0 0 0

Table 3 summarizes the main parameters used to run these methods. For GA and ES, each gene in the chromosome is mutated with a probability of 1/15, a value based on the chromosome length itself. Elitism is used in the GA to preserve the tenth percent of individuals having better fitness. The ES is intrinsically elitist, choosing only the best candidates among the union of new mutated individuals and the old population. For GA, in each iteration the best 5 individuals are chosen and preserved. Then, two random parents are picked up from the remaining individuals by using roulette wheel probability. A multi-point crossover operator is applied, being the crossing points established according to the diagram in Fig. 4. This allows the two individuals acting as parents to interchange several of their genes to produce offspring. This way, new individuals are produced until the population size is met, replacing all the old individuals. Lastly, the mutation operator is used with some individuals according to the probability previously stated. As can be seen, this is an aggressive scheme that maximizes the exploration in such a huge search space. For DE, the DE/local-to-best/1 optimization approach is followed. The population size is obtained from the representation length (15 genes $\times$ 10). Cross-over probability, as well as other specific parameters, has been set following the recommendations in [70, 89]. The number of iterations for each method has been adjusted so that a similar amount of evaluations is made.
3.2 Data sets

Table 4
Datasets used in the experimental study

Dataset	Type	Variables	Instances	Source
Cifar10	Integer	1 024	60 000	[90]
Delicious	Binary	983	16 105	[91]
Fashion	Integer	784	70 000	[92]
Glass	Real	9	214	[93]
Ionosphere	Real	34	351	[94]
Mnist	Integer	784	70 000	[95]
Semeion	Binary	256	1 593	[96]
Sonar	Real	60	208	[97]
Spect	Binary	22	267	[98]

In order to compare the performance of the three instances of EvoAAA along with the exhaustive and random search approaches, nine data sets are used. Their main traits are detailed in Table 4. The last column provides the origin reference for each one of them. The criteria followed to choose them have been:

•

Attribute type: An AE is a specialized ANN that, through a series of computations, reconstructs the input values. These have to be numeric, and three cases are considered: real values, integer values, and binary values. The goal is to evaluate how the AEs performance changes depending on the type of attributes they have to reconstruct.

•

Number of attributes: Half of the data sets have several hundreds of attributes, even more than a thousand in the case of cifar10. The other five are not that large, with only a handful of variables. This way the behavior of found AEs while working with a few versus a lot of variables will be analyzed.

Since an AE is an unsupervised representation learning method, class attributes have been removed for all the data sets. The number of variables indicated in Table 4 is the effective amount of them being processed.

3.3 Restrictions and evaluation

Five AE architectures will be obtained for each data set, EvoAAA-Gen: the instance based on a genetic algorithm, EvoAAA-Evo: the instance using evolutionary strategy as underlying search method, EvoAAA-Dif: the one based on differential evolution, and the two produced by the exhaustive (Exh) as well as random search (Ran). The same restrictions and evaluation procedure are applicable to all of them:

•
Performance evaluation: The common mean squared error metric is used to assess the performance of the AEs. This is an unbounded measure, which depends on the original range of values. The lower the MSE the better performance the AE has. This value is the trainloss factor in Eq. (1).
•
Termination cost: As shown in Table 3, the three evolutionary algorithms have been configured with 0 as termination cost. This means that the search will stop if a configuration returning $\textit{MSE}=$ 0 is found, but not before unless the maximum runtime or maximum number of iterations are reached.
•
Maximum runtime: The five search approaches will be limited to running for 24 hours. After that the best solution found until then is returned as preferred AE structure. In practice, this limitation will impede the exhaustive search from going through all the possible AE configurations, limiting as well the amount of combinations tested by the random search.

Aside from the MSE, the amount of architectures tested by each approach is also recorded during execution, as well as each individual solution. The number of combinations is limited by both the maximum runtime and the population size and iterations of the evolutionary methods.
3.4 Experiments

Table 5
Summary of results. For each combination dataset-optimization strategy the amount of evaluated individuals, number of layers and encoding length of the AE, its complexity and MSE (including the penalization factor) are provided

Dataset	Method	Individuals	Layers	Coding length	Complexity	Error (MSE) $\downarrow$
Cifar10	EvoAAA-Dif	2 537	1	38	0.004	0.0137
	EvoAAA-Evo	4 001	1	52	0.005	0.0133
	Exh	1 739	1	1	0.000	0.0395
	EvoAAA-Gen	1 115	1	39	0.004	0.0131
	Ran	2 101	3	5	0.002	0.0260
Delicious	EvoAAA-Dif	4 650	2	19	0.004	0.0124
	EvoAAA-Evo	4 001	1	39	0.004	0.0119
	Exh	5 837	1	1	0.000	0.0165
	EvoAAA-Gen	1 401	1	32	0.003	0.0102
	Ran	5 901	2	10	0.002	0.0134
Fashion	EvoAAA-Dif	2 604	1	256	0.026	444.8169
	EvoAAA-Evo	755	3	61	0.012	565.4571
	Exh	1 494	1	1	0.000	4 180.2370
	EvoAAA-Gen	439	5	156	0.047	1 881.0680
	Ran	1 501	2	35	0.007	782.3572
Glass	EvoAAA-Dif	4 650	2	3	0.001	24.8785
	EvoAAA-Evo	4 001	3	3	0.001	5.4093
	Exh	35 368	1	1	0.000	29.2589
	EvoAAA-Gen	4 505	1	5	0.000	0.4473
	Ran	35 301	3	7	0.002	1.2705
Ionosphere	EvoAAA-Dif	4 650	2	23	0.005	0.0740
	EvoAAA-Evo	4 001	3	15	0.003	0.0931
	Exh	29 905	1	1	0.000	0.2099
	EvoAAA-Gen	2 302	3	15	0.003	0.0917
	Ran	29 901	3	23	0.007	0.1049
Mnist	EvoAAA-Dif	2 754	1	162	0.016	192.5133
	EvoAAA-Evo	732	3	332	0.066	431.6960
	Exh	1 491	1	1	0.000	4 104.1400
	EvoAAA-Gen	403	1	156	0.016	254.7397
	Ran	1 401	2	35	0.007	611.5036
Semeion	EvoAAA-Dif	4 650	1	136	0.014	0.0459
	EvoAAA-Evo	4 001	1	134	0.013	0.0355
	Exh	19 861	1	1	0.000	0.1940
	EvoAAA-Gen	4 505	1	110	0.011	0.0376
	Ran	19 901	2	132	0.026	0.0604
Sonar	EvoAAA-Dif	4 650	1	29	0.003	0.0142
	EvoAAA-Evo	4 001	3	17	0.003	0.0162
	Exh	34 557	1	1	0.000	0.0462
	EvoAAA-Gen	4 505	3	7	0.001	0.0139
	Ran	35 301	4	8	0.003	0.0145
Spect	EvoAAA-Dif	4 650	2	13	0.003	0.0959
	EvoAAA-Evo	4 001	3	14	0.003	0.0834
	Exh	30 740	1	1	0.000	0.1829
	EvoAAA-Gen	4 505	5	15	0.004	0.0703
	Ran	30 701	3	14	0.004	0.1145

The 45 experiments1 (9 data sets times 5 search approaches) have been executed in the same hardware, a PC having 64 GB RAM with an NVidia RTX-2080 GPU, and using the following software configuration: Arch Linux [99], CUDA 10.1 [100], cudnn 7.6 [101], Tensorflow 1.14 [102], Keras 2.24 [103] and ruta 1.1 [104]. The RMSProp [105] optimizer of Tensorflow has been used. Since it relies on an adaptive learning rate, optimizing this hyper-parameter is not necessary. The default batch size value in Keras has been chosen, 32. Lastly, the training of the AEs is made in 20 epochs. Although a higher number of epochs, such as 100 or even more, is usual while trying to perform a final optimization of a representation, in this case the main interest is in comparing the search procedures rather than in obtaining fully optimized AEs. Reducing the number of epochs allows to try more architectures in a certain time interval. Regarding how the data instances are used, 20% of them are reserved from the beginning to assess AE performance, computing the MSE. The remainder 80% patterns are given to Keras to train each AE architecture. All the informed results correspond to test data, the initially reserved 20% of instances.

A summary of results is provided in Table 5. For each data set the columns indicate its name, the search method, the amount of individuals (AE configurations) tested, the number of hidden layers and encoding length of the best individual, its complexity (assuming that $\alpha=$ 0.0001) and the MSE obtained with test data. For these last two columns the lower the value the better the AE would be, having less complexity and superior reconstruction performance. The best configuration (lower MSE) for each data set has been highlighted in bold. These values include the complexity penalization, real best MSEs are provided in Table 6. Average error values, computed from all the evaluated solutions, and best values are presented in it.

The number of tested individuals is limited by the population size and iterations, as well as the maximum runtime, for the three EvoAAA instances. In addition all the invalid configurations produced during the exploration are discarded before they enter the training and testing phase. The exhaustive and random strategies only generate valid AE architectures, and the only limitation is the running time.

Table 6

Average (top-half) and best (bottom-half) performance (raw MSE, lower values are better) by data set and search strategy. Values of the best performer algorithm for each dataset are highlighted in bold

	Cifar10	Delicious	Fashion	Glass	Ionosphere	Mnist	Semeion	Sonar	Spect
EvoAAA-Dif (Avg.)	0.0256	0.0186	8979.9337	604.3130	0.2719	5073.9444	0.2348	0.0404	0.2575
EvoAAA-Evo (Avg.)	0.0085	0.0086	891.1958	562.3818	0.1152	616.6640	0.0322	0.0223	0.1145
Exh (Avg.)	0.0498	0.0195	18734.2379	611.4017	0.3687	9646.3325	0.2836	0.1331	0.3213
EvoAAA-Gen (Avg.)	0.0097	0.0075	2852.7007	345.4758	0.1227	3323.4105	0.0405	0.0157	0.1006
Rnd (Avg.)	0.0461	0.0206	8557.0936	606.7591	0.3303	4997.0538	0.2886	0.0761	0.2910
EvoAAA-Dif (Best)	0.0020	0.0131	444.7913	563.1547	0.0694	192.4971	0.0323	0.0106	0.0924
EvoAAA-Evo (Best)	0.0080	0.0080	565.4449	316.0380	0.0963	431.6296	0.0221	0.0128	0.0806
Exh (Best)	0.0394	0.0164	4180.2370	576.7005	0.2098	4104.1400	0.1939	0.0461	0.1828
EvoAAA-Gen (Best)	0.0090	0.0055	1881.0210	0.4468	0.0872	254.7241	0.0266	0.0115	0.0658
Rnd (Best)	0.0047	0.0021	782.3502	564.4037	0.0980	564.2114	0.0339	0.0114	0.1098

Table 7

Performance ranking of the tested methods. For each dataset the performance of the methods is ranked and indicated between 1 (best peformer) and 5 (worst one). The last row provides the average ranking for each algorithm

Dataset	EvoAAA-Dif	EvoAAA-Evo	Exh	EvoAAA-Gen	Ran
Cifar10	1	3	5	4	2
Delicious	4	3	5	2	1
Fashion	1	2	5	4	3
Glass	3	2	5	1	4
Ionosphere	1	3	5	2	4
Mnist	1	3	5	2	4
Semeion	3	1	5	2	4
Sonar	1	4	5	3	2
Spect	3	2	5	1	4
Average	2.00	2.56	5.00	2.33	3.11

3.5 Results and discussion

The results shown in Table 5 can be analyzed from several perspectives, raw performance: the lower MSE the better; size of explored space: although the more inspected individuals could be considered the better, the ability to find good solutions exploring less space has to be also taken into account, and solution complexity: the fewer layers and shorter encoding length the better. In addition, other aspects such as running times, convergence, etc., can be studied.

3.5.1 Analysis of performance

It is easy to see that the highest error values always correspond to the exhaustive search approach. MSE is unbounded and depends on the original attribute values. So, it does not allow to make comparisons among different data, but only between the five methods for each data set. In general, the error committed by Exh is one or two orders of magnitude above EvoAAA-Dif, EvoAAA-Gen and EvoAAA-Evo.

Comparing the three EvoAAA instances, EvoAAA-Gen is the best performer in 5 out of 9 datasets, with EvoAAA-Dif in a close second position gaining 3 out of 9 cases although the differences against EvoAAA-Evo are almost negligible in some cases. The biggest difference can be found with the fashion dataset, where the EvoAAA-Dif approach achieves one-quarter of the error shown by EvoAAA-Gen while using a simpler architecture. This analysis is made taking into account the complexity of AEs, i.e. the MSE is increased by the penalization factor.

In addition to best values, it would also be interesting to analyze the average behavior of the search strategies. For doing so, Table 6 shows for each data set (columns) and method (rows) the average and minimum (best) MSE2 achieved in each case. It can be observed that the best MSE obtained by the exhaustive and random search is far from the average MSE of EvoAAA-Dif, EvoAAA-Gen and EvoAAA-Evo. This leads to the conclusion that even a few iterations with EvoAAA would achieve a better result than 24 h of exhaustive or random search. Internal differences between best and average values for each search method tend to be minimal in most cases (cifar10, delicious, ionosphere, semeion and sonar). In general, these differences seem lower in the case of EvoAAA-Evo than with EvoAAA-Gen or EvoAAA-Dif. For instance, average and best values for fashion, glass and mnist are closer in the former case than in the latter. This suggests that EvoAAA-Evo would be preferable if only a few iterations are affordable.

If we strictly focus on the best values achieved by each approach, these returned at the end of all iterations without complexity penalization, EvoAAA-Dif stands out over the other algorithms. It has 5 best values, against 2 for EvoAAA-Gen, 1 for EvoAAA-Evo, and 1 for the random search. To assess the overall performance of each method a ranking is provided in Table 7, with the last row showing the average rank. As can be seen, EvoAAA-Dif is the best performer, closely followed by EvoAAA-Gen and EvoAAA-Evo. The rank difference between random search and exhaustive search is considerable. The bias in the exhaustive search, which tries consecutive configurations until the runtime is over, implies less opportunities to explore the solution space than the random search.

By applying a Friedman statistical test over the best MSE values (bottom half of Table 6) corresponding to each method/dataset pair a $\textit{p-value}=$ 0.0004248178 was obtained. This means that there are statistically significant differences among the evaluated optimization strategies. To elucidate among which methods there are significant differences a post-hoc procedure based on Friedman test is conducted, obtaining the exact p-values provided in Table 8. This information is summarized in the critical distance plot shown in Fig. 5. As can be observed, there is a significant difference between exhaustive search and the evolutionary-based methods. However, from an statistical point of view, differential evolution, evolutionary strategy and genetic algorithm are equivalent, since their performance is quite similar. Notwithstanding, both the raw p-values and the critical distance plot confirm that in average the EvoAAA-Dif alternative is ahead from the other options, achieving better AE configurations.

Table 8
Raw $p$ A-values obtained by the post-hoc procedure based on Friedman’s test

	DE	ES	EX	GA	RN
DE	–	0.4561	0.0001	0.6547	0.1360
ES	0.4561	–	0.0010	0.7656	0.4561
EX	0.0001	0.0010	–	0.0003	0.0113
GA	0.6547	0.7656	0.0003	–	0.2967
RN	0.1360	0.4561	0.0113	0.2967	–

Figure 5.

Critical distance plot comparing the five optimization methods.

3.5.2 Performance comparison against PCA/kPCA

Table 9
Performance comparison (MSE) between the best and worst AE configuration against PCA and kPCA

	AE configurations
Dataset	Best		Worst		PCA	kPCA
Cifar10	0.0020	(Dif)	0.0394	(Exh)	1.5792	0.7905
Delicious	0.0021	(Rnd)	0.0164	(Exh)	1.0438	0.5927
Fashion	444.7913	(Dif)	4180.2370	(Exh)	13418.6200	13338.9400
Glass	0.4458	(Gen)	576.7005	(Exh)	618.0143	614.9599
Ionosphere	0.0694	(Dif)	0.2098	(Exh)	1.4472	3.4753
Mnist	192.4971	(Dif)	4104.1400	(Exh)	11690.6000	7311.2490
Semeion	0.0221	(Evo)	0.1939	(Exh)	1.3263	0.3480
Sonar	0.0106	(Dif)	0.0461	(Exh)	1.2088	0.8512
Spect	0.0658	(Gen)	0.1828	(Exh)	1.3208	3.4119

Besides comparing how the AE configurations produced by each optimization method perform against each other, it would also be of interest to know how they compare with respect to classical dimensionality reduction methods. The best known algorithm in the latter group is PCA [39], a statistical method able to transform the original set of variables into a reduced set of linearly uncorrelated components. kPCA (Kernel PCA) [106] is a more powerful version of PCA, allowing non-linear transformations.

To apply PCA the stats R package has been used. Original values are centered and scaled, as recommended, and a predefined number of principal components is not set.3 Kernel PCA is applied with the implementation provided by the kernlab R package. The default Radial Basis Function kernel is used, and the number of principal components to return is set to the amount of dimensions in the original data.

In order to apply PCA and kPCA each dataset has been split into training and test partitions following a 10 folds cross validation scheme, obtaining an average MSE for each case. These values are shown in Table 9 along with the results of the best and worst AE configurations (taken from Table 6), so that they can be easily analyzed.

As can be observed, even the worst AE configurations produced by the exhaustive search are able to reconstruct not seen data far better than PCA. In general, the differences between reconstruction error AE vs PCA are remarkable. This result is consistent with previously published studies [107, 31]. Reconstruction from the results returned by kPCA shows slight improvements over PCA in most cases, but they are not close to that produced by AEs.

3.5.3 Analysis of explored space

As can be seen in Table 5, in general the exhaustive and random search approaches explore a much larger space of solutions than the evolutionary methods. In some cases, such as glass, ionosphere, semeion, sonar and spect, these approaches examine up to 15 times more configurations than EvoAAA. This is due to that these search algorithms devote all their time in analyzing candidate solutions while the evolutionary methods have other tasks to do, such as individuals selection, crossing, mutation, etc. However, neither random nor exhaustive search never achieve the best performance when AE complexity is taken into account. On the contrary, the MSE for these methods is always higher. Exhaustive search gets stuck in simpler architectures, as stated by its complexity values, despite its usually longer run times (analyzed later) that always go to 24 hours. This behavior is due to the way the exhaustive search has been implemented, trying all possible solutions allowed by the representation in Fig. 4 from right to left as is usually done when an interval of values is going to be traversed from beginning to end. Random search has the ability to explore all the solution space, but it is a non-guided approach. In a way the strategies followed by the exhaustive and random search are antagonistic. The former chooses to exploit the local space, trying all the possible solutions of a very reduced area. The second one jumps all over the space of solutions, without taking advantage of past configurations to exploit the locality of the most promising solutions.

The amount of solutions explored by the EvoAAA-Dif, EvoAAA-Gen and EvoAAA-Evo methods is quite similar while working with small datasets, such as glass, semeion, sonar or spect. By contrast, the EvoAAA-Evo and EvoAAA-Dif approaches are able to inspect more candidate structures when larger datasets are used. This could be due to the simpler procedure of EvoAAA-Evo to produce its offspring with respect to EvoAAA-Gen, since crossing is not necessary and the population size is smaller, and to the lower number of iterations conducted by EvoAAA-Dif with respect to EvoAAA-Gen.

3.5.4 Analysis of solution complexity

Another fact to take into account while comparing the different search procedures is the complexity of found architectures. Theoretically, simpler architectures achieving a similar performance would be preferable to more complex ones.

As can be stated by looking at the sixth column in Table 5, the lowest complexity is always that of the Exh approach. As said before, the exhaustive search is stuck in small architectures with thousands of combinations for activation functions and loss functions (see Fig. 4). However, these are not the best solutions as the MSE values demonstrate.

As would be expected, the random search also produces disparate AE configurations. Sometimes are simpler than those produced by the EvoAAA methods and sometimes are more complex.

Despite some exception, such as fashion and spect, EvoAAA-Gen usually produces simpler AEs with fewer layers and a more compact encoding than EvoAAA-Dif and EvoAAA-Evo. Therefore, at first sight the EvoAAA-Gen search seems to be the best choice, as it provides simpler AEs with better reconstruction capability. However, this comes with a cost as explained below.

3.5.5 Analysis of running times

The running times recorded during the experiments for each configuration are provided in Table 10. As in Table 5, for each data set there are five columns corresponding to the five search approaches. Analyzing this information the following conclusions can be drawn:

Table 10
Running times for each algorithm/dataset until reaching the stopping criteria

	Running times $\downarrow$
Dataset	EvoAAA-Dif	EvoAAA-Evo	Exh	EvoAAA-Gen	Rnd
Cifar10	24 h	24 h	24 h	24 h	24 h
Delicious	18 h 4 m.	17 h 54 m.	24 h	6 h 51 m.	24 h
Fashion	24 h	24 h	24 h	24 h	24 h
Glass	1 h 41 m.	1 h 20 m.	24 h	2 h 20 m.	24 h
Ionosphere	1 h 33 m.	0 h 44 m.	24 h	1 h 59 m.	24 h
Mnist	24 h	24 h	24 h	24 h	24 h
Semeion	2 h 28 m.	1 h 32 m.	24 h	5 h 0 m.	24 h
Sonar	1 h 24 m.	1 h 10 m.	24 h	4 h 13 m.	24 h
Spect	1 h 31 m.	0 h 59 m.	24 h	5 h 57 m.	24 h

•

The exhaustive and random strategies always reach the limit of 24 hours without being able to examine enough configurations to find a good solution, although the random approach is close to the evolutionary methods in some cases.

•

For the larger data sets (cifar10, fashion and mnist) none of the five approaches finish the search process, running out of time.

•

In general, EvoAAA-Dif and EvoAAA-Evo only need a fraction of the time used by EvoAAA-Gen to find AE architectures slightly more complex but with a similar performance.

From this analysis a simple guideline can be drawn, choose the EvoAAA-Dif instance if performance is the main and only goal, but consider the EvoAAA-Evo instance if running time is important while sacrificing a bit of reconstruction power. For large data sets the latter approach can save many hours in obtaining a good-enough AE architecture.

3.5.6 Solutions explored over time

The following aspect to be analyzed is how each approach explores the solution space. For doing so, the plots in Fig. 6 to Fig. 13 show the MSE (Y axis) of the solutions examined through time (X axis).4 Since not all methods take the same running time, there are differences among the X axis for a given data set.