Abstract
Machine learning models work better when curated features are provided to them. Feature engineering methods have been usually used as a preprocessing step to obtain or build a proper feature set. In late years, autoencoders (a specific type of symmetrical neural network) have been widely used to perform representation learning, proving their competitiveness against classical feature engineering algorithms. The main obstacle in the use of autoencoders is finding a good architecture, a process that most experts confront manually. An automated autoencoder symmetrical architecture search procedure, based on evolutionary methods, is proposed in this paper. The methodology is tested against nine heterogeneous data sets. The obtained results show the ability of this approach to find better architectures, able to concentrate most of the useful information in a minimized encoding, in a reduced time.
Introduction
Intelligent appliances based on machine learning systems [1] can be found in many everyday tasks. They are in charge of filtering spam email [2], detecting fraudulent transactions [3], recommending new products to buyers [4] and many other jobs with disparate complexity degrees [5, 6, 7, 8, 9, 10, 11]. To do that, ML methods need to extract knowledge from raw data. The usefulness of that knowledge mostly depends on the quality of data features. The goal is to produce descriptive and/or predictive models, depending on the task at hand.
Choosing a curated set of attributes, or building a new one from the original features, tends to produce better results [12] than using raw variables, although performance of ML methods could be also affected by learning metrics [13]. Hence the interest in feature engineering techniques in late years, including well-known preprocessing procedures [14] such as feature selection [15, 16] and feature extraction [17]. Representation learning (REPL) [18, 19] is a term tightly linked to perform feature engineering relying on deep learning techniques [20, 21].
Deep learning algorithms have a plethora of applications [22, 23]. Autoencoders [24, 25, 26] are a modern general-purpose DL-based family of tools for facing REPL. An AE is an unsupervised symmetric neural network [27] aimed to build an encoding that maximizes the reconstruction of data patterns. The obtained representation can be applied to many different tasks [28], including visualization, anomaly detection, hashing and noise removing. However, finding the proper AE architecture for each data set and function is not a trivial process. Usually, it is a challenge that experts have to deal with.
In this study EvoAAA, an automated methodology for designing AE architectures maximizing their reconstruction power when used with a specific data set, is proposed. The job is confronted as a hard optimization problem [29], unfeasible to solve by means of exhaustive search. Our hypothesis is that evolutionary methods [30] would be able to find near-optimal AE architectures in a reasonable time. It is a first approach to perform NAS specifically for AEs.
To verify the competitiveness of EvoAAA a thorough experimentation, including nine data sets and five search methods, three of them based on evolutionary optimization, is conducted. The architectures found through this approach are way better than those retrieved by exhaustive search.
Summary of terminology and acronyms
Summary of terminology and acronyms
That AEs are effective tools for REPL is a known fact [24, 25, 26], having proved their superiority against classical feature engineering methods such as PCA, LDA, ISOMAP and LLE [31]. They are also a tool closely related to nonstandard learning [32] problems. An AE is a symmetrical ANN [27], so it shares many of the characteristics of any ANN.
Training an ANN implies adjusting the weights that connect their neurons, using the backpropagation method [33] and any variation of the gradient descent algorithm such as SGD [34, 35]. The ANN architecture, i.e. number of layers, amount of units per layer, activation functions, etc., is set in advance, prior to the training process.
An inadequate ANN architecture could produce bad output results, regardless of the weights learned through the training process. If the ANN is too simple, adjusting too few parameters, it will be not able to learn. On the contrary, too complex ANNs suffer from overfitting [36] due to existing enough parameters to memorize the whole training data. These same problems also affect other kinds of ANNs [13], including AEs. Achieving a balanced architecture, the point where the ANN extracts enough information from seen data to generalize well while processing future never seen patterns, is still an unsolved problem. As a result, dozens of papers which contribute the design of ANNs to tackle specific problems [37, 38] are published every year.
The interest is in choosing a proper AE architecture to process a specific data set, so that the AE is able to learn an optimum representation of the data. Until now the design of AEs has been in charge of human experts. It is not an easy task to automate, since it is a hard combinatorial problem [29]. In fact, finding a good architecture can be seen as an optimization problem. This is a field where EMs [30] have shown their efficacy in the past.
Our proposal is a formulation to code any AE architecture so that it can be evolved by means of evolutionary approaches. The goal is to reduce the reconstruction error as much as possible. This way, the AE encoding will concentrate the maximum general-purpose information, rather than an encoding aimed to improve class separability or any other specific goal. Regarding evolutionary techniques, they will be used as the tool to optimize a set of parameters. The classical terminology in this field, summarized in Table 1 along with the acronyms appearing in the text, will be used.
Literature review
Feature engineering is a manual or automated task aimed to obtain a set of features better than the original one. Feature selection [14] consists in choosing a subset of attributes while maintaining most useful information in the data. It can be manually performed by an expert in the field, but mostly is faced with automated methods based on feature correlation [15] and mutual information [16]. By contrast, feature extraction methods transform the original data features to produce a new, usually reduced, set of attributes. Popular algorithms to do this are PCA [39] and LDA [40], whose mathematical foundations are relatively easy to understand.
More advanced studies work with the hypothesis that the distribution of variables in the original data lies along a lower-dimensional space, usually known as manifold. A manifold space works with the parameters that produce the data points in the original high-dimensional space. Finding this embedded space is the task of manifold learning [41] algorithms. Unlike PCA or LDA, manifold methods apply non-linear transformations, so they fall into the non-linear dimensionality reduction [42] category.
Autoencoders, as detailed in [27], are ANNs having a symmetric architecture, as shown in Fig. 1. The input and output layers have as many units as features there are in the data. Inner layers usually have fewer units, so that a more compact representation of the information hold in the data is produced. The goal is to reconstruct the input patterns into the output as faithfully as possible.
Classic architecture for an AE. Black nodes denote a 2-variable encoding layer. Dark gray nodes are intermediate hidden layers. Light gray ones are the input (left) and output layers.
Although AEs are mainly used to perform feature fusion [27], searching the manifold in which the parameters to rebuild the data are found, they have many other practical applications [28, 43]. A properly configured AE can project data of any dimensionality into 2 or 3 dimensions so that patterns can be graphically visualized [44]. AEs can be used to detect anomalies [45, 46], training them to faithfully reconstruct normal patterns. When anomalous data enters the AE, it produces a high output error denoting that these patterns do not follow the known distribution. Another interesting application of AEs, specifically of denoising AEs, is data noise removing. This kind of AE has been used to successfully denoise images [47] and speech [48]. Usually the loss function of the AE has to be adapted to the specific task to be faced, so that the obtained encoding promotes separability, topology preservation or any other desired characteristic. When only maximum performance is pursued while reconstructing the input patterns, the AE will produce a general purpose encoding in its inner layer.
AEs can be configured with a variable amount of inner layers, each of them having different lengths. The proper architecture will mostly depend on the complexity of the patterns to be reconstructed and the restrictions imposed by the encoding layer. These restrictions prevent the AE from simply copying the input onto the output [27], for instance by reducing the number of neurons in the encoding layer, forcing the output of most neurons to be zero (sparse representation), etc.
Finding the best parameters to tune a machine learning model is an uphill battle. Performing a grid search through an internal validation process is an usual approach. However, it is useful only for limited sets of parameters taking known ranges of values. Disparate search algorithms, metaheuristics [49] and optimization approaches [50], aimed to perform combinatorial optimization, could be applied. These go from relatively simple search methods [51, 52, 53, 54] such as A* or IDA to more advanced and complex approaches such as memetic algorithms [55], including classic means as simulated annealing [56] or tabu search [57]. Evolutionary methods [58, 59] have also been used to face optimization problems [60, 61, 62, 63, 64] for a long time. Many of these algorithms are based on the behavior of certain populations, such as ant colonies [65] and particle swarms [66]. In the following, we focus specifically on approaches based on natural evolution [67].
In evolutionary algorithms [68] the search space is defined by the chromosome. It is made up of several genes, each of them representing a specific trait or dimension. The values taken by the genes in a chromosome are limited, and each combination is a potential solution (a point in the search space). These are also known as individuals. Usually a set of these are randomly generated, producing an initial population. Each individual is assigned a fitness value that represents the goodness of the solution. From this point, a number of iterations are run consisting in the same steps. Firstly, certain individuals of the population are selected for reproduction based on their fitness. Then, a set of operators are applied to selected individuals in order to create new ones. Common operators are crossing, that mixes genes of two individuals to produce a new chromosome, and mutation, which randomly changes the value of one or more genes. Lastly, the whole population is evaluated by computing the new fitness values and the population is updated, usually replacing old individuals with new ones, although the best overall solutions can be kept (elitism) whether they are new or not. Three popular evolutionary methods are genetic algorithms [69], differential evolution [70] and evolution strategy [71]. The first follows the methodology just described of selection, crossing, mutation and evaluation. Although GAs can be seen as an old technique, they are still in wide use [50, 62, 72] due to their simplicity and good performance. The second method produces new individuals from differences between existing ones, while the third one focuses on evolving only a few individuals using the mutation operator as only tool.
Evolutionary methods are also suitable for combinatorial optimization, and they have been used for instance for support vector machines [73] and more recently for deep learning networks [74]. Even though EMs have been already used to optimize ANNs, many of the proposals have been focused on learning the weights linked to each connection. This is known as conventional neuroevolution [75, 76, 77, 78] and its main foundation is to use an EM instead of the traditional gradient descent algorithm to optimize weights. Therefore, a fixed network architecture is the base for all the population. The only difference among individuals is the set of weight matrices connecting each layer, values optimized by the EM. Neuroevolutive algorithms able to also evolve the network topology appeared later [79], being aimed many of them to optimize MLPs, although proposals to adjust other kinds of ANNs also exist. There are different approaches to construct the ANN architecture, being one of the most popular subnetworks composition [80]. A recent survey in that matter can be found in [81].
Components to be optimized during AE construction and the methods used for it. The numbers inside circles indicate the usual optimization order: firstly an architecture is set, then a set of hyperparameters is chosen, lastly the weights are adjusted.
More recently, the fusion of different techniques to fully automate the process of choosing a proper model structure and hyperparameters, adjusting the weights, etc., has given birth to a new field known as AutoML [82, 83, 84]. AutoML tools can be based on EMs, but also on Bayesian techniques and other optimization algorithms. Existing AutoML tools are mostly aimed to aid in the design of MLPs and CNNs [85, 86], maybe the two most popular kinds of ANNs. Essentially, these tools have a limited set of cells or blocks that they can combine to define the ANN topology. Depending on the task at hand a specific strategy is followed to choose these blocks. For instance, the AutoKeras tool [87] defines tasks that allow the automated design of ANNs for image, text and structured data classification. The ANN architecture is made up of predefined blocks such as ImageBlock, TextBlock or StructuredDataBlock, with some adjustable parameters. As a consequence, these AutoML tools have a coarse granularity and only can be used to build some specific types of ANNs, mainly supervised learners.
By contrast, in the present proposal EMs are used to evolve the architecture of AEs, a kind of ANN with some specific aspects such as its symmetric layout or its unsupervised learning approach. The AE layout is generated with a fine granularity, as described below, instead of using predefined blocks, and aiming to produce an ANN that generates a manifold of the data rather than a prediction model such as AutoKeras does. The weights are learned through the usual back-propagation algorithm. Therefore, the proposal is not a neuroevolution method since weights will still be learn through the usual gradient descent algorithm. The EvoAAA procedure proposed here could be a piece in an AutoML chain.
The rest of this paper is structured as follows: in Section 2 the different aspects to optimize while building an AE are outlined and the proposed architecture search methodology is presented. Section 3 describes the experimental framework, provides the analysis of performance results and discuss them. Lastly, conclusions are drawn in Section 4.
This section presents the proposal to face the problem formulated in Subsection 1.1, outlining the aspects that have to be taken into account while building an AE and detailing the methodology to follow.
Components to be optimized during AE construction
The process to build a new AE, adjusted to produce a good representation from input patterns, implies several steps. Each one is linked to the optimization of a certain component. A summary of the procedure, components and methods is shown in Fig. 2. The components are tuned in the following order:
Architecture: The structure of the AE has to be set, deciding the amount of layers it will be made of, the number of units per layer, which activation functions will be used in each unit, etc. For years, this has been manually done by experts. The conventional trial and error procedure is also a frequent approach, readjusting the AE architecture after the three building steps have been completed and the AE performance evaluated. During this step specific restrictions, as the symmetrical structure of the network, have to be taken into account. Hyper-parameters: Having decided on the design of the AE, the following step would be choosing the proper values for several hyperparameters. These are not part of the network structure nor are they part of the weights configuration. They are in charge of controlling the tuning of weights during the training process. The most common hyperparameters are the learning rate, the batch size and the number of epochs the network is trained. Although the values could be manually picked, usually a grid search algorithm is used. Once more, trial and error using part of the training patterns as validation, allows to find the best configuration. Weights: Once the architecture of the network and its hyperparameters have been set, the last action would be adjusting the weights that connect every unit in each layer with all the units in the following one. Unlike what happens for finding a proper network structure and good hyperparameters, there is a solid mathematical background related to how these weights should be tuned. Derivatives allow to know the contribution of each connection to the global committed error, so that small adjustments can be made in the correct direction. This is the foundation of the well-known gradient descent method.
Before an AE can be exploited each architecture has to be tested with multiple hyperparameters combinations, and the weights of everyone of these configurations have to be adjusted.
Chromosome genes, name and interval of values they can get.
The three previous steps are iterative in a nested way, as shown in Fig. 3. Usually several different architectures will be tested. For each architecture, the first step (noted as 1 inside a circle) would be getting different sets of hyperparameters to be tried. Adjusting the weights of each configuration, a procedure that is iterative by itself, would be the following step. The third stage, after an architecture has been fixed, a set of hyperparameters is chosen and the weights are adjusted, is evaluating the model. The output of this action would lead us to step 4 or step 6, depending on whether a certain criterion is satisfied or not. In the first case additional sets of hyperparameters, generally obtained by grid search, would be used. If all potential combinations have been tried, a step backward (noted as 5) would be altering the AE architecture. Step 6 marks the end of the process, having an AE ready to be exploited in the system.
As previously stated, there are well-known procedures for steps 2 and 3 (see Fig. 2) of the optimization process: grid search for finding the hyperparameters (step 2) and gradient descent for weight adjusting (step 3). On the contrary, finding a proper AE architecture is still an open problem.
The experience acquired while working with a certain data set could not be applicable when data change occurs. The amount of combinations is potentially infinite, so automating this process by means of exhaustive search is unfeasible. Most experts follow some heuristics to choose the number of layers and units, depending on the quantity of variables they have to deal with, while aspects as activation functions are statically assigned.
Seeking a way of automating the AEs design process, a suitable architecture could be found through a simple search if the number of combinations is restricted beforehand. It would be similar to the grid search procedure followed for hyperparameters. This approach would allow to choose the best structure among a bag of predesigned ones. However, it does not offer any guarantee over the performance of these predesigned models. Other straightforward search heuristics could be adopted, such as adding/deleting layers/units from a base structure as long as the performance of the AE improves.
The EvoAAA approach proposes to solve the task through EMs, since they have amply demonstrated their ability to solve disparate optimization problems. Specifically, it aims to use population based algorithms to evolve a set of AE architectures over time. To do so, a way to code all possible AE configurations is introduced. It will be the chromosome representation that the EMs will work with.
Purpose of each gene and description of their values
Purpose of each gene and description of their values
Evolutionary algorithms usually work with binary or real-valued genes. A set of genes builds a chromosome or individual of the population. In EvoAAA each chromosome will code the complete architecture of an AE. However, an integer gene representation is used rather than binary or real-valued genes.
The chromosome will be made up of 15 genes, as shown in Fig. 4. The number of each gene is shown above, their names inside and just below the range of values that can be assigned to them. The purpose of each gene, as well as the meaning of its values, are portrayed in Table 2. The main characteristics of this AE encoding are the following:
Different types of representations can be learned with AEs depending on the imposed restrictions. Some of these restrictions are linked to the type of AE [27], designed to induce sparsity, learn from noisy samples, etc. The first gen in the chromosome allows to choose among six different AE types (see details in Table 2). The AEs will have an encoding layer and up to six additional hidden layers that have to be taken in pairs: 2, 4 or 6. The value of the second gen, between 0 and 3, indicates the number of pairs of hidden layers. Therefore, the simplest AE would have only 3 layers, the input one, the encoding one and output one, while the most complex would be made up of 9 layers in total. Genes 3 to 6 state the number of units to have in the hidden layers. The last value is associated to the innermost layer, so it sets the encoding length. The other three values are linked to each layer pair, from outer to inner. The number of features in the data, noted as With the exception of the input layer, which is limited to transferring values to the next one, all other layers in the AE use an activation function. Although it is common for all units of an AE to have the same activation function, there is nothing that restricts the use of different functions. The proposed encoding uses 7 genes (7 to 13) to choose the activation functions to be applied in each inner layer, allowing eight different options (see Table 2). The output layer is treated independently with gene 14, since only activation functions producing positive output are allowed. The last gene in the chromosome can take values from 1 to 5, stating the loss function to be internally evaluated while training the AE. The meaning of these values is provided in Table 2.
Through the restrictions imposed in genes 3 to 6, the resulting AEs would be always undercomplete [27]. This means that the learned representation will be more compact than the original one, with a vector containing less values. In addition, the loss function evaluated while adjusting the weights will exclusively focus on reconstruction error. AEs can be trained to improve class separability, reduce the data complexity and other goals. In this case the main objective is to have a reconstruction of data patterns as good as possible. This will force the inner layer to concentrate as much useful, general purpose information as possible, so that the AE can be later used in any kind of task.
When designing an AE for learning new representations the main interest will be in the raw performance. In an AE the performance is usually measured as the error produced while reconstructing data patterns from the learned encoding. Nonetheless, the time needed to obtain the encoding is also a factor to consider. This is the motivation to include a penalization factor in EvoAAA:
The fitness function, that will decide the quality of the solutions, will be computed as shown in Eq. (1), where trainloss is the reconstruction loss produced by the AE with training data, Layers is the number of additional hidden layers (gen 2), Units encoding is the size of the encoding layer (gen 6), and
Therefore, AEs having a similar reconstruction performance but simpler architecture (see Subsection 3.5.9 for additional details) will be preferred over the more complex ones. The bias to obtain AEs with less layers and a shorter encoding is modulated with the
The reason for using a complex evolutionary algorithm to find a proper AE architecture is that the search space is huge. So, a strategy is needed to pick a good solution without having to explore much of this space. But, how large is the search space assuming the chromosome previously described?
Excluding genes 3–6, whose values would vary depending on the number of features in the dataset, there are more than a billion combinations see Eq. (2.2.3).
Working with very small datasets, those having a few dozens of attributes only, the amount of combinations will grow to several billions. We would face trillions of solutions or even more for high-dimensional datasets. Evaluating all those solutions to find the best one is currently unfeasible. As a result, looking for an optimal AE architecture would not be always possible by brute force. However, we could find good enough solutions through optimization mechanisms based on evolutionary strategies.
The EvoAAA methodology is a general evolutionary proposal to search good AE architectures. It can be instantiated using any population based evolutionary algorithm. In this study three specific instances of EvoAAA are tested, one with a genetic algorithm as underlying search method, another one relying on differential evolution and the third one using evolution strategy. The three of them will be compared against each other and also versus two baseline search methods, a simple exhaustive search and a random search. Nine different data sets will be used in the conducted experimentation.
Evolutionary search algorithms
We propose instantiating EvoAAA three times using three different evolutionary algorithms. All of them will use the former chromosome representation. These three approaches are:
Genetic algorithm (GA). A classical genetic algorithm [69], in which a population of individuals evolves through a crossover operator, to give rise to new ones, and to which a mutation operator is applied with a certain probability. Evolution strategy (ES). An aggressive solution-seeking algorithm [71], working with a few individuals who give rise to new ones exclusively through mutation. Differential evolution (DE). A population-based optimization algorithm [70, 88] in which new individuals (agents) are produced from the differences between two random agents with respect to another taken as reference.
Main parameters of the evolutionary algorithms
Table 3 summarizes the main parameters used to run these methods. For GA and ES, each gene in the chromosome is mutated with a probability of 1/15, a value based on the chromosome length itself. Elitism is used in the GA to preserve the tenth percent of individuals having better fitness. The ES is intrinsically elitist, choosing only the best candidates among the union of new mutated individuals and the old population. For GA, in each iteration the best 5 individuals are chosen and preserved. Then, two random parents are picked up from the remaining individuals by using roulette wheel probability. A multi-point crossover operator is applied, being the crossing points established according to the diagram in Fig. 4. This allows the two individuals acting as parents to interchange several of their genes to produce offspring. This way, new individuals are produced until the population size is met, replacing all the old individuals. Lastly, the mutation operator is used with some individuals according to the probability previously stated. As can be seen, this is an aggressive scheme that maximizes the exploration in such a huge search space. For DE, the DE/local-to-best/1 optimization approach is followed. The population size is obtained from the representation length (15 genes
Datasets used in the experimental study
Datasets used in the experimental study
In order to compare the performance of the three instances of EvoAAA along with the exhaustive and random search approaches, nine data sets are used. Their main traits are detailed in Table 4. The last column provides the origin reference for each one of them. The criteria followed to choose them have been:
Attribute type: An AE is a specialized ANN that, through a series of computations, reconstructs the input values. These have to be numeric, and three cases are considered: real values, integer values, and binary values. The goal is to evaluate how the AEs performance changes depending on the type of attributes they have to reconstruct. Number of attributes: Half of the data sets have several hundreds of attributes, even more than a thousand in the case of cifar10. The other five are not that large, with only a handful of variables. This way the behavior of found AEs while working with a few versus a lot of variables will be analyzed.
Since an AE is an unsupervised representation learning method, class attributes have been removed for all the data sets. The number of variables indicated in Table 4 is the effective amount of them being processed.
Five AE architectures will be obtained for each data set, EvoAAA-Gen: the instance based on a genetic algorithm, EvoAAA-Evo: the instance using evolutionary strategy as underlying search method, EvoAAA-Dif: the one based on differential evolution, and the two produced by the exhaustive (Exh) as well as random search (Ran). The same restrictions and evaluation procedure are applicable to all of them:
Performance evaluation: The common mean squared error metric is used to assess the performance of the AEs. This is an unbounded measure, which depends on the original range of values. The lower the MSE the better performance the AE has. This value is the trainloss factor in Eq. (1). Termination cost: As shown in Table 3, the three evolutionary algorithms have been configured with 0 as termination cost. This means that the search will stop if a configuration returning Maximum runtime: The five search approaches will be limited to running for 24 hours. After that the best solution found until then is returned as preferred AE structure. In practice, this limitation will impede the exhaustive search from going through all the possible AE configurations, limiting as well the amount of combinations tested by the random search.
Aside from the MSE, the amount of architectures tested by each approach is also recorded during execution, as well as each individual solution. The number of combinations is limited by both the maximum runtime and the population size and iterations of the evolutionary methods.
Summary of results. For each combination dataset-optimization strategy the amount of evaluated individuals, number of layers and encoding length of the AE, its complexity and MSE (including the penalization factor) are provided
Summary of results. For each combination dataset-optimization strategy the amount of evaluated individuals, number of layers and encoding length of the AE, its complexity and MSE (including the penalization factor) are provided
The 45 experiments1 (9 data sets times 5 search approaches) have been executed in the same hardware, a PC having 64 GB RAM with an NVidia RTX-2080 GPU, and using the following software configuration: Arch Linux [99], CUDA 10.1 [100], cudnn 7.6 [101], Tensorflow 1.14 [102], Keras 2.24 [103] and ruta 1.1 [104]. The RMSProp [105] optimizer of Tensorflow has been used. Since it relies on an adaptive learning rate, optimizing this hyper-parameter is not necessary. The default batch size value in Keras has been chosen, 32. Lastly, the training of the AEs is made in 20 epochs. Although a higher number of epochs, such as 100 or even more, is usual while trying to perform a final optimization of a representation, in this case the main interest is in comparing the search procedures rather than in obtaining fully optimized AEs. Reducing the number of epochs allows to try more architectures in a certain time interval. Regarding how the data instances are used, 20% of them are reserved from the beginning to assess AE performance, computing the MSE. The remainder 80% patterns are given to Keras to train each AE architecture. All the informed results correspond to test data, the initially reserved 20% of instances.
A summary of results is provided in Table 5. For each data set the columns indicate its name, the search method, the amount of individuals (AE configurations) tested, the number of hidden layers and encoding length of the best individual, its complexity (assuming that
The number of tested individuals is limited by the population size and iterations, as well as the maximum runtime, for the three EvoAAA instances. In addition all the invalid configurations produced during the exploration are discarded before they enter the training and testing phase. The exhaustive and random strategies only generate valid AE architectures, and the only limitation is the running time.
Average (top-half) and best (bottom-half) performance (raw MSE, lower values are better) by data set and search strategy. Values of the best performer algorithm for each dataset are highlighted in bold
Performance ranking of the tested methods. For each dataset the performance of the methods is ranked and indicated between 1 (best peformer) and 5 (worst one). The last row provides the average ranking for each algorithm
The results shown in Table 5 can be analyzed from several perspectives, raw performance: the lower MSE the better; size of explored space: although the more inspected individuals could be considered the better, the ability to find good solutions exploring less space has to be also taken into account, and solution complexity: the fewer layers and shorter encoding length the better. In addition, other aspects such as running times, convergence, etc., can be studied.
Analysis of performance
It is easy to see that the highest error values always correspond to the exhaustive search approach. MSE is unbounded and depends on the original attribute values. So, it does not allow to make comparisons among different data, but only between the five methods for each data set. In general, the error committed by Exh is one or two orders of magnitude above EvoAAA-Dif, EvoAAA-Gen and EvoAAA-Evo.
Comparing the three EvoAAA instances, EvoAAA-Gen is the best performer in 5 out of 9 datasets, with EvoAAA-Dif in a close second position gaining 3 out of 9 cases although the differences against EvoAAA-Evo are almost negligible in some cases. The biggest difference can be found with the fashion dataset, where the EvoAAA-Dif approach achieves one-quarter of the error shown by EvoAAA-Gen while using a simpler architecture. This analysis is made taking into account the complexity of AEs, i.e. the MSE is increased by the penalization factor.
In addition to best values, it would also be interesting to analyze the average behavior of the search strategies. For doing so, Table 6 shows for each data set (columns) and method (rows) the average and minimum (best) MSE2 achieved in each case. It can be observed that the best MSE obtained by the exhaustive and random search is far from the average MSE of EvoAAA-Dif, EvoAAA-Gen and EvoAAA-Evo. This leads to the conclusion that even a few iterations with EvoAAA would achieve a better result than 24 h of exhaustive or random search. Internal differences between best and average values for each search method tend to be minimal in most cases (cifar10, delicious, ionosphere, semeion and sonar). In general, these differences seem lower in the case of EvoAAA-Evo than with EvoAAA-Gen or EvoAAA-Dif. For instance, average and best values for fashion, glass and mnist are closer in the former case than in the latter. This suggests that EvoAAA-Evo would be preferable if only a few iterations are affordable.
If we strictly focus on the best values achieved by each approach, these returned at the end of all iterations without complexity penalization, EvoAAA-Dif stands out over the other algorithms. It has 5 best values, against 2 for EvoAAA-Gen, 1 for EvoAAA-Evo, and 1 for the random search. To assess the overall performance of each method a ranking is provided in Table 7, with the last row showing the average rank. As can be seen, EvoAAA-Dif is the best performer, closely followed by EvoAAA-Gen and EvoAAA-Evo. The rank difference between random search and exhaustive search is considerable. The bias in the exhaustive search, which tries consecutive configurations until the runtime is over, implies less opportunities to explore the solution space than the random search.
By applying a Friedman statistical test over the best MSE values (bottom half of Table 6) corresponding to each method/dataset pair a
Raw
A-values obtained by the post-hoc procedure based on Friedman’s test
Raw
Critical distance plot comparing the five optimization methods.
Performance comparison (MSE) between the best and worst AE configuration against PCA and kPCA
Performance comparison (MSE) between the best and worst AE configuration against PCA and kPCA
Besides comparing how the AE configurations produced by each optimization method perform against each other, it would also be of interest to know how they compare with respect to classical dimensionality reduction methods. The best known algorithm in the latter group is PCA [39], a statistical method able to transform the original set of variables into a reduced set of linearly uncorrelated components. kPCA (Kernel PCA) [106] is a more powerful version of PCA, allowing non-linear transformations.
To apply PCA the stats R package has been used. Original values are centered and scaled, as recommended, and a predefined number of principal components is not set.3 Kernel PCA is applied with the implementation provided by the kernlab R package. The default Radial Basis Function kernel is used, and the number of principal components to return is set to the amount of dimensions in the original data.
In order to apply PCA and kPCA each dataset has been split into training and test partitions following a 10 folds cross validation scheme, obtaining an average MSE for each case. These values are shown in Table 9 along with the results of the best and worst AE configurations (taken from Table 6), so that they can be easily analyzed.
As can be observed, even the worst AE configurations produced by the exhaustive search are able to reconstruct not seen data far better than PCA. In general, the differences between reconstruction error AE vs PCA are remarkable. This result is consistent with previously published studies [107, 31]. Reconstruction from the results returned by kPCA shows slight improvements over PCA in most cases, but they are not close to that produced by AEs.
As can be seen in Table 5, in general the exhaustive and random search approaches explore a much larger space of solutions than the evolutionary methods. In some cases, such as glass, ionosphere, semeion, sonar and spect, these approaches examine up to 15 times more configurations than EvoAAA. This is due to that these search algorithms devote all their time in analyzing candidate solutions while the evolutionary methods have other tasks to do, such as individuals selection, crossing, mutation, etc. However, neither random nor exhaustive search never achieve the best performance when AE complexity is taken into account. On the contrary, the MSE for these methods is always higher. Exhaustive search gets stuck in simpler architectures, as stated by its complexity values, despite its usually longer run times (analyzed later) that always go to 24 hours. This behavior is due to the way the exhaustive search has been implemented, trying all possible solutions allowed by the representation in Fig. 4 from right to left as is usually done when an interval of values is going to be traversed from beginning to end. Random search has the ability to explore all the solution space, but it is a non-guided approach. In a way the strategies followed by the exhaustive and random search are antagonistic. The former chooses to exploit the local space, trying all the possible solutions of a very reduced area. The second one jumps all over the space of solutions, without taking advantage of past configurations to exploit the locality of the most promising solutions.
The amount of solutions explored by the EvoAAA-Dif, EvoAAA-Gen and EvoAAA-Evo methods is quite similar while working with small datasets, such as glass, semeion, sonar or spect. By contrast, the EvoAAA-Evo and EvoAAA-Dif approaches are able to inspect more candidate structures when larger datasets are used. This could be due to the simpler procedure of EvoAAA-Evo to produce its offspring with respect to EvoAAA-Gen, since crossing is not necessary and the population size is smaller, and to the lower number of iterations conducted by EvoAAA-Dif with respect to EvoAAA-Gen.
Analysis of solution complexity
Another fact to take into account while comparing the different search procedures is the complexity of found architectures. Theoretically, simpler architectures achieving a similar performance would be preferable to more complex ones.
As can be stated by looking at the sixth column in Table 5, the lowest complexity is always that of the Exh approach. As said before, the exhaustive search is stuck in small architectures with thousands of combinations for activation functions and loss functions (see Fig. 4). However, these are not the best solutions as the MSE values demonstrate.
As would be expected, the random search also produces disparate AE configurations. Sometimes are simpler than those produced by the EvoAAA methods and sometimes are more complex.
Despite some exception, such as fashion and spect, EvoAAA-Gen usually produces simpler AEs with fewer layers and a more compact encoding than EvoAAA-Dif and EvoAAA-Evo. Therefore, at first sight the EvoAAA-Gen search seems to be the best choice, as it provides simpler AEs with better reconstruction capability. However, this comes with a cost as explained below.
Analysis of running times
The running times recorded during the experiments for each configuration are provided in Table 10. As in Table 5, for each data set there are five columns corresponding to the five search approaches. Analyzing this information the following conclusions can be drawn:
Running times for each algorithm/dataset until reaching the stopping criteria
Running times for each algorithm/dataset until reaching the stopping criteria
The exhaustive and random strategies always reach the limit of 24 hours without being able to examine enough configurations to find a good solution, although the random approach is close to the evolutionary methods in some cases. For the larger data sets (cifar10, fashion and mnist) none of the five approaches finish the search process, running out of time. In general, EvoAAA-Dif and EvoAAA-Evo only need a fraction of the time used by EvoAAA-Gen to find AE architectures slightly more complex but with a similar performance.
From this analysis a simple guideline can be drawn, choose the EvoAAA-Dif instance if performance is the main and only goal, but consider the EvoAAA-Evo instance if running time is important while sacrificing a bit of reconstruction power. For large data sets the latter approach can save many hours in obtaining a good-enough AE architecture.
The following aspect to be analyzed is how each approach explores the solution space. For doing so, the plots in Fig. 6 to Fig. 13 show the MSE (Y axis) of the solutions examined through time (X axis).4 Since not all methods take the same running time, there are differences among the X axis for a given data set.
MSE of the solutions explored by the algorithm through the running time.
As can be seen, the exhaustive search (Fig. 6) keeps a high error rate from start to end. For fashion and mnist, whose attributes are quite similar, good and bad solutions are mixed over time (see Fig. 7). For the remainder data sets the search does not seem able to improve much as new solutions are evaluated.
MSE of the solutions explored by the algorithm through the running time.
As can be observed in Fig. 8, the random search approach has a similar behavior to the exhaustive search, although it has larger fitness variability (the lines are less condensed) among the explored individuals. It is due to its ability to jump over all the solution space, instead of going through every possible combination of parameters.
MSE of the solutions explored by the algorithm through the running time.
The behavior of the EvoAAA-Gen and EvoAAA-Evo candidates is similar, as shown in Figs 9 and 10. Both start with lower error rates than the exhaustive and random strategies, and then improve as the running time goes by. This advance in the quality of solutions is not highlighted in the plots, as the Y axis is kept fixed to ease the comparison among the five approaches.
MSE of the solutions explored by the algorithm through the running time.
MSE of the solutions explored by the algorithm through the running time.
The behavior shown by the EvoAAA-Evo and EvoAAA-Gen methods while working with the glass data set is anomalous as can be observed in Figs 11 and 12. As stated in Table 4, this is a data set with only 9 attributes and a handful of instances. It is probably the hardest case for an AE, since there is no much room to reduce the data representation nor enough patterns to learn it.
MSE of the solutions explored by the algorithm through the running time.
MSE of the solutions explored by the algorithm through the running time.
MSE of the solutions explored by the algorithm through the running time.
By contrast with EvoAAA-Gen and EvoAAA-Evo, the EvoAAA-Dif algorithm seems to have a wider exploration range in most cases, probably due to its larger population. Figure 13 shows the quality of individuals evaluated through time by this search approach.
In Figs 14 to 16 the X axis has been changed from time to number of explored solutions, while the Y axis shows the error of the best solution found until now. The Y axis scale is kept fixed for each dataset, but the X axis scale changes since each approach examines a different amount of candidates. The goal is to analyze the improvement achieved by each optimization strategy as they explore more solution space. All plots are available in the repository.
As might be expected, all five methods find better solutions as the number of possible architectures examined grows. However, both the exhaustive method and random search show some rungs as the search progresses, getting sometimes stuck for a long time in the same error level. An example of this behavior can be seen in Fig. 14.
Improvement achieved as solutions are explored (how MSE decreases as the number of tested combinations grows).
The improvements achieved by EvoAAA-Gen and EvoAAA-Evo seem more progressive until they reach their minimum level (see Fig. 15). In addition, this lower error rate is achieved after exploring a lower number of potential solutions.
Improvement achieved as solutions are explored (how MSE decreases as the number of tested combinations grows).
The behavior of the EvoAAA-Dif algorithm is somehow a mix of the previous ones, with a continuous improvement of results and starting with a lower error but having certain similarities with the random strategy.
Improvement achieved as solutions are explored (how MSE decreases as the number of tested combinations grows).
The following step is to analyze the methods’ speed of convergence. In this case there are nine plots available in the repository, one per data set. One of them, shown in Fig. 17, is taken as reference for the following analysis. This way the same X and Y scales are shared by the five optimization strategies. The X axis corresponds to running time. Only a portion of the time spent is represented, otherwise the lines that depict the EvoAAA instances would occupy only a small portion of the area, to the left. The lines that continue to the right mean that better values were found later, whereas those that do not reach the X limit indicate the best value achieved is in the plot (e.g. random search with the mnist dataset). From these plots the following conclusions can be extracted:
Speed of convergence for each EvoAAA approach against exhaustive and random search.
In general, the exhaustive approach (thick solid line) is stuck in a high error rate most of the time. The only exception is the glass data set, with this search method almost chasing the three EvoAAA strategies. The random search (thin solid line) behaves aimlessly as it would be expected. Sometimes it shows a slow progress over time similar to that of the exhaustive strategy (e.g. cifar10 and delicious) while other times it finds a good solution very quickly and then no better solutions are found (e.g. fashion and mnist). It is a strategy that could provide a good result in short time but without guarantees. In general, the three EvoAAA instances converge quickly to a lower error value than the two baseline strategies, then stabilize and keep improving at a slower rate. In most occasions the ES approach (dotted line) reaches its optimum (lowest error value) before the GA (thick dashed line) does, then finishes the execution. On the contrary, the GA method keeps improving for longer. Due to this behavior, it is able to beat ES in many cases. The DE strategy (thin dashed line) usually starts with worse solutions than GA and ES, but in most cases it converges faster, particularly with the most complex datasets such as cifar10, fashion and mnist. It is able to get the best result in a fraction of the time with respect to the other alternatives in some cases.
On the basis of this analysis, it would be possible to adjust the parameters of the DE and ES algorithms, increasing the population or their number of iterations, in order to run longer and keep improving as the GA does. They would presumably achieve results comparable to those of the GA method for some small datasets (e.g. sonar, glass and spect), or even better with DE.
To finish this analysis, how the penalization factor linked to the AEs complexity influences the obtained results is scrutinized. For doing so, the DE strategy has been used over the sonar dataset with
The goal of the
Loss change through time for different 
To start with this analysis, Fig. 18 shows how the MSE changes through time, as the architectures are explored by the DE algorithm, for the considered
Loss vs AE complexity for different 
In order to better appreciate the extent to which performance degrades as
Loss and AE configuration obtained with each
The exact MSE values and configuration for each considered
As a result of this analysis,
AEs are a useful tool to face representation learning and perform dimensionality reduction. However, finding the AE architecture that fits better every case is a difficult task. Internal cross-validation is an usual approach for tuning hyperparameters. However, the solution space is huge if we want to also adjust the AE architecture. Therefore, more powerful methods to face this problem would be needed.
In this paper we have proposed EvoAAA, an evolutionary based approach to find the best AE architecture for each data set. First, a way of encoding the AE architecture within a chromosome has been proposed. It is broad enough to consider most AE variations, including different amounts of layers and units, individual activation functions per layer and several loss functions. Then, three search methods have been planned, one based on differential evolution, another one founded on a genetic algorithm and the other on an evolutionary strategy. Lastly, a thorough experimentation and analysis have been conducted, comparing the results of the three EvoAAA strategies against two different baselines, exhaustive and random search. Moreover, the reconstruction performance of the obtained AEs have been contrasted to that of classical algorithms such as PCA and Kernel PCA.
Overall, it has been demonstrated that the proposed methodology is able to find a good AE structure for each data set. It may not be the optimal one, as more advanced search algorithms and optimization strategies could improve these results, but it is better than the ones randomly chosen or found through an exhaustive exploration if results in a reasonable time are needed. The conducted experiments demonstrate that EvoAAA is a competitive procedure to accomplish the job.
Footnotes
The R code to reproduce these experiments, as well as the datasets used in them, are available to download from
Although a fair comparison would demand to limit the number of principal components to be used by PCA while reconstructing, choosing a similar amount to that used by the AEs, we have not imposed this restriction.
The analysis described here has been made with the overall results, although only some illustrative cases are graphically shown. Plots corresponding to all combinations of dataset
Acknowledgments
The authors are grateful to the seven anonymous reviewers and the editor for helping us in improving our work. This work was partially supported by the project TIN2015-68854-R (FEDER Founds) of the Spanish Ministry of Economy and Competitiveness.
