Abstract
It is well-known that tuning a metaheuristic is a critical task because the performance of a metaheuristic and the quality of its solutions depend on its parameter values. However, finding a good parameter setting is a time-consuming task. In this work, we apply the upper confidence bound (UCB) algorithm to automate offline tuning in a (1 + 1)-evolution strategy. Preliminary results show that our proposed approach is a less costly method.
Introduction
Metaheuristics solve a wide umbrella of optimization problems, covering areas from wellness [22] and engineering [11] to health [19] and arts [3]. The exploration and exploitation phases in metaheuristics are related to their parameter values. The parameter setting task specifies a suitable value for any parameter for a metaheuristic to improve the way it can move through the search space. Commonly, there are two approaches to setting parameter values in a metaheuristic [6]: online and offline parameter tuning methods. The first method consists of establishing the parameter values before the run of the metaheuristic. Then the parameter values remain fixed during the run. On the other hand, the second method changes parameter values on the fly of the run. The above suggests that the metaheuristic changes the parameter values according to the course of the run.
The no-free lunch (NFL) theorem [24] indicates that no optimization algorithm solves all optimization problems. The above implies that researchers and end users need not only to adapt the selected metaheuristic to achieve a good solution for a specific problem but also to perform the parameter setting task every time they face new problems [6]. A study in [1] reports that developing a new metaheuristic takes up only 10% of the time required, while the remaining 90% is spent on setting parameter values. In consequence, the tuning task is a time-consuming and tedious process.
All the above motivates us to propose a meta-optimizer based on UCB1 [2] and orthogonal arrays to automate offline tuning. The term meta-optimizer (tuner) was first coined by Mercer and Sampson [12]. It refers to the fact that the offline tuning problem is also an optimization problem that searches for the optimal set of parameters for the metaheuristic. We focus on offline parameter tuning method due to like work in [6] mentions, the advantage of dealing with it is its universality. In other words, if one can develop an efficient tuning method, it can serve to tune various metaheuristics.
UCB1 is part of the branch of machine learning (ML) algorithms called reinforcement learning (RL) algorithms. The integration of ML algorithms into the offline tuning problem has been reported in studies such as [6, 9], and [21].
To verify the effectiveness of our proposed meta-optimizer, we applied it to tune a (1 + 1)-evolution strategy [23] for solving the positioning of unmanned aerial vehicles (UAVs) [16].
The remainder of the work is organized as follows. Section 2 describes the UCB1 algorithm and the (1 + 1)-evolution strategy. Consequently, Section 3 gives details about our proposed meta-optimizer. Next, Section 4 shows preliminary results. Finally, Section 5 concludes this work and gives future directions.
Related work
In recent years, there has been an increasing number of proposed automated algorithm configuration techniques to tune metaheuristics through a computer-intensive search process. They can be classified as follows:
According to [25], the most relevant parameter-tuning techniques are:
ParamILS performs an iterated local search in the parameter space [8] and treats the problem as a task that only considers the configuration of categorical variables. Therefore, the numerical parameters need to be discretized, which can be achieved by generating a series of discrete values for each parameter using prior or random knowledge. For initialization, ParamILS requires an initial configuration as input and generates a small number of random configurations.
The SMAC configurator implements a surrogate model-based search of the parameter space [7]. Unlike ParamILS, SMAC handles numerical and categorical parameters natively, meaning it does not discretize numerical parameters.
The I-Race package [20] implements configuration procedures using an iterative mechanism that involves (i) generating candidate algorithm configurations through a probabilistic mechanism, (ii) selecting the best configurations through runs, and (iii) updating the probabilistic model used to generate candidate configurations, focusing on the elite candidate configurations.
REVAC [18] is a heuristic method that can be described as an evolutionary algorithm, working with a population of m parameter vectors. This population is updated by selecting parent vectors, which are recombined and mutated to produce a second vector that is inserted into the population.
EVOCA [13] is an evolutionary tuner that works with a set of parameters to be calibrated. The population size is calculated based on the number of parameters and their domains. It includes a set of relevant values for each parameter independently in its initial population. A set of well-distributed values is selected for each tuned parameter, and those values are combined in calibrations using a Latin hypercube design.
After conducting a thorough review of the literature, it became clear that there was a lack of exploration regarding reinforcement learning procedures. In the upcoming section, we will outline our bold approach of utilizing the UCB1 algorithm to fine-tune a metaheuristic.
Upper confidence bound 1
UCB1 is one of the RL algorithms to solve the multi-armed bandit (MAB) problem. Briefly, the MAB problem aims to find which machine in a casino will give us the maximum average reward. To do so, an agent performs an action a (pull an arm from the gambling machine) at each time step t to receive a reward. This simple idea has many practical applications as identifying a suitable advertisement banner.
The maximum average reward is defined as follows [15]:
UCB1 performs considering the principle of optimism in the face of uncertainty. It selects the best action a taking into account a confidence interval. In other words, UCB1 chooses an arm whose UCB1 value is the highest to explore [14] as follows:
Next, in Algorithm 1, we describe the steps to select the best arm according to UCB1 policy. Algorithm 1 is divided into two phases: the initialization and the selection. The first stage is performed from STEP 2 to STEP 7 which is to play at each arm once. The count vector keeps track of the number of times the i - th arm is pulled. The sum _ rewards vector keeps the cumulative rewards obtained from the i - th arm. The Q vector stores the average rewards obtained from the i - th arm.
After, Algorithm 1 performs the selection phase from STEP 9 to STEP 19. This stage identifies Q (a*). To do so, Algorithm 1 computes the UCB1 policy for each arm at each iteration. This is shown from STEP 10 to STEP 13. In STEP 10, num _ of _ arms refers to the total number of arms to play. Then, in STEP 14, Algorithm 1 chooses the arm to pull at each iteration according to UCB1 policy. Once, Algorithm 1 identifies the arm to pull at each iteration according to (3); the count vector, the Q vector, and the sum _ rewards vector are updated for that arm. Then, when Algorithm 1 achieves the total number of iterations (num _ of _ rounds in Algorithm 1), it takes the highest Q value from Q vector as (1) indicates. The above is shown in STEP 19.
1:
2: for i = 1, 2, …, num _ of _ arms
3: Obtain reward from arm i and keep it in variable reward _ arm
4: count [i] = count [i] +1
5: sum _ rewards [i] = sum _ rewards [i] + reward _ arm
6: Q [i] = sum _ rewards [i]/count [i]
7:
8:
9: for i = 1, 2, …, num _ of _ rounds
10: for arm = 1, 2, …, num _ of _ arms
11:
12: ucb1 [arm] = Q [arm] + upper _ bound
13:
14: Select the arm that has the maximum value in ucb1 vector and obtain the reward from this arm
15: Update the count vector of that arm
16: Update the sum _ rewards vector with the reward from that arm
17: Update the Q vector for that arm
18:
19: Select the arm which has a maximum Q value
In subsequent sections, we refer to arms by using the term bandits.
We aim to tune offline a (1 + 1)-evolution strategy to locate in a disaster zone a set of unmanned aerial vehicles (UAVs) to provide voice services to ground users (victims and first responders) as Fig. 1 shows.

Optimization problem.
On the other hand, Fig. 2 deploys the idea behind the (1 + 1)-evolution strategy. The flow chart begins by requesting the step size σ, the step size adaptation z, and the number of iterations t. Then, a parent solution X is randomly generated. This parent solution X is evaluated. The parent solution X is mutated, and a child X’ is generated. Subsequently, the feasibility of the child solution X’ is evaluated. If the child’s solution is feasible, it is evaluated in the objective function. Otherwise, it is penalized with zero. The fitness of the parent solution X and the child solution X’ is compared. If the child solution X’ is better than the parent solution X, the child solution X’ becomes the parent solution for the next generation. Then, the success counter is increased. Otherwise, the parent solution X is retained as is for the next generation, and the failure counter is increased. Next, it is checked whether it is time to update the step size σ. If it is time to update the step size σ, there are three options: increase it (σ/c), decrease it (σ · c), or keep it the same (σ). If the stopping condition (number of iterations) is met, the last parent individual is the solution to the problem. Otherwise, another iteration of the algorithm begins from the step of evaluating the parent solution X.

Flow chart of (1 + 1)-evolution strategy.
(1 + 1)-evolution strategy returns as a solution the geographic coordinates to achieve the maximum probability of successfully providing voice services to ground users.
We identify by previous experiments that the parameters that most influence the exploration/exploitation trade-off in the (1 + 1)-evolution strategy are: the step size σ and the step size adaptation z. In short, σ controls mutation, and z indicates how many iterations elapse to adapt σ. Table 1 shows the parameters used for the (1 + 1)-evolution strategy.
Parameters used for the (1 + 1)-evolution strategy
The (1 + 1)-evolution strategy applied to locate UAVs, uses the one-fifth success rule [23]. This rule suggests that the step size σ should be increased if more than 20% of the new solutions are successful, meaning that the offspring is better than the parent. On the other hand, if less than 20% of the new solutions are successful, the step size σ should be decreased.
The Fig. 3 depicts the methodology. It can be seen that in step 1, a set of configurations is defined for use in the calibration of any metaheuristic, using the Taguchi methodology (steps 1 to 5, in [10]).

Methodology to tune a metaheuristic utilizing orthogonal design and UCB1.
Before using this methodology, it is necessary to identify the parameters of interest to calibrate in the metaheuristic. Based on this, we define the different configurations to be evaluated. The parameter combinations form orthogonal arrays proposed by Taguchi; these are designs that possess the property of orthogonality, which is also found in factorial designs. These arrays are complete factorial, fractional, or mixed designs, depending on the number of factors under investigation in a particular case.
Taguchi named these arrays as follows: L a (b) C , where: a represents the number of tests or experimental conditions to be carried out. In other words, the number of rows or lines in the array. b represents the different levels at which each factor will be taken. C is the number of independent effects that can be analyzed, meaning the number of columns in the array.
Once the arrays are obtained, each bandit will be associated with a parameter configuration, as shown in step 2 of Fig. 3. In this way, when the stopping condition is met in the UCB1 algorithm, we will have the configuration (see step 3 of Fig. 3) that is suitable for the problem under study (see step 4 of Fig. 3).
Orthogonal design is an experimental design commonly used to test the effectiveness of different configurations on metaheuristic algorithms. Each configuration is made up of a set of parameters. The idea behind orthogonal design is to create a set of combinations, where each combination is unique and ensures that the effect of one experiment does not interfere with the effect of any other experiment. This means that each experiment is independent and produces reliable results. Table 2 illustrates the application of orthogonal design by presenting nine experimental designs that are used to test the effectiveness of nine different configurations. The parameters z and σ are the ones that most impact the evolution strategy according to prior simulation results.
Taguchi orthogonal array L93(2)
Taguchi orthogonal array L93(2)
The steps involved in typical RL algorithm for tuning the (1 + 1)-evolution strategy [14] are:
First, the agent interacts with the environment by performing an action that comprises the execution of the (1 + 1)-evolution strategy with one set of parameters (z and σ) from Table 2 to evaluate their effect on the algorithm performance.
The agent performs an action and moves from one state to another. An agent on the UCB1 algorithm has a single state, so this step does not apply.
The agent will receive a reward based on the action performed. In this research, the reward is the probability of a successful voice service.
Based on the reward the agent will understand whether the action was good or bad. In this research, our interest is in the trending of actions to select the most relevant orthogonal design, i.e., the best configuration (bandit) to tune the (1 + 1)-evolution strategy.
If the action was good, i.e., if the agent received a positive reward, the agent will prefer to use the successful orthogonal design (exploitation). Otherwise, the agent will try another orthogonal design to gain a positive reward (exploration).
The UCB1 algorithm’s stop condition is met when a bandit reaches a minimum of 34 calls. The winning arm will then determine the configuration for the (1 + 1)-evolution strategy and its associated problem.
The simulations were executed on a desktop computer. The specifications of the computer are as follows: AMD Ryzen 2600 processor, (2 × 8 dual channel) 16 GB of Crucial Ballistix DDR4 memory running at 3000 MHz, a 480 GB Crucial brand SSD, and an 8 GB GDDR5 AMD Radeon Rx 570 video card from Gigabyte. All components are mounted on a B450 AORUS ELITE Motherboard. The operating system is 64-bit Windows 10. The code was developed in Python language.
As shown in Fig. 4, there is a clear breakdown of the total number of times that each bandit was selected. Bandit 1 [5, 0.0001] appears to have poor exploration capability, mainly because of its small step size σ. This small step size leads to a constant fluctuation due to the one-fifth success rule in the (1 + 1)-evolution strategy, resulting in a shaky movement. On the other hand, Bandit 5 [10, 0.001] has a longer review time but has better results due to its increased exploration capabilities when its step size increases.

Final values for the accumulated number of bandit selections.
The performance of each configuration is shown in Fig. 5. It is noticeable that Bandit 5 maintains high values throughout the iterations, while Bandit 1 steadily obtains low values over time. It is possible to uncover where the UCB1 algorithm explores or exploits in the set of configurations. Bandits 5 and 6 are more often called, making an exploitation process while the rest of the configurations share similar values of calls along the iterations as shown in Fig. 6, here again, Bandit 1 has the least number of calls (nine).

Exploration-exploitation of UCB1 on the set of experiments.

Average probability of successful voice services Q(a) by iteration on bandits.
The box plots depicted in Fig. 7 collectively reveal five distinct sets of bandits with the most closely matched 3rd quartile values, after excluding outliers from their respective ranges. For a comprehensive data overview from Fig. 7, please refer to Table 3.

Average probability of successful voice services per bandit.
Summary of Experimental Designs on the (1 + 1)-evolution strategy performance
The first set has the Bandit 1, characterized by a minimal value of zero and a mean of 0.1942. Its 3rd quartile value, Q (a) =0.2092, signifies that 75% of the total data generated by the (1 + 1)-evolution strategy using this configuration falls below this threshold. The second set comprises Bandit 2 and Bandit 4, their mean is 0.4012 and 0.4759 respectively; here the 3rd quartile is 0.5072 and 0.5038 respectively, increasing substantially their value compared to the first set. In the third set, we find Bandit 7 and Bandit 8, their mean are 0.5638 and 0.5687, values higher than the previous set; having similar behavior when we compare the 3rd quartile of this set (0.6303 and 0.5939) with the second set. The fourth set includes Bandit 3, Bandit 6, and Bandit 9; their Q(a) value mean are 0.6867, 0.6045, and 0.6379, and their values in the 3er quartile are 0.6918, 0.6716, and 0.6386 respectively. Finally, the fifth set has the best bandit configuration with the 50% of total data produced by the (1 + 1)-evolution strategy falling below 0.7128 and up to 0.7015; while their maximal value is 0.7528. This configuration gives a Q(a*) = 0.7063.
The time used by the UCB1 algorithm for the exploration-exploitation process is irrelevant compared with the time used by the (1 + 1)-evolution strategy algorithm. Figure 8 shows the time consumed by the (1 + 1)-evolution strategy on executing 500 iterations and sending a Q value related to the i - th bandit.

Time on seconds to provide a solution by the (1 + 1)-evolution strategy.
The most time-consuming configuration was the Bandit 5, this Bandit was selected 34 times, finishing with a Q(a) of 0.7063 as the best configuration suggested by the UCB1 algorithm. In contrast, the most effective configuration considered time-consuming was the configuration on Bandit 9, this bandit was selected 27 times concluding with a Q(a) of 0.6351 as the fourth best. Since we aim to maximize the probability of successful voice services, we take the configuration suggested by Bandit 5.
Once we finish analyzing the experimental results, we gather the following conclusions: Using statistical techniques to tune metaheuristics can be advantageous. However, this task is challenging if a human expert performs it. To address this challenge, the UCB1 algorithm and the DOE technique provide an automated approach to tune metaheuristics. Utilizing the UCB1 algorithm’s exploitation-exploration process for early stopping when a trending configuration emerges, can significantly reduce the time and resources required for algorithm tuning.
We plan to find a suitable termination condition to stop UCB1 for future work. Another future direction includes applying our proposed meta-optimizer to tune toy optimization problems. Finally, we plan to apply Thompson sampling as another approach to tune a metaheuristic.
Footnotes
Acknowledgments
We thank Daniel Sánchez-Bautista for coding the UCB1 algorithm. Gabriela L. Rodríguez-Cortés gratefully acknowledges the scholarship from CONAHCyT to pursue her graduate studies. M.A. Cosío-León and Anabel Martínez-Vargas extend their thanks to the Centro Nacional de Supercómputo at IPICYT for providing essential insights into the UCB1 algorithm through the Diploma in Applied Artificial Intelligence.
