Adaptive Geometry Based Meta-Learning for Multi-Objective Combinatorial Optimization Problems

Abstract

In recent years, neural heuristics leveraging deep reinforcement learning have exhibited considerable promise in addressing multi-objective combinatorial optimization problems (MOCOPs). Nonetheless, challenges persist in attaining both high learning efficiency and optimal solution quality. To address this issue, we propose a novel multi-objective optimization algorithm grounded in information geometry and machine learning principles, which integrates adaptive gradient descent with meta-reinforcement learning techniques to effectively tackle MOCOPs. In this paper, we present a meta-learning framework aimed at enhancing model performance in multi-objective combinatorial optimization through tensor remodeling, preconditioned gradient descent, and entropy regularization strategies. Experimental results demonstrate that the proposed method yields significant performance improvements across several classic multi-objective combinatorial optimization challenges, including the Multi-objective Traveling Salesman Problem (MOTSP), Multi-objective Vehicle Routing Problem (MOCVRP), and Multi-objective Knapsack Problem (MOKP).

Keywords

Riemannian manifold meta-learning deep reinforcement learning multi-objective combinatorial optimization

Introduction

When addressing practical applications, it is common to encounter problems involving multiple objectives that need to be optimized simultaneously, but the decision variables are subject to discrete choices, often in the form of combinatorial structures. These problems are referred to as Multi-Objective Combinatorial Optimization Problems (MOCOPs). Typically, this type of problem is defined as follows:

\min f (x) = [f_{1} (x), f_{2} (x), \dots, f_{N} (x)] (n = 1, 2, \dots, n),

(1)

s . t . {\begin{matrix} g (x) = [g_{1} (x), g_{2} (x), \dots, g_{n} (x)] \leq 0 \\ h (x) = [h_{1} (x), h_{2} (x), \dots, h_{m} (x)] = 0 \\ x = [x_{1}, x_{2}, \dots, x_{d}, \dots, x_{D}] \\ x_{d_m i n} \leq x_{d} \leq x_{d_m a x} (d = 1, 2, \dots, D) \end{matrix},

(2)

where

x

is the decision variable in a D-dimensional space;

f (x)

is the objective function, and

n

the number of objectives to be optimized simultaneously. Each

f_{n} (x)

represents the n-th objective function, which is typically in conflict with the others.

g (x)

represents the inequality constraints, while

h (x)

represents the equality constraint. The constraints form the feasible region of the problem, where

x_{d_m i n}

and

x_{d_m a x}

denote the upper and lower limits of the variable search, respectively.

In the context of MOCOPs, the decision space is discrete, meaning that the decision variables $x$ often represent discrete choices, such as the selection of items, paths, or schedules. This is in contrast to traditional multi-objective optimization problems (MOPs), where the decision space is typically continuous. MOCOPs are frequently encountered in practical fields like logistics, vehicle routing, and scheduling, where the optimization aims to find a balance between conflicting objectives, such as minimizing cost, maximizing efficiency, and reducing environmental impact.

Thus, (1) represents a general formulation for a multi-objective combinatorial optimization problem, and (2) outlines constraints arising from discrete decision spaces, guiding the search for Pareto optimal solutions across competing objectives.

A fundamental distinction between MOPs and single-objective problems (SOPs) lies in their solution structures: MOPs exhibit a set of Pareto optimal solutions (the Pareto front), whereas SOPs typically yield a single global optimum or multiple equivalent optima. A solution $x$ is Pareto optimal if and only if there exists no other feasible solution $y$ such that: $f_{i} (x) \leq f_{i} (y),$ for all objectives $i \in {1, 2, \dots, n}$ , and $f_{i} (x) \leq f_{i} (y)$ for at least one objective $j \in {1, 2, \dots, n} .$

For minimization optimization problems, $x$ must have at least one objective component smaller than $y$ , and no other objective component of $x$ should be worse (i.e., larger in value) than $y$ . In the search for solutions that are as small as possible, a smaller solution is generally considered more preferable. Solutions on the Pareto Front (PF) are those that cannot be improved in any objective without worsening others. We use $x ≺ y$ to indicate that $x$ dominates $y$ . A solution is termed Pareto-optimal only when it is not dominated by any other solution in the solution set. In MOCOPs, the obtained non-dominated solution sets are referred to as the Pareto set. The collection of objective function value vectors corresponding to each solution in the Pareto set is termed the Pareto front (PF). The primary objective in solving MOCOPs lies in identifying Pareto-optimal solutions.

MOCOPs are widely studied in the community of computational intelligence. Classic ideas for solving multi-objective combinatorial optimization problems include the Pareto optimization, weighted method, $ε$ -constraint method, evolutionary algorithms, and neighborhood search-based algorithms, among others. The Pareto optimization method constructs the Pareto frontier by searching for solutions that are not dominated by other solutions, such as NSGA-II; the weighted method weights multiple objectives and combines them into a single objective optimization problem; the $ε$ -constraint method converts one or more objective functions into constraints to optimize the other objective; Evolutionary algorithms, such as NSGA-II and MOEA, use population evolution to search the solution space in parallel, making them suitable for complex nonlinear problems. Neighborhood search algorithms, like simulated annealing (Bertsimas & Tsitsiklis, 1993), introduce randomness and a cooling mechanism to escape local optima, while tabu search Gendreau and Potvin (2005) uses historical information (such as the tabu list) to prevent repeated visits to the solution space. Iterated local search Lourenço et al. (2003) improves the current solution by perturbing it and repeatedly applying local search. Current mainstream algorithms, particularly evolutionary algorithms, do not leverage gradient information in discrete optimization. However, for continuously differentiable problems, the efficiency of gradient utilization remains suboptimal. Some research Kang et al. (2023) and Wang et al. (2024) has shown that introducing gradient guidance strategies during training can improve the quality of the trained model. For example, Kang et al. (2023) highlighted that when a parameter space exhibits latent structure, there is an associated Riemannian metric with this parameter space. In this article, a novel multi-objective optimization algorithm is proposed by employing information geometry theory and based on machine learning tools. This algorithm introduces the concept of a Riemannian manifold and, during training, adaptively adjusts the gradient direction based on the geometric structure of the training samples to direct the evolution of the solution. Our main contributions can be summarized as follows:

We present a novel meta-reinforcement learning method, meta-reinforcement learning with gradient geometry adaptive tuning (GMRL), for preconditioned gradient descent, which facilitates geometric adaptive learning during meta-learning.

We propose a more efficient fine-tuning method to effectively address all sub-problems. By incorporating dynamic learning rate adjustment and entropy regularization, the model has been enhanced in terms of convergence speed, training stability, and strategy exploration ability.

Our experimental results on the few-shot learning benchmark task demonstrate that GMRL outperforms the state-of-the-art meta-reinforcement learning (MRL) family.

Furthermore, our experimental results on three classical MOCOPs confirm the effectiveness of our design.

Background

Multi-objective combinatorial optimization problems (MOCOPs) necessitate distinct methodological approaches due to their inherent computational complexity. The current research consensus delineates two foundational paradigms: exact methods and heuristic approaches. Exact Methods guarantee the identification of Pareto-optimal solutions through rigorous mathematical formulations, such as multi-objective branch-and-bound and dynamic programming. However, their exponential time complexity limits their applicability to large-scale problems. Heuristic Methods can be further categorized into four subclasses: single-solution metaheuristics , including simulated annealing (Bertsimas & Tsitsiklis, 1993) and tabu search Gendreau and Potvin (2005), which iteratively refine solutions toward optimality; population-based metaheuristics , such as NSGA-II (Deb et al., 2002) and MOEA/D (Zhang & Li, 2007), leveraging population diversity for global exploration; hybrid strategies, combining complementary algorithmic strengths to enhance search efficiency and solution quality; learning-based optimizers , such as neural combinatorial optimization (NCO) and reinforcement learning, which employ data-driven approaches to generate Pareto sets.

Over the last decade, advancements in neural network architectures have introduced innovative approaches to solving MOCOPs, giving rise to NCO algorithms. These methods leverage neural networks to automatically learn heuristics for solving combinatorial optimization problems, requiring minimal domain-specific expertise while often delivering high-quality solutions quickly. As a result, NCO has gained significant traction. Research in this domain is divided into two primary categories: end-to-end methods and improvement-based methods. End-to-end approaches aim to generate solutions independently, whereas improvement-based approaches integrate auxiliary algorithms to boost performance. This study concentrates on the end-to-end methodology.

Zhang et al. (2021) proposed the utilization of deep reinforcement learning algorithms to address multi-objective combinatorial optimization problems, introducing the MODRL. Lin et al. (2022) introduced the PMOCO method based on multi-objective reinforcement learning, which utilizes preference conditions to generate an approximate Pareto solution set and enhance the solution quality, speed, and model efficiency for solving multi-objective combinatorial optimization problems.

In 2017, the emergence of second-order meta-learning algorithm MAML and its derivatives provided a new approach for incorporating deep reinforcement learning into problem-solving. The integration of meta-learning into algorithms has been shown to improve their generalization ability. Zhang et al. (2023) presented Meta-DRL, a deep reinforcement learning algorithm based on the first-order meta-learning algorithm Reptile and an improved EMNH (Chen et al., 2023b).

On the other hand, in recent years, the emergence and development of various large language models, such as ChatGPT, have injected new vitality into the development of neural combinatorial optimization algorithms. Recent studies have explored integrating large language models (LLMs) with evolutionary computation (EC) to automate heuristic generation (Chen et al., 2023a; Meyerson et al., 2023; Yang et al., 2024). A notable example is FunSearch (Romera-Paredes et al., 2024), which frames Automatic Heuristic Design (AHD) as a functional search problem. In this approach, heuristics are represented as programs, and an evolutionary framework employs an LLM to iteratively improve the quality of generated functions. While FunSearch has demonstrated success across various tasks, it is computationally intensive and requires significant resources to produce high-quality heuristics. To address these limitations, Liu et al. (2024b) introduced Evolution Heuristic (EoH), a new evolutionary paradigm that synergizes LLMs and EC to streamline automatic heuristic design. Beyond the aforementioned directions, researchers have explored integrating multi-task learning into combinatorial optimization. Ibarz et al. (2022) and Reed et al. (2022) developed a general agent capable of tackling various tasks, including COPs. Wang and Yu (2023) proposed a multi-task learning framework utilizing separate encoders and decoders for combinatorial optimization. However, their method struggles with complex Vehicle Routing Problems (VRPs) and requires adjustments to handle unseen problems. Liu et al. (2024a) introduced a novel approach for cross-problem generalization in VRPs by treating VRP variants as combinations of shared attributes. Their method enables solving multiple VRP variants simultaneously through an end-to-end multi-task learning framework.

Preliminary

Reptile

Reptile is a gradient-based meta-learning algorithm introduced in 2018 by Alex Nichol et al. Nichol et al. (2018) at OpenAI. The core idea behind Reptile is to enable a model (e.g., a neural network) to quickly adapt to new tasks by repeatedly sampling tasks, performing random initializations, and applying gradient updates to the model’s trainable parameters —typically the weights and biases of each layer in the network. These parameters control the transformations between layers (e.g., linear mappings in fully connected layers or convolutional filters in CNNs) and ultimately determine the model’s output for a given input. Unlike some other meta-learning algorithms, Reptile does not rely on second-order gradient information, making it computationally efficient. By training the model on multiple tasks using stochastic gradient descent (SGD) and incrementally updating the shared initial parameters (i.e., the starting point for task-specific fine-tuning) after each training session, Reptile allows the model to rapidly generalize across tasks.

The training process can be summarized as follows:

Initialize Model Parameters: Start by randomly initializing the model parameters $θ$ . This initialization serves as the base parameters for subsequent task training.

Sample a Task: Randomly sample a task $T_{i}$ from the task distribution. Each task could represent a distinct problem, such as different classification or regression tasks.

Inner-loop Optimization: For the sampled task $T_{i}$ , perform several gradient descent updates on the task-specific objective using SGD.

Reformulation of Tensors

This study employs the concept of expansion to reshape the gradient tensor of a convolution kernel into a matrix form. Tensor expansion, also referred to as matrixization or planarization, involves rearranging the elements of an n-dimensional tensor $X \in R^{I_{1} \times \dots \times I_{N}}$ into a matrix representation (Kolda & Bader, 2009). Here, $I_{1}, I_{2}, \dots, I_{N}$ denote the sizes of the tensor along each dimension. The n-type expansion of the tensor $X$ can be formally described as follows:

X \to X_{[n]} \in R^{I_{1} \times \dots \times I_{N}}, I_{M} = \prod_{k \neq n} I_{k},

(3)

where

I_{M} = \prod_{k \neq n} I_{k}

represents the product of the sizes of all dimensions except the nth dimension. For instance, consider a three-dimensional weight tensor

W \in R^{C_{i n} \times K_{h} \times K_{w}}

, where:

$C_{i n}$ denotes the number of input channels,

$K_{h}$ denotes the height of the convolution kernel,

$K_{w}$ denotes the width of the convolution kernel.

The n-type expansion of the tensor X can be formally described as follows:

$W_{1} \in R^{C_{i n} \times (K_{h} K_{w})}$ ,

$W_{2} \in R^{K_{h} \times (C_{i n} K_{w})}$ ,

$W_{3} \in R^{K_{w} \times (C_{i n} K_{h})}$ .

Each form corresponds to a different mode of expansion, allowing for flexibility in how the tensor is reshaped into a matrix. Its significance is as follows:

Faster computations. Converts complex tensor operations into simpler matrix calculations; accelerates the gradient update process (e.g., by generating the preconditioning matrix).

Adapts to geometric structures. Different unfolding methods (e.g., $W_{1}, W_{2}, W_{3}$ ) capture distinct patterns in data; this helps the model dynamically adjust learning strategies.

Preserves key relationships. Maintains critical connections (e.g., between input channels and spatial features); this allows flexible handling of diverse tasks.

Riemannian Manifold

An $n$ -dimensional Riemannian manifold is defined by a manifold $M$ and a Riemannian metric $g : M \to R^{n \times n}$ . The Riemannian metric $g$ is a positive-definite bilinear form defined at each point $p \in M$ , denoted as $g_{p}$ (Lee, 2012). Specifically, $g_{p}$ provides an inner product on the tangent space $T_{p} M$ of the manifold at point $p$ , expressed as $g_{p} (u, v)$ , where $u, v \in T_{p} M$ . This metric enables the definition of various geometric concepts on the manifold, such as geodesics and curvature. Geodesics represent locally shortest paths, while curvature describes the bending properties of the manifold. For example, the unit sphere $S^{n}$ is a Riemannian manifold with constant positive curvature $+ 1$ , and the hyperbolic space $H^{n}$ has constant negative curvature $- 1$ . The framework of Riemannian manifolds is extensively used to study and analyze the geometry of non-Euclidean spaces.

Preconditioned Gradient Descent (PGD)

PGD is a method designed to minimize the empirical risk, which represents the expected loss of a model on the training data. Specifically, the empirical risk is defined as the average loss of the model over the training dataset. For a model parameterized by $θ$ and a specific task $τ$ , where $τ$ is defined by a training dataset $D$ and a validation dataset $D$ . The method achieves this by updating model parameters through gradient adjustments, while incorporating a preconditioner matrix $P$ to reconfigure the geometry of the parameter space. For a given model with parameters $θ$ and task $τ = D_{τ}^{t r}, D_{τ}^{v a l}$ , the gradient update with preconditioning can be expressed as:

θ_{τ, k + 1} = θ_{τ, k} - α P \nabla_{θ_{τ, k}} L_{τ} (θ_{τ, k}; D_{τ}^{t r}),

(4)

k = 0, 1, \dots, a n d θ_{τ, 0} = θ,

(5)

where $L τ (θ τ, k; D_{τ}^{t r})$ denotes the empirical loss function for task $τ$ and parameters $θ_{τ, k}$ . When $P = I$ , this formulation simplifies to standard Gradient Descent (GD). By incorporating second-order information, $P$ can be selected as the inverse Fisher information matrix $F^{- 1}$ , transitioning to Natural Gradient Descent (NGD) (Amari, 1998). Alternatively, employing the inverse Hessian matrix $H^{- 1}$ as derived from Newton’s method allows for further optimization improvements (Lee, 2012). Adaptive gradient methods, which approximate diagonal matrices using historical gradients, are another approach to constructing $P$ (Kang et al., 2023). These preconditioning techniques mitigate the effects of pathological curvature and enhance optimization efficiency (Amari et al., 2020).

Methodology

Our meta-reinforcement learning algorithm with gradient geometry adaptive tuning (GMRL) includes a meta-learning process with gradient geometry adaptive tuning function (GAA), an effective fine-tuning process, and an inference process, as shown in Figure 1. GMRL is general, incorporating the first-order ladder-based Reptile algorithm (Liu et al., 2024b) as the core component for meta-learning, while employing the widely-used neural solver POCO (Kwon et al., 2020) as the foundational model. In the meta-learning process, the number of iterations for training model $θ$ is set to $T_{m}$ , and the multi-task mode is used to accelerate the training. In addition, we design the gradient geometry adaptive regulator to improve the data utilization efficiency. The specific details of each design are described below.

Figure 1.

The overall framework of GMRL.

Adaptive Adjustment of Gradient Geometry Meta-learning

In the meta-learning process, the procedure initializes a random meta-model

θ

. It was subsequently trained with a specific number of iterations, denoted as

T_{m}

. For each iteration, several weights (

\tilde{N}

) are randomly sampled from the given distribution. Each weight is associated with a specific subproblem, necessitating the utilization of DRL to modify the parameters of its corresponding submodel. In other words, submodels were derived from the meta-model and guided by specific weights for

T_{u}

update steps. Next, the differences in the parameters between each submodel and meta-model were calculated, followed by averaging these

\tilde{N}

differences to determine the mean update. Among them, gradient prediction and guidance are carried out through GAA (see Section “GAA: Gradient geometry Adaptive Adjustment”). Finally, the parameters of the meta-model were adjusted by scaling the average update with the learning rate (meta learning step size), as illustrated in Algorithm 1.

GAA: Gradient Geometry Adaptive Adjustment

We consider an L-layer neural network $f_{θ} (\cdot)$ parameterized by $θ = W^{1}, \dots, W^{l}, \dots, W^{L} .$ In the typical Reptile setup for a task $τ \sim p (T)$ , the parameters $W^{l}$ are updated using the following gradient formulation:

W_{τ, K}^{l} \leftarrow W_{τ, 0}^{l} - α \cdot \sum_{k = 0}^{K - 1} G_{τ, k}^{l} s . t W_{τ, 0}^{l} = W^{l},

(6)

where

G_{τ, k}^{l} = \nabla_{W_{τ, K}^{l}} L_{τ}^{i n} (θ_{τ, k}; D_{τ}^{t r})

represents the gradient with respect to

W_{τ, K}^{l}

and

α

is the learning rate. In GAA, the gradient tensor undergoes a transformation. First, it is reshaped into a matrix (see Section “Riemannian manifold”). Then, the meta-parameter

ϕ = {M^{l}}_{l = 1}^{L}

which modifies the singular values of the gradient matrix, is introduced. The diagonal matrix

M^{l}

with positive entries is defined as follows:

M^{l} = {\begin{matrix} d i a g (S p (m_{1}^{l}), \dots, S p (m_{C_{i n}}^{l})) i f C_{i n} \leq K_{h} K_{w} \\ d i a g (S p (m_{1}^{l}), \dots, S p (m_{K_{h} K_{w}}^{l})) i f K_{h} K_{w} \leq C_{i n} \end{matrix},

(7)

where

m_{i}^{l} \in R, S p (x) = \frac{1}{2} \log (1 + e x p (2 * x))

. The diagonal matrix

M^{l}

is constructed based on the input size and kernel dimensions. If the number of input channels

C_{i n}

is less than or equal to the total number of kernel elements

K_{h} K_{w}

the diagonal entries are determined by

S p (m_{1}^{l}), \dots, S p (m_{C_{i n}}^{l})

. Otherwise, if

K_{h} K_{w} \leq C_{i n}

, the diagonal entries are determined by

S p (m_{1}^{l}), \dots, S p (m_{K_{h} K_{w}}^{l})

. This matrix is applied to the gradient matrix through the following transformation:

{\tilde{G}}_{τ, k}^{l} = \cup_{τ, k}^{l} (M^{l} \cdot \sum_{τ, k}^{l}) V_{τ, k}^{l},

(8)

where

{\tilde{G}}_{τ, k}^{l} = \cup_{τ, k}^{l} (M^{l} \cdot \sum_{τ, k}^{l}) V_{τ, k}^{l},

is derived from the singular value decomposition (SVD) of

G_{τ, k}^{l}

. The transformed gradient matrix

{\tilde{G}}_{τ, k}^{l}

is then reshaped back to its original tensor format

{\tilde{G}}_{τ, k}^{l}

using inverse unfolding. Subsequently, the GAA preconditioned gradient descent step is expressed as:

W_{τ, K}^{l} \leftarrow W_{τ, 0}^{l} - α \cdot \sum_{k = 0}^{K - 1} {\tilde{G}}_{τ, k}^{l} s . t W_{τ, 0}^{l} = W^{l},

(9)

where

{\tilde{G}}_{τ, k}^{l}

incorporates the influence of the meta-parameter

ϕ

. The comprehensive process is outlined in Algorithm 2.

Theorem 1.

Let ${\tilde{G}}_{τ, k}^{l} \in R^{m \times n}$ be the ‘l-layer k-th inner-step’ gradient matrix transformed by the meta-parameter $M^{l}$ for task $τ$ . The preconditioner GAA induced by ${\tilde{G}}_{τ, k}$ is a Riemannian metric that depends on the task-specific parameter $θ_{τ, k}$ .

Theorem 1 formally shows that GAA depends on the task-specific parameters $θ_{τ, k}$ (Kang et al., 2023). However, previous studies considered non-adaptive preconditioners $P (ϕ)$ (Lee & Choi, 2018; Li et al., 2017; Rajasegaran et al., 2020; von Oswald et al., 2021; Zhao et al., 2020), which are all kept static (Figure 2 (b)). In addition to this, some works consider adapting the preconditioner $P$ to the inner step $k P (k; ϕ)$ (Figure 2 (c)) (Rajasegaran et al., 2020), while other works consider adapting the preconditioner $P$ to a separate task $P (D_{τ}^{t r}; ϕ)$ (Figure 2 (d)) (Simon et al., 2020), GAA can be considered the state of the art adaptive preconditioner because it depends on $θ_{τ, k}$ (Figure 2 (e)) and is fully adaptive (i.e., task-specific and path-dependent) (Kang et al., 2023; Zhao et al., 2020).

Figure 2.

Diagram of MAML and PGD-MAML family.

If the parameter space possesses an inherent geometric structure, the conventional gradient $\nabla L$ may not align with the steepest descent direction (Amari, 1998). To define this direction accurately within the parameter space, a Riemannian metric $g (ω)$ , represented as a positive definite matrix for each parameter $ω$ , is required. This metric modifies the descent direction to $- g (ω)^{- 1} \nabla L$ , reflecting the underlying geometry of the space (Amari, 1998). When a preconditioner matrix serves as the Riemannian metric, it characterizes the parameter space’s geometry, enabling optimization along the true steepest descent path. Kang et al. (2023) demonstrated in Theorem 1 that GAA acts as a Riemannian metric for individual parameters, ensuring theoretical support for achieving steepest descent learning within its parameter space. GAA integrates two core components: a unitary matrix of gradients $U_{τ, k}^{l}$ , which captures task-specific and path-dependent geometric details, and a meta-parameter $M_{m e t a}$ , which incorporates shared geometric information across tasks. Together, these factors enhance GAA’s functionality compared to constant metrics, such as a unit sphere with curvature +1. While GAA guarantees Riemannian metric properties, its effectiveness relies on how well it corresponds to the true parameter space. If the GAA deviates significantly, its utility may diminish. Therefore, ensuring that the meta-learned GAA aligns closely with the actual parameter space is critical for achieving efficient optimization.

Efficient Fine-tuning

Once the metamodel is trained, it can be fine-tuned with many gradient updates to derive a custom submodel for a given weight vector. In the fine-tuning stage, we introduce a dynamic learning rate adjustment mechanism, which can make the model adapt to different learning rates in different training stages, so as to improve the training efficiency and stability. Second, we introduce entropy regularization, which enhances explorability and enables better generalization. Its detailed update is shown in Equation 10.

θ^{'} = θ - α \cdot \nabla (\frac{1}{N} \sum_{i = 1}^{N} (A (s_{i}, a_{i}) \cdot \log π (a_{i} | s_{i}, θ)) - λ H (π (a_{i} | s_{i}, θ))),

(10)

where $H (π (a_{i} | s_{i}))$ is the entropy of the policy, which is used to encourage exploration by maximizing the uncertainty of the action distribution $π (a | s, θ)$ . $λ$ is the coefficient of entropy regularization, which controls the trade-off between exploration and exploitation. $α$ is the learning rate.

Experiments

We conducted computational experiments to assess performance on the MOTSP, the MOCVRP, and the MOKP. Following the approach outlined in Chen et al. (2023b), Lin et al. (2022), we evaluated instances of varying sizes: n = 20, 50, 100 for MOTSP/MOCVRP and n = 50, 100, 150 for MOKP. The experiments were executed on a system equipped with a 13th Gen Intel Core i9-13900KF processor and an RTX 4090 GPU.

Problem Setting

Our experiments focus on MOTSP (Lust & Teghem, 2010a), MOCVRP (Lacomme et al., 2006), and MOKP (Bazgan et al., 2009), applying a unified model setup across all tasks with variations in input sizes and masking methods specific to each problem. The core policy model is based on an attention-based encoder (Kool et al., 2018).

To generate the training and evaluation datasets, we create random problem instances with controlled characteristics to ensure diversity and assess model performance under varying difficulty levels. For MOTSP, instances are generated with uniformly distributed node coordinates on a $[0, 100]^{2}$ plane, and the two objectives (e.g., distance and time) are defined with correlation coefficients ranging from $- 0.8$ (high conflict) to $0.8$ (low conflict) to explicitly control the objective conflict level.

For MOCVRP, customer locations are randomly generated in a unit square and scaled to $[0, 100]^{2}$ , while customer demands follow a normalized distribution $d_{i} \sim U [0.01, 0.51]$ . Capacity constraint tightness is intrinsically controlled through demand aggregation: total instance demand is normalized to 1.0 against a fixed vehicle capacity of 1.0 , creating natural packing efficiency challenges. This induces $ρ \in [0.6, 0.95]$ constraint tightness levels without explicit capacity tuning. Vehicle usage is minimized rather than fixed, with maximum routes bounded by customer count.

For MOKP, item weights and two profit values per item are generated such that the pairwise correlation between objectives can be set to desired levels (conflicting or harmonious). Knapsack capacity tightness is controlled by setting it to $50 % - 80 %$ of the sum of the weights of all items, ensuring non-trivial constraint satisfaction.

For training, we generate 10,000 random problem instances per epoch and run the model for 3000 iterations, progressively reducing the training dataset size. The optimization process employs the ADAM algorithm with an initial learning rate of $η = 10^{- 4}$ , subject to a decay rate of $10^{- 6}$ .

Hyperparameters

To ensure fairness in comparison, the parameter settings are consistent with EMNH (Chen et al., 2023b). The meta-learning rate, denoted as lambda, is gradually reduced from the initial value $ϵ_{0} = 1$ to $0$ . For optimization, the Adam optimizer uses a fixed learning rate of $10^{- 4}$ . Key parameters are configured as $B = 64, T_{m} = 3000, T_{u} = 100$ , and $\tilde{L} = M$ . The $L$ weight vectors for PF construction follow the method in Kool et al. (2018), with $L = C_{H + M - 1}^{M - 1}$ . When $M = 2$ and $M = 3$ , $H$ is set to $100$ ( $N = 101$ ) and $13$ ( $N = 105$ ), respectively.

Metrics

Metrics. The performance of GMRL is assessed based on solution quality, primarily using hypervolume (HV) and Gap (Lin et al., 2022). HV evaluates the hypervolume of solutions, while Gap measures the relative difference in hypervolume compared to our approach. Additionally, Time represents the duration required to solve 200 randomly generated test instances.

Baselines

We introduce three strong baselines that utilize WS (weighted sum) scalarization to ensure a fair comparison. These include the cutting-edge meta-learning-based neural heuristics, MDRL (Zhang et al., 2023) and EMNH (Chen et al., 2023b), as well as the advanced multi-task learning-based neural heuristic, MTNCO (Liu et al., 2024a). These three frameworks rely on problem-specific heuristics to generate and explore feasible solutions across various tasks. Each neural heuristic adopts POCO as the foundational model for solving single-objective subproblems, with identical training data sizes and 3000 iterations applied across the three multi-objective combinatorial optimization problems.

Experimental Results

The performance evaluation of MOTSP, MOCVRP, and MOKP across various scales is detailed in Tables 1 to 3. These tables present key metrics, including average HV, gap, and the total runtime required to solve 200 randomly generated test instances. To determine the significance of observed differences, a Wilcoxon rank-sum test was applied with a 1% significance threshold, ensuring a rigorous statistical analysis of the comparative outcomes. In addition, in order to compare the generalization ability of the proposed algorithm, we designed three experiments. First, MOCVRP-50 was trained by EMNH and GMRL respectively, and the model was fine-tuned by EMNH’s fine tuning method to solve MOCVRP-20 and MOCVRP-100. The results are shown in Table 4. Secondly, MOCVRP-50 was trained by EMNH, and the model was fine tuned by EMNH and GMRL respectively to solve MOCVRP-20 and MOCVRP-100. The results are shown in Table 5. Finally, to compare the ability of algorithms to solve practical problems, GMRL tested on three datasets, KroAB100, KroAB150, and KroAB200 (Lust & Teghem, 2010b). The comparison results with other algorithms are shown in Table 6. Best-performing results and those without significant differences are highlighted in bold, while suboptimal or non-significant results are marked with underlines. Methods annotated with ”-Aug” denote inference outcomes enhanced with additional instances, as described in Lin et al. (2022).The Gap between HV and GAP of all methods and GMRL-Aug is given in the report. During the experiment, the training data size of all algorithms will gradually decrease by a factor of 10. The results obtained will be displayed in batches.

Table 1.

Results of 200 Random Instances of MOCOPs at Original Scale.

Method	MOTSP (n $=$ 20)			MOTSP (n $=$ 50)			MOTSP (n $=$ 100)
Method	HV $↑$	Gap $↓$	Time	HV $↑$	Gap $↓$	Time	HV $↑$	Gap $↓$	Time
MTNCO	0.6260	0.19%	1.09s	0.6351	0.89%	3.93s	0.6947	1.07%	11.21s
MDRL	0.6271	0.01%	1.20s	0.6360	0.75%	3.30s	0.6966	0.51%	12.68s
EMNH	0.6271	0.01%	1.21s	0.6360	0.75%	3.26s	0.6966	0.51%	12.68s
GMRL	0.6271	0.01%	1.20s	0.6362	0.72%	3.15s	0.6968	0.48%	12.62s
MDRL-Aug	0.6271	0.01%	27.62s	0.6406	0.03%	144.80s	0.7019	0.04%	747.80s
EMNH-Aug	0.6271	0.01%	27.60s	0.6406	0.03%	144.78s	0.7020	0.03%	747.81s
GMRL-Aug	0.6272	0.00%	27.50s	0.6408	0.00%	144.16s	0.7022	0.00%	747.16s

Method	MOKP (n $=$ 50)			MOKP (n $=$ 100)			MOKP (n $=$ 150)
Method	HV $↑$	Gap $↓$	Time	HV $↑$	Gap $↓$	Time	HV $↑$	Gap $↓$	Time
MTNCO	0.3552	0.25%	3.68s	0.4523	0.29%	10.60s	0.3055	0.29%	44.21s
MDRL	0.3530	0.87%	4.00s	0.4531	0.11%	11.02s	0.3061	0.10%	45.60s
EMNH	0.3560	0.03%	4.04s	0.4534	0.04%	11s	0.3063	0.03%	45.50s
GMRL	0.3561	0.00%	3.79s	0.4536	0.00%	11s	0.3064	0.00%	45.97s

Method	MOCVPR (n $=$ 20)			MOCVPR (n $=$ 50)			MOCVPR (n $=$ 100)
Method	HV $↑$	Gap $↓$	Time	HV $↑$	Gap $↓$	Time	HV $↑$	Gap $↓$	Time
MTNCO	0.4271	0.84%	1.90s	0.4055	1.37%	4.93s	0.3963	2.65%	13.21s
MDRL	0.4273	0.79%	2.70s	0.4060	1.00%	5.30s	0.4041	0.74%	18.68s
EMNH	0.4280	0.49%	2.63s	0.4076	0.61%	5.51s	0.4057	0.39%	18.56s
GMRL	0.4282	0.44%	2.12s	0.4078	0.56%	5.56s	0.4059	0.32%	18.51s
MDRL-Aug	0.4291	0.23%	6.02s	0.4082	0.46%	24.00s	0.4071	0.02%	120.80s
EMNH-Aug	0.4297	0.09%	6.09s	0.4096	0.12%	24.00s	0.4072	0.00%	120.52s
GMRL-Aug	0.4301	0.00%	5.77s	0.4101	0.00%	24.05s	0.4072	0.00%	120.49s

Table 2.

Results of 200 Random Instances of Small Scale MOCOPs.

Method	MOTSP (n $=$ 20)			MOTSP (n $=$ 50)			MOTSP (n $=$ 100)
Method	HV $↑$	Gap $↓$	Time	HV $↑$	Gap $↓$	Time	HV $↑$	Gap $↓$	Time
MTNCO	0.6250	0.35%	1.05s	0.6271	1.75%	2.93s	0.6847	1.48%	11.21s
MDRL	0.6265	0.11%	1.20s	0.6288	1.49%	3.30s	0.6859	1.31%	12.65s
EMNH	0.6265	0.11%	1.21s	0.6289	1.47%	3.22s	0.6860	1.30%	12.71s
GMRL	0.6269	0.05%	1.05s	0.6289	1.47%	3.11s	0.6866	1.21%	12.61s
MDRL-Aug	0.6271	0.51%	27.60s	0.6379	0.06%	143.60s	0.6945	0.07%	743.80s
EMNH-Aug	0.6272	0.00%	27.54s	0.6379	0.06%	143.57s	0.6946	0.06%	743.03s
GMRL-Aug	0.6272	0.00%	27.52s	0.6383	0.00%	143.44s	0.6950	0.00%	743.02s

Method	MOKP (n $=$ 50)			MOKP (n $=$ 100)			MOKP (n $=$ 150)
Method	HV $↑$	Gap $↓$	Time	HV $↑$	Gap $↓$	Time	HV $↑$	Gap $↓$	Time
MTNCO	0.3551	0.31%	3.67s	0.4521	0.33%	8.63s	0.3045	0.62%	40.21s
MDRL	0.3531	0.87%	4.02s	0.4530	0.13%	11.02s	0.3061	0.10%	45.60s
EMNH	0.3561	0.03%	3.80s	0.4534	0.04%	12.00s	0.3063	0.03%	44.61s
GMRL	0.3562	0.00%	3.84s	0.4536	0.00%	12.23s	0.3064	0.00%	44.30s

Method	MOCVPR (n $=$ 20)			MOCVPR (n $=$ 50)			MOCVPR (n $=$ 100)
Method	HV $↑$	Gap $↓$	Time	HV $↑$	Gap $↓$	Time	HV $↑$	Gap $↓$	Time
MTNCO	0.4245	1.11%	1.89s	0.4017	1.57%	3.71s	0.3987	1.60%	13.47s
MDRL	0.4253	0.91%	2.10s	0.4033	1.18%	5.56s	0.4006	1.13%	18.51s
EMNH	0.4265	0.63%	2.28s	0.4043	0.93%	5.53s	0.4021	0.77%	18.62s
GMRL	0.4267	0.58%	2.10s	0.4046	0.86%	5.89s	0.4029	0.58%	18.59s
MDRL-Aug	0.4281	0.26%	6.02s	0.4066	0.37%	24.00s	0.4036	0.39%	120.80s
EMNH-Aug	0.4289	0.07%	6.09s	0.4077	0.10%	24.13s	0.4048	0.10%	120.07s
GMRL-Aug	0.4292	0.00%	5.82s	0.4081	0.00%	24.46s	0.4052	0.00%	120.84s

Table 3.

Results of 200 Random Instances of Extremely Small scale MOCOPs.

Method	MOTSP (n $=$ 20)			MOTSP (n $=$ 50)			MOTSP (n $=$ 100)
Method	HV $↑$	Gap $↓$	Time	HV $↑$	Gap $↓$	Time	HV $↑$	Gap $↓$	Time
MTNCO	0.6238	0.54%	1.06s	0.6221	2.08%	3.93s	0.6743	1.79%	11.21s
MDRL	0.6255	0.27%	1.18s	0.6228	1.97%	3.20s	0.6757	1.59%	12.55s
EMNH	0.6257	0.24%	1.22s	0.6234	1.87%	3.14s	0.6757	1.59%	12.54s
GMRL	0.6257	0.24%	1.20s	0.6246	1.68%	3.11s	0.6759	1.56%	12.52s
MDRL-Aug	0.6269	0.05%	27.61s	0.6345	0.13%	143.63s	0.6862	0.06%	748.80s
EMNH-Aug	0.6270	0.03%	27.60s	0.6347	0.09%	144.48s	0.6862	0.06%	750.94s
GMRL-Aug	0.6272	0.00%	27.50s	0.6353	0.00%	144.38s	0.6866	0.00%	750.59s

Method	MOKP (n $=$ 50)			MOKP (n $=$ 100)			MOKP (n $=$ 150)
Method	HV $↑$	Gap $↓$	Time	HV $↑$	Gap $↓$	Time	HV $↑$	Gap $↓$	Time
MTNCO	0.3551	0.31%	3.67s	0.4520	0.35%	8.60s	0.3043	0.55%	40.21s
MDRL	0.3531	0.87%	4.03s	0.4532	0.09%	11.02s	0.3060	0.13%	45.60s
EMNH	0.3560	0.06%	4.04s	0.4532	0.09%	12.13s	0.3064	0.00%	43.61s
GMRL	0.3562	0.00%	3.79s	0.4536	0.00%	12.03s	0.3064	0.55%	42.63s

Method	MOCVPR (n $=$ 20)			MOCVPR (n $=$ 50)			MOCVPR (n $=$ 100)
Method	HV $↑$	Gap $↓$	Time	HV $↑$	Gap $↓$	Time	HV $↑$	Gap $↓$	Time
MTNCO	0.4218	1.52%	1.92s	0.3960	2.32%	3.79s	0.3872	3.13%	13.45s
MDRL	0.4233	1.17%	5.60s	0.3983	1.75%	5.56s	0.3916	2.03%	18.51s
EMNH	0.4241	0.98%	5.71s	0.3996	1.43%	5.51s	0.3945	1.30%	19.00s
GMRL	0.4245	0.89%	6.09s	0.4001	1.31%	5.72s	0.3954	1.08%	19.09s
MDRL-Aug	0.4271	0.28%	6.02s	0.4034	0.49%	24.00s	0.3966	0.76%	120.80s
EMNH-Aug	0.4277	0.14%	5.95s	0.4046	0.20%	24.02s	0.3990	0.18%	125.56s
GMRL-Aug	0.4283	0.00%	5.32s	0.4054	0.00%	24.86s	0.3997	0.00%	125.90s

Table 4.

Comparison Results of Generalization Ability of Two Algorithm Models.

Step	MOCVRP (n $=$ 20)				MOCVRP (n $=$ 100)
	EMNH		GMRL		EMNH		GMRL
	HV $↑$	Gap $↓$	HV $↑$	Gap $↓$	HV $↑$	Gap $↓$	HV $↑$	Gap $↓$
10 Step	0.4231	0.09%	0.4235	0.00%	0.3941	0.18%	0.3948	0.00%
20 Step	0.4237	0.20%	0.4245	0.00%	0.3939	0.43%	0.3956	0.00%
50 Step	0.4240	0.16%	0.4247	0.00%	0.3964	0.10%	0.3968	0.00%
80 Step	0.4247	0.02%	0.4248	0.00%	0.3969	0.08%	0.3972	0.00%
100 Step	0.4248	0.02%	0.4249	0.00%	0.3969	0.08%	0.3972	0.00%

Table 5.

Comparison Results of Generalization Ability of Two Fine-Tuning Methods.

Step	MOCVRP (n $=$ 20)				MOCVRP (n $=$ 100)
	EMNH		GMRL		EMNH		GMRL
	HV $↑$	Gap $↓$	HV $↑$	Gap $↓$	HV $↑$	Gap $↓$	HV $↑$	Gap $↓$
10 Step	0.4231	0.05%	0.4233	0.00%	0.3941	0.10%	0.3945	0.00%
20 Step	0.4237	0.12%	0.4242	0.00%	0.3939	0.25%	0.3949	0.00%
50 Step	0.4240	0.12%	0.4245	0.00%	0.3964	0.08%	0.3967	0.00%
80 Step	0.4247	0.00%	0.4247	0.00%	0.3969	0.05%	0.3971	0.00%
100 Step	0.4248	0.00%	0.4248	0.00%	0.3969	0.08%	0.3972	0.00%

Table 6.

Results of Generalization Capability on Benchmark Instances.

Method	KroAB100			KroAB150			KroAB200
Method	HV $↑$	Gap $↓$	Time	HV $↑$	Gap $↓$	Time	HV $↑$	Gap $↓$	Time
MTNCO	0.6887	1.46%	4.53s	0.6853	0.67%	13.23s	0.7248	1.96%	101.21s
MDRL	0.6890	1.42%	5.42s	0.6865	0.49%	14.20s	0.7264	1.61%	112.56s
EMNH	0.6909	1.00%	4.83s	0.6866	0.48%	13.89s	0.7272	1.50%	108.39s
GMRL	0.6914	0.93%	4.90s	0.6874	0.36%	14.87s	0.7274	1.48%	106.15s
MDRL-Aug	0.6959	0.27%	61.37s	0.6886	0.19%	253.14s	0.7341	0.57%	238.80s
EMNH-Aug	0.6970	0.13%	59.91s	0.6888	0.16%	237.98s	0.7362	0.28%	233.33s
GMRL-Aug	0.6979	0.00%	59.97s	0.6899	0.00%	247.14s	0.7383	0.00%	233.53s

Analysis of Results

According to the results in Tables 1 to 3, GMRL-Aug outperforms the currently more advanced neural heuristics on all three problems. When data augmentation is performed, GMRL-Aug further improves the solution and outperforms other baselines on MOTSP with n $=$ 50 and n $=$ 100 and performs well on the MOCVRP problem. The entire training process of MOCVRP20 in the three experimental stages of GMRL and EMNH is depicted in Figure 3. In the first experimental stage, as illustrated in Figure 3 (a), during the middle to early periods of training, the HV of both algorithms attained a high level and continued to grow at a minor rate in the subsequent stages. The newly added two-stage experiment as shown in Figure 3 (b) and three-stage experiment as shown in Figure 3 (c) have, to a certain extent, circumvented this issue by reducing the size of the training data and relatively more distinctly reflect the characteristics of the algorithm. Figure 4 presents the results produced by models trained on training sets of various scales. From Figures 3 and 4, it can be inferred that as the size of the training data gradually decreases, the advantages of GMRL become increasingly prominent, indicating that GMRL has higher data utilization and training efficiency compared to other neural heuristic algorithms.

Figure 3.

Comparison of MOCVPR-20 in Three Phases.

Figure 4.

Comparison of MOCVPR-50 at different scales.

In terms of generalization capability, as shown in Table 4, models trained by GMRL demonstrate superior performance and faster convergence speed in cross-problem-size generalization. The results in Table 5 further indicate that GMRL’s fine-tuning methodology also achieves higher performance and faster convergence speed in cross-problem-size generalization. Analysis of both tables reveals that the Global Adaptation Algorithm (GAA) plays a more crucial role than Efficient Fine-tuning (EF) in achieving these results. Furthermore, when the number of fine-tuning steps is 20, GMRL exhibits its largest performance lead over alternatives. When fine-tuning data is relatively sufficient (e.g., at 80 and 100 steps), the difference in HV values between the two algorithms becomes small, and both achieve favorable results. According to the results in Table 6, GMRL-Aug outperforms the comparative algorithm across all three benchmark datasets. GMRL-Aug improves performance through multiple inference runs and ensemble strategies, which inevitably increases computational time. Notably, its performance lead widens progressively as the problem scale increases. This demonstrates that GMRL possesses enhanced capability for tackling complex real-world problems. Therefore, the improvements brought by GMRL are more meaningful under the following conditions:

Scenarios with scarce data or high annotation costs: When it is difficult to obtain large-scale training data, the high data utilization rate of GMRL makes it more practical.

Scenarios that require rapid model adaptation: In cross problem scale generalization tasks (as shown in Tables 4 and 5), GMRL exhibits faster convergence speed, which is crucial for applications that require rapid deployment to problems of different scales.

Robustness and Potential in Large-Scale Applications: As the complexity of the problem increases, as shown in Table 6, the performance advantage of GMRL becomes more apparent as the problem size (such as KroAB100 to KroAB200) increases, indicating its potential to solve complex real-world problems.

GMRL-Aug enhances performance by running multiple inferences and aggregating results, though this inevitably increases computation time. This trade-off makes it especially suitable for: (1) applications where solution quality outweighs computational costs (e.g., high-value logistics optimization), (2) offline planning (with flexible time constraints), (3) establishing performance upper bounds for the method.

Conclusion

In this paper, we propose the Gradient-adaptive Meta-Reinforcement Learning (GMRL) algorithm, which improves data efficiency and training stability through tensor reparameterization and preconditioned gradient optimization. Extensive experiments on multi-objective combinatorial optimization problems (including MOTSP, MOCVRP, MOKP) and standard benchmarks demonstrate that GMRL achieves competitive solution quality–particularly in data-scarce scenarios and cross-scale generalization tasks, where it shows superior convergence speed. The enhanced GMRL-Aug variant further advances performance boundaries, albeit with higher computational demands–a worthwhile trade-off for precision-sensitive offline applications . Although GMRL, as a neural heuristic, does not theoretically guarantee exact Pareto optimality, its exceptional data efficiency and robustness represent a significant advancement. Future directions include: (1) extending GMRL to constrained MOCOPs with complex feasibility conditions, (2) developing efficient augmentation strategies to reduce computational costs , and (3) investigating synergy with large language models for initial solution generation and heuristic refinement .

Footnotes

Acknowledgments

This research is supported by the National Natural Science Foundation of China (Grant No. 62576239), the University Natural Science Research Project of Anhui Province (Grant No.2023AH040056), the Natural Science Research Project of Anhui Province (Graduate Research Project, Grant No. YJS20210463), the 2023 New Era Education Quality Engineering Project of Anhui Province (Graduate Research Project, Grant No. 2023xscx086), the funding plan for Scientific research activities of academic and technical leaders and reserve candidates in Anhui Province (Grant No.2021H264), the top talent project of disciplines (majors) in Colleges and universities in Anhui Province (Grant No. gxbjZD2022021), the University Synergy Innovation Program of Anhui Province, China (GXXT-2022-033) and supported by the Graduate Innovation Fund of Huaibei Normal University (Grant No. CX2023043).

ORCID iDs

Fangzhen Ge

Mingshi Wang

Authors contribution statement

Mingshi Wang: Conceptualization of this study, Methodology, Software, Writing. Fangzhen Ge: Supervision, Writing - Review & Editing, Funding acquisition. Debao Chen: Visualization, Investigation. Longfeng Shen: Supervision. Huaiyu Liu: Data curation.

Funding

The author(s) received no financial support for the research, authorship and/or publication of this article.

Declaration of Conflicting Interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability

Data will be made available on request.

References

Amari

S. I.

(1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251–276. https://doi.org/10.1162/089976698300017746

Amari

S. I.

Grosse

Nitanda

Suzuki

(2020). When does preconditioning help or hurt generalization? arXiv preprint arXiv:2006.10732.

Bazgan

Hugot

Vanderpooten

(2009). Solving efficiently the 0–1 multi-objective knapsack problem. Computers & Operations Research, 36(1), 260–279. https://doi.org/10.1016/j.cor.2007.09.009

Bertsimas

Tsitsiklis

(1993). Simulated annealing, volume 8. Dordrecht: Institute of Mathematical Statistics. https://doi.org/10.1214/ss/1177011077

Chen

Dohan

(2023a). Evoprompting: Language models for code-level neural architecture search. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt & S. Levine (Eds.), Advances in neural information processing systems (Vol. 36, pp. 7787–7817). https://proceedings.neurips.cc/paper_files/paper/2023/file/184c1e18d00d7752805324da48ad25be-Paper-Conference.pdf.

Chen

Wang

Zhang

Cao

Chen

(2023b). Efficient meta neural heuristic for multi-objective combinatorial optimization. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt & S. Levine (Eds.), Advances in neural information processing systems (Vol. 36, pp. 56825–56837). https://proceedings.neurips.cc/paper_files/paper/2023/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf.

Deb

Pratap

Agarwal

Meyarivan

(2002). A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II, Volume 6. New York, NY: IEEE. https://doi.org/10.1109/4235.996017 .

Gendreau

Potvin

J. Y.

(2005). Tabu search. In E. K. Burke & G. Kendall (Eds.), Search Methodologies: Introductory tutorials in optimization and decision support techniques (pp. 165–186). Boston, MA: Springer US. ISBN 978-0-387-28356-2. https://doi.org/10.1007/0-387-28356-0_6

Ibarz

Kurin

Papamakarios

Nikiforou

Bennani

Csordás

Dudzik

A. J.

Bošnjak

Vitvitskyi

Rubanova

Deac

Bevilacqua

Ganin

Blundell

Veličković

(2022). A generalist neural algorithmic learner. In B. Rieck & R. Pascanu (Eds.), Proceedings of the first learning on graphs conference, Proceedings of Machine Learning Research (Vol. 198, pp. 2:1–2:23). https://proceedings.mlr.press/v198/ibarz22a.html.

10.

Kang

Hwang

Kim

Rhee

(2023). Meta-learning with a geometry-adaptive preconditioner. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 16080–16090). https://arxiv.org/abs/2304.01552.

11.

Kolda

T. G.

Bader

B. W.

(2009). Tensor decompositions and applications. SIAM Review, 51(3), 455–500. https://doi.org/10.1137/07070111X

12.

Kool

Van Hoof

Welling

(2018). Attention, learn to solve routing problems! arXiv preprint arXiv:1803.08475 https://arxiv.org/abs/1803.08475.

13.

Kwon

Y. D.

Choo

Kim

Yoon

Gwon

Min

(2020). Pomo: Policy optimization with multiple optima for reinforcement learning. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan & H. Lin (Eds.), Advances in neural information processing systems (Vol. 33, pp. 21188–21198). https://proceedings.neurips.cc/paper_files/paper/2020/file/f231f2107df69eab0a3862d50018a9b2-Paper.pdf.

14.

Lacomme

Prins

Sevaux

(2006). A genetic algorithm for a bi-objective capacitated arc routing problem. Computers & Operations Research, 33(12), 3473–3493. https://doi.org/10.1016/j.cor.2005.02.017

15.

Lee

J. M.

(2012). Smooth manifolds. New York, NY: Springer New York. ISBN 978-1-4419-9982-5, pp. 1–31. https://doi.org/10.1007/978-1-4419-9982-5_1

16.

Lee

Choi

(2018). Gradient-based meta-learning with learned layerwise metric and subspace. In J. Dy & A. Krause (Eds.), Proceedings of the 35th international conference on machine learning, Proceedings of Machine Learning Research (Vol. 80, pp. 2927–2936). https://proceedings.mlr.press/v80/lee18a.html.

17.

Zhou

Chen

(2017). Meta-sgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835 https://arxiv.org/abs/1707.09835.

18.

Lin

Yang

Zhang

(2022). Pareto set learning for neural multi-objective combinatorial optimization. arXiv preprint arXiv:2203.15386 https://doi.org/10.48550/arXiv.2203.15386.

19.

Liu

Lin

Wang

Zhang

Xialiang

Yuan

(2024a). Multi-task learning for routing problem with cross-problem zero-shot generalization. In Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining KDD ’24 (p. 1898–1908). New York, NY, USA: Association for Computing Machinery. ISBN 9798400704901. https://doi.org/10.1145/3637528.3672040

20.

Liu

Xialiang

Yuan

Lin

Luo

Wang

Zhang

(2024b). Evolution of heuristics: Towards efficient automatic algorithm design using large language model. In Forty-first international conference on machine learning. https://openreview.net/forum?id=BwAkaxqiLB.

21.

Lourenço

H. R.

Martin

O. C.

Stützle

(2003). Iterated local search. In F. Glover & G. A. Kochenberger (Eds.), Handbook of metaheuristics (pp. 320–353). Boston, MA: Springer US. ISBN 978-0-306-48056-0. https://doi.org/10.1007/0-306-48056-5_11

22.

Lust

Teghem

(2010a). The multiobjective traveling salesman problem: A survey and a new approach. Berlin, Heidelberg: Springer Berlin Heidelberg. ISBN 978-3-642-11218-8, pp. 119–141. https://doi.org/10.1007/978-3-642-11218-8_6

23.

Lust

Teghem

(2010b). Two-phase pareto local search for the biobjective traveling salesman problem. Journal of Heuristics, 16(3), 475–510. https://doi.org/10.1007/s10732-009-9103-9

24.

Meyerson

Nelson

M. J.

Bradley

Gaier

Moradi

Hoover

A. K.

Lehman

(2023). Language model crossover: Variation through few-shot prompting. arXiv preprint arXiv:2302.12170 https://arxiv.org/abs/2302.12170.

25.

Nichol

Achiam

Schulman

(2018). On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999 https://arxiv.org/abs/1803.02999.

26.

Rajasegaran

Khan

Hayat

Khan

F. S.

Shah

(2020). Meta-learning the learning trends shared across tasks. arXiv preprint arXiv:2010.09291 https://arxiv.org/abs/2010.09291.

27.

Reed

Zolna

Parisotto

Colmenarejo

S. G.

Novikov

Barth-Maron

Gimenez

Sulsky

Kay

Springenberg

J. T.

Eccles

Bruce

Razavi

Edwards

Heess

Chen

Hadsell

Vinyals

Bordbar

De Freitas

(2022). A generalist agent. arXiv preprint arXiv:2205.06175 https://arxiv.org/abs/2205.06175.

28.

Romera-Paredes

Barekatain

Novikov

Balog

Kumar

M. P.

Dupont

Ruiz

F. J. R.

Ellenberg

J. S.

Wang

Fawzi

Kohli

Fawzi

(2024). Mathematical discoveries from program search with large language models. Nature, 625(7995), 468–475. https://doi.org/10.1038/s41586-023-06924-6

29.

Simon

Koniusz

Nock

Harandi

(2020). On modulating the gradient for meta-learning. In A. Vedaldi, H. Bischof, T. Brox & J. M. Frahm (Eds.), Computer Vision – ECCV 2020 (pp. 556–572). Cham: Springer International Publishing. ISBN 978-3-030-58598-3.

30.

von Oswald

Zhao

Kobayashi

Schug

Caccia

Zucchet

Sacramento

(2021). Learning where to learn: Gradient sparsity in meta and continual learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang & J. W. Vaughan (Eds.), Advances in neural information processing systems (Vol. 34, pp. 5250–5263). https://proceedings.neurips.cc/paper_files/paper/2021/file/2a10665525774fa2501c2c8c4985ce61-Paper.pdf.

31.

Wang

(2023). Efficient training of multi-task combinatorial neural solver with multi-armed bandits. arXiv preprint arXiv:2305.06361 https://arxiv.org/abs/2305.06361.

32.

Wang

Dai

Liu

(2024). Adagc: A novel adaptive optimization algorithm with gradient bias correction. Expert Systems with Applications, 256, 124956. https://doi.org/10.1016/j.eswa.2024.124956

33.

Yang

Zhao

Zhu

Zhou

Jia

Zan

(2024). Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17), 19368–19376. https://doi.org/10.1609/aaai.v38i17.29907

34.

Zhang

(2007). Moea/d: A multiobjective evolutionary algorithm based on decomposition. IEEE Transactions on Evolutionary Computation, 11(6), 712–731. https://doi.org/10.1109/TEVC.2007.892759

35.

Zhang

Wang

Zhang

Zhou

(2021). Modrl/d-el: Multiobjective deep reinforcement learning with evolutionary learning for multiobjective optimization. In 2021 International joint conference on neural networks (IJCNN) (pp. 1–8). https://doi.org/10.1109/IJCNN52387.2021.9534083

36.

Zhang

Wang

(2023). Meta-learning-based deep reinforcement learning for multiobjective optimization problems. IEEE Transactions on Neural Networks and Learning Systems, 34(10), 7978–7991. https://doi.org/10.1109/TNNLS.2022.3148435

37.

Zhao

Kobayashi

Sacramento

von Oswald

(2020). Meta-learning via hypernetworks. In Proceedings of the 4th workshop on meta-learning at NeurIPS 2020 (MetaLearn 2020). s.l.: NeurIPS. https://doi.org/10.3929/ethz-b-000465883