Combining decomposition and graph capsule network for multi-objective vehicle routing optimization

Abstract

In order to alleviate urban congestion, improve vehicle mobility, and improve logistics delivery efficiency, this paper establishes a practical multi-objective and multi constraint logistics delivery mathematical model based on graphs, and proposes a solution algorithm framework that combines decomposition strategy and deep reinforcement learning (DRL). Firstly, taking into account the actual multiple constraints such as customer distribution, vehicle load constraints, and time windows in urban logistics distribution regions, a multi constraint and multi-objective urban logistics distribution mathematical model was established with the goal of minimizing the total length, cost, and maximum makespan of urban logistics distribution paths. Secondly, based on the decomposition strategy, a DRL framework for optimizing urban logistics delivery paths based on Graph Capsule Network (G-Caps Net) was designed. This framework takes the node information of VRP as input in the form of a 2D graph, modifies the graph attention capsule network by considering multi-layer features, edge information, and residual connections between layers in the graph structure, and replaces probability calculation with the module length of the capsule vector as output. Then, the baseline REINFORCE algorithm with rollout is used for network training, and a 2-opt local search strategy and sampling search strategy are used to improve the quality of the solution. Finally, the performance of the proposed method was evaluated on standard examples of problems of different scales. The experimental results showed that the constructed model and solution framework can improve logistics delivery efficiency. This method achieved the best comprehensive performance, surpassing the most advanced distress methods, and has great potential in practical engineering.

Keywords

Urban logistics distribution multi objective optimization deep reinforcement learning decomposition strategy graph capsule network attention mechanism

1. Introduction

Vehicle Routing Problem (VRP) is a typical combinatorial optimization problems (COP), and also an NP hard problem. In the actual logistics distribution process, the delivery efficiency is limited by the constraints of vehicle loading (Capacity) and customer time window (Time Window). Therefore, VRP (CVRPTW) considering capacity and time window constraints also belongs to the NP hard problem, which has a wider practical application prospect. When solving CVRPTW, different objectives and costs are generally considered. The cost can be the number of vehicles/routes, total distance traveled by all vehicles, total travel time, etc., or customer satisfaction. In the case of diversified costs, all costs should be optimized simultaneously [1]. Therefore, CVRPTW is a multi-objective optimization problem (MOP), and exploring efficient solutions for such problems has important theoretical and practical significance for the development of logistics supply chains.

In recent years, research on solving the optimal vehicle routing method is emerging in endlessly. Most of the methods need to establish mathematical models to complete the vehicle path optimization by defining different types of variables, constraint functions and objective functions. The commonly used methods mainly include exact algorithms and heuristics [2]. Among them, exact search algorithms are to establish a corresponding mathematical model for a specific problem, and then use mathematical methods to solve it, which can definitely find the optimal solution of the problem. However, due to its NP hard nature, it is difficult to apply to solving problems with more than 50 customers [3]. The heuristic search algorithms are proposed based on the optimization algorithm. Its basic idea is to give a feasible solution to the COP to be solved within an acceptable range. Common heuristic search algorithms mainly include Ant Colony Optimization (ACO) algorithm [4], Genetic Algorithm (GA) [5] and Particle Swarm Optimization (PSO) algorithm [6]. Compared with the exact search algorithm, the heuristic search algorithm has better robustness and feasibility when dealing with large-scale VRP problems, but it usually needs to be designed for specific problems and professional domain knowledge [7], and it is difficult to find a better solution to large-scale problems in polynomial time. Although exact search algorithm produce the optimal solution, while meta heuristic based methods produce near optimal solutions, these methods cannot be extended to problems involving hundreds to potential thousands of customers, and there is still room for further improvement in solving efficiency. Therefore, quickly solving such large-scale problems with reasonable accuracy remains an open challenge, which requires learning based methods. Therefore, it is particularly important to introduce learning methods to efficiently find near optimal solutions.

In recent years, more and more researches have applied DRL to solving COP [8], and have made breakthrough progress. DRL is a combination of deep learning (DL) and reinforcement learning (RL). It utilizes the powerful representation ability of deep learning to fit the action value function or action strategy function of intelligent agents in reinforcement learning, effectively solving the storage problem of action value and state value when the action space and state space are too large. DRL is mainly applied to sequential decision-making tasks. The AlphaGo [9] and AlphaGo Zero [10] Go algorithms use the DRL model combined with Monte Carlo Tree (MCT) search to successfully defeat the world Go champion, pushing the theoretical and applied research of DRL to new heights, and opening up new ideas for using DRL to solve VRP [11]. The optimal selection of decision variables in the discrete decision space of VRP has a natural similarity to the function of RL sequential decision-making, and the characteristics of DRL’s “off-line training, online decision-making” make it possible for VRP to solve in real-time online. Therefore, using DRL method to solve traditional COP is a good choice. Compared with the traditional combinatorial optimization algorithm (COA), the DRL based COA has a series of advantages, such as fast solving speed, strong generalization ability, and so on. This kind of method is a research hotspot in recent years. Table 1 provides a summary of methods for solving VRP and its related problems based on DRL.

Table 1
VRP solution method based on DRL.

Training DRL

Method Type scale Heuristic type DL model algorithm

Nazari et al. [12] TSP CVRP 20, 50, 100 Construction $+$ sampling RNN (Pointer net) $+$ ATT Reinforce

NeuRewriter [13] CVRP 20, 50, 100 Improvement (swap) RNN $+$ ATT A2C

L2I [14] CVRP 20, 50, 100 Improvement (hybrid) ATT Reinforce

NLNS [15] CVRP SDVRP 20, 50, 100 Improvement (ruin and repair) ATT A2C

MDAM [16] TSP CVRP SDVRP 20, 50, 100 Construction $+$ beam search Transformer (AM) Reinforce

POMO [17] TSP CVRP 20, 50, 100 Construction $+$ augmentation Transformer (AM) POMO

Zhao et al. [18] CVRP VRPTW 20, 50, 100 Construction $+$ local search RNN $+$ ATT A2C

MAAM [19] VRPSTW 20, 50, 100, Construction $+$ sampling Transformer (AM) MARL

150

Wu et al. [20] TSP 20, 50, 100 Improvement (2-opt, swap, Modified Transformer A2C

CVRP insert)

Zhang et al. [21] MOVRPTW 50 Construction $+$ sampling Transformer (AM) Reinforce $+$ EL

ANN-AM [22] TDTSPTW 20, 40 Construction $+$ sampling Transformer (AM) $+$ RNN Reinforce

LCP [23] TSP, CVRP, PCTSP 20, 50, 100 Construction $+$ improvement AM/Pointer net Reinforce

P-MOCO [24] MOTSP MOVRP 20, 50, 100 Construction $+$ augmentation RNN $+$ ATT A2C

EAS [25] TSP CVRP 100 Construction $+$ active search Transformer (POMO) POMO $+$ IL

		Training			DRL
Nazari et al. [12]	TSP CVRP	20, 50, 100	Construction $+$ sampling	RNN (Pointer net) $+$ ATT	Reinforce
NeuRewriter [13]	CVRP	20, 50, 100	Improvement (swap)	RNN $+$ ATT	A2C
L2I [14]	CVRP	20, 50, 100	Improvement (hybrid)	ATT	Reinforce
NLNS [15]	CVRP SDVRP	20, 50, 100	Improvement (ruin and repair)	ATT	A2C
MDAM [16]	TSP CVRP SDVRP	20, 50, 100	Construction $+$ beam search	Transformer (AM)	Reinforce
POMO [17]	TSP CVRP	20, 50, 100	Construction $+$ augmentation	Transformer (AM)	POMO
Zhao et al. [18]	CVRP VRPTW	20, 50, 100	Construction $+$ local search	RNN $+$ ATT	A2C
MAAM [19]	VRPSTW	20, 50, 100,	Construction $+$ sampling	Transformer (AM)	MARL
		150
Wu et al. [20]	TSP	20, 50, 100	Improvement (2-opt, swap,	Modified Transformer	A2C
	CVRP		insert)
Zhang et al. [21]	MOVRPTW	50	Construction $+$ sampling	Transformer (AM)	Reinforce $+$ EL
ANN-AM [22]	TDTSPTW	20, 40	Construction $+$ sampling	Transformer (AM) $+$ RNN	Reinforce
LCP [23]	TSP, CVRP, PCTSP	20, 50, 100	Construction $+$ improvement	AM/Pointer net	Reinforce
P-MOCO [24]	MOTSP MOVRP	20, 50, 100	Construction $+$ augmentation	RNN $+$ ATT	A2C
EAS [25]	TSP CVRP	100	Construction $+$ active search	Transformer (POMO)	POMO $+$ IL

Based on the above literature analysis, current research on VRP and its related variants mainly focuses on heuristic methods. However, with the development of DRL research in COP, there are still the following limitations in this field: (1)

Heuristic methods usually adopt the idea of “partitioning before optimization”, where different partitions are optimized separately, resulting in a lack of overall correlation.

(2)

At present, research on DRL methods in COP mainly focuses on solving TSP, VRP, and other problems through the interaction between intelligent agent learning and the environment, while research on solving CVRPTW problems is relatively lacking.

(3)

GNN, as a powerful tool for processing non Euclidean data and obtaining graphical information, has received extensive research in recent years. However, when using GNN and RL to solve COP, not only did the dependency relationships between edges in the graph structure not be considered, but also the information mining in multi constraint and multi-objective COP was not deep enough.

Based on the above analysis, since the VRP has a graph structure [26], that is, nodes can be embedded through graph/network information, and the Graph Capsule Network (G-Caps Net) has a strong effect in solving spatial distribution problems [27], G-Caps Net can be used to model graph COP, so this paper uses G-Caps Net to build an end-to-end DRL framework. In this framework, the node information of CVRPTW is input into the model in the form of a 2D graph, and then the graph attention capsule network (G-AT Caps Net) is modified by considering multi-level features, edge information, and residual connections between layers in the graph structure [28]. Finally, the next node is selected based on the module length of the output capsule instead of the probability distribution through search strategies such as greedy search or sampling methods.

The network framework designed in this article is called Residual Edge Graph Attention Capsule Networks (Res-E-G-AT-Caps Net). The Res-E-G-AT-Caps Net model considers not only node information features but also edge information features (edge information mainly refers to the weighted distance between nodes). Firstly, the information representing nodes and edges (such as weights) in the MOCVRPTW graph is fused and updated, and richer feature information is captured through the multi head attention (MHA) mechanism. All features are encoded through the primary capsule layer, and residual concatenation and batch normalization (BN) processing are performed. Then, all features are extracted through the convolutional capsule layer, followed by residual concatenation and BN processing, Finally, the individual capsules (as a vector) are output through the fully connected capsule layer, and the module length of the capsule vector represents the probability of node selection. When dealing with multiple targets, the integration method is used to normalize all targets, that is, the reward function calculates all targets in a weighted sum [29]. Finally, use the baseline REINFORCE algorithm with rollout to optimize the entire network.

The main contributions of this paper can be summarized below: (1)

For the multi-objective optimization of MOCVRPTW, based on the idea of decomposition, the multi-objective is transformed into a set of scalar sub objective optimizations, and collaborative optimization is carried out on all sub objectives based on the idea of parameter transfer.

(2)

A Res-E-G-AT-Caps Net framework is proposed to deeply capture the graph structure information of the MOCVRPTW problem. G-Caps Net is used to encode and extract features of node and edge information in the graph, and to capture the local position global structure association relationship of graph nodes. Simultaneously consider residual connections to reduce feature loss caused by information filtering.

(3)

A loss function suitable for multi-objective problems is designed. The baseline REINFORCE algorithm with rollout is used to train the strategy network to speed up and stabilize the training of the model. Different local search strategies are used to improve the model and further improve the quality of the solution.

(4)

The proposed framework can be extended to different graph COP solutions, with simple model implementation, high efficiency, and strong generalization ability. The efficiency of the proposed framework was evaluated on a randomly generated MOCVRPTW instance dataset, and its generalization ability was verified on standard test cases.

2. Problem description and model building

2.1 Problem description

Generally, CVRPTW can be described as follows: For a distribution center within a known region, there are several vehicles originating from the distribution center that need to complete multiple logistics orders in an orderly and non repetitive manner, and if customer demand does not exceed the vehicle’s carrying capacity. In addition, these orders have limited specific delivery time windows, if the delivery vehicles are earlier or later than this time window range, a certain time penalty cost (i.e. soft time window) needs to be added to the final total transportation cost. Under the above constraints, the logistics distribution route should be reasonably planned to minimize the total delivery cost (the cost here can be the number of vehicles/routes, the total distance traveled by all vehicles, total travel time, etc.). Therefore, CVRPTW is a multi-objective optimization problem, namely MOCVRPTW.

In this paper, a MOCVRPTW can be described as: In a certain distribution region, there is a distribution center with a known location and several vehicles originating from the distribution center. The logistics distribution needs of different customers in the distribution region are completed in an orderly and non repetitive manner, and each vehicle has a maximum load constraint. Each customer node has a distribution soft time window constraint. Under the above constraints, the vehicle travel route is reasonably planned to ultimately achieve the optimization of total transportation distance, the total of logistics delivery cost and makespan.

MOCVRPTW is related to multiple factors, and the mathematical model established is very complex with numerous constraints. In order to facilitate modeling, the following assumptions were made in this study:

(1)
Only considering the logistics distribution of a single logistics distribution center.
(2)
The vehicles for logistics distribution must start from the distribution center and return to the distribution center after completing all customer order delivery tasks.
(3)
All vehicles have the same capacity and driving speed, and have a maximum distance limit.
(4)
Each vehicle only completes the delivery of one route.
(5)
The demand and location coordinates of each customer are known and fixed, and all customer nodes are connected.
(6)
If each customer node is only allowed to be accessed once, then customer nodes that have already been served by the vehicle will not be considered as candidates for viable customer nodes of other vehicles.
(7)
During the delivery process of vehicles, the impact caused by temporary vehicle malfunctions or incorrect delivery of goods will not be considered temporarily.

2.2 Model of MOCVRPTW

Consider MOCVRPTW as a graph $G (V, E)$ , $V$ represents the collection of customer nodes and distribution centers, the set of edges is $E \subset V \times V$ , which represents the path connecting customers and distribution centers. Assuming the distribution center is $D_{0}$ , the customer set is $C u s = {1, 2, 3, \dots, n}$ , where $n$ is the total number of customers, the customer’s demand is $q_{i}$ , $i \in C u s$ , the set of all nodes is $V = {D_{0}} \cup C u s$ , the set of vehicles is $K = {1, 2, 3, \dots, m}$ , $m$ is the number of vehicles, the maximum capacity of the vehicle is $Q$ . For $\forall i \in V$ , there are $q_{i} ⩽ Q$ . The distance between the distribution center and customer nodes, as well as between customers, adopts European distance $D i s (i, j) = \sqrt{(x_{i} - x_{j})^{2} + (y_{i} - y_{j})^{2}}$ , $i, j \in V$ . Where $(x_{i}, y_{i})$ is the coordinate of the customer node $i$ , $i \in V$ . The time window for the customer node $i$ is $[t_{s i}, t_{e i}]$ , $t_{s i} < t_{e i}$ , the service time is $s_{i}$ . Each edge weight is composed of the distance $D i s (i, j)$ between the $i$ -th and $j$ -th nodes, the driving time of the vehicle on this side is $t_{i j} = \frac{D i s (i, j)}{v}$ , where $v$ is the driving speed of the vehicle.

The mathematical model of MOCVRPTW is as follows:

Minimize total transportation distance:

\begin{aligned} f_{1} = min \sum_{k \in K} \sum_{i, j \in V} D i s (i, j) \times x_{i j k} \end{aligned}

(1)

Minimize total transportation costs:

\begin{aligned} f_{2} = min \cos t = C_{1} + C_{2} + C_{3} \end{aligned}

(2)

Where, $C_{1}, C_{2}, C_{3}$ respectively represent the fixed cost of vehicles, transportation cost, and time delay cost in the logistics distribution process.

\begin{aligned} C_{1} & = \sum_{m \in K} \sum_{i, j \in V} ({v t f}_{m}) S_{i j} \end{aligned}

(3)

\begin{aligned} C_{2} & = \sum_{m \in K} \sum_{i, j \in V} ({v t c}_{m}) D i s (i, j) \end{aligned}

(4)

\begin{aligned} C_{3} & = α (\sum_{m \in K} \sum_{i \in S^{m}} \sqrt{{(x_{i} - \frac{\sum_{i \in S^{m}} x_{i}}{| S^{m} |})}^{2} + {(y_{i} - \frac{\sum_{i \in S^{m}} y_{i}}{| S^{m} |})}^{2}}) \end{aligned}

(5)

Where: $S_{i j} = {0, 1}$ is the decision variable, ${v t f}_{m}$ is the fixed cost of the vehicle, ${v t c}_{m}$ is the unit transportation rate, $α$ is the time delay cost corresponding to the unit distance, $S^{m}$ is is the set of nodes served in the $m$ -th route, that is, the set where the $m$ -th vehicle is responsible for delivery. $| S^{m} |$ is the number of elements contained in the set, that is, the number of nodes.

Minimize makespan:

\begin{aligned} f_{3} = min (max_{k \in K} T (R_{k})) \end{aligned}

(6)

Where, $T (R_{k})$ represents the travel time of path $R_{k}$ . $T (R_{k})$ consists of travel time, waiting time and service time of all sides of path $R_{k}$ .

The relevant constraints and decision conditions are as followed:

All customer nodes have equal number of edges in and out:

\begin{aligned} \sum_{i \in V} x_{i j k} = \sum_{i \in V} x_{j i k}, \forall j \in C u s, k \in K \end{aligned}

(7)

All customer nodes have and only have a unique vehicle service:

\begin{aligned} \sum_{k \in K} \sum_{j \in V} x_{i j k} = 1, \forall i \in C u s \end{aligned}

(8)

All vehicles must not exceed their maximum capacity:

\begin{aligned} \sum_{i \in C u s} q_{i} \sum_{j \in V} x_{i j k} ⩽ Q, \forall k \in K \end{aligned}

(9)

Path decision variables:

\begin{aligned} x_{i j k} \in {0, 1}, \forall i, j \in V, k \in K \end{aligned}

(10)

Is there a time requirement for the delivery task:

\begin{aligned} Z_{i k} \in {0, 1} \forall k \in K, i \in V \end{aligned}

(11)

The departure time of the vehicle is 0:

\begin{aligned} b_{D_{o} k} = 0, \forall k \in K \end{aligned}

(12)

Based on the above analysis, for a feasible candidate customer node $i$ to be accessed by vehicle $k$ , the following conditions need to be met: Vehicle $k$ should be able to provide service to customer node $i$ within time window $[t_{s i}, t_{e i}]$ after completing the service to the current customer node, if vehicle $k$ is not currently serving customers, then the vehicle departs from the distribution center and should be able to reach customer node $i$ before $t_{s i}$ , if vehicle $k$ arrives at customer node $i$ before $t_{s i}$ , the vehicle must wait until $t_{s i}$ before starting service. In addition, the total demand of all customer nodes that the vehicle routing passes through cannot exceed the maximum capacity of the vehicle.

2.3 Solving multi-objective models based on decomposition strategy

Based on the analysis in Section 2.2, CVRPTW is a multi-objective optimization problem (MOP). When solving MOP, decomposing it into a set of sub-problems and modeling each sub-problem separately, known as the decomposition strategy, is a reliable method. The decomposition strategy [30] has great advantages in maintaining the distribution of solutions, and by analyzing the information of adjacent problems to optimize, it can avoid falling into local optima.

This article combines the decomposition strategy with DRL to solve MOCVRPTW. Specifically, the entire vehicle routing problem is explicitly decomposed into a set of scalar quantum problems and solved using collaborative methods. Finding optimization sub-problems tends to guide Pareto Optimal (PO) solutions. When all scalar optimization problems are solved, the expected Pareto Frontier (PF) can be obtained.

The decomposition strategy decomposes MOCVRPTW into multiple single objective sub-problem, with each sub-problem optimizing one objective function while keeping the other objective functions fixed. Since the weight vectors of two adjacent sub-problem are adjacent, these two adjacent sub-problem may have very close optimal solutions [31]. This article uses the Weighted Sum Method (WSM) [29] to decompose the MOP into multiple single objective sub-problems, and then uses DRL to solve each sub-problem. Each sub-problem is modeled as a graph capsule network, and then uses a neighborhood based parameter transfer strategy [32] to solve optimization collaboratively among sub-problems.

3. Model description

The DRL process for solving MOCVRPTW requires defining the state space, action space, and reward function, and constructing a DRL model. Through training and testing, the model performance is optimized to ultimately obtain the best vehicle routing solution. The overall framework for solving MOCVRPTW based on DRL is shown in Figure 1. Firstly, define the reinforcement learning form of MOCVRPTW. Secondly, design a Res-E-G-AT-Caps Net network model based on value strategy gradient reinforcement learning method to train the network model. Finally, different action selection strategies and search strategies are used to obtain higher quality solutions.

Figure 1.

Overall framework for solving MOCVRPTW based on DRL.

3.1 Multi-agent reinforcement learning

Modeling MOCVRPTW as Markov Decision Process (MDP) $(S, A, R, γ)$ , among them, $S$ represents the state, $A$ represents the set of actions, $R$ represents the reward function, and $γ$ represents the discount factor. Define its state space, action space, state transition, and return function to form a reinforcement learning form of MOCVRPTW.

State: State $S = {S_{g}, S_{a}}$ is divided into global state $S_{g}$ and agent state $S_{a} = (S_{a 1}, S_{a 2}, \dots, S_{a m})$ . The global state $S_{g}$ is the overall image feature information output by the encoder, which belongs to the static state. The state $S_{a}$ of the intelligent agent (i.e. vehicle) is composed of the states of all intelligent agents, the state of a single agent $k$ is $S_{a k} = {{last}_{k}, {rest}_{k}}$ , ${last}_{k}$ is the feature of the node (i.e. customer node) selected in the previous step of agent $k$ , ${rest}_{k}$ represents the remaining capacity of the current vehicle of the intelligent agent $k$ , while $S_{a k}$ is in a dynamic state that changes over time.

Action: The multi-agent reinforcement learning action space is the joint action space $A^{t} = {A_{k}^{t}}$ of all agents, where $k = 1, 2, 3, \dots, m$ . Agent action $A_{k}^{t}$ refers to the node selected by Agent $k$ at the current time step $t$ , including customer points and distribution centers that have not yet been visited.

State transition: After the current time step $t$ , agent $k$ selects action $A_{k}^{t}$ , and the agent state transitions to $S_{a k}^{t + 1} = {S_{a k}^{t} * A_{k}^{t}}$ , the symbol “*” indicates that the node selected by the action is added to the current state until the complete joint action $A^{t}$ is formed, and the complete state transitions from $S^{t}$ to $S^{t + 1}$ .

Reward: For MOCVRPTW, the objective functions include minimizing total distance, minimizing total cost, and minimizing maximum completion time. The three objective functions are normalized [33], and then the normalized results of the three objectives are weighted and summed. The negative number of the weighted sum is used as the cumulative return, resulting in:

\begin{aligned} R = - \sum_{k = 1}^{m} \sum_{t = 1}^{T - 1} [{D i s}^{*} (A_{k}^{t}, A_{k}^{t + 1}) + \cos t^{*} (A_{k}^{t}, A_{k}^{t + 1}) + T (R_{k})^{*} (A_{k}^{t}, A_{k}^{t + 1})] \end{aligned}

(13)

The parameterized random policy $π_{θ}$ selects dynamic $A^{t}$ at each time step $t$ based on the probability vector $p_{θ}$ output by the policy network until the end state (i.e. all customer nodes have been accessed). The final solution output of the strategy is a complete sequence of node selection, i.e. $π = {π_{1}, π_{2}, \dots, π_{T}}$ , where $T$ is the length of the selected node sequence. According to the chain rule, the probability of a complete strategy $π$ for instance $s$ output from random strategy $π_{θ}$ is:

\begin{aligned} P (π | s) = \prod_{t = 1}^{T} p_{θ} (π_{t} | s_{t - 1}, π_{t - 1}) \end{aligned}

(14)

3.2 Model architecture

Mining the access order of customer nodes in the logistics delivery process is an important means to achieve efficient logistics delivery. The access sequence of customer nodes needs to be comprehensively analyzed based on factors such as vehicle load and time window constraints in the current and next states. A complete sequence of vehicle access nodes is a series of interrelated sequential data, with obvious local and global correlations. Therefore, the inherent and overall correlation of vehicle access node sequences is of great significance for the overall efficiency of logistics delivery.

This article proposes a Residual Edge Graph Attention Capsule Networks (Res-E-G-AT-Caps Net) model for constructing vehicle access node sequences. The overall framework of the model is shown in Figure 2. This network includes a graph attention model as an encoder and a fully connected capsule layer as a decoder.

Figure 2.

Res-E-G-AT-Caps Net model.

3.2.1 Fusion feature representation based on multi attention mechanism

The fusion feature representation based on nodes and edges refers to the combination of node and edge features to form a more comprehensive image feature representation. Nodes mainly refer to customer nodes and distribution centers, while edges refer to the path length between any two nodes. Node attribute features mainly include information such as customer demand, delivery time windows of each node, and location coordinates. Edge features are the Euclidean distance between any two nodes.

(1) Embedding layer

Embedding layer, as a commonly used layer in DL models, is mainly used to convert discrete inputs at high latitudes into continuous vectors at low dimensions for model processing. For embedding different types of data, it is only necessary to map each symbol to a unique encoding and use a trainable matrix to convert each number into a corresponding embedding vector.

Using 2D graph $G (V, E)$ as input, encode each node and edge with a unique identifier (such as a number), convert it to the corresponding integer number, and then convert it to the corresponding embedding vector. Combine the embedding vectors of all nodes and edges into a matrix in order, and then use this matrix as the input for the capsule network.

The input node features $n_{i}$ (including node coordinates, demand, and time window), and the input edge features Euclidean distances $e_{i j}$ , $i, j \in V$ . The embedding process of nodes and edges is:

\begin{aligned} x_{i} & = B N (A_{0} n_{i} + b_{0}), i \in V \end{aligned}

(15)

\begin{aligned} {\hat{e}}_{i j} & = B N (A_{1} n_{i} + b_{1}), i, j \in V \end{aligned}

(16)

Where, $A_{0}$ and $A_{1}$ respectively represent learnable weight matrices, $b_{0}$ and $b_{1}$ represent learnable weight vectors, and $B N (\cdot)$ represent batch normalization (BN) [34].

(2) Multiple attention mechanism

Because edge information and node information have certain relevance, integrating edge information into node information and updating each node information can better integrate features, as shown in Figure 3. On the other hand, residual connections can effectively fuse high and low features, minimizing the loss of detailed features caused by network deepening. The residual block can comprehensively extract the spatial and channel features of the data as it gradually deepens the network depth. This enhances the expression ability of the capsule, allowing the model to transmit more information with a small amount of primary capsules.

Figure 3.

Node and edge information fusion process.

After information fusion and residual connection, the final embedded feature vector of each node is:

\begin{aligned} h_{i} = \sum_{j = 1}^{n} α_{i j} W_{1} x_{j} + x_{i} \end{aligned}

(17)

Where, $α_{i j}$ represents the attention coefficient, represents the weight coefficient of the node relative $j$ to the node $i$ , $i, j \in V$ . $W_{1}$ represent the learnable weight matrix.

\begin{aligned} α_{i j} = \frac{\exp (σ (g^{T} [W (x_{i} ‖ x_{j} ‖ {\overset{⌢}{e}}_{i j})]))}{\sum_{z = 1}^{m} \exp (σ (g^{T} [W (x_{i} ‖ x_{j} ‖ {\overset{⌢}{e}}_{i j})]))} \end{aligned}

(18)

Where, $(\cdot)^{T}$ represents transpose operator, $\cdot | | \cdot$ represents connection operator, $g$ and $W$ represent learnable weight vector and weight matrix respectively, $σ (\cdot)$ is the LeakyReLU activation function

MHA [18] is an attention mechanism in multiple different dimensional spaces, and the multi head attention mechanism helps to obtain feature information from different dimensions. The number of attention heads used in this article is $M = 8$ , that is, to calculate attention separately in 8 spaces with dimension $\dim (h) / M = 16$ , including:

\begin{aligned} q_{m} & = W_{m}^{Q} \times h \end{aligned}

(19)

\begin{aligned} k_{m} & = W_{m}^{k} \times h \end{aligned}

(20)

\begin{aligned} v_{m} & = W_{m}^{V} \times h \end{aligned}

(21)

Where, $q_{m}$ , $k_{m}$ , $v_{m}$ are the query, key, value calculated on the dimension space. $W_{m}^{Q}$ , $W_{m}^{K}$ , $W_{m}^{V}$ are the corresponding network parameters.

Calculate the scaling point multiplication value $u_{i j}^{m}$ between $q_{i, m}$ and $k_{j, m}$ on each attention head dimension space, and use the SoftMax function to normalize $u_{i j}^{m}$ to the attention score $a_{i j}^{m} \in [0, 1]$ . Perform dot product operation on the attention score $a_{i *}^{m}$ and the corresponding $v_{i, m}$ to obtain the feature $h_{i, m}^{^{'}}$ of each attention head subspace. Finally, fuse all the features of the attention head dimension space into complete node features, namely:

\begin{aligned} u_{i j}^{m} & = {\begin{cases} \frac{q_{i, m}^{T} \times k_{j, m}}{\sqrt{\dim (k_{j, m})}}, j adjacent to i; \\ - \infty, otherwise . \end{cases} \end{aligned}

(22)

\begin{aligned} a_{i j}^{m} & = SoftMax (u_{i j}^{m}) = \frac{e^{u_{i j}^{m}}}{\sum_{j^{'}} e^{u_{i j^{'}}^{m}}} \end{aligned}

(23)

\begin{aligned} h_{i m}^{^{'}} & = \sum_{j} a_{i j}^{m} \times v_{j}^{m} \end{aligned}

(24)

\begin{aligned} M H A (h_{i}) & = \sum_{m = 1}^{M} W_{m}^{O} \times h_{i, m}^{^{'}} \end{aligned}

(25)

Where, $W_{m}^{O}$ is the network parameter for multi head attention feature fusion.

3.2.2 Feature extraction and node prediction based on Caps Net

The fusion feature representation of node and edge data, as shallow and intuitive features, is not sufficient to express hidden relationships. Therefore, it is necessary to extract deep level correlation features between nodes from the input shallow features through a node feature extraction model. Caps Net [35], [36] uses a vector composed of a set of neurons to store comprehensive features such as entity attributes and spatial relationships, which has stronger representation ability and robustness. Caps Net has innovatively proposed a dynamic routing mechanism based on capsule vector structure to quickly extract effective features from a large amount of information. In capsule networks, low-level capsules depict local features of objects, high-level capsules represent overall features, and dynamic routing mechanisms are used to extract effective features from low-level capsules and update high-level capsules.

This article constructs a feature extraction and node prediction model based on Caps Net, which automatically learns the relationship between local and global nodes through the transformation matrix between capsules. This part is a three-layer capsule network, including convolutional layer, primary capsule layer, and fully connected capsule layer.

(1) Convolutional layer

This layer is a standard convolutional layer that extracts the deep local features of the vector matrix formed by the fusion feature representation of the previous stage at different positions through different scales of convolutional window widths. The convolution layer uses the convolution window to connect with the local area of the feature presentation layer, and the weighted sum of the local information is transferred to the activation function to generate the final output value of the convolution layer. This layer can extract deep level node and edge features from the vector matrix, with the advantage of translation invariance, and can parallel capture the associations between state data.

Let the convolutional layer have $k$ convolutional kernels with a step size of 1, where the $i$ -th convolutional kernel is $w_{i} \in R^{c \times 2 d}$ and $c$ is the window width of the convolutional kernel, used to identify local features. $2 d$ is the dimension input to the convolutional layer vector matrix. Each convolution check feature matrix slides from top to bottom for convolution operations, and the feature map $m_{i}$ generated by the $i$ -th convolution kernel is:

\begin{aligned} m_{i} = f (w_{i} \cdot l_{i : i + c - 1} + b_{i}) \in R^{n - c + 1} \end{aligned}

(26)

Where, $l_{i : i + c - 1}$ is a continuous embedding of $c$ features, $b_{i}$ is the offset top, and $f$ is the ReLU function, which can alleviate the phenomenon of gradient explosion/vanishing.

By arranging the obtained $m_{i}$ in one order, the output feature matrix of the convolutional layer can be obtained:

\begin{aligned} M = [m_{1}, m_{2}, \dots, m_{k}] \in R^{(n - c + 1) \times k} \end{aligned}

(27)

(2) Primary capsule layer

The features contained in the primary capsules are all from the output of the convolutional layer, and the number of capsules depends on the size of the convolutional feature map. The primary capsule layer is the first capsule layer in the model, which replaces the scalar neurons of CNN with the capsules of vector neurons and also replaces pooling operations, avoiding pooling operations from damaging the local position information and overall sequence structure of node data. The primary capsule combines features extracted from the same location by abstracting low-level features from high-level features, effectively capturing the spatiotemporal relationship between local positions and the overall structure of the data. This layer can extract different attributes of a certain feature in a node, such as the time when the customer node receives service.

The primary capsule layer encapsulates different attributes of the vector matrix $M_{i}^{^{'}} (i = 1, 2, \dots, n - c + 1)$ with a length of 1 in the convolutional layer, $M_{i}^{^{'}}$ is the row vector $i$ vector of $M$ . The primary capsule layer can convert $M$ into a capsule matrix through a transformation matrix.

Assuming the dimension of the primary capsule is $l_{1}$ , and the transformation matrix of the $i$ th primary capsule is $z_{i} \in R^{l \times k}$ , each transformation matrix can be operated on with $M_{i}^{^{'}}$ to generate feature maps $p_{i}$ one by one:

\begin{aligned} p_{i} = g (z_{i} \cdot M_{i}^{^{'}} + e_{i}) \in R^{n - c + 1} \end{aligned}

(28)

Where, $e_{i}$ is the offset top, and $g$ is the nonlinear activation function.

Since each capsule includes $l_{1}$ transformation matrices, the output of each capsule is $u_{i} \in R^{(n - c + 1) \times l_{1}}$ . Rearrange the outputs of all capsules to obtain the output feature matrix of the primary capsule layer:

\begin{aligned} U = [u_{1}, u_{2}, \dots, u_{q}] \in R^{(n - c + 1) \times l_{1} \times q} \end{aligned}

(29)

(3) Fully connected capsule layer

The final layer of the model is a fully connected capsule layer, where each capsule is connected to the primary capsule. The capsule matrix obtained from the primary capsule is multiplied by the transformation matrix to obtain a prediction vector, and then dynamic routing is used to generate the final category capsule:

\begin{aligned} V = [v_{1}, v_{2}, \dots, v_{j}] \in R^{j \times l_{2}} \end{aligned}

(30)

Where, $v_{j} \in R^{l_{2}}$ represents the $j$ -th class capsule.

The output of a capsule is also a vector, and since each capsule can be considered as a vector representing a specific entity, the modulus of the capsule vector can be used to represent the probability of an individual being selected.

3.2.3 Dynamic routing iterative update mechanism

The vector capsule in Caps Net encapsulates information from the convolved features, and forms a primary capsule through linear changes. Each capsule in the primary capsule layer will transfer the features upward through the weighting of the coupling coefficient, and the mapped new vector will get a high-level capsule through the nonlinear activation function Squash. The calculation process between capsules at different levels mainly includes two stages: matrix transformation and dynamic routing. Dynamic routing refers to the iterative and cyclic transfer of features from low-level capsules between adjacent capsule layers, and the formation of more abstract high-level capsules. The primary capsule layer uses dynamic routing mechanism to learn features in the data, and transmits the predicted feature information to the fully connected capsule layer in the upper layer. If the prediction results are consistent, the upper layer capsule is activated. Compared to the process of weighted summation and nonlinear transformation required by scalar neurons, capsule networks require matrix transformation to balance the spatial and hierarchical relationships of the identified objects.

Figure 4.

Dynamic routing mechanism.

The dynamic routing structure relationship (as well as the relationship between the primary capsule layer and the fully connected capsule layer) is shown in Figure 4. This process first involves performing a matrix transformation on each primary capsule to obtain a prediction vector:

\begin{aligned} u_{j q} = u_{q} \cdot w_{q j} \end{aligned}

(31)

Where, $u_{q}$ is the output of the primary capsule and $w_{q j}$ is the transformation matrix.

Afterwards, calculate the prediction vector:

\begin{aligned} S_{j} = \sum_{q} c_{q j} \cdot u_{j q} \end{aligned}

(32)

Where, $c_{q j}$ is the coupling coefficient, which can be calculated through dynamic routing algorithms, representing the connection weights between capsules. For capsule $q$ , the sum of all weights $c_{q j}$ is 1.

Since there is a vector transformation between capsules, the extrusion function Sqash is used as the activation function to compress and redistribute $S_{j}$ , and transform the modulus of $S_{j}$ to the range of 0 $\sim$ 1, namely:

\begin{aligned} y_{i} = \frac{∥ S_{j} ∥^{2}}{1 + ∥ S_{j} ∥^{2}} \cdot \frac{S_{j}}{∥ S_{j} ∥} \end{aligned}

(33)

Where, $y_{i}$ is the output vector of the $j$ -th capsule in the fully connected capsule layer, and its length is also the probability of the node being selected. The first half of this equation is the compression function, which is used to constrain the length of the $y_{i}$ vector. The second half can unit $S_{j}$ to keep its direction synchronized with $y_{i}$ . Therefore, the compression function only changes the length of $S_{j}$ and does not change the direction of $S_{j}$ .

The dynamic routing algorithm iteratively learns the nonlinear mapping relationship between the prediction layer and the fully connected layer, requiring the use of the $Soft max$ function to continuously update the coupling coefficient $c_{q j}$ :

\begin{aligned} c_{q j} & = \frac{\exp (b_{q j})}{\sum_{k} \exp (b_{q k})} \end{aligned}

(34)

\begin{aligned} b_{q j} & \leftarrow b_{q j} + u_{q j} \cdot v_{j} \end{aligned}

(35)

Where, $b_{q j}$ represents the prior probability of connecting capsule $q$ to $j$ , which can be set to 0 initially.

Determine the similarity between the vectors based on the inner product of the primary capsule prediction vector $u_{j | q}$ and the output vector $v_{j}$ of the fully connected capsules, iteratively update $b_{q j}$ , and sequentially update the coupling coefficient $c_{q j}$ .

3.2.4 Loss function

When using the capsule network to predict the customer node access sequence in the VRP, the commonly used loss function is the Cross-entropy Loss function [34], which can be used to compare the difference between the predicted results and the real results, so as to adjust the network parameters and make the prediction results more accurate.

Specifically, when predicting the order of customer node visits in VRP, the predicted results of each node can be represented as a probability distribution, where each node has a certain probability of being predicted as the next node. Then, the actual results can be represented as a one-hot encoded vector, where only one element’s vector is 1 and the rest are 0, which corresponds to the position of the next node. For example, if the access order is [0, 2, 1, 3], then its unique heat codes are [1, 0, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 0, 1], and then you can use the cross entropy loss function to compare the difference between the model output and the real one-hot code.

Assuming that $y_{i}$ represents the true result of node $i$ , and $p_{i j}$ represents the probability that node $i$ is predicted to be node $j$ , the cross entropy loss function can be expressed as:

\begin{aligned} L = - \sum_{i} y_{i} \sum_{j} p_{i j} \log (p_{i j}) \end{aligned}

(36)

Where, the first term $\sum_{i} y_{i}$ represents that only one element in the vector of the real result is 1, so after summing, it is equal to 1. The second term $\sum_{j} p_{i j} \log (p_{i j})$ represents the entropy value of the predicted results, which is used to evaluate the uncertainty of the predicted results.

In the training process, the network parameters are adjusted by minimizing the cross entropy loss function, so that the prediction results are closer to the real results.

3.2.5 Mask strategy

In the process of constructing the solution, it is necessary to mask nodes that do not meet the conditions during node prediction at each time point, that is, to avoid repeatedly selecting customer nodes, if the remaining vehicle capacity does not meet the customer node, if the time to arrive at the customer node is not within the service time window of the node, and if the distribution center is selected twice in a row (i.e. selecting depot node $π_{t - 1} \in {1, 2, \dots, k}$ at time $t - 1)$ .

Specifically, in Eq. (19), the attention coefficient of the above case is set to $- \infty$ for mask, and then the attention coefficient is normalized by using Eq. (20) through SoftMax activation function.

4. DRL algorithm based on policy gradient

The DRL algorithm based on policy gradient is a reinforcement learning algorithm that uses deep neural networks to learn policies. Its main idea is to optimize the policy function by maximizing cumulative rewards, while using gradient descent method to update policy parameters. Specifically, the training process of this algorithm is mainly divided into the following steps:

Step 1: Initialize the parameters of the policy function.

Step 2: At each time step, use the current policy function to interact with the environment and generate a sequence of states, actions, and rewards.

Step 3: Calculate the rewards for each time step and use these rewards to calculate the gradient of each state action pair.

Step 4: Use gradient descent method to update the reference of the strategy function to better match the goal of maximizing cumulative rewards.

Step 5: Repeat steps 2–4 until the strategy function converges or reaches the preset number of training times.

In practical applications, DRL algorithm based on policy gradient usually adopts some improved technologies, such as baseline, importance sampling and truncation return, to improve the efficiency and stability of the algorithm. In addition, this algorithm can also be combined with other reinforcement learning algorithms, such as value strategy learning and DQN, to achieve better performance.

4.1 Strategy network training methods

The REINFORCE algorithm with Rollout baseline is mainly based on a reinforcement learning algorithm for policies, which estimates policy gradients by calculating the cumulative return of a single agent and trains the single agent strategy. This article uses this algorithm to train multi-agent joint strategies, estimates the strategy gradient by calculating the cumulative return of the joint strategy, and trains the multi-agent strategy network. For a given instance $s$ , policy network $θ$ outputs the action probability vector $p_{θ} (π_{θ} | s)$ for each step of all agents, and then outputs joint policy $π_{t} = sample (p_{θ} (π | θ))$ in a sampling manner. And the reference network $θ^{b l}$ outputs the joint probability $π_{t}^{b l} = greedy (p_{θ^{b l}} (π | θ))$ in a greedy selection manner based on the action probability vector $p_{θ^{b l}} (π_{t} | s)$ output by the reference network. Evaluate the expected cumulative return $L (θ | s) = E_{p_{θ} (s)} [R (π)]$ of the strategy based on the Monte Carlo algorithm, where $R (π)$ is the cumulative return of strategy $π = {π_{1}, π_{2}, \dots, π_{T}}$ .

Calculate the policy gradient using the REINFORCE algorithm with a baseline and update the policy network parameters using a gradient descent method, namely:

\begin{aligned} \nabla_{θ} L (θ | s) & = - E_{p_{_{θ}} (π | s)} [(R (π) - R (π^{b l})) \times \nabla_{θ} \log p_{θ} (π | s)] \end{aligned}

(37)

\begin{aligned} θ & = Adam (θ, \nabla_{θ} L (θ | s)) \end{aligned}

(38)

Using benchmark network $θ^{b l}$ to evaluate the difficulty of instance $s$ can effectively reduce the variance of the gradient during network training. The benchmark network is updated in a rollback manner. At the end of each training round, the strategy network $θ$ is compared with the benchmark network $θ^{b l}$ . In the significance level $α (= 0.05)$ $t$ -test, if the solution output by the strategy network is significantly better than the benchmark network, the benchmark network is updated in a rollback manner $θ^{b l} \to θ$ .

4.2 Action selection and local search strategy

The strategy network that has undergone multiple rounds of learning has good decision-making ability, and the action selection strategy selects actions based on the probability vector output by the network. This article adopts two different action selection strategies:

The greedy action selection strategy fully trusts the policy network, with each step based on the output value of the fully connected capsule layer, selecting the action with the maximum module length.

The sampling action selection strategy uses the output value of the fully connected capsule layer as the sampling module length distribution, and selects actions based on this distribution. Therefore, this strategy does not always select the action with the maximum module length, but instead selects the corresponding action with different module lengths.

In the process of network training, benchmark network $θ^{b l}$ serves as a critical criterion for evaluating the difficulty of each batch of training instances, and using greedy action selection strategies can quickly obtain effective “evaluation indicators”. As an “actor”, strategy network $θ$ needs to effectively evaluate its decision-making ability, and use sampling actions to select strategies that can effectively estimate the expected value of solution quality, that is, the decision-making ability of the “actor”. $θ^{b l}$ and $θ$ can effectively improve learning efficiency and model performance by selecting appropriate actions through different action selection strategies.

The trained model can quickly solve MOCVRPTW using a greedy action selection strategy, but there is some room for improvement for some difficult instances, such as route crossing problems in the sub loop and overconfidence behavior of the greedy action selection strategy, which leads to missing actions with a high selection probability (but not the highest). In order to improve the quality of the solution, this article adopts two local search strategies:

2-opt search. In response to the problem of route crossing in the sub loop, 2-opt local search takes each sub loop of the model output solution as the initial solution, and optimizes all sub loops through 2-opt operation to improve the overall quality of the solution.

Sampling search. The model uses a sampling action selection strategy to repeatedly solve the same instance to obtain multiple complete solutions, and takes the optimal solution. If $s$ is the number of complete solution samples, the optimal solution is $π * = \arg min {L (π_{1}), L (π_{2}), \dots, L (π_{s}),}$ , avoiding overconfidence behavior in the policy network. Due to the advantage of fast solution in this model, repeated sampling solution will not excessively consume time and cost.

5. Experiment and result analysis

Since the paper deals with logistics and optimization problems, it is important to consider potential ethical implications, this paper mainly considers environmental impact and fairness in distribution. The dataset used in this study is a standard dataset, and the simulated experimental hardware conditions are consistent. Special cases have been explained in the model assumptions, and the environmental impact is fair. In addition, according to the model assumption, the distribution of customer nodes and delivery volume are known and fixed, and the distribution is also fair. Therefore, the ethical factors present in this study have no direct impact on the experimental simulation results.

5.1 Experimental setting

Using PyTorch to implement the overall framework of the model, the policy network model was trained on a single GPU (2080ti, 10G graphics memory), and tested on a Windows7 operating system running on Intel Core i7-3630QM 2.40GHz, 8GB RAM. The model was implemented using Python 3.7. Meanwhile, comparative experiments were conducted on all multi-objective optimization algorithms on the standard software platform PlatEMO [37] written in MATLAB.

This article verifies the feasibility and effectiveness of the proposed framework through experiments, which include two parts: training and testing. As an unsupervised learning method, only model input and reward function are required during the training process, without the need for the best tour as a label.

Training data: To verify the performance of the proposed model in this article, the models were trained on customer node examples with scales of 25, 50, and 100, respectively, where the maximum capacity of vehicles for customer nodes 25, 50, and 100 is 50, 100, and 200, respectively. The coordinates of nodes are randomly generated in [0, 100] $\times$ [0, 100]. In addition, the starting TW is randomly generated on [0, 200], where the width of TW has a Gaussian distribution. The coordinates of the distribution center are randomly generated on [30, 60].

In the model training phase, for the 25 customer and 50 customer scale problems, the number of training rounds (epoch) is set to 100, the number of training batches per round is set to 2500, and the number of examples per batch is set to 512. For the problem of 100 customer sizes, due to the limitation of graphics memory size, the number of cases per batch is set to 128. In the case testing phase, for problems of different sizes, 10000 sets of cases are tested under their corresponding distributions. Once the model has been trained, it can be used to directly output Pareto Front (PF).

Testing data: This article considers the Solomon benchmark data-set [38] of CVRPTW consisting of 25, 50, and 100 customers, which is classified based on vehicle capacity and customer location (C, R, RC). Test each category based on vehicle capacity and customer location type. The number of training rounds (epochs) for each model is 100. Compare the obtained results with the selected comparison framework/algorithm on the objective function established in Section 2.2.

Parameter settings: The embedding dimension of node information is 128, and the embedding dimension of edge information is 16. The Adam optimizer [39] is used to optimize the network parameters, and the learning rate is set to 1 $\times$ 10-4. The number of samples $s$ in the sampling search is set to 128. In addition, the weights of the first sub-problem are initialized using the Xavier method [40], and the weights of subsequent sub-problems are generated by the introduced neighborhood based parameter transfer strategy.

Method comparison: Due to the consideration of CVRPTW solutions for three objectives in this article, and the constructed model being a DRL class model, comparisons with relevant multi-objective optimization algorithms and DRL methods are considered when comparing methods.

This article selects classic INSGA-II [41] and MOEA/D [42] multi-objective evolutionary algorithms, MARDAM [43] (multi-agent routing deep attention mechanism) method, PtrNet [15] method, and RE-GAT [44] method.

The parameter settings in the comparison algorithm are consistent with the corresponding literature. All comparison methods are executed under the same conditions, which means using the same starting and ending criteria, the same number of starting search points, the same dataset, and the same hardware to run the algorithm. All methods are run in their optimal environment, taking the average of the optimal results.

5.2 Comparison of training processes

Figure 5 shows the learning curves of the Res-E-G-AT-Caps Net model with MARDAM [43], PtrNet [15], and RE-GAT [44] on 100 customer node instances. From Figure 5, it can be seen that the learning curve of the Res-E-G-AT-Caps Net model oscillates more smoothly, and the final convergence result is also better than other models. On the one hand, this is because the graph capsule model proposed in this article for MOCVRPTW can quickly converge to local optima in the network. On the other hand, the interaction between multi-agent systems enables the current agent to consider joint action optimization when selecting nodes, which helps to reduce gradient variance and accelerate model training.

Figure 5.

Learning curves of each model.

Table 2

Output for Solomon dataset (CVRPTW with 25 customers).

		R1-25					C1-25					RC1-25
Method	Type	$f_{1}$	$f_{2}$	$f_{3}$	Gap	Time	$f_{1}$	$f_{2}$	$f_{3}$	Gap	Time	$f_{1}$	$f_{2}$	$f_{3}$	Gap	Time
INSGA-II [41]	H	795.7	8012.7	198.7	–	19.89 s	518.6	5478.5	121.4	–	14.76 s	678.8	6783.5	145.2	–	17.23 s
MOEA/D [42]	H	821.3	8892.4	208.8	–	21.34 s	597.4	5992.3	132.1	–	16.52 s	703.5	7021.4	156.2	–	18.42 s
MARDAM [43]	RL, G	658.8	5378.3	143.2	0.58%	2.35 s	404.8	3219.5	117.4	0.43%	0.60 s	459.2	4012.8	132.3	0.52%	1.72 s
PtrNet [15]	RL, G	659.3	5342.4	144.6	8.03%	–	408.4	3320.5	118.3	6.12%	–	461.6	4045.1	138.5	7.43%	–
PtrNet [15]	RL, BS	640.7	5102.6	129.8	4.92%	–	387.5	3014.8	102.4	3.17%	–	419.7	3873.4	117.5	4.12%	–
RE-GAT [44]	RL, G	626.1	4873.4	112.5	2.78%	2 s	365.3	2894.9	98.4	1.26%	2 s	401.2	3498.5	102.3	1.93%	2 s
RE-GAT [44]	RL, S	619.4	4537.6	103.8	1.58%	15 m	348.9	2698.3	89.6	0.83%	15 m	387.3	3109.3	94.6	1.02%	15 m
Ours (2-opt)	RL, 2-opt	620.2	4542.5	104.6	1.62%	3 s	350.6	2701.5	90.2	0.85%	3 s	388.7	3110.4	95.1	1.05%	3 s
Ours (Sample)	RL, S	610.6	4317.5	94.7	1.44%	12 m	340.3	2603.7	84.3	0.75%	12 m	363.9	2987.4	88.3	0.89%	12 m

Table 3

Output for Solomon dataset (CVRPTW with 50 customers).

		R1-50					C1-50					RC1-50
Method	Type	$f_{1}$	$f_{2}$	$f_{3}$	Gap	Time	$f_{1}$	$f_{2}$	$f_{3}$	Gap	Time	$f_{1}$	$f_{2}$	$f_{3}$	Gap	Time
INSGA-II [41]	H	1897.3	15892	198.4	–	34.2 s	1672.4	14328	190.3	–	20.1 s	1744.8	13205	184.5	–	25.6 s
MOEA/D [42]	H	2012.6	17873	225.6	–	45.7 s	1822.3	16273	210.4	–	31.3 s	1924.5	15982	201.4	–	30.2 s
MARDAM [43]	RL, G	1219.8	10229	167.3	1.26%	2.71 s	1077.8	7326.8	143.2	0.97%	2.24 s	1119.8	6309.6	156.7	1.03%	3.0 s
PtrNet [15]	RL, G	1221.4	10242	169.5	9.78%	–	1082.4	7338.9	146.3	7.22%	–	1121.3	6324.5	159.3	8.36%	–
PtrNet [15]	RL, BS	1201.2	8932.4	134.5	5.46%	–	983.6	6623.7	110.2	3.21%	–	1002.6	5673.9	134.6	4.25%	–
RE-GAT [44]	RL,G	1084.3	7621.6	110.4	4.05%	7 s	963.2	5219.8	88.6	2.17%	7 s	987.3	5412.7	99.3	3.26%	7 s
RE-GAT [44]	RL, S	1054.6	6987.4	98.7	1.54%	1 h	945.3	5012.7	79.4	0.97%	1 h	965.2	5203.4	84.3	1.02%	1 h
Ours (2-opt)	RL, 2-opt	1055.3	6989.2	99.1	1.56%	8 s	946.4	5015.3	80.2	1.02%	8 s	966.4	5205.1	85.2	1.12%	8 s
Ours (sample)	RL, S	1002.4	6602.3	83.2	0.65%	34 m	893.1	4209.5	72.2	0.45%	34 m	902.5	4874.3	78.6	0.53%	34 m

Table 4

Output for Solomon dataset (CVRPTW with 100 customers).

		R1-100					C1-100					RC1-100
Method	Type	$f_{1}$	$f_{2}$	$f_{3}$	Gap	Time	$f_{1}$	$f_{2}$	$f_{3}$	Gap	Time	$f_{1}$	$f_{2}$	$f_{3}$	Gap	Time
INSGA-II [41]	H	4906.5	46542	263.5	–	1.5 h	4562.2	43219	234.6	–	1.5 h	4705.1	44984	245.3	–	1.5 h
MOEA/D [42]	H	5090.7	47983	267.4	–	1.5 h	4628.8	42308	236.8	–	1.5 h	4813.4	45273	251.2	–	1.5 h
MARDAM [43]	RL, G	2703.4	23972	210.1	2.64%	3 m	2398.3	22981	178.4	2.12%	3 m	2417.6	15622	189.7	2.43%	3 m
PtrNet [15]	RL, G	1723.5	16120	213.4	10.12%	–	1342.6	15023	182.3	8.34%	–	1459.6	11098	192.3	9.12%	–
PtrNet [15]	RL, BS	1696.8	15238	192.3	8.39%	–	1317.2	14763	165.5	6.23%	–	1418.4	10987	176.1	7.13%	–
RE-GAT [44]	RL, G	1669.4	14863	189.1	6.68%	17 s	1202.5	14293	162.1	4.48%	17 s	1401.3	10827	163.2	5.53%	17 s
RE-GAT [44]	RL, S	1616.7	14563	184.4	3.25%	4 h	1198.6	14108	156.7	1.19%	4 h	1367.7	10532	159.8	2.25%	4 h
Ours (2-opt)	RL, 2-opt	1621.6	14219	143.2	0.63%	18 s	1276.9	14121	117.3	0.49%	18 s	1379.5	10586	121.2	0.53%	18 s
Ours (sample)	RL, S	1573.2	11082	141.7	0.45%	42 m	1193.3	12097	112.4	0.29%	42 m	1228.1	9739.8	116.4	0.31%	42 m

5.3 Analysis of the calculation results of the test set

Compare the Res-E-G-AT-Caps Net model and the combination of 2-opt search strategy and sampling search strategy with INSGA-II [41] and MOEA/D [42] in multi-objective optimization algorithms, and compare them with PtrNet [15], MARDAM [43], and RE-GAT [41] in DRL methods. Among them, INSGA-II [41] and MOEA/D [42] are multi-objective optimization algorithms based on decomposition strategy, MARDAM is a sequential multi-agent model with actor critical performance, PtrNet is an end-to-end solution, and RE-GAT is a graph attention network model that considers residual structure.

Tables 2–4 shows the test results of the DRL framework proposed in this article compared to other comparison methods on datasets of different scales (25/50/100), and also shows the average operation time of all test cases. From the table, it can be seen that the DRL framework proposed in this article outperforms the compared algorithms/methods in all objective functions. To be precise, the solution quality on the test cases of 25 customer nodes and 50 customer nodes is similar to that of the comparison algorithm, but the solution time is much faster than other algorithms. On the scale problem of 100 customer nodes, both the solution quality and time are better than other algorithms. The DRL model with 2-opt local search strategy and sampling strategy outperforms other algorithms in terms of solution quality and time for three scale problems.

As shown in Table 2, by testing 25 different types of customer nodes (C, R, RC) on the Solomon data-set, it can be seen that: The Res-E-G-AT-Caps Net model with a 2-opt local search strategy and sampling strategy can obtain the optimal solution. Taking the R-type data-set as an example, the results obtained by the sampling strategy shorten the total transportation distance by about 1.57%, the total transportation cost by about 5.21%, and the maximum makespan by about 10.45% compared to the solutions obtained by the 2-opt local search strategy. In addition, compared to other learning based methods compared, the optimal RE-GAT (Sample) method has reduced the total transportation distance by about 1.44%, the total transportation cost by about 5.09%, and the maximum makespan by about 9.61%.

As shown in Table 3, by testing 50 different types of customer nodes (C, R, RC) on the Solomon data-set, it can be seen that: The Res-E-G-AT-Caps Net model with a 2-opt local search strategy and sampling strategy can obtain the optimal solution. Taking the C-type data-set as an example, the results obtained by the sampling strategy reduced the total transportation distance by about 5.96%, the total transportation cost by about 19.14%, and the maximum makespan by about 11.22% compared to the solutions obtained by the 2-opt local search strategy. In addition, compared to other learning based methods compared, the optimal RE-GAT (Sample) method has reduced the total transportation distance

by about 5.84%, the total transportation cost by about 19.08%, and the maximum makespan by about 9.97%.

As shown in Table 4, by testing 100 different types of customer nodes (C, R, RC) on the Solomon data-set, it can be seen that: The Res-E-G-AT-Caps Net model with a 2-opt local search strategy and sampling strategy can obtain the optimal solution, taking the RC data-set as an example, the results obtained by the sampling strategy are about 12.32% shorter in total transportation distance, 8.68% lower in total transportation cost, and about 4.12% shorter in maximum makespan compared to the solutions obtained by the 2-opt local search strategy. In addition, compared to other learning based methods compared, the optimal RE-GAT (Sample) method has reduced the total transportation distance by about 11.36%, the total transportation cost by about 8.13%, and the maximum makespan by about 37.18%.

Figure 6.

PF obtained by all comparison methods on 3 objective 25 customer instances.

Figure 7.

PF obtained by all comparison methods on 3 objective 50 customer instances.

From Tables 2–4, it can be seen that the sampling strategy can obtain the optimal solution given by all methods. In contrast, Res-E-G-AT-Caps Net outperforms or approaches all comparison algorithms/models in terms of computational speed. In addition, compared to heuristic algorithms, the model proposed in this article has better experimental results. On the other hand, by synthesizing experimental results on datasets of different sizes and types, it can be seen that C-type data performs the best, followed by RC type, and finally R type. This is because the trained data is randomly generated examples, and examples with certain rules make the model more efficient.

The more uniformly distributed the Patero Front (PF) [45] mapping of the solution set composed of all non dominated solutions in the target space, the better the solution performance of the algorithm. The PF performance obtained by all comparative algorithms/models in various situations is shown in Figures 6 to 8. It can be seen that the model trained in this article can be efficiently applied to three target VRPs with different numbers of cities. Although the model was trained on randomly generated examples, it still performs well on other types of datasets.

Figure 8.

PF obtained by all comparison methods on 3 objective 100 customer instances.

Table 5

HV values obtained by all comparison methods on different numbers of customer instances.

		CVRPTW-25			CVRPTW-50			CVRPTW-100
Method	Type	R	C	RC	R	C	RC	R	C	RC
INSGA-II	H	28731	30358	29320	4.21E $+$ 05	4.43E $+$ 05	4.36E $+$ 05	8.76E $+$ 05	8.96E $+$ 05	8.83E $+$ 05
MOEA/D	H	27904	29201	28102	4.02E $+$ 05	4.20E $+$ 05	4.11E $+$ 05	8.42E $+$ 05	8.63E $+$ 05	8.55E $+$ 05
MARDAM	RL, G	36943	38426	37920	4.97E $+$ 05	5.02E $+$ 05	5.00E $+$ 05	9.23E $+$ 05	9.44E $+$ 05	9.35E $+$ 05
PtrNet	RL, G	44907	46385	45328	5.17E $+$ 05	5.32E $+$ 05	5.22E $+$ 05	1.01E $+$ 06	1.23E $+$ 06	1.12E $+$ 06
PtrNet	RL, BS	49225	52290	51126	5.66E $+$ 05	5.90E $+$ 05	5.76E $+$ 05	1.12E $+$ 06	1.46E $+$ 06	1.29E $+$ 06
RE-GAT	RL, G	57314	59034	58293	6.68E $+$ 05	6.84E $+$ 05	6.72E $+$ 05	1.32E $+$ 06	1.54E $+$ 00	1.44E $+$ 06
RE-GAT	RL, S	59603	62109	61082	6.97E $+$ 05	7.12E $+$ 05	7.05E $+$ 05	1.38E $+$ 06	1.57E $+$ 06	1.47E $+$ 06
Ours (2-opt)	RL, 2-opt	63407	64327	63653	7.34E $+$ 05	7.66E $+$ 05	7.44E $+$ 05	1.45E $+$ 06	1.71E $+$ 06	1.58E $+$ 06
Ours (sample)	RL, S	65218	67329	66782	7.86E $+$ 05	7.98E $+$ 05	7.90E $+$ 05	1.65E $+$ 06	1.96E $+$ 06	1.83E $+$ 06

In addition, High Volume (HV) [45] is used to evaluate the size of the super volume formed by the obtained PF and its reference point, and to verify the distribution characteristics of the solution in the target space. The larger the HV, the higher the algorithm performance. Table 5 provides HV performance indicators for all comparative algorithms/methods. From Table 5, it can be seen that the Res-E-G-AT-Caps Net model performs best in all examples.

In order to display the experimental results of different methods more clearly, qualitative analysis was conducted on all methods. Taking CVRPTW-100 as an example, Figure 9 shows the final vehicle routing maps under different comparison methods.

Figure 9.

Routing diagrams of different methods under CVRPTW-100.

Each color in Figure 9 represents a routing. From Figure 9, it can be seen that on the data-set of 100 customers, the routing distribution obtained by the method proposed in this paper is more uniform. Combining Tables 2–4, it can be seen that the method proposed in this paper has a certain degree of efficiency.

5.4 Framework calculation time complexity analysis

This framework solves vehicle routing problems through offline training and online testing. Evaluate the relationship between the proposed model’s increase in graph size (i.e. nodes) and runtime during the training and testing stages by solving VRP. For the training phase, the training time not only depends on the graph size of the problem, but also on the quantity and batch size of the training data. Without loss of generality, this paper uses 10000 training instances and the same batch size (the batch size is 128) to test the running time when the number of nodes in a single epoch increases from 1 to 100 instances.

In this article, INSGA-II and MOEA/D belong to heuristic search algorithms. The algorithm complexity of INSGA-II is $O (m n^{2})$ , where $m$ is the number of targets and $n$ is the number of populations. The algorithm complexity of MOEA/D is relatively low compared to INSGA-II. The GAT based model and its improved version have the same runtime complexity as the model proposed in this article, both of which are $O (n)$ . The model proposed in this article has significant improvements in both runtime and optimal solution spacing.

5.5 Generalization comparison experiment

To verify the migration performance of the Res-E-G-AT-Caps Net model, the model was migrated and solved on different customer sizes (150 and 200, taking R-type instances as examples) and different target numbers (2-objective and 5-objective). The experimental results of 2-objective are shown in Figure 10. It can be seen from Figure 10 that the Res-E-G-AT-Caps Net model proposed in this paper is significantly superior to other algorithms/DRL-based models on all 150 and 200 customer instances. In addition, Table 6 presents the values of HV for different number of objective functions and different scale examples. From Table 6, it can be seen that the model proposed in this paper performs best in all examples.

In order to further demonstrate the effectiveness of the proposed framework, this paper used the actual operational basic data of urban logistics distribution research in References [46], [47] (named Case 1 and Case 2 here) as the experimental data-set. Meanwhile, 3S-MMDEA [48] and HF-VRP-DL [49] are further selected as methods for comparison. Table 7 presents the experimental comparison results of different methods in different objective functions, gap rates, and the final time of obtaining results.

Figure 10.

PF obtained by all comparison methods on a 2-objective instance.

Table 6

HV values obtained by all comparison methods on different objective numbers and number of customer instances.

		2-objective		5-objective
Method	Type	150 customer	200 customer	150 customer	200 customer
INSGA-II	H	18563	31525	8.02E $+$ 10	1.34E $+$ 11
MOEA/D	H	19023	32617	8.12E $+$ 10	1.59E $+$ 11
MARDAM	RL, G	21059	42176	8.89E $+$ 10	1.78E $+$ 11
PtrNet	RL, G	34094	44583	1.02E $+$ 11	3.01E $+$ 11
PtrNet	RL, BS	34185	45072	1.15E $+$ 11	3.12E $+$ 11
RE-GAT	RL, G	39043	47925	1.24E $+$ 11	3.34E $+$ 11
RE-GAT	RL, S	40122	48231	1.47E $+$ 11	3.45E $+$ 11
Ours (2-opt)	RL, 2-opt	41309	49226	1.55E $+$ 11	4.23E $+$ 11
Ours (sample)	RL, S	42095	50102	1.63E $+$ 11	4.57E $+$ 11

Table 7

Comparison of all comparison methods and indicators on Case1 and Case 2.

		Case1					Case2
Method	Type	$T$	$f$	$T (T S_{g})$	Gap	Time	$T$	$f$	$T (T S_{g})$	Gap
MOEA/D	H	8612.8	99608	192.23	–	1.5 h	34972.4	210834	192.23	–
3S-MMDEA	H	8230.3	92083	181.2	–	1 h	32093.5	192073	159.76	–
HF-VRP-DL	DL, G	7947.5	87025	173.7	0.0068	2 min	29407.7	176525	121.65	0.0301
PtrNet (greedy)	RL, G	7498.6	85984	166.8	0.0059	–	28102.5	169833	102.36	0.0287
PtrNet (beam search)	RL, BS	7347.8	84632	156.7	0.0055	–	27409.7	162346	99.39	0.0246
RE-GAT (greedy)	RL, G	7198.4	82127	151.5	0.0049	20 s	25101.5	141108	83.98	0.0196
RE-GAT (samle)	RL, S	7125.5	81497	149.3	0.0041	55 s	25103.6	141072	84.48	0.0187
Ours (2-opt)	RL, 2-opt	7012.4	80301	147.1	0.0032	20 s	24948.5	139965	83.32	0.0023
Ours (sample)	RL, S	6985.3	80189	146.9	0.0028	55 s	24896.4	138921	81.02	0.0018

From Table 7, it can be seen that: The Res-E-G-AT-Caps Net model with a 2-opt local search strategy and sampling strategy can obtain the optimal solution. Taking the Case 1 data-set as an example, the results obtained by the sampling strategy are about 0.39% shorter in total transportation distance, 0.14% lower in total transportation cost, and about 0.14% shorter in maximum makespan compared to the solutions obtained by the 2-opt local search strategy. In addition, compared to other learning based methods compared, the optimal RE-GAT (Sample) method has reduced the total transportation distance by about 1.97%, the total transportation cost by about 1.60%, and the maximum makespan by about 1.61%.

6. Conclusion

This article proposes a multi-agent based deep reinforcement learning framework to improve the solving efficiency of MOCVRPTW. Unlike the traditional approach of “grouping before planning”, this paper models the Markov decision process for MOCVRPTW based on decomposition, designs an improved graph attention capsule network, considers node and edge information comprehensively, and introduces residual mechanism to design the Res-E-G-AT-Caps Net model. The multi-agent of this model utilizes high-level feature information to learn and cooperate with each other’s actions, planning vehicle paths from the overall problem, and quickly solving MOCVRPTW through offline training of the model. To verify the feasibility and effectiveness of the proposed framework, numerical experiments were conducted on publicly available standard examples and compared with existing learning based methods and meta heuristic algorithms. The calculation results indicate that the proposed framework is feasible and efficient for solving vehicle routing problems of different scales.

The MOCVRPTW considered in this article is a VRP expansion problem in a static environment, while in the actual logistics and distribution process, the transportation environment is usually constantly changing, which means it will face the situation of dynamic order arrival. Therefore, future research will focus on building end-to-end deep reinforcement learning frameworks in dynamic environments.

Footnotes

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grant 61806006, China Postdoctoral Science Foundation under Grant No. 2019M660149, Graduate Innovation Foundation of Jiangsu Province under Grant No. KYLX16_0781, the 111 Project under Grants No. B12018, and PAPD of Jiangsu Higher Education Institutions.

Ethical and informed consent for data used

The experiments mentioned in this article do not involve ethical experiments.

Written informed consent was obtained from all the participants prior to the enrollment (or for the publication) of this study (or case report).

Authors contribution statement

Haifei Zhang: Conceptualization, Software, Formal analysis, Investigation, Data Curation, Writing-Original Draft, Visualization.

Hongwei Ge: Methodology, Validation, Investigation, Resources, Writing-Review & Editing, Supervision, Funding acquisition.

Ting Li: Software, Data Curation.

Lujie Zhou: Validation, Methodology.

Shuzhi Su: Validation, Funding acquisition.

Yubing Tong: Writing-Review & Editing.

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability and access

The datasets generated or analyzed during this study are available from the corresponding author on reasonable request.

References

Gevaers

Eddy

Vanelslander

, Cost Modelling and Simulation of Last-mile Characteristics in an Innovative B2C Supply Chain Environment with Implications on Urban Areas and Cities, Procedia – Social and Behavioral Sciences 125 (2014), 398–411.

Zhang

Yang

Tong

, Review of vehicle routing problems: Models, classification and solving algorithms, Archives Computational Methods Engineering 29(1) (2022), 195–221.

Sharma

, Monika, A literature survey on multi depot vehicle routing problem, International Journal for Scientific Research Development 3(4) (2015), 1752–1757.

Zhou

Chen

Deng

, Parameter adaptation-based ant colony optimization with dynamic hybrid mechanism, Engineering Applications of Artificial Intelligence 105139 (2022).

Zhen

Wang

et al., Multi-depot multi-trip vehicle routing problem with time windows and release dates, Transportation Research Part E 135 (2020), 1–21.

Xue

Tang

Pang

Liu

A.X.

, Self-adaptive parameter and strategy based particle swarm optimization for large-scale feature selection problems with multiple classifiers, Applied Soft Computing 88 (2020), 1–12.

Bengio

Lodi

Prouvost

, Machine learning for combinatorial optimization: A methodological tour d’horizon, European Journal of Operational Research 290(2) (2021), 405–421.

Wang

Tang

, Deep reinforcement learning for transportation network combinatorial optimization: A survey, Knowledge-Based Systems 233 (2021), 107526.

Silver

Huang

Maddison

C.J.

et al., Mastering the game of go with deep neural networks and tree search, Nature 529(7587) (2016), 484–489.

10.

Tang

Shao

Zhao

et al., Recent progress of deep reinforcement learning: From AlphaGo to AlphaGo Zero, Control Theory & Applications 34(12) (2017), 1529–1546.

11.

Zhao

Mao

Zhao

et al., A hybrid of deep reinforcement learning and local search for the vehicle routing problems, IEEE Transactions on Intelligent Transportation Systems 99 (2020), 1–11.

12.

Nazari

Oroojlooy

Snyder

L.V.

et al., Reinforcement Learning for Solving the Vehicle Routing Problem, Advances in Neural Information Processing Systems, 2018, 9861–9871.

13.

Chen

Tian

, Learning to perform local rewriting for combinatorial optimization, Advanced Neural Information Processing Systems 32 (2019), 6281–6292.

14.

, Zhang

and Yang

, A Learning-based Iterative Method for Solving Vehicle Routing Problems, in: Proceedings of the International Conference on Learning Representations, 2020.

15.

André

Tierney

, Neural Large Neighborhood Search for the Capacitated Vehicle Routing Problem, in: Proceedings of European Conference on Artificial Intelligence, 2020.

16.

Xin

Song

Cao

et al., Multi-Decoder Attention Model with Embedding Glimpse for Solving Vehicle Routing Problems, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2020.

17.

Kwon

Y.D.

Choo

Kim

et al., POMO: Policy optimization with multiple optima for reinforcement learning, Advances in Neural Information Processing Systems 33 (2020), 21188–21198.

18.

Zhao

Mao

Zhao

et al., A hybrid of deep reinforcement learning and local search for the vehicle routing problems, IEEE Transactions on Intelligent Transportation Systems 99 (2020), 1–11.

19.

Zhang

et al., Multi-vehicle routing problems with soft time windows: A multi-agent reinforcement learning approach, Transportation Research Part C – Emerging Technologies 121 (2020), 102861.

20.

Song

Cao

et al., Learning improvement heuristics for solving routing problems, IEEE Transact. Neural Networks Learning System 33(9) (2021), 5057–5069.

21.

Zhang

Wang

Zhang

et al., MODRL/D-EL: Multiobjective Deep Reinforcement Learning with Evolutionary Learning for Multiobjective Optimization, in: Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), IEEE, 2021, pp. 1–8.

22.

et al., Solving time-dependent traveling salesman problem with time windows with deep reinforcement learning, in: Proceedings of the 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2021, pp. 558–563.

23.

Kim

Park

Kim

, Learning collaborative policies to solve NP-hard routing problems, Advanced Neural Information Processing System 34 (2021), 10418–10430.

24.

Lin

Yang

Zhang

, Pareto Set Learning for Neural Multi-objective Combinatorial Optimization, in: Proceedings of the International Conference on Learning Representations, 2022.

25.

Hottung

Kwon

Y.D.

Tierney

, Efficient active search for combinatorial optimization problems, in: Proceedings of the International Conference on Learning Representations, 2022.

26.

Revin

Potemkin

V.A.

Balabanov

N.R.

et al., Automated machine learning approach for time series classification pipelines using evolutionary optimization, Knowledge-based systems, 2023.

27.

Gao

Pan

et al., FACapsnet: A fusion capsule network with congruent attention for cyberbullying detection, Neurocomputing, 2023.

28.

Velickovic

Cucurull

Casanova

Romero

Lio

Bengio

, Graph attention networks, in: Proceedings of the International Conference on Learning Representations, 2017.

29.

Abdullahi

Reyes-Rubiano

Ouelhadj

et al., Modelling and multi-criteria analysis of the sustainability dimensions for the green vehicle routing problem, European Journal of Operational Research 292 (2021).

30.

Vinyals

Babuschkin

Czarnecki

W.M.

et al., Grandmaster level in StarCraft II using multi-agent reinforcement learning, Nature 575(7782) (2020), 350–354.

31.

K.W.

Zhang

Wang

, Deep reinforcement learning for multi-objective optimization, IEEE Transactions on Cybernetics 51(6) (2021), 3103–3114.

32.

Fellows

Mahajan

Rudner

et al., VIREL: A Variational Inference Framework for Reinforcement Learning, Neural Information Processing Systems, 2019, 7120–7134.

33.

Fang

Chen

et al., Reinforcement Learning With Multiple Relational Attention for Solving Vehicle Routing Problems, IEEE transactions on cybernetics, 2021. doi: 10.1109/TCYB.2021.3089179.

34.

Ioffe

Szegedy

, Batch normalization: accelerating deep network training by reducing internal covariate shift, in: Proceedings of International Conference on International Conference on Machine Learning, 2015, pp. 448–456.

35.

Sabour

Frosst

Hinton

, Dynamic routing between capsules, in: Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, USA, 2017, pp. 3859–3869.

36.

Yang

Zhao

Chen

et al., Investigating the transferring capability of capsule networks for text classification, Neural Networks 118 (2019), 247–261.

37.

http://bimk.ahu.edu.cn/index.php?s=/Index/Software/index.html.

38.

Solomon

M.M.

, Algorithms for the vehicle routing and scheduling problem with time window constraints, Operations Research 35(2) (1987), 254–265.

39.

Kingma

D.P.

J.L.

, Adam: A method for stochastic optimization, in: Proceedings of the 3rd International Conference on Learning Representations, San Diego, 2015, pp. 1–11.

40.

Glorot

Bengio

, Understanding the difficulty of training deep feed forward neural networks, in: Proceedings of the 13th International Conference on Artificial Intelligence, 2010, pp. 249–256.

41.

Srivastava

Singh

Mallipeddi

, NSGA-II with objective-specific variation operators for multi objective vehicle routing problem with time windows, Expert Systems with Applications 176 (2021), 114779.

42.

Wang

Dai

Zhao

et al., Multi-objective optimization of hexahedral pyramid crash box using MOEA/D-DAE algorithm, Applied Soft Computing 118 (2022).

43.

Bono

Dibangoye

J.S.

Simonin

et al., Solving multi-agent routing problems using deep attention mechanisms, IEEE Transactions on Intelligent Transportation Systems 22(12) (2021), 7804–7813.

44.

Lei

Guo

Wang

et al., Solve routing problems with a residual edge-graph attention neural network, Neurocomputing 508 (2022), 79–98.

45.

Ryoji

Hisao

, A review of evolutionary multimodal multiobjective optimization, IEEE Transactions on Evolutionary Computation 24(1) (2020).

46.

Wang

X.L.

M.Z.

et al., Two-echelon logistics distribution region partitioning problem based on a hybrid particle swarm optimization-genetic algorithm, Expert Systems with Applications 42(12) (2015), 5019–5031.

47.

Zhang

H.F.

H.W.

Yang

J.M.

S.Z.

Tong

Y.B.

, Combining affinity propagation with differential evolution for three-echelon logistics distribution optimization, Applied Soft Computing 131C(109878) (2022).

48.

Zhang

H.F.

H.W.

S.Z.

Tong

Y.B.

, Three-stage multi-modal multi-objective differential evolution algorithm for vehicle routing problem with time windows, Intelligent Data Analysis 28(2) (2024), 485–506. doi: 10.3233/IDA-227410.

49.

Fadda

Mancini

Serra

et al., The heterogeneous fleet vehicle routing problem with draft limits, Computers & Operations Research 149(Jan.) (2023), 1–10.

		Training			DRL
Method	Type	scale	Heuristic type	DL model	algorithm
Nazari et al. [12]	TSP CVRP	20, 50, 100	Construction $+$ sampling	RNN (Pointer net) $+$ ATT	Reinforce
NeuRewriter [13]	CVRP	20, 50, 100	Improvement (swap)	RNN $+$ ATT	A2C
L2I [14]	CVRP	20, 50, 100	Improvement (hybrid)	ATT	Reinforce
NLNS [15]	CVRP SDVRP	20, 50, 100	Improvement (ruin and repair)	ATT	A2C
MDAM [16]	TSP CVRP SDVRP	20, 50, 100	Construction $+$ beam search	Transformer (AM)	Reinforce
POMO [17]	TSP CVRP	20, 50, 100	Construction $+$ augmentation	Transformer (AM)	POMO
Zhao et al. [18]	CVRP VRPTW	20, 50, 100	Construction $+$ local search	RNN $+$ ATT	A2C
MAAM [19]	VRPSTW	20, 50, 100,	Construction $+$ sampling	Transformer (AM)	MARL
		150
Wu et al. [20]	TSP	20, 50, 100	Improvement (2-opt, swap,	Modified Transformer	A2C
	CVRP		insert)
Zhang et al. [21]	MOVRPTW	50	Construction $+$ sampling	Transformer (AM)	Reinforce $+$ EL
ANN-AM [22]	TDTSPTW	20, 40	Construction $+$ sampling	Transformer (AM) $+$ RNN	Reinforce
LCP [23]	TSP, CVRP, PCTSP	20, 50, 100	Construction $+$ improvement	AM/Pointer net	Reinforce
P-MOCO [24]	MOTSP MOVRP	20, 50, 100	Construction $+$ augmentation	RNN $+$ ATT	A2C
EAS [25]	TSP CVRP	100	Construction $+$ active search	Transformer (POMO)	POMO $+$ IL