Abstract
In order to alleviate urban congestion, improve vehicle mobility, and improve logistics delivery efficiency, this paper establishes a practical multi-objective and multi constraint logistics delivery mathematical model based on graphs, and proposes a solution algorithm framework that combines decomposition strategy and deep reinforcement learning (DRL). Firstly, taking into account the actual multiple constraints such as customer distribution, vehicle load constraints, and time windows in urban logistics distribution regions, a multi constraint and multi-objective urban logistics distribution mathematical model was established with the goal of minimizing the total length, cost, and maximum makespan of urban logistics distribution paths. Secondly, based on the decomposition strategy, a DRL framework for optimizing urban logistics delivery paths based on Graph Capsule Network (G-Caps Net) was designed. This framework takes the node information of VRP as input in the form of a 2D graph, modifies the graph attention capsule network by considering multi-layer features, edge information, and residual connections between layers in the graph structure, and replaces probability calculation with the module length of the capsule vector as output. Then, the baseline REINFORCE algorithm with rollout is used for network training, and a 2-opt local search strategy and sampling search strategy are used to improve the quality of the solution. Finally, the performance of the proposed method was evaluated on standard examples of problems of different scales. The experimental results showed that the constructed model and solution framework can improve logistics delivery efficiency. This method achieved the best comprehensive performance, surpassing the most advanced distress methods, and has great potential in practical engineering.
Keywords
Introduction
Vehicle Routing Problem (VRP) is a typical combinatorial optimization problems (COP), and also an NP hard problem. In the actual logistics distribution process, the delivery efficiency is limited by the constraints of vehicle loading (Capacity) and customer time window (Time Window). Therefore, VRP (CVRPTW) considering capacity and time window constraints also belongs to the NP hard problem, which has a wider practical application prospect. When solving CVRPTW, different objectives and costs are generally considered. The cost can be the number of vehicles/routes, total distance traveled by all vehicles, total travel time, etc., or customer satisfaction. In the case of diversified costs, all costs should be optimized simultaneously [1]. Therefore, CVRPTW is a multi-objective optimization problem (MOP), and exploring efficient solutions for such problems has important theoretical and practical significance for the development of logistics supply chains.
In recent years, research on solving the optimal vehicle routing method is emerging in endlessly. Most of the methods need to establish mathematical models to complete the vehicle path optimization by defining different types of variables, constraint functions and objective functions. The commonly used methods mainly include exact algorithms and heuristics [2]. Among them, exact search algorithms are to establish a corresponding mathematical model for a specific problem, and then use mathematical methods to solve it, which can definitely find the optimal solution of the problem. However, due to its NP hard nature, it is difficult to apply to solving problems with more than 50 customers [3]. The heuristic search algorithms are proposed based on the optimization algorithm. Its basic idea is to give a feasible solution to the COP to be solved within an acceptable range. Common heuristic search algorithms mainly include Ant Colony Optimization (ACO) algorithm [4], Genetic Algorithm (GA) [5] and Particle Swarm Optimization (PSO) algorithm [6]. Compared with the exact search algorithm, the heuristic search algorithm has better robustness and feasibility when dealing with large-scale VRP problems, but it usually needs to be designed for specific problems and professional domain knowledge [7], and it is difficult to find a better solution to large-scale problems in polynomial time. Although exact search algorithm produce the optimal solution, while meta heuristic based methods produce near optimal solutions, these methods cannot be extended to problems involving hundreds to potential thousands of customers, and there is still room for further improvement in solving efficiency. Therefore, quickly solving such large-scale problems with reasonable accuracy remains an open challenge, which requires learning based methods. Therefore, it is particularly important to introduce learning methods to efficiently find near optimal solutions.
In recent years, more and more researches have applied DRL to solving COP [8], and have made breakthrough progress. DRL is a combination of deep learning (DL) and reinforcement learning (RL). It utilizes the powerful representation ability of deep learning to fit the action value function or action strategy function of intelligent agents in reinforcement learning, effectively solving the storage problem of action value and state value when the action space and state space are too large. DRL is mainly applied to sequential decision-making tasks. The AlphaGo [9] and AlphaGo Zero [10] Go algorithms use the DRL model combined with Monte Carlo Tree (MCT) search to successfully defeat the world Go champion, pushing the theoretical and applied research of DRL to new heights, and opening up new ideas for using DRL to solve VRP [11]. The optimal selection of decision variables in the discrete decision space of VRP has a natural similarity to the function of RL sequential decision-making, and the characteristics of DRL’s “off-line training, online decision-making” make it possible for VRP to solve in real-time online. Therefore, using DRL method to solve traditional COP is a good choice. Compared with the traditional combinatorial optimization algorithm (COA), the DRL based COA has a series of advantages, such as fast solving speed, strong generalization ability, and so on. This kind of method is a research hotspot in recent years. Table 1 provides a summary of methods for solving VRP and its related problems based on DRL.
VRP solution method based on DRL.
VRP solution method based on DRL.
Based on the above literature analysis, current research on VRP and its related variants mainly focuses on heuristic methods. However, with the development of DRL research in COP, there are still the following limitations in this field:
Heuristic methods usually adopt the idea of “partitioning before optimization”, where different partitions are optimized separately, resulting in a lack of overall correlation. At present, research on DRL methods in COP mainly focuses on solving TSP, VRP, and other problems through the interaction between intelligent agent learning and the environment, while research on solving CVRPTW problems is relatively lacking. GNN, as a powerful tool for processing non Euclidean data and obtaining graphical information, has received extensive research in recent years. However, when using GNN and RL to solve COP, not only did the dependency relationships between edges in the graph structure not be considered, but also the information mining in multi constraint and multi-objective COP was not deep enough.
Based on the above analysis, since the VRP has a graph structure [26], that is, nodes can be embedded through graph/network information, and the Graph Capsule Network (G-Caps Net) has a strong effect in solving spatial distribution problems [27], G-Caps Net can be used to model graph COP, so this paper uses G-Caps Net to build an end-to-end DRL framework. In this framework, the node information of CVRPTW is input into the model in the form of a 2D graph, and then the graph attention capsule network (G-AT Caps Net) is modified by considering multi-level features, edge information, and residual connections between layers in the graph structure [28]. Finally, the next node is selected based on the module length of the output capsule instead of the probability distribution through search strategies such as greedy search or sampling methods.
The network framework designed in this article is called Residual Edge Graph Attention Capsule Networks (Res-E-G-AT-Caps Net). The Res-E-G-AT-Caps Net model considers not only node information features but also edge information features (edge information mainly refers to the weighted distance between nodes). Firstly, the information representing nodes and edges (such as weights) in the MOCVRPTW graph is fused and updated, and richer feature information is captured through the multi head attention (MHA) mechanism. All features are encoded through the primary capsule layer, and residual concatenation and batch normalization (BN) processing are performed. Then, all features are extracted through the convolutional capsule layer, followed by residual concatenation and BN processing, Finally, the individual capsules (as a vector) are output through the fully connected capsule layer, and the module length of the capsule vector represents the probability of node selection. When dealing with multiple targets, the integration method is used to normalize all targets, that is, the reward function calculates all targets in a weighted sum [29]. Finally, use the baseline REINFORCE algorithm with rollout to optimize the entire network.
The main contributions of this paper can be summarized below:
For the multi-objective optimization of MOCVRPTW, based on the idea of decomposition, the multi-objective is transformed into a set of scalar sub objective optimizations, and collaborative optimization is carried out on all sub objectives based on the idea of parameter transfer. A Res-E-G-AT-Caps Net framework is proposed to deeply capture the graph structure information of the MOCVRPTW problem. G-Caps Net is used to encode and extract features of node and edge information in the graph, and to capture the local position global structure association relationship of graph nodes. Simultaneously consider residual connections to reduce feature loss caused by information filtering. A loss function suitable for multi-objective problems is designed. The baseline REINFORCE algorithm with rollout is used to train the strategy network to speed up and stabilize the training of the model. Different local search strategies are used to improve the model and further improve the quality of the solution. The proposed framework can be extended to different graph COP solutions, with simple model implementation, high efficiency, and strong generalization ability. The efficiency of the proposed framework was evaluated on a randomly generated MOCVRPTW instance dataset, and its generalization ability was verified on standard test cases.
Problem description
Generally, CVRPTW can be described as follows: For a distribution center within a known region, there are several vehicles originating from the distribution center that need to complete multiple logistics orders in an orderly and non repetitive manner, and if customer demand does not exceed the vehicle’s carrying capacity. In addition, these orders have limited specific delivery time windows, if the delivery vehicles are earlier or later than this time window range, a certain time penalty cost (i.e. soft time window) needs to be added to the final total transportation cost. Under the above constraints, the logistics distribution route should be reasonably planned to minimize the total delivery cost (the cost here can be the number of vehicles/routes, the total distance traveled by all vehicles, total travel time, etc.). Therefore, CVRPTW is a multi-objective optimization problem, namely MOCVRPTW.
In this paper, a MOCVRPTW can be described as: In a certain distribution region, there is a distribution center with a known location and several vehicles originating from the distribution center. The logistics distribution needs of different customers in the distribution region are completed in an orderly and non repetitive manner, and each vehicle has a maximum load constraint. Each customer node has a distribution soft time window constraint. Under the above constraints, the vehicle travel route is reasonably planned to ultimately achieve the optimization of total transportation distance, the total of logistics delivery cost and makespan.
MOCVRPTW is related to multiple factors, and the mathematical model established is very complex with numerous constraints. In order to facilitate modeling, the following assumptions were made in this study:
Only considering the logistics distribution of a single logistics distribution center. The vehicles for logistics distribution must start from the distribution center and return to the distribution center after completing all customer order delivery tasks. All vehicles have the same capacity and driving speed, and have a maximum distance limit. Each vehicle only completes the delivery of one route. The demand and location coordinates of each customer are known and fixed, and all customer nodes are connected. If each customer node is only allowed to be accessed once, then customer nodes that have already been served by the vehicle will not be considered as candidates for viable customer nodes of other vehicles. During the delivery process of vehicles, the impact caused by temporary vehicle malfunctions or incorrect delivery of goods will not be considered temporarily.
Consider MOCVRPTW as a graph
The mathematical model of MOCVRPTW is as follows:
Minimize total transportation distance:
Minimize total transportation costs:
Where,
Where:
Minimize makespan:
Where,
The relevant constraints and decision conditions are as followed:
All customer nodes have equal number of edges in and out:
All customer nodes have and only have a unique vehicle service:
All vehicles must not exceed their maximum capacity:
Path decision variables:
Is there a time requirement for the delivery task:
The departure time of the vehicle is 0:
Based on the above analysis, for a feasible candidate customer node
Based on the analysis in Section 2.2, CVRPTW is a multi-objective optimization problem (MOP). When solving MOP, decomposing it into a set of sub-problems and modeling each sub-problem separately, known as the decomposition strategy, is a reliable method. The decomposition strategy [30] has great advantages in maintaining the distribution of solutions, and by analyzing the information of adjacent problems to optimize, it can avoid falling into local optima.
This article combines the decomposition strategy with DRL to solve MOCVRPTW. Specifically, the entire vehicle routing problem is explicitly decomposed into a set of scalar quantum problems and solved using collaborative methods. Finding optimization sub-problems tends to guide Pareto Optimal (PO) solutions. When all scalar optimization problems are solved, the expected Pareto Frontier (PF) can be obtained.
The decomposition strategy decomposes MOCVRPTW into multiple single objective sub-problem, with each sub-problem optimizing one objective function while keeping the other objective functions fixed. Since the weight vectors of two adjacent sub-problem are adjacent, these two adjacent sub-problem may have very close optimal solutions [31]. This article uses the Weighted Sum Method (WSM) [29] to decompose the MOP into multiple single objective sub-problems, and then uses DRL to solve each sub-problem. Each sub-problem is modeled as a graph capsule network, and then uses a neighborhood based parameter transfer strategy [32] to solve optimization collaboratively among sub-problems.
Model description
The DRL process for solving MOCVRPTW requires defining the state space, action space, and reward function, and constructing a DRL model. Through training and testing, the model performance is optimized to ultimately obtain the best vehicle routing solution. The overall framework for solving MOCVRPTW based on DRL is shown in Figure 1. Firstly, define the reinforcement learning form of MOCVRPTW. Secondly, design a Res-E-G-AT-Caps Net network model based on value strategy gradient reinforcement learning method to train the network model. Finally, different action selection strategies and search strategies are used to obtain higher quality solutions.

Overall framework for solving MOCVRPTW based on DRL.
Modeling MOCVRPTW as Markov Decision Process (MDP)
State: State
Action: The multi-agent reinforcement learning action space is the joint action space
State transition: After the current time step
Reward: For MOCVRPTW, the objective functions include minimizing total distance, minimizing total cost, and minimizing maximum completion time. The three objective functions are normalized [33], and then the normalized results of the three objectives are weighted and summed. The negative number of the weighted sum is used as the cumulative return, resulting in:
The parameterized random policy
Mining the access order of customer nodes in the logistics delivery process is an important means to achieve efficient logistics delivery. The access sequence of customer nodes needs to be comprehensively analyzed based on factors such as vehicle load and time window constraints in the current and next states. A complete sequence of vehicle access nodes is a series of interrelated sequential data, with obvious local and global correlations. Therefore, the inherent and overall correlation of vehicle access node sequences is of great significance for the overall efficiency of logistics delivery.
This article proposes a Residual Edge Graph Attention Capsule Networks (Res-E-G-AT-Caps Net) model for constructing vehicle access node sequences. The overall framework of the model is shown in Figure 2. This network includes a graph attention model as an encoder and a fully connected capsule layer as a decoder.

Res-E-G-AT-Caps Net model.
The fusion feature representation based on nodes and edges refers to the combination of node and edge features to form a more comprehensive image feature representation. Nodes mainly refer to customer nodes and distribution centers, while edges refer to the path length between any two nodes. Node attribute features mainly include information such as customer demand, delivery time windows of each node, and location coordinates. Edge features are the Euclidean distance between any two nodes.
(1) Embedding layer
Embedding layer, as a commonly used layer in DL models, is mainly used to convert discrete inputs at high latitudes into continuous vectors at low dimensions for model processing. For embedding different types of data, it is only necessary to map each symbol to a unique encoding and use a trainable matrix to convert each number into a corresponding embedding vector.
Using 2D graph
The input node features
Where,
(2) Multiple attention mechanism
Because edge information and node information have certain relevance, integrating edge information into node information and updating each node information can better integrate features, as shown in Figure 3. On the other hand, residual connections can effectively fuse high and low features, minimizing the loss of detailed features caused by network deepening. The residual block can comprehensively extract the spatial and channel features of the data as it gradually deepens the network depth. This enhances the expression ability of the capsule, allowing the model to transmit more information with a small amount of primary capsules.

Node and edge information fusion process.
After information fusion and residual connection, the final embedded feature vector of each node is:
Where,
Where,
MHA [18] is an attention mechanism in multiple different dimensional spaces, and the multi head attention mechanism helps to obtain feature information from different dimensions. The number of attention heads used in this article is
Where,
Calculate the scaling point multiplication value
Where,
The fusion feature representation of node and edge data, as shallow and intuitive features, is not sufficient to express hidden relationships. Therefore, it is necessary to extract deep level correlation features between nodes from the input shallow features through a node feature extraction model. Caps Net [35], [36] uses a vector composed of a set of neurons to store comprehensive features such as entity attributes and spatial relationships, which has stronger representation ability and robustness. Caps Net has innovatively proposed a dynamic routing mechanism based on capsule vector structure to quickly extract effective features from a large amount of information. In capsule networks, low-level capsules depict local features of objects, high-level capsules represent overall features, and dynamic routing mechanisms are used to extract effective features from low-level capsules and update high-level capsules.
This article constructs a feature extraction and node prediction model based on Caps Net, which automatically learns the relationship between local and global nodes through the transformation matrix between capsules. This part is a three-layer capsule network, including convolutional layer, primary capsule layer, and fully connected capsule layer.
(1) Convolutional layer
This layer is a standard convolutional layer that extracts the deep local features of the vector matrix formed by the fusion feature representation of the previous stage at different positions through different scales of convolutional window widths. The convolution layer uses the convolution window to connect with the local area of the feature presentation layer, and the weighted sum of the local information is transferred to the activation function to generate the final output value of the convolution layer. This layer can extract deep level node and edge features from the vector matrix, with the advantage of translation invariance, and can parallel capture the associations between state data.
Let the convolutional layer have
Where,
By arranging the obtained
(2) Primary capsule layer
The features contained in the primary capsules are all from the output of the convolutional layer, and the number of capsules depends on the size of the convolutional feature map. The primary capsule layer is the first capsule layer in the model, which replaces the scalar neurons of CNN with the capsules of vector neurons and also replaces pooling operations, avoiding pooling operations from damaging the local position information and overall sequence structure of node data. The primary capsule combines features extracted from the same location by abstracting low-level features from high-level features, effectively capturing the spatiotemporal relationship between local positions and the overall structure of the data. This layer can extract different attributes of a certain feature in a node, such as the time when the customer node receives service.
The primary capsule layer encapsulates different attributes of the vector matrix
Assuming the dimension of the primary capsule is
Where,
Since each capsule includes
(3) Fully connected capsule layer
The final layer of the model is a fully connected capsule layer, where each capsule is connected to the primary capsule. The capsule matrix obtained from the primary capsule is multiplied by the transformation matrix to obtain a prediction vector, and then dynamic routing is used to generate the final category capsule:
Where,
The output of a capsule is also a vector, and since each capsule can be considered as a vector representing a specific entity, the modulus of the capsule vector can be used to represent the probability of an individual being selected.
The vector capsule in Caps Net encapsulates information from the convolved features, and forms a primary capsule through linear changes. Each capsule in the primary capsule layer will transfer the features upward through the weighting of the coupling coefficient, and the mapped new vector will get a high-level capsule through the nonlinear activation function Squash. The calculation process between capsules at different levels mainly includes two stages: matrix transformation and dynamic routing. Dynamic routing refers to the iterative and cyclic transfer of features from low-level capsules between adjacent capsule layers, and the formation of more abstract high-level capsules. The primary capsule layer uses dynamic routing mechanism to learn features in the data, and transmits the predicted feature information to the fully connected capsule layer in the upper layer. If the prediction results are consistent, the upper layer capsule is activated. Compared to the process of weighted summation and nonlinear transformation required by scalar neurons, capsule networks require matrix transformation to balance the spatial and hierarchical relationships of the identified objects.

Dynamic routing mechanism.
The dynamic routing structure relationship (as well as the relationship between the primary capsule layer and the fully connected capsule layer) is shown in Figure 4. This process first involves performing a matrix transformation on each primary capsule to obtain a prediction vector:
Where,
Afterwards, calculate the prediction vector:
Where,
Since there is a vector transformation between capsules, the extrusion function Sqash is used as the activation function to compress and redistribute
Where,
The dynamic routing algorithm iteratively learns the nonlinear mapping relationship between the prediction layer and the fully connected layer, requiring the use of the
Where,
Determine the similarity between the vectors based on the inner product of the primary capsule prediction vector
When using the capsule network to predict the customer node access sequence in the VRP, the commonly used loss function is the Cross-entropy Loss function [34], which can be used to compare the difference between the predicted results and the real results, so as to adjust the network parameters and make the prediction results more accurate.
Specifically, when predicting the order of customer node visits in VRP, the predicted results of each node can be represented as a probability distribution, where each node has a certain probability of being predicted as the next node. Then, the actual results can be represented as a one-hot encoded vector, where only one element’s vector is 1 and the rest are 0, which corresponds to the position of the next node. For example, if the access order is [0, 2, 1, 3], then its unique heat codes are [1, 0, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 0, 1], and then you can use the cross entropy loss function to compare the difference between the model output and the real one-hot code.
Assuming that
Where, the first term
In the training process, the network parameters are adjusted by minimizing the cross entropy loss function, so that the prediction results are closer to the real results.
In the process of constructing the solution, it is necessary to mask nodes that do not meet the conditions during node prediction at each time point, that is, to avoid repeatedly selecting customer nodes, if the remaining vehicle capacity does not meet the customer node, if the time to arrive at the customer node is not within the service time window of the node, and if the distribution center is selected twice in a row (i.e. selecting depot node
Specifically, in Eq. (19), the attention coefficient of the above case is set to
DRL algorithm based on policy gradient
The DRL algorithm based on policy gradient is a reinforcement learning algorithm that uses deep neural networks to learn policies. Its main idea is to optimize the policy function by maximizing cumulative rewards, while using gradient descent method to update policy parameters. Specifically, the training process of this algorithm is mainly divided into the following steps: Step 1: Initialize the parameters of the policy function. Step 2: At each time step, use the current policy function to interact with the environment and generate a sequence of states, actions, and rewards. Step 3: Calculate the rewards for each time step and use these rewards to calculate the gradient of each state action pair. Step 4: Use gradient descent method to update the reference of the strategy function to better match the goal of maximizing cumulative rewards. Step 5: Repeat steps 2–4 until the strategy function converges or reaches the preset number of training times.
In practical applications, DRL algorithm based on policy gradient usually adopts some improved technologies, such as baseline, importance sampling and truncation return, to improve the efficiency and stability of the algorithm. In addition, this algorithm can also be combined with other reinforcement learning algorithms, such as value strategy learning and DQN, to achieve better performance.
Strategy network training methods
The REINFORCE algorithm with Rollout baseline is mainly based on a reinforcement learning algorithm for policies, which estimates policy gradients by calculating the cumulative return of a single agent and trains the single agent strategy. This article uses this algorithm to train multi-agent joint strategies, estimates the strategy gradient by calculating the cumulative return of the joint strategy, and trains the multi-agent strategy network. For a given instance
Calculate the policy gradient using the REINFORCE algorithm with a baseline and update the policy network parameters using a gradient descent method, namely:
Using benchmark network
The strategy network that has undergone multiple rounds of learning has good decision-making ability, and the action selection strategy selects actions based on the probability vector output by the network. This article adopts two different action selection strategies:
The greedy action selection strategy fully trusts the policy network, with each step based on the output value of the fully connected capsule layer, selecting the action with the maximum module length. The sampling action selection strategy uses the output value of the fully connected capsule layer as the sampling module length distribution, and selects actions based on this distribution. Therefore, this strategy does not always select the action with the maximum module length, but instead selects the corresponding action with different module lengths.
In the process of network training, benchmark network
The trained model can quickly solve MOCVRPTW using a greedy action selection strategy, but there is some room for improvement for some difficult instances, such as route crossing problems in the sub loop and overconfidence behavior of the greedy action selection strategy, which leads to missing actions with a high selection probability (but not the highest). In order to improve the quality of the solution, this article adopts two local search strategies:
2-opt search. In response to the problem of route crossing in the sub loop, 2-opt local search takes each sub loop of the model output solution as the initial solution, and optimizes all sub loops through 2-opt operation to improve the overall quality of the solution. Sampling search. The model uses a sampling action selection strategy to repeatedly solve the same instance to obtain multiple complete solutions, and takes the optimal solution. If
Experiment and result analysis
Since the paper deals with logistics and optimization problems, it is important to consider potential ethical implications, this paper mainly considers environmental impact and fairness in distribution. The dataset used in this study is a standard dataset, and the simulated experimental hardware conditions are consistent. Special cases have been explained in the model assumptions, and the environmental impact is fair. In addition, according to the model assumption, the distribution of customer nodes and delivery volume are known and fixed, and the distribution is also fair. Therefore, the ethical factors present in this study have no direct impact on the experimental simulation results.
Experimental setting
Using PyTorch to implement the overall framework of the model, the policy network model was trained on a single GPU (2080ti, 10G graphics memory), and tested on a Windows7 operating system running on Intel Core i7-3630QM 2.40GHz, 8GB RAM. The model was implemented using Python 3.7. Meanwhile, comparative experiments were conducted on all multi-objective optimization algorithms on the standard software platform PlatEMO [37] written in MATLAB.
This article verifies the feasibility and effectiveness of the proposed framework through experiments, which include two parts: training and testing. As an unsupervised learning method, only model input and reward function are required during the training process, without the need for the best tour as a label.
Training data: To verify the performance of the proposed model in this article, the models were trained on customer node examples with scales of 25, 50, and 100, respectively, where the maximum capacity of vehicles for customer nodes 25, 50, and 100 is 50, 100, and 200, respectively. The coordinates of nodes are randomly generated in [0, 100]
In the model training phase, for the 25 customer and 50 customer scale problems, the number of training rounds (epoch) is set to 100, the number of training batches per round is set to 2500, and the number of examples per batch is set to 512. For the problem of 100 customer sizes, due to the limitation of graphics memory size, the number of cases per batch is set to 128. In the case testing phase, for problems of different sizes, 10000 sets of cases are tested under their corresponding distributions. Once the model has been trained, it can be used to directly output Pareto Front (PF).
Testing data: This article considers the Solomon benchmark data-set [38] of CVRPTW consisting of 25, 50, and 100 customers, which is classified based on vehicle capacity and customer location (C, R, RC). Test each category based on vehicle capacity and customer location type. The number of training rounds (epochs) for each model is 100. Compare the obtained results with the selected comparison framework/algorithm on the objective function established in Section 2.2.
Parameter settings: The embedding dimension of node information is 128, and the embedding dimension of edge information is 16. The Adam optimizer [39] is used to optimize the network parameters, and the learning rate is set to 1
Method comparison: Due to the consideration of CVRPTW solutions for three objectives in this article, and the constructed model being a DRL class model, comparisons with relevant multi-objective optimization algorithms and DRL methods are considered when comparing methods.
This article selects classic INSGA-II [41] and MOEA/D [42] multi-objective evolutionary algorithms, MARDAM [43] (multi-agent routing deep attention mechanism) method, PtrNet [15] method, and RE-GAT [44] method.
The parameter settings in the comparison algorithm are consistent with the corresponding literature. All comparison methods are executed under the same conditions, which means using the same starting and ending criteria, the same number of starting search points, the same dataset, and the same hardware to run the algorithm. All methods are run in their optimal environment, taking the average of the optimal results.
Comparison of training processes
Figure 5 shows the learning curves of the Res-E-G-AT-Caps Net model with MARDAM [43], PtrNet [15], and RE-GAT [44] on 100 customer node instances. From Figure 5, it can be seen that the learning curve of the Res-E-G-AT-Caps Net model oscillates more smoothly, and the final convergence result is also better than other models. On the one hand, this is because the graph capsule model proposed in this article for MOCVRPTW can quickly converge to local optima in the network. On the other hand, the interaction between multi-agent systems enables the current agent to consider joint action optimization when selecting nodes, which helps to reduce gradient variance and accelerate model training.

Learning curves of each model.
Output for Solomon dataset (CVRPTW with 50 customers).
Output for Solomon dataset (CVRPTW with 100 customers).
Compare the Res-E-G-AT-Caps Net model and the combination of 2-opt search strategy and sampling search strategy with INSGA-II [41] and MOEA/D [42] in multi-objective optimization algorithms, and compare them with PtrNet [15], MARDAM [43], and RE-GAT [41] in DRL methods. Among them, INSGA-II [41] and MOEA/D [42] are multi-objective optimization algorithms based on decomposition strategy, MARDAM is a sequential multi-agent model with actor critical performance, PtrNet is an end-to-end solution, and RE-GAT is a graph attention network model that considers residual structure.
Tables 2–4 shows the test results of the DRL framework proposed in this article compared to other comparison methods on datasets of different scales (25/50/100), and also shows the average operation time of all test cases. From the table, it can be seen that the DRL framework proposed in this article outperforms the compared algorithms/methods in all objective functions. To be precise, the solution quality on the test cases of 25 customer nodes and 50 customer nodes is similar to that of the comparison algorithm, but the solution time is much faster than other algorithms. On the scale problem of 100 customer nodes, both the solution quality and time are better than other algorithms. The DRL model with 2-opt local search strategy and sampling strategy outperforms other algorithms in terms of solution quality and time for three scale problems.
As shown in Table 2, by testing 25 different types of customer nodes (C, R, RC) on the Solomon data-set, it can be seen that: The Res-E-G-AT-Caps Net model with a 2-opt local search strategy and sampling strategy can obtain the optimal solution. Taking the R-type data-set as an example, the results obtained by the sampling strategy shorten the total transportation distance by about 1.57%, the total transportation cost by about 5.21%, and the maximum makespan by about 10.45% compared to the solutions obtained by the 2-opt local search strategy. In addition, compared to other learning based methods compared, the optimal RE-GAT (Sample) method has reduced the total transportation distance by about 1.44%, the total transportation cost by about 5.09%, and the maximum makespan by about 9.61%.
As shown in Table 3, by testing 50 different types of customer nodes (C, R, RC) on the Solomon data-set, it can be seen that: The Res-E-G-AT-Caps Net model with a 2-opt local search strategy and sampling strategy can obtain the optimal solution. Taking the C-type data-set as an example, the results obtained by the sampling strategy reduced the total transportation distance by about 5.96%, the total transportation cost by about 19.14%, and the maximum makespan by about 11.22% compared to the solutions obtained by the 2-opt local search strategy. In addition, compared to other learning based methods compared, the optimal RE-GAT (Sample) method has reduced the total transportation distance
by about 5.84%, the total transportation cost by about 19.08%, and the maximum makespan by about 9.97%.
As shown in Table 4, by testing 100 different types of customer nodes (C, R, RC) on the Solomon data-set, it can be seen that: The Res-E-G-AT-Caps Net model with a 2-opt local search strategy and sampling strategy can obtain the optimal solution, taking the RC data-set as an example, the results obtained by the sampling strategy are about 12.32% shorter in total transportation distance, 8.68% lower in total transportation cost, and about 4.12% shorter in maximum makespan compared to the solutions obtained by the 2-opt local search strategy. In addition, compared to other learning based methods compared, the optimal RE-GAT (Sample) method has reduced the total transportation distance by about 11.36%, the total transportation cost by about 8.13%, and the maximum makespan by about 37.18%.

PF obtained by all comparison methods on 3 objective 25 customer instances.

PF obtained by all comparison methods on 3 objective 50 customer instances.
From Tables 2–4, it can be seen that the sampling strategy can obtain the optimal solution given by all methods. In contrast, Res-E-G-AT-Caps Net outperforms or approaches all comparison algorithms/models in terms of computational speed. In addition, compared to heuristic algorithms, the model proposed in this article has better experimental results. On the other hand, by synthesizing experimental results on datasets of different sizes and types, it can be seen that C-type data performs the best, followed by RC type, and finally R type. This is because the trained data is randomly generated examples, and examples with certain rules make the model more efficient.
The more uniformly distributed the Patero Front (PF) [45] mapping of the solution set composed of all non dominated solutions in the target space, the better the solution performance of the algorithm. The PF performance obtained by all comparative algorithms/models in various situations is shown in Figures 6 to 8. It can be seen that the model trained in this article can be efficiently applied to three target VRPs with different numbers of cities. Although the model was trained on randomly generated examples, it still performs well on other types of datasets.

PF obtained by all comparison methods on 3 objective 100 customer instances.
HV values obtained by all comparison methods on different numbers of customer instances.
In addition, High Volume (HV) [45] is used to evaluate the size of the super volume formed by the obtained PF and its reference point, and to verify the distribution characteristics of the solution in the target space. The larger the HV, the higher the algorithm performance. Table 5 provides HV performance indicators for all comparative algorithms/methods. From Table 5, it can be seen that the Res-E-G-AT-Caps Net model performs best in all examples.
In order to display the experimental results of different methods more clearly, qualitative analysis was conducted on all methods. Taking CVRPTW-100 as an example, Figure 9 shows the final vehicle routing maps under different comparison methods.

Routing diagrams of different methods under CVRPTW-100.
Each color in Figure 9 represents a routing. From Figure 9, it can be seen that on the data-set of 100 customers, the routing distribution obtained by the method proposed in this paper is more uniform. Combining Tables 2–4, it can be seen that the method proposed in this paper has a certain degree of efficiency.
This framework solves vehicle routing problems through offline training and online testing. Evaluate the relationship between the proposed model’s increase in graph size (i.e. nodes) and runtime during the training and testing stages by solving VRP. For the training phase, the training time not only depends on the graph size of the problem, but also on the quantity and batch size of the training data. Without loss of generality, this paper uses 10000 training instances and the same batch size (the batch size is 128) to test the running time when the number of nodes in a single epoch increases from 1 to 100 instances.
In this article, INSGA-II and MOEA/D belong to heuristic search algorithms. The algorithm complexity of INSGA-II is
Generalization comparison experiment
To verify the migration performance of the Res-E-G-AT-Caps Net model, the model was migrated and solved on different customer sizes (150 and 200, taking R-type instances as examples) and different target numbers (2-objective and 5-objective). The experimental results of 2-objective are shown in Figure 10. It can be seen from Figure 10 that the Res-E-G-AT-Caps Net model proposed in this paper is significantly superior to other algorithms/DRL-based models on all 150 and 200 customer instances. In addition, Table 6 presents the values of HV for different number of objective functions and different scale examples. From Table 6, it can be seen that the model proposed in this paper performs best in all examples.
In order to further demonstrate the effectiveness of the proposed framework, this paper used the actual operational basic data of urban logistics distribution research in References [46], [47] (named Case 1 and Case 2 here) as the experimental data-set. Meanwhile, 3S-MMDEA [48] and HF-VRP-DL [49] are further selected as methods for comparison. Table 7 presents the experimental comparison results of different methods in different objective functions, gap rates, and the final time of obtaining results.

PF obtained by all comparison methods on a 2-objective instance.
HV values obtained by all comparison methods on different objective numbers and number of customer instances.
Comparison of all comparison methods and indicators on Case1 and Case 2.
From Table 7, it can be seen that: The Res-E-G-AT-Caps Net model with a 2-opt local search strategy and sampling strategy can obtain the optimal solution. Taking the Case 1 data-set as an example, the results obtained by the sampling strategy are about 0.39% shorter in total transportation distance, 0.14% lower in total transportation cost, and about 0.14% shorter in maximum makespan compared to the solutions obtained by the 2-opt local search strategy. In addition, compared to other learning based methods compared, the optimal RE-GAT (Sample) method has reduced the total transportation distance by about 1.97%, the total transportation cost by about 1.60%, and the maximum makespan by about 1.61%.
This article proposes a multi-agent based deep reinforcement learning framework to improve the solving efficiency of MOCVRPTW. Unlike the traditional approach of “grouping before planning”, this paper models the Markov decision process for MOCVRPTW based on decomposition, designs an improved graph attention capsule network, considers node and edge information comprehensively, and introduces residual mechanism to design the Res-E-G-AT-Caps Net model. The multi-agent of this model utilizes high-level feature information to learn and cooperate with each other’s actions, planning vehicle paths from the overall problem, and quickly solving MOCVRPTW through offline training of the model. To verify the feasibility and effectiveness of the proposed framework, numerical experiments were conducted on publicly available standard examples and compared with existing learning based methods and meta heuristic algorithms. The calculation results indicate that the proposed framework is feasible and efficient for solving vehicle routing problems of different scales.
The MOCVRPTW considered in this article is a VRP expansion problem in a static environment, while in the actual logistics and distribution process, the transportation environment is usually constantly changing, which means it will face the situation of dynamic order arrival. Therefore, future research will focus on building end-to-end deep reinforcement learning frameworks in dynamic environments.
Footnotes
Acknowledgments
This work is supported by the National Natural Science Foundation of China under Grant 61806006, China Postdoctoral Science Foundation under Grant No. 2019M660149, Graduate Innovation Foundation of Jiangsu Province under Grant No. KYLX16_0781, the 111 Project under Grants No. B12018, and PAPD of Jiangsu Higher Education Institutions.
Ethical and informed consent for data used
The experiments mentioned in this article do not involve ethical experiments.
Written informed consent was obtained from all the participants prior to the enrollment (or for the publication) of this study (or case report).
Authors contribution statement
Haifei Zhang: Conceptualization, Software, Formal analysis, Investigation, Data Curation, Writing-Original Draft, Visualization.
Hongwei Ge: Methodology, Validation, Investigation, Resources, Writing-Review & Editing, Supervision, Funding acquisition.
Ting Li: Software, Data Curation.
Lujie Zhou: Validation, Methodology.
Shuzhi Su: Validation, Funding acquisition.
Yubing Tong: Writing-Review & Editing.
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Data availability and access
The datasets generated or analyzed during this study are available from the corresponding author on reasonable request.
