Abstract
Nested Rollout Policy Adaptation (NRPA) is a Monte Carlo search algorithm that learns a playout policy in order to solve a single player game. In this paper we apply NRPA to the vehicle routing problem. This problem is important for large companies that have to manage a fleet of vehicles on a daily basis. Real problems are often too large to be solved exactly. The algorithm is applied to standard problem of the literature and to the specific problems of EDF (Electricité De France, the main French electric utility company). These specific problems have peculiar constraints. NRPA gives better result than the algorithm previously used by EDF.
Keywords
Introduction
Monte Carlo Tree Search (MCTS) has been successfully applied to many games and problems [9]. Nested Monte Carlo Search (NMCS) [10] is an algorithm that works well for puzzles. It biases its playouts using lower level playouts. At level zero NMCS adopts a uniform random playout policy. Online learning of playout strategies combined with NMCS has given good results on optimization problems [43]. Other applications of NMCS include Single Player General Game Playing [36], Cooperative Pathfinding [7], Software testing [41], heuristic Model-Checking [42], the Pancake problem [8], Games [16], Cryptography [25] and the RNA inverse folding problem [38].
Online learning of a playout policy in the context of nested searches has been further developed for puzzles and optimization with Nested Rollout Policy Adaptation (NRPA) [45]. NRPA has found new world records in Morpion Solitaire and crosswords puzzles. Edelkamp, Cazenave and co-workers have applied the NRPA algorithm to multiple problems. They have optimized the algorithm for the Traveling Salesman with Time Windows (TSPTW) problem [18,26]. Other applications deal with 3D Packing with Object Orientation [28], the physical traveling salesman problem [29], the Multiple Sequence Alignment problem [30], Logistics [14,27], Graph Coloring [15] and Inverse Folding [12]. The principle of NRPA is to adapt the playout policy so as to learn the best sequence of moves found so far at each level.
Electricité De France (EDF) is the main French electric utility company. Each year, eleven millions of services have to be carried out by EDF technicians at customers places (problem fixing, meter replacement, energetic diagnosis, business prospects). A technician drive by car more than 22,000 kilometers per year. There are about 10,000 EDF technicians servicing customers, thus the total distance travelled by these technicians on a yearly basis is around 220,000,000 kilometers. Each service is accomplished by a technician having the required skills within a time window determined during the appointment booking. As a result, numerous Vehicle Routing Problems with Time Windows (VRPTW) have to be solved on a daily basis (one problem per catchment area). Any of these VRPTW involves quite a big search space, since EDF technicians of a single catchment area have to achieve every day hundreds of visits at customers places. The objective of the work described in this paper was to adapt the NRPA algorithm to the specifics of EDF problem.
Related approaches that apply Artificial Intelligence techniques to transportation include agent based approaches [5], Monte Carlo Search approaches [1,27,34,47] and learning of simulations of urban traffic [23]. Our work is close to the Monte-Carlo Search approach applied to the Capacitated Vehicle Routing Problem with Time Windows (CVRPTW).
We now give the outline of the paper. The next section briefly describes the state of the art of Vehicle Routing. The third section describe the problems of EDF and the current solver. The fourth section details nrpaDQ, the NRPA algorithm that we implemented, focusing on the proposed improvements from standard NRPA. The fifth section gives experimental results. In Section 5.1 we provide numerical results obtained with nrpaDQ for various instances of Vehicle Routing, both from literature (Solomon instances) and from real-life problems. In Sections 5.2 to 5.5 we present the results we obtained for some variations of nrpaDQ that we tested on the same set of instances. Finally, the last section concludes.
The vehicle routing problem
The Vehicle Routing Problem (VRP) is one of the most studied optimization problems. It was first stated in 1959 in [24]. Basically, it consists of finding an optimal route for a number of vehicles that are used to deliver goods or services to a set of customers, taking into account a set of constraints. In the simpler variant of the problem, all vehicles start from a single depot and end their route in this depot. The objective function to be minimized may combine up to three criteria (namely the number of customers that are not serviced, the number of vehicles used, the total distance travelled by the vehicles), each criterion having an associated weight. The VRP can be formalized as a graph problem: let

Example of VRP graph.
The VRP is of the non deterministic polynomial time hard type (NP-Hard), which implies that so far we do not know of a general method which is able to optimally solve any instance of the problem in polynomial time. As a result, the VRP is considered very difficult. Nowadays exact methods can solve to optimality quite challenging instances of the problem (for instance more than one hundred visits with Branch and Price methods).
The VRP is a real challenge for delivery companies operating a fleet of vehicles. It has given rise to a number of variations. The CVRP (Capacitated VRP) is defined as a VRP with a demand associated to each customers (e.g., the number of parcels they have purchased) and each vehicle has a limited capacity. The VRP with time windows (VRPTW) implies to serve each customer within a given time window (possibly different for each customer). The capacitated VRP with time windows (CVRPTW) combines the characteristics of the two previous variants. Figure 2 gives an illustration of CVRPTW graph with 4 vertices: one depot and 3 customers. Each customer both consumes a quantity q and has a specific duration of service, which must be scheduled in a given time-window.

Example of CVRPTW graph.

Example of feasible route in a CVRPTW problem.

Example of unfeasible route for customer time window reason.
In this example we assume the that the capacity of a vehicle is 100.
Fig. 3 gives an example of feasible route.
Fig. 4 gives an example of unfeasible route with the same visited customers but with a different order: vehicle arrives at node 1 after a 50 duration including travel and first appointment duration (20 + 20 + 10, in bold characters in the figure) which is too late to schedule a duration of 30 for the second service within the
Fig. 5 gives an example of unfeasible route because of capacity reason. With the hypothesis of 100 capacity vehicle, the sum of the quantity needed for the visited customers (60 + 60, in bold characters in the figure) outsizes the 100 capacity of the vehicle.
Fig. 6 gives an example of unfeasible route because of the time window of the depot. The duration of the route is the sum of service duration and travel duration (110 = 10 + 30 + 20 + 40 + 10, in bold characters in the figure) can not be scheduled within the

Example of unfeasible route for capacity reason (capacity of vehicle = 100).

Example of unfeasible route for depot time window reason.
Although the previous variations are the most widely studied because of their great practical importance, many other extensions of the basic VRP have also been proposed. Among them, let us just mention two of them. First, the dynamic VRP (DVRP) where vehicles may be dynamically re-routed during their route in order to fulfill new customers orders. Second, the VRP with Pickup and Delivery (VRPPD) where a fleet of vehicles must satisfy transportation requests (e.g., picking parcels up at some places and delivering them at given locations).
A wide range of methods have been used to solve the VRP. These methods may be broken down into three sub-groups, namely exact methods, heuristic methods and metaheuristic methods. These methods are presented hereafter.
Exact methods. These methods tend to find an optimal solution for the VRP. As mentioned above, they are thus used to solve instances of the VRP of consequent size. Among the most studied exact methods for the VRP and variations we may mention the Branch-and-Cut and the Branch-and-Price algorithms [33], the column-generation algorithm [4] and the set-partitioning method [2].
These methods can also provide a bound of optimum, especially in relaxing some of the most complicated constraints. For example, in suppressing the conventional subtour elimination constraint.
Heuristic methods. Not in a comprehensive manner, let us mention two interesting approaches. First the cluster-first, route-second heuristic [31]. Second, the savings heuristic [21] [3]. For the local search approaches we see in [50] a heuristic projection pool with powerful insertion and guided local search strategies. The local search described in [44] gives reference solutions on several instances of the benchmark VRP Solomon instances tested in this work.
Metaheuristic methods. The most commonly used metaheuristics for solving VRP and variations are the particle swarm optimization [35], simulated annealing [20], genetic algorithm [39] [49] and Tabu search [22,40]. Among the best algorithms for solving the VRP Solomon instances tested in this work, we can cite genetic algorithm described in [6], evolutionary algorithm detailed in [37] or Tabu search proposed in [48].
The EDF capacitated vehicle routing problem with time windows
In this section we first explain the optimization problem encountered at EDF and then describe the current EDF approach to the problem.
Description of the EDF problems
The Vehicle Routing Problem modeled here includes time windows, which is a classical feature of VRP problems. It means that:
Each technician has an availability time window. Each route starts at the beginning of this time window and must end before the end of this time window.
Each appointment has a time window and a duration. A route visiting the appointment must start after the beginning of the time window and end (including the appointment duration) before the end of the time window
If a route arrives at an appointment location before the beginning of its time windows, it is possible to add waiting time before starting the service.
The problem also includes capacities, another classical feature of VRP problems:
Each vehicle has several stock capacities (6 per vehicle for the EDF problem). Vehicle stocks are initialized with initial capacity at the beginning of each route.
For any stock, each operation will consume a stock quantity.
For each route, for each capacity, the sum of the quantity used by the operation of the route must not be greater than the capacity of the stock.
Several peculiar features are taken into account in EDF modelization.
Taking into account an off-duty time window for lunch break. No place is imposed for it. It can interrupt a trip but not a service to a customer.
Taking into account specific skills for visits. For each visit, only a subset of technicians is skilled to carry out the technical operations.
Journey distance and journey duration are not proportional, because vehicle mean speed depends on the journey itself. Consequently, 2 different distance matrices are provided.
Distance matrices are not symmetric: the distance between two points depends on the direction of the route.
Another difference with academic data is that on real problems, there is not always enough people to carry out the whole set of visits. Consequently, we must first maximize the number of visits that will be done. Some of the appointments have a high priority, and because of customers satisfaction, it is not possible to cancel it. Other appointments have less priority, because they consist for example in a technical operation that can be postponed, with no corresponding customer appointment. Moreover, network preventive maintenance visits have less priority than troubleshooting activities. In order to evaluate a solution, a lexicographic objective function is taken into account:
Maximization of the number of achieved visits of high priority. Maximization of the number of achieved visits of low priority. Minimization of the economic function, taking into account number of technicians used (ponderated with a proportional daily wage) and number of kms (ponderated with a proportional km cost).
Thus, when comparing two solutions, we first compare the number of visits of high priority that are achieved in each solution.if these numbers are not equal then we know which solution is best. If they are equal, then we compare the number of visits of low priority, if again these numbers are equal, then we compare the third criterion of the two solutions, that is the economic function. The problem is solved each day for determining the technicians routes of the next day, with a computing time which must not last more than an few hours.
Description of current EDF approach
Current approach for the problem is based on stochastic greedy algorithm and variable neighborhood search. The stochastic greedy algorithm is based on [46] algorithm. First phase consist in creating routes and insert in each route the visits that increases the less the travelled distance, until there is no room for new visits. For each insertion, the place in the route is chosen to minimize the increase of kilometers. If no insertion is possible, a new route is created. Then the process is repeated, until no visit and technician is left.
The local search consist of 2 kinds of movements:
Small movements consist in choosing randomly an appointment, and finding the best place to move the appointment, i.e., the best place in all the existing routes to reinsert the appointment. Large movements consist in choosing randomly a route, then entirely destroying it and trying to reinsert all its appointments in the other routes. If no improvement is possible, the movement is canceled.
Advantages of such a method is that it computes rapidly good solutions, very often acceptable by operators because greedy algorithm choices, based on distance, are humanly intuitive.
Nested rollout policy adaptation for vehicle routing
In this section we start with explaining the NRPA algorithm. Then we give our modelling of the CVRPTW problem for NRPA. We finish the section describing how weight are heuristically initialized in order to speed-up convergence of the algorithm.
Description of NRPA
An effective combination of nested levels of search [10] and of policy learning has been proposed with the NRPA algorithm [45]. NRPA holds world records for Morpion Solitaire and crosswords puzzles.
NRPA is given in Algorithm 3. The principle is to learn weights for the possible actions so as to bias the playouts. The playout algorithm is given in Algorithm 1. It performs Gibbs sampling, choosing the actions with a probability proportional to the exponential of their weights.

The playout algorithm
The adaptive rollout policy is a policy parameterized by weights on each action. During the playout phase, action is sampled according to this weights. The playout algorithm is given in Algorithm 1. It uses Gibbs sampling, each move is associated to a weight. A move is coded as an integer that gives the index of its weight in the policy array of floats. The algorithm starts with initializing the sequence of moves that it will play (line 2). Then it performs a loop until it reaches a terminal states (lines 3–6). At each step of the playout it calculates and stores in the z variable the sum of all the exponentials of the weights of the possible moves (lines 7–10) and chooses a move proportional to its probability given by the softmax function (line 11). Then it plays the chosen move and adds it to the sequence of moves (lines 12–13). Then, the policy is adapted on the best current sequence found, by increasing the weight of the best actions.

The Adapt algorithm
The Adapt algorithm is given in Algorithm 2. Basically, a policy is an array of floats which associates each move (coded by an integer) to its weight (coded by a float). The Adapt algorithm starts with saving in the local variable
In NRPA, each nested level takes as input a policy, and returns a sequence. Inside the level, the algorithm makes many recursive calls to lower levels, providing weights, getting sequences and adapting the weights on those sequences. In the end, the algorithm returns the best sequence found in that level. At the lowest level, the algorithm simply makes a rollout.
The NRPA algorithm is given in Algorithm 3. At level zero it simply performs a playout (lines 2–3). At greater levels it performs N iterations and for each iteration it calls itself recursively to get a score and a sequence (lines 4–7). If it finds a new best sequence for the level it keeps it as the best sequence (lines 8–11). Then it adapts the policy using the best sequence found so far at the current level (line 12).
NRPA balances exploitation by adapting the probabilities of playing moves toward the best sequence of the level, and exploration by using Gibbs sampling at the lowest level. It is a general algorithm that has proven to work well for many optimization problems.
Playout policy adaptation has also been used for games such as Go [32] or various other games with success [11].

The NRPA algorithm
There are several design choices when implementing NRPA. The first choice is to decide how to code the possible moves. For the vehicle routing problem we choose to code a move as a starting node, an arrival node and a vehicle. More precisely, a solution to the problem is an ordered sequence of visits, including special visits (SV) that represent the fact that the corresponding technician is at the depot. So a sequence solution always starts and ends with a SV. If there is a SV between two visits within the sequence, that means the end of a route for a technician, and the beginning of a route for another technician (with no chronological ordering of these two routes that will be carried out simultaneously).
Another design choice is to define the score of a playout. Score includes number of non visited customers multiplied by a great penalization number, number of used vehicles multiplied by an rather great weight, and number of kilometers. We reject movements that do not respect the time windows in the playout algorithm: all solutions respect both customers time windows and vehicle time windows.
If there were more objectives, a lexicographic ordering of the objectives would be the best way to represent the scores of a playout. Playouts could then be compared simply and easily.
Initialization of the weights
In NRPA weights are uniformly initialized to 0.0. We propose to heuristically initialize the weights of NRPA in order to speedup convergence. The greater the distance from the current city to another city, the less likely the other city is a good choice. Standard NRPA starts with an uniform policy with all weights set to 0.0 and does not distinguishes between close and far cities. The heuristic initialization of the weights initializes each weight with a value proportional to the inverse of the distance. This initialization is only used for the first iteration of NRPA.
The quantile heuristic
The objective of the Quantile Heuristic is to penalize movements from bad solutions. Solution scores for all playouts need to be stored. When a new playout is computed, if its score is worse than a defined quantile, then its movements weights are decreased. Efficient implementation is needed to limit the increase of computation time needed to sort the solution scores and compute the quantile. Sequence of algorithm is described in Algorithm 4. The adapt function is the same as for good solutions, with negative coefficient α, as described in Algorithm 5. The coefficient α used for bad solutions can be different from the one used to adapt policy to good solutions. In the rest of the paper we call nrpaDQ our implementation of NRPA with both the Distance heuristic and the Quantile heuristic.

The NRPA algorithm with quantiles

The adapt algorithm to bad solutions
The operational process of EDF consists in providing a solution to the CVRPTW within around two hours. Each problem is solved during the night so as to provide solutions for the next day. In this section, the parameters used when testing the 3 variations of NRPA are 3 levels,
The current solver of EDF, Opturn first perform a greedy algorithm, then goes on with a local search. These procedures are both parallelized on 6 threads. At the first stage, each thread involves 3,000 iterations of greedy search. The best among the 6 solutions is used as a starting point for the 6 simultaneous local searches, each performing 3,000 iterations. These values were chosen during the tuning of Opturn after many experiments that showed that no improvements are obtained with bigger values. As a result, the running time of Opturn is smaller than the running time of the NRPA.
The running time for NRPA and nrpaD is approximately 7,000 seconds and up to 8,000 seconds when using the Quantile heuristic. The running time of Opturn is approximately 5,000 seconds. Importantly, these run times are compatible with the operational process as they last approximately two hours.
The different algorithms tested on the 56 standard instances
The different algorithms tested on the 56 standard instances
(Continued)
Summary of the different algorithms tested on the 56 standard instances
Results for the different algorithms tested on the EDF instances
Summary for the different algorithms tested on the EDF instances
Table 1 gives results on the literature instances. It compares 4 algorithms:
NRPA: Standard NRPA (3 levels) without heuristics nrpaD: NRPA (3 levels) with Distance initialization heuristic nrpaDQ: NRPA (3 levels) with Distance initialization heuristic and Quantile heuristic Opturn: current EDF solver, based on Solomon and LNS heuristic.
To compare the results of the 4 variants, the lexicographical approach takes into account first the number of vehicles used and then the kilometers.
V: number of needed vehicles
Km: sum of the Kms of the routes
Table 2 gives a summary of runs on the standard problems. The Km and the Vehicles (V) columns give the average of objective function values on the Solomon instances. The δ value is the number of problems that are solved better than with standard NRPA minus the number of problems that are solved worse. Comparison is lexicographic, based on number of vehicles, and if equals, on the number of Kms. Logically, the δ value is 0 for the reference (that is, the NRPA variant).
We observe that nrpaDQ scores 55 while Opturn scores −50. This is a great improvement over the current solver. However, it should be stressed that the current solver was optimized on the EDF problems. The Distance heuristic improves a lot over standard NRPA. The Distance + Quantile heuristic only improves slightly over the Distance heuristic alone.
Table 3 and Table 4 give the results of the 4 algorithms on the EDF instances. For each algorithm, the 3 components of objective function are displayed:
Missed: number of missed appointments
V: number of needed vehicles
Km: sum of the Kms of the routes
Table 3 compares the results of runs of the different algorithms on the EDF problems. For each instance we compare the number of realized technical operations because on real EDF instances there are not necessarily enough vehicles to achieve all the operations. We also measure the number of vehicles and the number of kilometers, as in classical VRP problem instances. To compare the results of the 4 variants, the lexicographical approach takes into account first the achieved operations, then the number of vehicles used and finally the kilometers.
Table 4 gives the average values for the whole set of EDF instances.
The average number of missed visits is smaller with NRPA than with Opturn. It is the same for all NRPA variants. The number of vehicles used for the operations is also slightly smaller and this is the same for all NRPA variants. The average number of kilometers is greater with standard NRPA than with Opturn, however it is smaller with nrpaD and nrpaDQ than with Opturn. With respect to the lexicographic objective function described in Section 4.1, we can conclude that nrpaD and nrpaDQ improves on the current solver for the real-life EDF instances.
The nrpaDQ with Distance and Quantile heuristics provides solutions with the same values for the 3 components of the objective function. These two heuristics perform better in average than the standard NRPA variant as for the number of traveled kilometers: 324 vs 363, that is a 10 percents improvement. If we compare with Opturn results, the improvement account for around 5 percents. This is still significant from a practical point of view because of the huge amount of kilometers travelled by the technicians on a yearly basis: about 220 millions of kilometers.

Mean value of objective functions on Solomon instances for different numbers of levels.

Mean value of objective functions on EDF instances for different numbers of levels.
Value of objective functions on EDF instances for different numbers of levels
As described at the beginning of Section 5, the operational process implies finding a solution to the CVRPTW within about two hours. This is why the experiments described above used number of levels
In this section as well as in the 3 next sections, we are looking for an overall improvement of the algorithm behavior on all the tested instances. Therefore, from now on, for Solomon instances, we will focus on the average value of the objective functions of the whole set of instances. In contrast, for EDF instances we will give results for each instance along with a summarized view through the mean values of the objective functions. The lexicographic objective function is summed up with a one million weight for missed appointments, a one thousand weight for the number of used vehicles, and a unitary weight for kilometers. Figures 7, 8 and Table 5 give the results of the 7 different runs with the parameter settings described above (
The choice of nrpaDQ with 4 levels appears to be the best setting, although the 3-level variation performs also very well as compared to the others. For
nrpaDQ with more playouts
Another way of trying to improve the quality of the solutions lies in increasing N, the number of iterations at each level. Of course this has a significant impact on the running time. We ran our experiment with
Figure 9, Table 6 and Fig. 10 give the results of the 6 different runs described above (

Mean value of objective functions on Solomon instances for different numbers of playouts at each level (
Value of objective functions one EDF instances for different numbers of playouts at each level (

Mean value of objective functions on EDF instances for different numbers of playouts at each level (
Not surprisingly, the quality of results clearly improves when increasing the number of playouts up to 175. However,
In Section 5 we ran our experiments taking into account the common computer resources that can be accessed at an operational level. However, it is interesting to check whether being able to use more threads during the search would improve the quality of the provided solutions. Thus in this section we use root parallelization [13,19]. The principle of root parallelization is to run several threads independently with different seeds and to join the corresponding results when the thinking time is elapsed. We run nrpaDQ on a number of threads varying from 1 to 12, while keeping
In Fig. 11, Table 7, and Fig. 12 we present the results with the number of threads T varying from 1 to 12.
Logically, the more threads are used, the better the results.

Mean value of objective functions on Solomon instances for different numbers of threads (
Value of objective functions on EDF instances for different numbers of threads (

Mean value of objective functions on EDF instances for different numbers of threads (

Mean value of objective functions on Solomon instances for the stabilized approach (
Value of objective functions on EDF instances for the stabilized approach (

Mean value of objective functions on EDF instances for the stabilized approach (
Stabilized NRPA [17] is a simple improvement of NRPA. The principle is to play P playouts at level 1 before each call to the adapt function. The number of calls to the adapt function at level 1 is still N, the number of iteration of upper levels. So at level 1,
In Fig. 13, Table 8, Fig. 14 we present the results of a nrpaDQ with 1 thread (1,000,000 playouts), the results of a stabilized nrpaDQ with a period of
Conclusion
We have presented the NRPA algorithm and its application to the CVRPTW problems. This problem is important for EDF, a company that plans numerous operations every day on the French electrical network. We have given the modelization used to address the CVRPTW problem and two heuristics to improve on standard NRPA: the Distance heuristic and the Quantile heuristic. The Distance heuristic improves a lot on standard NRPA. The Quantile heuristic in combination with Distance heuristic is a slight improvement over NRPA with only Distance heuristic. We also compared nrpaDQ, our implementation of NRPA with the heuristics, to the current EDF solver Opturn. On standard instances nrpaDQ is much better. On the operational EDF instances it is still better even though Opturn was tuned for these instances, while nrpaDQ is a general algorithm that uses a Distance heuristic that works for all kinds of VRP problems.
As we have seen from the examples of EDF instances, NRPA with heuristics performs better than the current solver. The percentage of kilometers saved by heuristic NRPA is greater than 5% of the total number of kilometers. EDF agents drive hundreds of millions of kilometers each year. The use of nrpaDQ could save millions of kilometers each year and reduce the carbon footprint of EDF by hundreds of tons of
Complementary experiments have also been carried out in order to test alternative parameter settings. The purpose of these tests was to check whether, regardless of operational constraints (computer resources and running time limitations) the nrpaDQ was capable of providing us with better solutions. It turns out that the answer is yes. Though the corresponding variations could not be operationally implemented yet for practical reasons, they give promising leads for future improvements.
Footnotes
Acknowledgement
This work was supported in part by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR19-P3IA-0001 (PRAIRIE 3IA Institute).
