Abstract
With the continuous expansion of city scale and the advancement of transportation technology, route recommendations have become an increasingly common concern in academic and engineering circles. Research on route recommendation technology can significantly satisfy the travel demands of residents and city operations, thereby promoting the construction of smart cities and the development of intelligent transportation. However, most current route recommendation methods focus on generating a route satisfying a single objective attribute and fail to comprehensively consider other types of objective attributes or user preferences to generate personalized recommendation routes. This study proposes a multi-objective route recommendation method based on the reinforcement learning algorithm Q-learning, that comprehensively considers multiple objective attributes, such as travel time, safety risk, and COVID-19 risk, and generates recommended routes that satisfy the requirements of different scenarios by combining user preferences. Simultaneously, to address the problem that the Q-learning algorithm has low iteration efficiency and easily falls into the local optimum, this study introduces the dynamic exploration factor σ and initializes the value function in the road network construction process. The experimental results show that, when compared to other traditional route recommendation algorithms, the recommended path generated by the proposed algorithm has a lower path cost, and based on its unique Q-value table search mechanism, the proposed algorithm can generate the recommended route almost in real time.
Keywords
Introduction
Route recommendation research has always been a hot topic in the fields of computers and transportation, and it is crucial to urban transportation planning and smart city construction [1–5]. The traditional approach is to find the shortest geographical route connecting a given origin and destination, namely, the traveling salesman problem [6–8]. In fact, user preferences play a major role in the generation process of recommended routes [9]. In addition to considering the shortest route as the recommendation objective, the route recommendation algorithm can also consider other attributes as the recommendation objective to satisfy the travel requirements of different user preferences. For example, the route recommended by the route recommendation algorithm aimed at a short travel time or distance can reduce the user’s travel time and cost. The route recommendation algorithm aimed at city sightseeing can provide tourists with scenic tour routes. However, these route recommendation algorithms typically only consider a single attribute as the objective and cannot consider multiple attributes comprehensively. Therefore, multi-objective route recommendation algorithms have emerged that can comprehensively consider the multiple preferences of users and calculate the route to meet the multiple requirements of users.
However, existing multi-objective route recommendation algorithms cannot provide routes that balance multiple objective attributes. For example, when following the user preferences of fuel consumption with less weight, travel time, and environmental index with more weight, the existing multi-objective route recommendation algorithms may suggest a long detour route, which will significantly increase the vehicle’s fuel consumption and travel time. This extreme case was not desired by the user. Simultaneously, most existing multi-objective route recommendation algorithms tend to focus on common indicators, such as travel time, travel distance, and fuel consumption, while ignoring the safety difference between different routes. Traffic accidents on urban roads not only cause personal injury and property damage but also cause serious traffic congestion. In particular, during peak hours, delays caused by traffic jams and accidents on urban roads result in enormous socioeconomic losses. Traffic accidents have become one of the leading causes of death worldwide, with China having one of the highest numbers of traffic deaths. In some cases, route safety must be prioritized. For example, tourists pay more attention to the safety of the route recommended by the system when traveling in a new city. For school bus commutes, schools and parents are more concerned about the safety of the route than the length of the trip. In these cases, high route security is the primary goal. In addition, the global spread of COVID-19 (coronavirus disease 2019) has become the current background, and research on the COVID-19 epidemic has become a research hotspot in various fields. However, few studies have been conducted on route generation algorithms that can avoid high-risk areas of COVID-19.
Reinforcement learning is a very popular research topic in the field of artificial intelligence in recent years. Reinforcement learning is mainly composed of agent, environment, state, action and reward. It aims to simulate the decision-making and learning process of agents in the constantly changing environment, with the goal of maximizing the cumulative reward value of agents. After the agent performs an action, the environment will transition to a new state, for which the environment will give the agent a reward value (positive or negative). The agent then performs the new action according to a certain strategy based on the new state and the reward from environmental feedback. The agent interacts with the environment repeatedly in this manner until the agent receives the last state in the environment.
The process of an agent interacting with the environment and expecting the maximum reward value to reach the final state is similar to the process of a car seeking the best route to the end. The reward feedback mechanism of reinforcement learning can also be used as an evaluation index for the quality of multi-objective optimization results in the route recommendation process. Therefore, based on reinforcement learning theory, this study proposes a multi-objective route recommendation algorithm [6] called MORRBQ (Multi-objective Route Recommendation Method Based on Q-learning). Route security and COVID-19 risk were added as the consideration factors of the system; that is, the three objective attributes of short travel time, low safety risk, and low COVID-19 risk were considered simultaneously, and the route recommendation was made based on user preferences. Simultaneously, the MORRBQ algorithm introduces a dynamic exploration factor σ and initializes the value function in the road network construction process, which is used to solve the defects of the Q-learning algorithm, such as low iteration efficiency and ease of falling into the local optimum.
The main contributions of this study are as follows: In addition to travel time, safety risk and COVID-19 risk are added as the consideration objectives for the proposed route recommendation algorithm, which is used to facilitate users to assign weights to different objective attributes to obtain personalized recommendation routes. A reward table training algorithm that considers different user preferences is proposed. A Q-value table initialization algorithm is proposed, which shortens the agents’ iteration time and makes the agents’ choice of road nodes and actions clearer. A multi-objective route recommendation algorithm based on a dynamic exploration factors is proposed to avoid falling into the local optimum. A series of experiments are conducted that verify the effectiveness and practicality of the proposed algorithms, and the experimental results show the work performed in this study to be more effective than existing route recommendation methods.
The remainder of this paper is organized as follows. Section 2 discusses related work. Section 3 presents the definitions used in this paper. Section 4 introduces the algorithm framework and the content of the multi-objective route recommendation algorithm. Section 5 presents the experimental results and analyses. Finally, Section 6 concludes the paper and outlines future work.
Related work
Extensive research has been conducted on route recommendations resulting in many achievements. Existing route recommendation methods can be divided into two categories: single-objective and multi-objective route recommendations.
Single-objective route recommendation
Research on single-objective route recommendations has a long history. The commonly used Dijkstra algorithm is used to search for the shortest route, which corresponds to optimizing a single objective and typically only considers one-dimensional costs such as distance, time, and speed [7]. Research by some scholars [5, 6] avoids unnecessary braking of vehicles by using the optimal speed curve to design the control system to improve fuel efficiency on the route from origin to destination and achieve the goal of saving fuel resources. However, the above scholar’s research has a problem of low route recommendation accuracy. To address this problem, Cheng et al. [10] proposed a travel route recommendation algorithm based on interest theme and distance matching. The algorithm first obtains the user’s real historical travel footprints through analysis and then extracts the user’s interest theme preference and distance matching based on the user’s stay time in each scenic spot. Finally, an optimal trip calculation method is designed based on the trip time limit, start point, and end point. Compared to the traditional algorithm, the proposed algorithm has a higher accuracy and recall rate. Wang et al. [11] combined A* algorithm and recursive neural network based on attention mechanism, proposed a travel cost prediction algorithm for estimating candidate locations to destination locations, to recommend travel routes with lower travel costs for users. Zhou et al. [12] proposed a hybrid route prediction and recommendation method to solve route recommendations under a mixed traffic mode. Feng et al. [13] proposed a recommendation algorithm based on DBSCAN using passenger hotspots for profitable taxi cruising route recommendation. Jia et al. [14] proposed an algorithm that combined the weighted shortest route and deep learning to recommend context-aware routes for users. Studies by the above scholars did not consider user preferences. Therefore, Silva et al. [15] proposed a route recommendation method based on user behavior, which mined user preferences through the historical data of user trips and recommended appropriate routes to users. Ji et al. [16] studied the dynamic taxi route recommendation problem as a sequential decision problem and designed an adaptive deep reinforcement learning method that fused the extracted spatio-temporal features through a deep policy network to carry out effective route recommendation. However, these studies did not consider the traffic characteristics of the routes. The literatures [17, 18] deal with single-objective route selection by analyzing different traffic characteristics, with the goal of reducing travel time, fuel consumption, and exposure to air pollutants. Kang et al. [19] considered the road slope of a specific intersection and used a dynamic programming method to determine the optimal speed curve to reduce the fuel consumption. In the above studies, some methods did not consider user preferences, whereas others that considered user preferences did not combine different preferences (such as travel time and fuel consumption) to achieve multi-objective route recommendations. Similarly, some of these scholars’ studies did not consider user preferences, and some methods that considered user preferences still did not combine different preferences.
Multi-objective route recommendation
Sarker et al. [20] proposed a multi-objective route-recommendation algorithm that considered three different attributes: fuel consumption, travel time, and air quality. Based on the reinforcement learning algorithm and user preferences, the algorithm can generate a personalized recommended route in time. Guo et al. [21] proposed a method to predict the travel time and fuel consumption of different roads using GPS trajectories, and recommended a route for users that comprehensively considered time and fuel efficiency. Zheng et al. [22] proposed a variable neighborhood search algorithm and a hybrid particle swarm optimization genetic optimization algorithm to recommend top-K routes. Huang et al. [23] proposed a flexible multi-task deep travel route planning framework, which constructed heterogeneous networks through the relationship between POI and users and integrated auxiliary information for more effective planning. Based on the particle swarm optimization algorithm, Malik et al. [24] designed a route optimization objective function considering five parameters: distance, road congestion, weather conditions, route popularity, and user preference, which were used to predict the optimal route to the next tourist attraction. Rani et al. [25] combined the K-means clustering method with the traveling salesman problem-solving method to generate travel itineraries by considering both distance and travel time to recommend the optimal route. The above multi-objective route recommendation methods do not consider safety risks and other factors, and because of the high time complexity of the optimization solution calculation, they cannot provide near-real-time optimal routes.
In summary, research in the field of route recommendation mainly focuses on route recommendation methods based on a single objective and route recommendation methods based on multiple objectives. Existing single-objective route recommendation methods cannot consider the multi-preference requirements of users, whereas most multi-objective route recommendation methods do not consider some novel objective attributes, such as safety risk attributes and COVID-19 risk attributes. Owing to the constant variation of the novel coronavirus and continuous spread of epidemic viruses, it has become an urgent concern for users whether the generated recommended routes can avoid high-risk areas. Based on the research status of the above, this study proposes a multi-objective route recommendation method based on reinforcement learning, which considers the travel time and safety risk, and the COVID-19 risk is one of the objective attributes. Simultaneously, it satisfies the demand of user preferences and meets the recommended route under the condition of different scenarios to realize personalized recommendation.
Problem description
This section provides background information and problem description.
Road network model
(1) Road Network
In this study, a road network is abstracted as a network model composed of finite nodes and edges, which is represented by G(V, E) where V is the set composed of the starting and ending points of each road section and is represented by

Cost of road segment.
(2) Route
The set of a series of edges connecting the origin and destination is called a route in the road network, and can be expressed as follows:
(3) Total Route Cost
The total route cost refers to the sum of the travel time, safety risk, and COVID-19 risk costs for each section of the route. The higher the value, the higher is the cost of the corresponding attribute of the path. The total path cost is calculated as follows:
(1) Travel time
The minimization of travel time is a common optimization goal in various route recommendation algorithms. In this study, the travel time consumption value of the recommended route is calculated based on the length of each road section in the New York Road network data and the average traffic speed of the city road. The traffic time consumption value of each road section in the road network is stored in the traffic time matrix
(2) Safety Risk Index
The occurrence of urban road traffic accidents is inevitable and is affected by certain factors, and the frequency of accidents varies by section. In this model, the probability of traffic accidents, namely, the safety risk index, is approximated by calculating the frequency of traffic accidents for each road section in the historical data. The safety risk index of each road section in the road network is stored in the safety risk matrix S, where si,j (si,j ∈ S) is expressed as the safety risk index of the road section ei,j. As shown in Fig. 2, the frequency of traffic accidents in Brooklyn, New York is relatively high, with an annual accident rate of 18,000 on average. Staten Island has a low accident rate, with fewer than 4,000 accidents annually.

Traffic accidents distribution in New York.
(3) COVID-19 Risk Index
The COVID-19 epidemic has spread and covered the entire world, which has become the background of the current era and an issue that must be paid attention to in social production activities that concern peoples’ lives and safety. Therefore, it is of great research value and social significance to recommend routes to avoid COVID-19 risk areas. Based on public historical data, this study calculated the COVID-19 risk index of each road section based on the time attribute and longitude and latitude coordinates of the COVID-19 risk areas. The COVID-19 risk index of each road section in the road network is stored in the COVID-19 risk matrix C, where ci,j (ci,j ∈ C) represents the COVID-19 risk index on the road section ei,j.
(1) Environment
In this study, the environment, also known as the road network environment, is defined as the updated road network. Each edge ei,j has three different attribute values, which are stored in the form of [w
t
ti,j, w
s
si,j, w
c
ci,j], representing the predicted value of the traffic time, safety risk index, and COVID-19 risk index under the given user preference
(2) Agent
Under the given road network environment, source node and destination node, the route recommendation algorithm as an agent selects the most appropriate action (i.e., selects an appropriate road section) based on the predicted attributes of each road section and generates an optimal route from the source node to destination node after a series of calculations and selections.
(3) State Space
A state is a summary of the current environment denoted by s. The state space is the set of all possible states denoted by S, where s ∈ S. In this study, the state space is a set of road network environments composed of finite edges and nodes, and each edge contains all the objective attribute values (travel time, safety risk index, and COVID-19 risk index).
(4) Action Space
An action is the choice made by the agent, denoted by a. The action space is the set of all actions denoted by A, where a ∈ A. The action space defines the set of actions (agent selects an edge) that can be selected by the agent in the current state, that is, given node v i , source node v st , destination node v end , and the attribute values of different edges.
Optimal route
(1) Optimal route under a single objective.
The optimal route under a single objective consists of a limited set of road segments (edges) that connect the source and destination node set by the user, which satisfies the optimal result under a certain objective (e.g., the shortest route between two points is to achieve the goal of minimizing the total distance cost).
(2) Optimal route under multi-objective optimization
Given the road network G = (V, E), each edge has three objective attribute values: road length, safety risk index, and COVID-19 risk index, an optimal route connecting the source node and destination node set by the user is generated, and the total cost of the three objective attributes on this route is minimized.
Multi-objective route recommendation algorithm based on q-learning
Based on the Q-learning algorithm, this study proposes a multi-objective route recommendation algorithm based on Q-learning (MORRBQ) to find the optimal route from the source node to the destination node satisfying the multi-objective constraints. As a classical algorithm in the field of reinforcement learning, compared with other traditional methods, the Q-learning algorithm can quickly find the optimal route and save a significant amount of computation time during route selection.
The MORRBQ algorithm calculates the reward value of each road section based on the three objective attributes (travel time, safety risk index, and COVID-19 risk index) and forms a reward matrix. This algorithm uses a route recommendation algorithm as the agent. In the execution process of the algorithm, first, the state and action are constructed into a Q-value table based on the reward matrix and Behrman equation. The Q-value is Q (s t , a t ), which refers to the expected value of the reward that can be obtained by taking action a t under state s t at a certain time t. Subsequently, the Dijkstra algorithm is used to pregenerate the shortest route connecting the source node and destination node, and the nodes passing through the route are used to initialize the Q-value table. Then, the dynamic exploration factor σ and value function are used to update the Q-value table. The environment provides feedback on the corresponding expected reward value Q (s t , a t ), according to the action taken by the agent. The agent is trained several times to obtain the updated Q-value table. Finally, the agent selects the optimal route from the source node to the destination node by selecting an action that maximizes the Q-value after receiving the state each time according to the Q-value table. The complete framework of the algorithm is shown in Fig. 3.

Multi-objective route recommendation algorithm framework.
The algorithm used in this study used three datasets from different sources. First, it must delete abnormal records in the dataset and fill in the missing items. To eliminate the influence of dimensional and numerical size scale differences between datasets, the data should be normalized. y
k
represents the value of the objective attribute k. The formula for data normalization is shown in Equation (3).
Travel time prediction
Travel time is one of the most commonly used optimization objectives for route recommendation. In this study, the ratio of the length of each road section to the average speed of vehicles was used as the predicted value of the travel time of this road section, and the calculation formula is expressed in Equation (4).
The safety risk index refers to the probability of traffic accidents occurring on a road section. In route recommendations, users are more inclined to choose road sections with fewer traffic accidents occurred in the past. In existing studies on safety route recommendation, most researchers adopt a calculation method based on the occurrence density of traffic accidents, that is, within a certain radius, the ratio between the number of accidents and area or the average distance between the midpoint of a road section and the location of traffic accidents in the area. The density-based safety risk index calculation method increases the computational cost and time consumption, which is not conducive to the timely feedback of route recommendation results. In the data pre-processing stage, the route recommendation algorithm proposed in this study uses a random forest to classify the safety risk index of each road section. According to the historical data on traffic accidents, the safety evaluation of road sections is divided into four categories. Values from 1 to 4 represent the safety risks from low to high. The safety risk index of road segment (edge) ei,j is represented by si,j, and the safety risk index of road segment can be directly used as a component of the reward value of MORRBQ algorithm. After assigning the safety risk index to each road section in the data pre-processing stage, the computation time and memory consumption of the system can be significantly reduced.
COVID-19 risk prediction
A novel function of the proposed multiobjective route recommendation method is to avoid COVID-19 risk areas when creating routes. Based on the current location of the road (the midpoint coordinates of the road) and the longitude and latitude coordinates of the location of COVID-19 cases, the COVID-19 risk index ci,j was calculated and added to the reward value. The calculation method for ci,j is to draw a circle with the midpoint of the road section as the center and the length of the road section as the radius. The average distance between the midpoint of the road section and each COVID-19 case within the circle was calculated and divided by the length of the road section. All roads were assigned a COVID-19 risk index value, calculated as follows:
The reward value is the immediate reward for choosing a particular action (e.g., edge ei,j) from a finite set of actions. To determine the immediate reward of the selected edge, the given user preference and all predicted values of the road segment (travel time, safety, and COVID-19 risk) are combined, and the direct distance between the current node v
i
and destination node v
end
(i.e., Euclidean distance) is considered to ensure that the vehicle drives to the destination as quickly as possible. Therefore, the agent creates a reward table based on the road network, destination node v
end
, and given user preferences, namely
The reward table is in the form of an adjacency matrix,
where v
end
denotes the destination node. Assume that the user’s preference is

Road network topology.
Reward table instance
where row represents state v i , in which the agent is in, and column represents the actions that the agent can take in the state located at node v i . For example, in the state of node v2, the reward value of the agent taking action to reach node v3 was –0.191.
Using appropriate strategies to select the optimal action is a key point in the field of force-change learning. The Q-learning algorithm uses ɛ-greedy strategies to plan the search and update the environment. In general, ɛ is a fixed hyperparameter, which may cause the agent to not sufficiently explore the environment in the early stage of iterative computation, resulting in a failure to generate the optimal route and failure to converge in the later stage. To solve this problem, a dynamic exploration factor σ was proposed in this study. The size of σ is dynamically adjusted based on the number of iterations during the execution of the algorithm. The specific function expression of σ is given by Equation (6).
The process of agent exploration is constantly searching for the optimal strategy, and its operational efficiency will improve with a reduction in the exploration process. Combined with the Dijkstra algorithm, the value function Q(s, a) of Q-learning is initialized using the shortest route result obtained by the algorithm. The initialization of the Q-value table can not only effectively shorten the iteration time of the algorithm but also make the agents’ selection of road nodes and actions clearer. The specific process of the Q-value table initialization is presented in Algorithm 3.
Q-value table update and model training
As shown in Fig. 5, the agent builds a matrix Q-value table of the same order as the reward table, which stores the Q-value corresponding to each state in the environment, and every action taken after receiving the state. From the Q-value table, the agent can take the action corresponding to the largest Q-value after receiving any state. To achieve this goal, the agent used the Behrman equation (i.e., the value function) to update the Q-value table, as shown in Equation (7).

Q-value table update and model training.
The Q-value table must be updated several times (the number of iterations is denoted by D). In each iteration, any node was randomly selected as the initial state. Then, based on the value of the dynamic exploration factor σ and the size of the random number generated at each step, the agent selects the action a t and executes it in the current state s t , obtains the new state st+1 and the current reward rt+1, and updates the value of Q (s t , a t ) in the table simultaneously. Iterate until the destination node is reached. The entire algorithm is a process of constantly updating the Q-value table, and then determining which action to take in a certain state is the best choice according to the updated value. A detailed description of the Q-value table construction and multiobjective route recommendation is presented in Algorithm 4.
After training, the system uses the updated Q-value table to select the optimal route from the source node. Specifically, it is based on the current node to find different adjacent nodes, from the Q-value of the update table to retrieve the Q-value, on the edge of the adjacent to the current node, based on certain rules from all possible edges, choosing to have the largest Q-value to reach the next node, repeating the steps until it reaches the destination node, and generating the multi-objective optimal route.
The main operation of the multi-objective optimal route selection is to find the Q-value table and select the action with the largest Q-value in the retrieval process. This search operation significantly reduces the complexity of the multi-objective optimization problem and reduces the solution time to near real time.
Algorithm complexity analysis
Time complexity analysis
The time complexity of Algorithm 1 primarily depends on two factors: (1) the number of sections (edges) in the road network and (2) the size of the COVID-19 case list within the radius of each road section. Therefore, the time complexity of Algorithm 1 is O (k * m2), where k denotes the number of COVID-19 cases within the radius of each section and m denotes the number of sections (edges) in the road network. Because k ⪡ m, the time complexity of Algorithm 1 is O (m2).
The time complexity of Algorithm 2 primarily depends on the computation cost of the reward value between nodes; therefore, the time complexity of Algorithm 2 is O (n2), where n denotes the number of nodes in the road network.
The time complexity of Algorithm 3 primarily depends on the search of the shortest route; therefore, the time complexity of Algorithm 3 is O (n2), where n denotes the number of nodes in the road network.
The time complexity of Algorithm 4 primarily depends on two factors: (1) the number of training sessions of the Q-value table and (2) the size of the Q-value table. Therefore, the time complexity of Algorithm 3 is O (c * n2), where c denotes the number of training sessions in the Q-value table, c is a constant, and n denotes the number of nodes in the road network. Because c ⪡ n, the time complexity of Algorithm 3 is O (n2).
Space complexity analysis
The space complexity of Algorithm 1 primarily depends on the storage cost of two parts of the data: (1) the number of road segments (edges) in the road network and (2) the number of COVID-19 cases within the radius of each road section. Therefore, the space complexity of Algorithm 1 is O (k * m2), where k denotes the number of COVID-19 cases within the radius of each road section and m denotes the number of edges in the road network. Because k ⪡ m, the space complexity of Algorithm 1 is O (m2).
The space complexity of Algorithm 2 primarily depends on the matrix of the space overhead reward, and the space complexity of Algorithm 4 primarily depends on the Q-value table space overhead reward. The size of the matrix and Q-value table are decided by the road network node number. Therefore, the space complexity of Algorithms 2 and 4 is O (n2), where n denotes the number of nodes in the road network.
The space complexity of Algorithm 3 primarily depends on the storage overhead of the road network G(V, E); therefore, the space complexity of Algorithm 3 is O (n2), where n denotes the number of nodes in the road network.
Experiment
Datasets
Real-world geotagged datasets were used in this experiment, including New York road network data, New York traffic accident data, and New York COVID-19 data.
(1) Road Network data
This dataset was derived from the New York City road network data published by OpenStreetMap, as shown in Fig. 6. It consists of more than one million data records, including the longitude and latitude coordinates of the starting and ending points of the road segments, the length of the road segments, and other characteristics, and the detailed description is shown in Table 2. For example, the data record (2, 886.4, 40.082561, –74.548430, 40.081576, –74.554403) indicates that the total length of the road segment numbered 2 is 886.4 m, and the longitude and latitude coordinates of the starting and ending points are (–74.548430, 40.082561) and (–74.554403, 40.081576).

Road network in New York.
Information of Road network in New York
In this study, 20 source-destination node pairs were randomly selected from the New York City road network for the multi-objective route recommendation experiment, and the Euclidean distance between the selected source-destination node pairs should be at least 5 km.
(2) Traffic Accident Data
The dataset is derived from the publicly released traffic accident data from the New York City Police Department and records the traffic accident data of New York City from 2015 to 2017, including a total of 477732 records, the time of the accident, the accident location latitude and longitude coordinates, the number of casualties, and severe level characteristics, including the severity of the traffic accident, ranging from 1 to 4. A higher value indicates a higher accident severity. The composition and recording features of this dataset are presented in Table 3.
Information of traffic accidents in New York
(3) COVID-19 Data
The dataset was from 578 COVID-19 cases published by the New York City government from October 21 to December 27, 2020, and includes features such as the time of COVID-19 risk areas, latitude and longitude coordinates, number of medical visits, and number of deaths. The feature compositions and partial records of this dataset are presented in Table 4.
Information of COVID-19 in New York
To verify the superiority of the MORRBQ algorithm in the multi-objective route recommendation problem, this study conducted relative comparative experiments between the proposed algorithm and three other traditional route recommendation algorithms. The four algorithms are described as follows:
(1) MORRBQ Algorithm
The MORRBQ algorithm is a route recommendation algorithm proposed in this study. It is implemented based on the reinforcement learning algorithm Q-learning, which introduces the initialization operation of the Q-value table and dynamic search factor σ. In the training phase of the model, 10000 iterations were performed to generate the Q-value table for the experimental phase. In each iteration, a source-destination pair is randomly selected from the set of nodes and edges, and a Q-learning algorithm is used to update the Q-value table accordingly. To achieve the optimal experimental effect, after comparing the experimental results obtained under different parameter settings, γ= 0.7 and α= 0.6 were set in this study.
(2) MORR Algorithm
The MORR algorithm [20] is also a multi-objective route recommendation algorithm based on the Q-learning algorithm, and is the main comparison algorithm of the MORRBQ algorithm proposed in this study. However, the MORR algorithm does not change the core of the Q-learning algorithm, and it also considers common objective attributes (travel time, fuel consumption, and air quality).
(3) PreGo Algorithm.
The PreGo algorithm [24] is a dynamic routing algorithm that suggests the best route by retrieving real-time traffic congestion conditions while minimizing the values of two objective attributes (travel time and fuel consumption) with the same priority. The PreGo algorithm uses the dynamic A* algorithm as an optimization scheme to solve the route problem of binocular objects with the same priority. In the experimental process, the PreGo algorithm was implemented based on the steps of the multipreference shortest-route algorithm in the literature [14].
(4) MORP Algorithm
The MORP algorithm [26] uses a multi-objective optimization scheme to determine the optimal route from the source node to destination node. It considers travel time, fuel consumption, and air quality equally when finding the best route. However, because of the multidimensional hypergrid search for the optimal solution, the MORP algorithm takes longer time (depending on the map size) to provide the optimal solution.
Experiment Settings
The experimental environment is AMD Ryzen 5 processor, CPU 3.00 GHz, operating platform is Windows 10, this study uses Python programming language to achieve the relevant algorithms in the experiment.
User preference settings
To reflect the daily requirements of users, three different settings of user preference weights were considered in the experiment, as follows:
Scenario 1:
Scenario 2:
Scenario 3:
Evaluation indicators
To quantitatively evaluate the performance of the algorithm, this study uses the running time, convergence time, convergence episodes, reward value variation trend, exploration steps, and route cost as the evaluation indices of the algorithm performance.
(1) Running time
To evaluate the running time of the algorithm, the average of the total running time of the algorithm on multiple routes was calculated, as shown in Equation (8).
(2) Convergence time and episodes.
To measure the convergence speed of the algorithm, the convergence time and number of convergence episodes are used as evaluation indices, and the convergence time can reflect the execution speed of the route recommendation algorithm. The number of convergent episodes refers to the number of episodes iterated by the agent when the algorithm reached the convergence state. The smaller the number of episodes, the higher the learning efficiency of the agent.
(3) Change trend of the reward of the algorithm
In the process of algorithm execution, with the advancement of iteration, the size and variation trend of the reward value can effectively reflect the excellent degree of the algorithm.
(4) Exploration
The number of exploration steps is the number of actions that the agent must perform to reach the destination node in each round. At the same time interval, the fewer the number of exploration steps of the algorithm, the more training episodes the agent can carry out, and the more environmental information it can obtain.
(5) Route Cost
Route cost refers to the sum of the travel time, safety risk, and COVID-19 risk costs of the route on each road section. The calculation method is shown in Equation (9).
Time performance
Table 5 shows the time performance of the four algorithms, which were repeated several times, and the average value was taken as the experimental output data. The time performance of the algorithm was evaluated using the running time of the algorithm, time required to achieve convergence, and number of episodes required to achieve convergence.
Time performance comparison of the 4 algorithms
Time performance comparison of the 4 algorithms
As can be observed from the results in the table, compared with the other three route recommendation algorithms, the time required for the MORRBQ algorithm to achieve convergence and the number of episodes required to achieve convergence are significantly less because the initialization operation of the Q-value table in the proposed method significantly saves computing time and improves the convergence speed of the algorithm. However, the PreGo and MORP algorithms require a significant amount of time to obtain an optimal solution.
The reward value realizes the connection between the goal and algorithm by making the task goal concrete and nmerical. In a specific application, with the advancement of exploration steps, the size and variation trend of the reward value can effectively reflect the degree of excellence of the algorithm.
As shown in Fig. 7, the agent obtains a higher reward value at the initial stage of exploration because the MORRBQ algorithm adopts Dijkstra’s algorithm to initialize the Q-value table function. Compared with the other three algorithms, the reward value of the MORRBQ algorithm increased significantly and converged to zero faster. Moreover, the small fluctuation in the reward value growth curve of the MORRBQ algorithm also indicates that it is more directional and effectively speeds up the exploration process of the agent.

The change trend of reward value of the 4 algorithms.
The cost of the algorithm in the optimization process can be expressed as the number of steps required for convergence. The peak and falling speeds of the exploration steps reflect the optimization ability and stability of the algorithm.
As shown in Fig. 8, because the Dijkstra algorithm is used to initialize the value function during road network construction, the peak value of the exploration steps of the MORRBQ algorithm is considerably smaller than that of the other three algorithms, and the number of episodes required to reach the convergence state is also considerably smaller than that of the other algorithms. Compared with the MORR algorithm, the MORRBQ algorithm has higher stability after reaching the convergence state. Overall, the dynamic exploration factor σ enhances the exploration ability of the MORRBQ algorithm at the initial stage of iteration and assists the algorithm to improve the final convergence degree when more environmental feedback information is obtained. Compared with other algorithms, the MORRBQ algorithm shows stronger stability performance and exploration ability.

Change trend of exploration steps of the 4 algorithms.
Figures 12, and 14 show the route cost comparison results of the recommended route generated by the MORRBQ, PreGo, and MORP algorithms, respectively, for different objective attributes under three user preference setting scenarios.

Route generation results of the 4 algorithms in Scenario 1.

Route generation results of the 4 algorithms in Scenario 2.

Route generation results of the 4 algorithms in Scenario 3.
As can be observed from Fig. 9, the route generated by the PreGo algorithm has higher travel time, safety risk, and COVID-19 risk than other methods because the PreGo algorithm discards some of the best routes when using the dynamic A* algorithm to find a solution. The MORP algorithm uses an adaptive ɛ-constrained optimization solution and considers different attributes in isolation.

Route cost of the 4 algorithms on different target attributes in Scenario 1.
As shown in Fig. 10, in Scenario 2, the experimental results of the MORRBQ and MORR algorithms were better than those of the other two algorithms, and both the PreGo and MORP algorithms did not consider any user preferences. The route recommendation algorithm of the MORRBQ algorithm considers the preferences of a given user for different objective attributes, and the setting based on the given user preferences reduces the travel time in Scenario 2 and reduces the safety risk.

Route cost of the 4 algorithms on different target attributes in Scenario 2.
As shown in Fig. 11, in Scenario 3, although the route generated by the PreGo algorithm has a low travel time, it has the highest consumption cost in the two objective attributes of safety and COVID-19 risks, which fails to satisfy the setting conditions of Scenario 3; that is, it only considers safety and COVID-19 risks and does not require the shortest travel time. The route generated by the MORR algorithm has a high travel-time cost. The route generated by the MORRBQ algorithm has low safety and COVID-19 risks, which satisfy the user preference settings in Scenario 3.

Route cost of the 4 algorithms on different target attributes in Scenario 3.
In conclusion, the PreGo algorithm generated routes with longer travel time, higher safety risk, and higher COVID-19 risk. The MORP algorithm provides the Pareto optimal route but does not consider the user’s preference for different objective attributes. Moreover, the MORP and PreGo algorithms require a significant amount of computing time. The results of the MORR algorithm are better than those of the PreGo and MORP algorithms; however, the ɛ-greedy strategy of the MORR algorithm easily falls into a local optimum in the process of finding the route. It can be seen that the route generated by the MORRBQ route recommendation algorithm proposed in this study can consider user preferences for different route attributes and provide route suggestions in near real time.
Figures 12–14 show the recommended routes generated by the four algorithms considering the travel time, safety risk, and COVID-19 risk attributes simultaneously in different scenarios. The red, yellow, green, and black routes are generated by the MORRBQ, MORR, PreGo, and MORP algorithms, respectively. Point A is the source node (starting point), point B is the destination node (end point), and the black marked point is the location of COVID-19 cases. Considering the traffic accident data of New York City as the data source of the thermal map, it can intuitively show the road segments with high traffic accidents, that is, a high safety risk. The brighter the color, the higher is the safety risk of the road segment.
As can be observed from Fig. 12, under the setting conditions of Scenario 1, the routes generated by the MORRBQ and MORR algorithms all avoid road sections with high traffic accidents (i.e., high safety risks) and high COVID-19 risk, and the routes generated by the MORP algorithm do not avoid areas with high safety risks. However, the route generated by the PreGo algorithm has high safety and COVID-19 risks.
The route results generated by the four algorithms in Scenario 2 are shown in Fig. 13. The route generated by the MORRBQ algorithm avoids road sections with high safety risk and has a short traffic time, which satisfies the weight allocation set in Scenario 2. However, the routes generated by the MORR and MORP algorithms did not avoid areas with high safety risks, and the routes generated by the PreGo algorithm consumed the maximum time among the four algorithms.
Figure 14 shows the route results generated by the three algorithms in Scenario 3. It can be observed that the MORRBQ algorithm considers the optimization objectives of safety and COVID-19 risks, and the generated route has low safety and COVID-19 risks. Although the route generated by the MORR algorithm has a low risk of COVID-19, it does not avoid areas with high safety risks. The route generated by the MORP algorithm does not avoid the areas with high safety and COVID-19 risks. Although the route travel time generated by PreGo algorithm is shorter, its safety and COVID-19 risks are the highest compared with the other two algorithms.
Conclusion
Based on the reinforcement learning algorithm, this study proposes a multi-objective route recommendation algorithm, MORRBQ, which integrates user preferences. This algorithm not only adopts the common route recommendation optimization objective, which is the travel time, but also adopts the safety and COVID-19 risk objective attributes. Different objective attributes and their weight allocation will result in different route recommendation results. High-weighted travel time attribute will enable the agent to generate a short distance-prior route, high-weighted security risk attribute will enable the agent to generate a safety-first route, and high-weighted COVID-19 risk attribute will enable the agent to generate a route to avoid epidemic areas. Compared with other algorithms, MORRBQ algorithm can balance the impact of travel time, safety risk and COVID-19 risk, and generate personalized recommendation routes while following user preferences. The initialization operation of the Q-value table can effectively shorten the iteration time of the algorithm and clarify the agents’ choice of road node and action. The introduction of the dynamic exploration factor σenhances the exploration ability of the MORRBQ algorithm at the initial stage of iteration and assists the algorithm to improve the final convergence degree when more environmental feedback information is obtained. A series of experimental results on real road network data show that the recommended route generated by the MORRBQ algorithm has a lower route cost than other algorithms, and because of the unique Q-value table search mechanism of the MORRBQ algorithm, it can generate route recommendation results almost in real time. In future work, we will consider the application of a reinforcement learning algorithm in scenic spot recommendation and organically combine it with the route recommendation work in this study.
Footnotes
Acknowledgments
The authors would like to thank the reviewers for their useful comments and suggestions for this paper. This work was supported by the National Natural Science Foundation of China (Grant Nos. 61702010, 61972439), the Anhui Provincial Natural Science Foundation of China (Grant No. 2208085MF164), and the University Natural Science Research Program of Anhui Province (Grant No. KJ2021A0125).
