Abstract
Path planning in an unknown environment is a basic task for mobile robots to complete tasks. As a typical deep reinforcement learning, deep Q-network (DQN) algorithm has gained wide popularity in path planning tasks due to its self-learning and adaptability to complex environment. However, most of path planning algorithms based on DQN spend plenty of time for model training and the learned model policy depends only on the information observed by sensors. It will cause poor generalization capability for the new task and time waste for model retraining. Therefore, a new deep reinforcement learning method combining DQN with prior knowledge is proposed to reduce training time and enhance generalization capability. In this method, a fuzzy logic controller is designed to avoid the obstacles and help the robot avoid blind exploration for reducing the training time. A target-driven approach is used to address the lack of generalization, in which the learned policy depends on the fusion of observed information and target information. Extensive experiments show that the proposed algorithm converges faster than DQN algorithm in path planning tasks and the target can be reached without retraining when the path planning task changes.
Introduction
In recent years, intelligent mobile robots have been widely used, and almost all industries require robots to perform specific tasks, including domestic service [1], disaster relief [2], military operations [3] and so on. To accomplish specific task, path planning in an unknown environment is essential for many types of mobile robots, such as rescue robots [4], agricultural robots [5]. The robot must reach the work area from other position before performing the task. Not only do robots need to handle specific tasks, but they must also be able to plan a collision-free path between current position and work area with information observed by sensors. Hence, path planning in an unknown environment is a fundamental task for mobile robots.
Robot path planning aims to reach a predefined destination without any collision. Many scholars have done related research on path planning. Numerous solutions have been developed, such as Dijkstra algorithm [6], A* algorithm [7], flower pollination algorithm [8], grid-based approach [9], artificial potential field method [10] and fuzzy logic method [11–13]. Although these traditional methods are feasible in some cases, there still have some limitations. For the map-based methods, path search cannot be performed without global map information. Some methods based on local information can search for path in the absence of global information, but these methods still have some drawbacks. For example, under the joint action of attraction and repulsion, the artificial potential field method may cause the robot to fall into the local minimum potential energy region. For some rule-based methods, the rules may need to be regenerated when environments change due to lack of self-learning ability.
With the rise of machine learning and artificial intelligence, reinforcement learning, a type of machine learning, is applied to many challenging areas to solve difficult problems. It uses the principle of reward and penalty to learn strategies when the agent interacts with the environment. As a traditional reinforcement learning, Q-learning is widely used in path planning task. Smart et al. [14] applied Q-learning to robot navigation for the first time and designed a simple reward function to help the robot find a solution. Low et al. [15] proposed an improved Q-learning to solve the path planning, which use the flower pollination algorithm to initialize the Q-table so as to improve the convergence rate of Q-learning. However, when the size of the work environment of path planning task increases, a larger state space is generated and a longer time for updating the matrix of Q-value are required. It means that Q-learning is inefficient in solving the problems with high dimensional state space. Jaradat et al. [16] proposed a new definition of state space to limit the number of states and combined it with Q-learning for mobile robot path planning. The limitation of the state number does not solve the “curse of dimensionality” caused by large state space. Therefore, some researchers combine deep learning with Q-learning to solve the problem. They use a neural network to approximate the optimal action-value function so as to avoid the creation of the matrix of Q-value. This combination of deep learning and reinforcement learning is called deep reinforcement learning (DRL). Duguleana et al. [17] combined Q-learning with neural network to solve the path planning problem in dynamic environment and verified the effectiveness of the proposed method in a real robot. Yi et al. [18] combined fuzzy reasoning, neural network and Q-learning to solve the navigation problem of mobile robot, and proved its effectiveness in experiments.
Path planning task in an unknown environment requires the robot to constant self-study with obtained information to search for a path to the target. Hence, deep reinforcement learning approach is well suited to this task. Deep learning enables an agent to perceive the environment, while reinforcement learning enables an agent to learn the best strategies to deal with difficult problems [19]. Deep Q-network (DQN) is a typical deep reinforcement learning algorithm [20]. It has successfully been applied in many domains such as energy management [21], job shop scheduling [22] and image steganalysis [23]. It can effectively solve path planning problems in an unknown environment. Lei et al. [24] proposed a deep reinforcement learning path planning method in unknown environment and tested its effectiveness in actual environment with ROS framework, but the training of the model took a lot of time. Cheng et al. [25] proposed a concise deep reinforcement learning obstacle avoidance algorithm with the powerful deep Q-network architecture to solve the obstacle avoidance and path planning of unmanned marine vessels. Wu et al. [26] proposed a lazy training method for deep Q-network used to reduce the training time of neural network and verified its effectiveness in path planning tasks.
However, the deep Q-network requires a lot of exploration in the early stage of training and the agent is more likely to collide with obstacles when using the random exploration method. Excessive collision data is not helpful for model training and results in a slower convergence rate. Furthermore, the DQN path planning algorithm lacks generalization when the next action of the agent depends only on the observed environmental information. Thus, the model needs to be retrained to adapt to a new situation (a new starting point, destination, or work environment).
To address the slow convergence rate and the lack of generalization ability of DQN, this paper presents a new deep reinforcement learning path planning algorithm, which combines DQN with the prior knowledge that can guide the robot to avoid obstacles by choosing appropriate behaviors. In this method, a fuzzy rule base is built to introduce the prior obstacle avoidance knowledge. In the exploration phase, agent selects actions in two ways, random selection and fuzzy logic rules. Therefore, more effective obstacle avoidance experience is acquired for neural network training so as to reduce the training time. Furthermore, a target-driven approach is used for model training, which directly uses the target information and the information observed by the sensors as the input of network to train the model. In this manner, the policy learned depends on the target position and the current state of the agent, rather than just the current state of the agent. Hence, there is no need to re-train the network for changed starting positions or new targets.
The rest of this paper is structured as follows. In section 2, the modeling of mobile robot is constructed. In section 3, the path planning algorithm proposed is introduced. In section 4, the simulation environment and simulation model are created, and the simulation experiment and experimental results are provided. In section 5, the conclusion of this paper is given.
Modeling of mobile robot
The autonomous mobile robot in this work is a four-wheeled, differentially driven, skid-steering robot. The differential speed between left wheels and right wheels produces lateral velocities which yield to the effect of side skidding [27]. The kinematic structure and equipped sensors of robot are shown in Fig. 1, where XOY is the global coordinate system and X′O′Y′ is the local coordinate system of the robot. In the structure, W means the distance between the left and right wheels. When the robot steers, the point R CC about which the robot rotates is called the instantaneous center of curvature. At any time, the position of robot can be described by its current coordinates (R x , R y ) and direction θ. The intelligent mobile robot is equipped with five ultrasonic range sensors (u1, u2, u3, u4, u5) and a compass sensor CS used to obtain the direction of the mobile robot. The angle between the directions of each ultrasonic sensor is 30 degrees. The three main ultrasonic sensors (u1, u2, u3) cover a range of 15 degrees and 4 meters of distance and two auxiliary ultrasonic sensors (u4, u5) are located on the side, each covering a range of 15 degrees and a distance of 2.5 meters.

The kinematic structure and equipped sensors of the intelligent mobile robot.
The movement of the robot is realized by differential drive mode. With this configuration, the left front wheel and the left rear wheel of the robot have the same linear velocity v
l
, and the right front wheel and the right rear wheel have the same linear velocity v
r
. According to the model, the relationship between the forward linear velocity v and the steering angular velocity ω of the robot and the linear velocity of wheels can be obtained.
When the robot steers, the instantaneous radius of curvature R
r
, the distance between the center of the robot and R
CC
, was calculated to be:
At time t, the position of the robot at next time t + dt can be predicted:
where (R CCx , R CCy ) is the coordinate of the instantaneous center of curvature.
Reinforcement learning
Reinforcement learning is a potential method in the field of artificial intelligence. It is a reward-guided learning approach in which the agent interacts with the environment and get reward feedback. The basic principle of reinforcement learning algorithm is shown in Fig. 2. It aims to learn a strategy with maximizing the cumulative rewards of the agent from the environment [28]. In the process of interacting with the environment, an action will be strengthened if it receives a positive feedback. Otherwise, it will be weakened.

Framework of reinforcement learning.
Q-learning is an effective model-free reinforcement learning, which can work well in the absence of global information. It obtains the optimal strategy by iteratively optimizing the Q table storing Q values [29]. The calculation of Q value is related to the immediate reward r after performing the action a at state s and the maximum Q value of next state s′:
where Q (s, a) is Q value for state s and action a, and the update formula of the Q table is as follows:
In the update formula, Q
new
(s, a) means the updated Q value, α is the learning rate of the algorithm and γ is discounter factor. In each update, the highest Q value
However, in the path planning task, the time required for Q-learning to complete the task increases exponentially as the state space increases [30]. It means that Q-learning is unsuitable for the tasks of high dimensional state space. With the combination of deep learning and reinforcement learning, deep reinforcement learning can solve the problems that were previously intractable. DQN is a typical deep reinforcement learning algorithm, which combines the traditional Q-learning with deep learning. Compared with Q-learning algorithm, DQN has three features: DQN uses a neural network to approximate the optimal action-value function for solving the “curse of dimensionality” caused by large state space. DQN uses the experience replay mechanism to provide enough training samples for network updates, and eliminates the correlations in the observation sequences and smooths over changes in the data distribution through randomizing over the data. DQN reduces the correlations between the Q values and the target values through periodically updating the Q values towards the target values. However, DQN still has the problems of slow convergence and lack of generalization in path planning task, which results in plenty of time spent on model training
To solve the shortcomings of DQN in solving path planning, a reinforcement learning path planning algorithm guided by prior obstacle avoidance knowledge is proposed. The framework of the proposed method is shown in the Fig. 3.

Framework of the proposed path planning algorithm.
The DQN algorithm requires a lot of exploration in the early stage of training, and the acquired data is stored in the memory for neural network training. Neural networks are more likely to learn better by providing it with effective data [31]. Therefore, obtaining excessive collision data results in a slower convergence rate. To speed up the convergence of the algorithm, a fuzzy rule base is adopted in this study to introduce the prior knowledge of obstacle avoidance. Obstacle avoidance knowledge can help the agent gain more effective obstacle avoidance experience and avoid blind exploration so as to improve the convergence rate of the algorithm. In this method, the prior obstacle avoidance knowledge is introduced to guide agent in the exploration phase of the algorithm. The greedy policy is used to deal with exploration and exploitation trade-off. The agent has the possibility of 1 - ɛ to explore the environment, and the possibility of ɛ to exploit the best action from learned experience. In the exploitation stage, the action with the maximum Q value is selected through the neural network. In the exploration stage, the action is selected through two methods: random selection and using prior knowledge of obstacle avoidance. The probability of action selection through prior obstacle avoidance knowledge is p, and the probability of random selection is 1 - p.
There is a lack of generalization to new path planning task when the input of neural network is only environmental information observed by the agent. In such case, the destination information is hardcoded in neural network parameters. Thus, the network parameters are required to be updated when the path planning task changes. It means that the model must be retrained to adapt to a new situation. In order to improve the generalization ability of the model and enable it to handle different path planning tasks without retraining, a target-driven approach is used in this method to train the model. In this way, the target information is directly used as the input of the neural network, that is, the input consists of the current state s of the robot observed by the sensors and the goal information g representing the state of target relative to the current position of the robot.
At each time-step t, the obtained experiences e
t
= (s
t
, g
t
, a
t
, r
t
, st+1, gt+1) are stored in the memory D
t
= {e1, . . . , e
t
} containing a certain number of experiences. In the learning process, Q-learning updates is applied on samples of experience (s, g, a, r, s′, g′) ∼ U (D), selected evenly at random from the data memory. There are two neural networks, behavior network and target network, used in our algorithm. The output of the behavior network used to calculate the predicted value is Q (s, g, a ; θ), and the output of target network for calculating the target value is Q′ (s, g, a ; θ-), where θ and θ- is the parameter of the two networks. The Q-learning update at iteration i uses the following loss function:
In this study, the state information is the distance information measured by sensors, and the goal information is the position of the target relative to the mobile robot. If the agent reaches the target position or collides with an obstacle, it is considered to be in a terminated state. The total pseudocode of the proposed algorithm is like below Algorithm 1.
By this method, the intelligent robot completes the task of obstacle avoidance and path planning while improving the convergence rate and generalization ability of algorithm. After training, an adaptive path planning model in unknown environment is obtained.
As previously mentioned, prior obstacle avoidance knowledge can reduce the randomness of the exploration process and provide more effective experience for the neural network. In this section, a fuzzy logic controller for obstacle avoidance is designed to introduce prior knowledge. Fuzzy logic controller can imitate human mental process, avoid the difficulty of building mathematical models and complete complex tasks through simplified design. The structure of the fuzzy logic controller is shown in Fig. 4. The details of input and output and their membership function design are described as below.

The fuzzy logic controller of obstacle avoidance.
The complexity of fuzzy rule base and solving rate are positively correlated with the dimension of input. In the obstacle avoidance task, the performance difference between the controller with five ultrasonic sensor information inputs and the controller whose input is three ultrasonic sensor information is not significant. For comprehensive consideration, only the distance information of the middle three ultrasonic sensors (u1, u2, u3) are used as the input of fuzzy logic controller.
The distance information collected by the three ultrasonic sensors ranged from zero to four meters. Therefore, the universe of discourse is defined in the region [0, 4]. When the robot moves, the obstacle information in front is more important. Therefore, three linguistic values are defined for the distance information of front ultrasonic sensor, ‘Near’, ‘Far’ and ‘Very far’. Since the main function of the left and right sensors is auxiliary obstacle avoidance, only two linguistic values, ‘Near’ and ‘Far’ are defined. The input membership functions are shown in Fig. 5, each of which is composed of trapezoidal membership functions. The horizontal axis means the distance value of ultrasonic sensor and the vertical axis is the membership degree of fuzzy sets.

Input membership functions: (a) membership function of front ultrasonic sensor data, (b) membership function of ultrasonic sensor data on both sides.
The forward linear velocity of the robot is set to remain constant when the robot moves, and obstacle avoidance behavior is mainly achieved by changing the steering angular velocity. Thus, the output of the fuzzy logic controller corresponds to the steering angular velocity of the robot. A negative output corresponds to a left turn, and a positive output corresponds to a right turn.
For the output variable, the universe of discourse is defined in the region [-2, 2], in radians per second. It is classified either as ‘FasterLeft’, ‘Left’, ‘Forward’, ‘Right’ and ‘FasterRight’, respectively correspond to fast left turn, left turn, straight turn, right turn, and fast right turn. As shown in Fig. 6, the out-put membership function is composed of trapezoidal membership functions and triangular membership functions. The horizontal axis represents the steering angular velocity and the vertical axis represents the membership degree.

Output membership function.
In this study, five actions are predefined for the mobile robot, which correspond to five fuzzy sets of the output universe. Therefore, the output steering signal needs to be mapped to the action of the agent. For each output signal, its degree of membership to each fuzzy set are calculated and the action corresponding to the fuzzy set with the highest degree of membership is selected.
According to the number of inputs and the number of fuzzy subsets of each input, (3*2*2 = 12) fuzzy rules are set to guide the robot avoid obstacles. The “If –Then” manner is used to represent the relationship between input and output [32]. The constructed fuzzy rule base for fuzzy logic controller is described in Table 1.
Control rule table of the designed fuzzy controller
Through this rule base and the defined membership function, the robot can avoid obstacles and obtain more effective data for model training so as to reduce the training time of the model.
A deep reinforcement learning task contains various components, such as state space, reward function and neural network. The details of each of these components are described as below.
As previously mentioned, the input of neural network consists of the state observed by the sensor and the goal information. For this policy, a state will be determined by the five readings obtained by these ultrasonic sensors. The goal information contains the distance d from the current position of the robot to the target point and the direction θ′ of the target point relative to the local coordinate system of the robot, as show in the Fig. 7. The relative distance d was calculated to be:
Target point position relative to the robot local coordinates.
where (R
x
, R
y
) and (T
x
, T
y
) are the current coordinates of robot and the target point relative to the global coordinate system. The relative direction θ′ was calculated to be:
For the mobile robot, five actions are considered: moving forward, turning left, turning right, faster turning left and faster turning right. The forward linear velocity v and steering angular velocity ω corresponding to the five groups of actions are shown in Table 2.
Robot’s actions and corresponding velocity
The duration of each action is one second. After each action, the state and goal information are re-read and the next action is selected
Neural network can be used to develop a policy from s t to a t . In this paper, a three-layer neural network is used, including input layer and output layer, to learning policy and approximate the optimal action-value function. The structure of the neural network is presented in Fig. 8.

Neural network structure.
The size of the input data is 7, corresponding to the distance information observed by the five ultrasonic sensors, the distance from the robot to the target and the direction of the target relative to the local coordinate system of the robot. There are 128 neurons in the hidden layer and the Rectified Linear Unit (ReLU) [33] is used as the activation function of the hidden layer. The size of the output data is 5, representing the Q value of each action. When the action is selected through the neural network, the action with the maximum Q value is selected according to the current state and goal information.
Reinforcement learning algorithms require an instruction-centered feedback signal, which makes it difficult to design an appropriate reward function. In a path planning task, the reward signal must consider two subtasks: on the one hand, the mobile robot must develop an obstacle avoidance subtask to driving safely through the environment. On the other hand, the mobile robot needs to get to the target as quickly as possible. For these two subtasks, the following reward function are designed:
When colliding with an obstacle or wall, the agent gets a negative reward directly:
When the agent successfully reaches the target position, a large reward is directly given:
In other cases, the obtained reward R
else
is determined by both the reward R
safe
of keep safe and the reward R
close
of close to target.
When the agent does not collide with obstacles or reach the target, the states of the agent was clustered into two types: Safe States –S (When the distance of obstacles detected by ultrasonic sensor is far away) and Non-Safe States –NS (When the distance of obstacle detected by ultrasonic sensor is relatively close). To reduce the possibility of collision with obstacles, the R
safe
is defined as follows:
where n is the step number and dmin is the minimum distance of five ultrasonic sensors.
In order to reach the target as quickly as possible, the Rclose is defined as follows:
Therefore, the total reward function is as follow:
The effectiveness of proposed method is evaluated by a series of path planning tasks, where the mobile robot is trapped in an unknown environment and search a collision-free path through constant self-learning.
Experimental Setup
Several virtual simulation environments for path planning tasks and a simulation robot corresponding to the structure described in section 2 are created by Gazebo simulator to train our model. Gazebo is a physical simulation platform that supports simulation of various robots, sensors, and environment models. It has multiple physics engines and supports advanced dynamics simulation. It also contains multiple built-in sensors and robot models, and has various built-in plugins and functional interfaces [34]. In this paper, python language is used to implement the path planning algorithm and call the built-in interface of the Gazebo simulator to control the robot and obtain robot information. The simulation robot model is shown in Fig. 9. In all experiments, the same parameters of robot model are used without any prior knowledge about the environment. The next action can only be judged by the sensor information and the calculated goal information. To implement the model more easily, Tensorflow, an open source software library developed by researchers and engineers working on the Google Brain Team, is used. Throughout our experiments, a NVIDIA GeForce GTX 1080Ti GPU is used for training. Through multiple tests, the parameters in Table 3 were used in our experiment.

Simulation robot built in Gazebo.
Training parameters and values
The ɛ-greedy policy was used as the action selection policy. The initial value of ɛ is 1, and after a certain number of pre-training, the ɛ begins to decrease. It drops by 0.0025 per step and stays the same when it drops to 0.01. For the proposed algorithm, two sets of experiments are conducted to evaluate its convergence rate and generalization performance.
Rate of convergence
In this section, a series of comparative tests are conducted in three environments of the same size with different number of obstacles, by using the DQN and our method. As shown in Fig. 10, the length and width of the three environments are 20 meters. The first environment is the simplest, with 10 obstacles. It contains only bar and cylinder obstacles and each obstacle is far away from the others. The second environment consists of 22 obstacles and the obstacle types contain bar, cube and cylinder. The third environment is the most complex. It contains 35 obstacles, including bar obstacles, cube obstacles and cylinder obstacles of different sizes, and the location of obstacles is very disorderly. In each environment, the position of the robot is the starting position and the red triangle represents the destination.

Virtual simulation environment built in Gazebo: (a) first environment, (b) second environment, (c) third environment.
In three environments, the DQN algorithm and our method are used for path planning tasks and the convergence rates of them are compared. Each model was trained for 600 epochs. Figure 11 recorded the cumulative reward per episode and average reward of the two algorithms. As shown in Fig. 11, the agent lack experience in how to reach the target in the early stage of training. Therefore, the agent gets a low reward value. As the training progresses, the agent constantly learns how to avoid obstacles and reach the target position. Thus, the times of reaching the target and the reward value increase accordingly. By comparing the reward curves of the same algorithm in different environments, it can be seen that in a simpler environment, the cumulative reward curve is more stable and the agent can get a high reward faster. The reason is that the robot is more likely to collide with obstacles in a complex environment and cannot reach the target.

The reward curves of two methods in three environments: (a)–(c) reward curves using the prior knowledge-based DQN in the first environment, the second environment and the third environment, (d)–(f) reward curves using DQN in the first environment, the second environment and the third environment.
The comparison of the average rewards of the two algorithms is recorded in Fig. 12. Obviously, in the three environments with different number of obstacles, our method has a higher rate of convergence, because the model guided by obstacle avoidance knowledge can learn faster about how to avoid obstacles and reach the target point. The number of training epochs that the agent first reached the target point in different environments are recorded in Fig. 13. As shown in Fig. 13, a consistent trend of increasing number of training epochs required by agent, as the environment gets more complex. In spite of this, it can be seen that the training epochs required by our method to reach the target is less than that of DQN in all environments.

The comparison of average rewards in the three environments: (a) average rewards in first environment, (b) average rewards in second environment, (c) average rewards in third environment.

Comparison of the required training epochs for first reaching the target.
In order to further verify the validity of the proposed method, after training a certain epoch, the models trained by DQN and prior knowledge-based DQN were compared and tested. As can be seen from Fig. 13, the number of training epochs that the agent first reached the target by DQN in three environments are 86, 96, 148 respectively. Therefore, models with 100, 100 and 150 trained epochs are selected to test. In each environment, the corresponding model was tested 50 times. The starting and ending points of tests are consistent with that of Fig. 10, and the results are recorded in Table 4. As shown in the table, the models trained after a certain epoch by both methods can reach the target point. However, the DQN models has significantly lower success rates and more step count to reach the target. This indicates that the model trained by the proposed algorithm requires less training epochs. To compare behavior difference of the model trained by the two methods, after training all epochs, the same tests were carried out on the models trained by these two methods. The results of multiple tests are recorded in Table 5, and the paths chosen by different models when the target point is successfully reached are recorded in Fig. 14. As can be seen from Table 5, after enough training epochs, the success rate and average moving step count of the models trained by DQN and proposed method are not significantly different. And it can be seen from Fig. 14 that the paths to the target point selected by the DQN models and the prior knowledge-based DQN models are basically the same. This proves that the prior obstacle avoidance knowledge can shorten the model training time, but has little effect on the model that has already converged.
Performance comparison of models
Performance comparison of models after training all epochs

The paths searched by different models in three environments: (a)–(c) the paths searched by the prior knowledge-based DQN model in the first environment, the second environment and the third environment, (d)–(f) the paths searched by DQN model in the first environment, the second environment and the third environment.
To sum up, the results indicate the validity of prior obstacle avoidance knowledge. The proposed method can help mobile robot obtain more effective obstacle avoidance experience in path planning tasks due to the guidance of prior knowledge. Therefore, it has a faster rate of convergence than DQN.
In this section, the generalization performance of the proposed method is discussed and analyzed. First, three sets adaptive of models corresponding to the three environments in Fig. 10 are obtained, each of which was trained 600 epochs in the corresponding environment by DQN and prior knowledge-based DQN. During training, the starting and ending points are consistent with the starting and ending points marked in Fig. 10. Then, two tests are performed to test the generalization performance of the trained models.
In the first test, the focus is on the generalization capability to new target and new starting point. Thus, the test environment for each model should be the same as the training environment of that model, changing only the starting and ending positions. The starting point and target point were reselected to test whether the robot could reach the target point with the trained model. As shown in Fig. 15, five starting positions and ten destinations were selected for each environment, where the blue circle represents the starting point and the red triangle represents the target point. Thus, a total of (5*10 = 50) path planning tests per trained model. In each environment, the model corresponding to the environment is used for testing. The test results of the trained models in their respective environments are illustrated in Table 6. The success rate was used to measure the performance. Collision with an obstacle or failure to reach the target within 100 steps is considered a failure. Therefore, the agent either reaches its target quickly or fails. As shown in Table 6, in all environments, the model trained by both DQN and prior knowledge-based DQN can reach the target without retraining after changing the target position and starting position, and it has a high success rate from new starting points to new targets. This proves that the model trained in target-driven approach is generalizable to new target and new starting point, whether or not has prior knowledge of obstacle avoidance.

Test scenarios for first generalization performance test: (a) first environment, (b) second environment, (c) third environment.
The results of the first generalization performance test
In the second test, the generalization capability of trained models to new environment was concerned. Thus, a new virtual environment was used to test the performance of the three sets of trained models. As shown in Fig. 16, five starting positions and ten destinations were selected in the new environment. Therefore, each model was required 50 path planning tests in the new environment. The performance of the models in the new environment are recorded in Table 7. It can be seen that the model trained in the first environment performed worst in the new environment, with a success rate of only 0.38 and 0.42. Compared to the model trained in the first environment, the model trained in the second environment performed better, with a success rate of 0.64 and 0.68. The model trained in the third environment performed significantly better in the new environment than the other two models, with a highest success rate of 0.86 and 0.82. The results showed that models trained in more complex environments had higher success rates. This indicates that the model trained in complex environments are more adaptable to new environment.

Test scenarios for second generalization performance test.
The results of the second generalization performance test
In summary, the test results prove the generalization of the model trained and the effectiveness of the target-driven approach. The model trained by our method has high generalization capability to new starting points and targets, and the more complex the environment, the stronger the generalization ability of the trained model to a new environment.
To verify the effectiveness of the proposed method in path planning task, a set of comparative tests were conducted. A hybrid path planning method was used to compare with the proposed method. The hybrid method is implemented by the obstacle avoidance knowledge in Section 3 and the heuristic knowledge of goal-directed knowledge, moves in a straight line in the direction of the target. In this method, the robot will approach the target point along a straight line as long as possible while ensuring safety, and while the sensors detect a close obstacle follow the obstacle avoidance knowledge. In order to test the performance of knowledge-based DQN and hybrid path planning method, six experiments with different environments were designed. Three of the environments are the same as the environment in Fig. 10, except that the target point changes. The other three environments are rebuilt in Gazebo, namely fourth environment, fifth environment and sixth environment. The collision-free paths searched by two methods in different environments were recorded in Fig. 17 and Fig. 18, and the moving step count to the target were recorded in Table 8.

The paths searched of two methods: (a)–(c) the paths searched using the prior knowledge-based DQN in the first environment, the second environment and the third environment, (d)–(f) the paths searched using the hybrid path planning method in the first environment, the second environment and the third environment.

The path search results of two methods: (a)–(c) the paths searched using the prior knowledge-based DQN in the fourth environment, the fifth environment and the sixth environment, (d)–(f) the paths searched using the hybrid path planning method in the fourth environment, the fifth environment and the sixth environment.
Comparison of moving step count to the target
In Figs. 17 18, the blue circle represents the starting point, the white triangle represents the end point, and the red curve represents the trajectory of robot. Some information can be summarized from Figs. 17, 18 and Table 8. In the fourth environment, both methods can reach the target, and the trajectories and moving step count of two methods were almost the same. From the result, the hybrid approach performs as well as prior knowledge-based DQN in this scenarios. In the first environment, the trajectories of two methods are slightly different. The trajectory of prior knowledge-based DQN is slightly smoother and total moving step count of prior knowledge-based DQN was fewer than that of hybrid method. In the second environment, fifth environment and sixth environment, the paths searched of two methods are quite different. It can be seen that the trajectories obtained by prior knowledge-based DQN in these environments are smoother and requires fewer moving steps. It indicates that our method can obtain a relatively optimized path while reaching the target point. In the third environment, the proposed method can still reach the target point without collision. However, the robot used the hybrid path planning method was trapped in a loop and constantly switching between goal-directed knowledge and obstacle avoidance knowledge. In summary, the results prove that our method can carry out path planning tasks well in different environments.
In this paper, a deep reinforcement learning path planning algorithm based on prior knowledge is proposed. To speed up the convergence of the algorithm, a fuzzy logic controller is used to introduce prior obstacle avoidance knowledge in the exploration phase of the algorithm. For the problem of poor generalization, a target-driven approach is used for model training, which trains the model with the target information and the state information of the mobile robot as the input of network.
The proposed method is evaluated through a series of experiments. A simulation robot and virtual simulation environments for the experiments were built in Gazebo simulator. Extensive experiments have demonstrated the effectiveness of our method in path planning. Furthermore, these experiments proved that the proposed method converges faster than DQN in the same experimental environment. When the target position, the starting position of the robot, or even the environment changes, the model can reach the target without retraining.
In the future work, to further improve the algorithm performance, more prior knowledge will be used to guide the mobile robot to perform the task. To perform more complex path planning experiments, more special types of obstacles will be considered in the environment, such as puddles and lawns, rather than just rigid obstacles. And more types of sensors will be equipped for a comprehensive perception of environmental information, such as image sensors and LIDAR. Furthermore, the method will be further improved and applied in the field of multi-robot path planning.
Footnotes
Acknowledgments
The authors would like to acknowledge funding support from the National Natural Science Foundation Committee (NSFC) of China (grant no. 51905397) and The Fundamental Research Funds for the Central Universities and (WUT:2018III069GX), (WUT:2019III071GX), as well as the contributions from all collaborators with in the projects mentioned.
