Abstract
We present an approach to autonomous drone racing inspired by how a human pilot learns a race track. Human pilots drive around the track multiple times to familiarise themselves with the track and find key points that allow them to complete the track without the risk of collision. This paper proposes a three-stage approach: exploration, navigation, and refinement. Our approach does not require prior knowledge about the race track, such as the number of gates, their positions, and their orientations. Instead, we use a trained neural pilot called DeepPilot to return basic flight commands from camera images where a gate is visible to navigate an unknown race track and a Single Shot Detector to visually detect the gates during the exploration stage to identify points of interest. These points are then used in the navigation stage as waypoints in a flight controller to enable faster flight and navigate the entire race track. Finally, in the refinement stage, we use the methodology developed in stages 1 and 2, to generate novel data to re-train DeepPilot, which produces more realistic manoeuvres for when the drone has to cross a gate. In this sense, similar to the original work, rather than generating examples by flying in a full track, we use small tracks of three gates to discover effective waypoints to be followed by the waypoint controller. This produces novel training data for DeepPilot without human intervention. By training with this new data, DeepPilot significantly improves its performance by increasing its flight speed twice w.r.t. its original version. Also, for this stage 3, we required
Introduction
Autonomous Drone Racing (ADR) poses the problem of developing an artificial pilot capable of flying on a race track autonomously and, as an ultimate goal, competing against a human pilot aiming to beat them [20]. The way human pilots tackle this challenge is intriguing. It has been shown that human pilots do not seek an optimal flight. Instead, the flight policy seems simple: fly as quickly as possible towards the next gate, and once closer to it, manoeuvre the drone such that the gate can be crossed without colliding with it. For the latter, the pilot pays attention to potential areas of collision at the gate and reduces the speed in the case the next gate requires a significant turn in the flight direction [23]; all of that by only observing live video transmitted from a camera on board the drone to head-mounted glasses worn by the pilot. To achieve outstanding performance, pilots fly several laps on the race track to get familiar with it and to learn good places to increase/decrease speed.
Inspired by the learning stage performed by humans, some approaches have proposed to use Deep Learning (DL) to train Convolutional Neural Networks (CNN) to be used as neural pilots that can regress flight commands from camera images [26]. DeepPilot is an approach where a CNN is taught basic flight commands given a set of consecutive images depicting a gate. Following a reactive approach, the flight commands aim to align the drone w.r.t. the gate’s centre for further crossing. Although the model could pilot the drone to navigate through a race track effectively, the flight speed is not very fast. Nevertheless, a noticeable feature of DeepPilot is its ability to maintain a forward direction once the gate stops from being observed. This is known as the blind spot zone of the gate [4], which is the flight zone where the drone has to fly forwards only to cross the gate; otherwise, it could hit the side frame of the gate.
Therefore, in the effort toward developing an autonomous artificial pilot for ADR, in this work, we present an approach that requires no information about the race track except the drone’s position in the arena. In a real scenario, it is reasonable to expect that positioning could be provided by GPS, a motion capture system [9] or visual SLAM [19]. In the first stage, our approach employs two well-known methods used in ADR: 1) an artificial pilot that has been trained to fly towards a gate and cross it as trained to [26]; 2) a gate detector based on the Single Shot Detector [1]. These two CNN-based methods are easy and inexpensive to be trained compared to state-of-the-art DL methods requiring a large amount of data. Thus, we combine them to automatically discover the enter and exit 3D points of the blind spot zone. This is when the drone is ready to cross the gate, and when it has been crossed, it is safe for the drone to change direction. Thus, in this first step, the drone flies autonomously throughout the race tack using the neural pilot (DeepPilot) and the gate detector to discover the last drone’s position before entering the blind spot and the next drone’ position where the drone has safely exited the blind spot. For the second stage, the discovered waypoints will be used by a flight controller seeking to fly at a faster speed. Finally, the third stage learns the information provided by the flight controller to improve the performance of the neural pilot.
The aforementioned strategy has been evaluated in the RotorS simulator implemented for Gazebo [10]. We compared it against the performance of a human pilot on the same race track with very similar time results on average. We show experiments on the flight performance when our approach performs one lap only to discover the enter/exit blind spot waypoints and compare it to refined waypoints obtained after completing several laps to get a group of enter/exit waypoints from which we use the average position.
This paper has been organised as follows to convey our approach: Section 2 discusses the related work; Section 3 describes in more detail our approach; Section 4 presents our experimental framework; and finally, Section 5 outlines our conclusions and future work.
Related work
In recent years, autonomous drone racing (ADR) has gained significant importance in the scientific community, as it represents an ideal test bed for evaluating advances in autonomous aerial navigation. Since 2016, competitions such as the IEEE IROS Autonomous Drone Racing (ADR) [20,21], the AlphaPilot [7], and Microsoft Game of Drones [18], where the goal is to develop an artificial pilot that can autonomously execute the same task as a human pilot, have been designed. Therefore artificial pilots in a racing environment must be able to perceive, reason, plan, and act within seconds to reach the goal.
In [27], the authors analyse the methods and strategies used in ADR. They highlight four main modules: perception, localisation, planning, and control algorithms from the point of view of the optimal flight. The perception module is responsible for detecting and localising the gates that the drone needs to navigate through. However, perception difficulties arise during the race, such as blurred images, occlusions, partial views of the target to be traversed, susceptibility to lighting conditions and gate overlap.
Deep learning techniques have proven to be a successful solution for gate detection in ADR. Researchers such as [1,11,12,20,32] have trained models to identify the gates in diverse conditions, enabling drones to navigate the race course more efficiently and effectively. Furthermore, these techniques have outperformed traditional computer vision techniques, such as colour segmentation or line and corner detection, resulting in better performance and results.
Another perception problem is the limited field of view of the cameras, which can prevent knowing the precise location of the drone. For example, when the drone is approaching a gate, the camera may be unable to determine if it has crossed the gate or is obstructing it. To address this issue, the authors propose using complementary information. For instance, the authors in [4] used the accelerations obtained by the Inertial Measure Unit (IMU) and a Kalman filter to estimate when the drone left the gate. Another approach is the gate’s base detection using LiDAR sensors [11] or optical flow [12]. These methods ensure that the drone successfully passes through the gate without collision.
Concerning flight path planning, it is necessary to know the positions of the gates and a state estimator of the drone. The most common vision-based localisation systems in robotics are Simultaneous Localisation and Mapping (SLAM) [22] and Visual Odometry (VO) [5,6]. However, these systems are computationally expensive, resulting in a position estimate of less than 15 Hz in embedded systems [9,19,22]. Alternatively, authors such as [20] have implemented IMU-based position estimation methods using the Kalman Filter [15,16] or the use of neural networks combined with the Extended Kalman Filter to obtain a higher frequency position than vision-based localisation systems. For example, the authors of [29] have implemented convolutional neural networks that associate a 2D image with a 3D pose for drone localisation within a known environment. This model provides the drone’s pose at a frequency of 65 Hz using an intelligent camera, with an average error of 0.25 m in translation and 2.0° in heading compared to ORB-SLAM.
Since the methods mentioned earlier for pose estimation accumulate errors, some authors consider modifying the trajectory planning by the position of the gate [1,11,12,28], the relative position [2–4,14,15], the actions [20], the speed [25,26,31], or even the direction [13,14] of the drone.
In contrast to visual localisation or pose estimation methods, motion capture systems do not accumulate errors and provide positions up to 500 Hz, which allows agile, high-speed flights [8,9]. For example, the authors of [9] present a solution using 36 VICON (motion capture system) and calculate the optimal trajectory in time. Furthermore, the authors give the gate positions (waypoints) and pre-calculate the optimal flight path by concatenating the waypoints for a single turn several times.
However, the authors in [24] affirm that in the solution presented by Foehn in [9], the drone cannot explore and learn from experience to improve the visibility of gates by choosing an entry angle like human pilots. Instead, the drone can only perform a flight if it knows a sequence of waypoints. Motivate for this, we propose a three-step strategy to train a neural pilot effectively. The first step consists of exploring the race track to identify the waypoints. The second step executes a point tracker using the points of interest found in step one as a reference. Finally, using the information from step two, images and flight signals, we train a new model to improve the performance of the neural pilot.
Methodology
This section presents a three-stage strategy to train a neural pilot effectively: 1) the exploration stage, 2) the Navigation stage and 3) the refinement stage. In the exploration stage, the drone navigates on the race track guided by a neural pilot combined with a gate detector to obtain a set of waypoints, as shown in Fig. 1. Next, the navigation stage executes a waypoint controller using the points discovered in stage one and the drone’s position taken from the Gazebo’s global position to complete the race track quickly, see Fig. 4. Finally, the refinement stage utilises the information gathered in stage two, including images and flight signals, to create a new dataset aimed at improving the performance of DeepPilot [26]. This new dataset was used to train the DeepPilot model further, as shown in Fig. 5.

Schematic view of our approach where we propose a first stage, that consists of exploring the race track to discover 3D points related to the drone’s position when it is about to enter the gate and after exiting it during the crossing. We use a gate detection to discover the waypoints while DeepPilot, a neural pilot, flies the drone autonomously on a race track.
During this stage, we utilise a neural pilot to control the drone on the race track and a gate detector to identify the entry and exit point of the blind spot. To obtain a set of 3D points in meters, we use the global position of the drone provided by the simulator. Finally, to define the successful entry and exit points of the blind spot, we implement a set of rules described in Section 3.1.3.
Neural pilot
DeepPilot is a neural pilot that guides a drone on a racetrack without having information such as gate positions and orientations. DeepPilot comprises three branches to obtain the flight signals in parallel. Each branch consists of four convolutional layers, three inception modules, one fully connected layer and one regressor to each flight signal value. DeepPilot associates a set of images with a flight signal to produce translational and rotational motion as a tuple
The authors utilised a dataset comprising 10,334 mosaic images to train the model in recognising seven fundamental movements: right, left, up, down, right rotation, left rotation, and forward displacement. Each mosaic image consists of 6 frames, effectively acting as a memory that captures a discernible motion trend for the neural network to learn the appropriate flight commands. The assigned labels for each mosaic image span a range from 0.9 to 0.1 for roll, 1 to 0.1 for pitch, 0.1 to 0.05 for yaw, and 0.1 to 0.05 for altitude. Additionally, the authors introduced noise filters to the output values to dampen spikes in the flight signals, which tend to induce oscillatory behaviour and jerkiness.
Gate detector
Due to DeepPilot not identifying the gates explicitly, we implemented a gate detector based on deep learning to indicate the position of the gates in the image. Furthermore, we use a reduced variant of the Single Shot Detector (SSD) network [17], called SSD7, to minimise the training and search time. This network is a fast multi-category object detector that combines predictions on multiple feature maps of different sizes, producing detections at high and low levels of the image by applying convolution filters. The SSD7 network architecture is based on a VGG16 image classification network and an auxiliary structure comprising seven convolution layers, as shown in Fig. 3. The auxiliary structure obtains multi-scale feature maps and convolutional predictions to know the bounding box displacement and the box’s position relative to the location of each feature map.
The dataset used for gate identification includes images captured in various real-world environments (indoors and outdoors) and images from the Gazebo simulator [1,4]. The trained model is designed to detect gates in different orientations, heights, skew angles, and even in cases where gates overlap, all under diverse lighting conditions. Including real-world environments in our dataset adds diversity and facilitates the practical implementation of the strategy proposed in this work on a real drone in future work.
Additionally, we utilised the gate detector to eliminate noise filters from the flight commands generated by the original DeepPilot model employed in the navigation stage. We also addressed four failure scenarios during flight [31]: 1) slow forward translation, 2) errors in roll values observed at the image edges, 3) roll oscillations at the front of the gate, and 4) slow rotation when the gates are tilted.
Waypoints discovery
To discover the effective waypoints, we use the DeepPilot network [26] to navigate autonomously in the racetrack and a Single Shot Detector (SSD) network [1,4] to detect the gates during the flight. Also, we used the global position of the drone provided by the simulator to obtain a 3D point in meters.
We obtained two 3D points for each gate, which indicate the entrance and exit of the blind spot zone. We define a blind spot zone as the space where the drone navigates to the gate, and it doesn’t know if it crosses the gate or is in the middle. Hence, we used a gate detector to identify when the drone stopped tracking the gate (the entry of the blind spot) and when the drone crossed the gate completely (exit of the blind spot), as shown in Fig. 2. Furthermore, we designed the following rules to add or remove a waypoint:
If the gate detection is active, the algorithm evaluates whether the area of the current detection is larger than the registered larger area. Then, the larger area updates its value by the area of the current detection, as shown in Fig. 2(a) and 2(b)
If there is no gate detection, a counter (cnt) is initialised with zero value. cnt indicates the number of times the gate has yet to be detected. Due to the interest is to find the blind zone, i.e. when it no longer sees the edge of the gate as it is in the centre, as shown in Fig. 2(d).
If the gate detection is not active, cnt increments by one. But if there is detection, cnt is reset to zero.
If cnt is equal to a number of frames, the algorithm adds to a list the 3D coordinates of the drone to indicate the entrance to the blind spot. as shown Fig. 2(d).
Once the entry to the blind spot is identified, we implement a safety that prevents a waypoint from being placed in the centre of the gate. For this, we set a specific time for the drone to leave the blind spot because if a waypoint is placed in the centre of the gate, during navigation using the waypoints, the drone could turn to the next gate without leaving the blind spot area first, then the risk of collision is higher. as shown Fig. 2(e).
Additionally, we implemented two actions to eliminate a false positive if one has been detected. Therefore, to add the exit from the blind spot area, the gate area must be smaller than 75% of the size of the largest recorded area. as shown Fig. 2(f)

Example of the discovering waypoints. We use the DeepPilot network [26] to guide a drone on a race track autonomously and a Single Shot Detector to visually identify the gates. The blue point represents the entry, and the red represents the exit drone positions in the blind spot zone. The waypoint controller uses these waypoints to perform a much faster flight later.

Gate detector implemented to identify the blind spot zone. We used the SSD7 network based on VGG16 and an auxiliary structure comprising seven convolution layers. The network input is an image of 120 x 160 pixels, and the output is a vector, which includes class ID and a bounding box.

Schematic view of our approach where we propose a second step that uses a set of waypoints discovered in the first step to fly the drone throughout the track without collisions effectively.
Our waypoint controller uses the drone’s current position obtained by the Gazebo simulator at
Where sign is defined as:
To obtain
We implement a proportional controller to correct the yaw angle and obtain
Additionally, we implement a proportional-integral controller for roll and height. For roll, we define a reference of zero to maintain the drone close to the direction vector, as shown in Eq. (3), where

Schematic view of our module to improve the DeepPilot performance. In the module, we associated the mosaic image and the flight signals provided by the waypoint controller to obtain a new dataset.
DeepPilot has been shown to allow a drone to navigate a race track with randomly placed gates at different orientations and heights. However, it has two problems. The first problem is that the estimated flight signals must have noise filters on the output values to smooth out the spikes in the flight signals, which produce oscillatory behaviour and jerkiness. The second problem is that, when changing the platform, for example, the Bebop 2 simulated in RotorS [10], DeepPilot provides the wrong proportion of the signal value. This means that the estimated flight signals are correct to the flight signal sent to the drone (roll angle, pitch angle, yaw rate, altitude rate) but with an incorrect proportion of the signal value. To solve this problem, there are two options. The first is to adjust a set of constants that act as gains to improve the performance of the artificial pilot, and the second is to provide more examples of the training set.
In [31], we proposed a strategy using a gate detector to tune the gains automatically. Unfortunately, although this strategy improves the performance of DeepPilot and allows it to complete the track without collisions, it performs jerky and mechanical movements. Therefore, the best option is to provide a new training set that teaches DeepPilot to move smoothly and naturally. However, generating a dataset from human pilot flights would involve long training periods and an imbalance in training data, as pilots do not fly drones at the same speed. Therefore, in the interest of learning flight cues that ensure that the drone does not collide, we propose the following:
Design runway sections that are generally found on a race track, such as straight sections, zig–zag sections, curves and elevation changes, as shown in Fig. 6.
Use the exploration stage described in Section 3.1 to find the entry and exit points of the blind zone points (waypoints).
Use the navigation step described in Section 3.2 to navigate as fast as possible using the discovered waypoints and the point tracker.
Generate a new dataset in which the image mosaic is associated with corresponding flight signals, including roll angle, pitch angle, yaw rate, and altitude rate; see Fig. 5.

Examples of track sections used to improve DeepPilot performance. The sections include: (a) straight segments with gates at 2, 2.5 and 3 meters height; (b) right zig–zag segments with gates at 2, 2.5 and 3 meters height; (c) left zig–zag segments with gates at 2, 2.5 and 3 meters height; (d) right curve segments with gates at 2, 2.5 and 3 meters height; (e) left curve segments with gates at 2, 2.5 and 3 meters height; and (f–h) straight segments with elevation changes. These sections were designed to provide diverse flight signals for the drone’s deep learning algorithm to learn and ensure safe navigation.
Each track covers an area of 40 m × 6 m. We replicated each section three times to provide examples with gates of different heights. As shown in Fig. 6, we also designed tracks by interleaving the heights. For instance, Fig. (f) depicts a straight section with three gates: gates 1 and 3 have a height of 2 m, while the second gate has a height of 2.5 m. The following section, Fig. (g), also has three gates, with gates 1 and 3 having a height of 2.5 m. Finally, the last section is a straight segment where the height of the gates increases gradually, with gate 1 having a height of 2 m, gate 2 having a height of 2.5 m, and the last gate having a height of 3 m, see Fig. (h).
To ensure that the drone did not collide, we performed ten runs of the exploration stage to obtain the entry and exit points of the blind zone, and the average of these waypoints was used as a reference in the navigation stage, Fig. 4. In Fig. 7, we display the waypoints discovered during the exploration stage on the training tracks. After that, we capture the new dataset associating the image mosaic with a flight signal during the navigation stage, Fig. 5. Table 1 shows the number of images collected from each track segment which compose the new dataset. In total, 3476 images were collected from 18 race track segments. Therefore, we use 66.36% less mosaics than the original dataset to improve DeepPilot’s performance.
Table 2 compares the flight commands used in the original DeepPilot dataset against the flight command values of the new dataset using the refinement stage. In this comparison, you can see that the range of the flight commands is smaller in the new dataset. For example, the range in roll goes from 0.2 to 0, in pitch from 1 to 0, in yaw from 1 to 0 and in altitude from 0.3 to 0. This distribution means that the waypoint controller mainly controls pitch and yaw, while height and roll do not require constant change, similar to the behaviour of a human pilot. Due to the control signals prioritising yaw and pitch, there is no ambiguity in roll and yaw. This enables a single DeepPilot model to control the flight signals, reducing training time.

Average waypoints discovered during ten runs of the exploration stage on each track section, corresponding to sections from Fig. 6 (a)–(e). This waypoint list served as a reference for the navigation stage while simultaneously generating the dataset to train the improved DeepPilot model. As a result, the dataset only required 3,475 images, representing a 66.36% reduction compared to the original dataset.
Data distribution that composes the new dataset to improve the performance of DeepPilot. The dataset comprises 614 images from straight lines, 1127 from zig–zag sections, 1132 from curves and 603 from elevation changes. Each image is associated with a flight signal provided by the waypoint controller and has a resolution of
Distribution of ground truth flight command values associated as labels to the images in the original dataset used in [26] to train DeepPilot and the new dataset recorded during a refinement stage
System overview
We used the Alienware R5 laptop to carry out our experimental framework. This laptop features a corei7 processor, 32 GB of RAM and an NVIDIA GTX 1070 graphics card. Additionally, it runs the Ubuntu 20.04 LTs operating system and the Robot Operating System (ROS) Noetic Ninjemys version. To run the SSD7 and DeepPilot networks, we used an NVIDIA GTX 1070 graphics card, TensorFlow and Keras 2.8.0 frameworks.
We designed our communication architecture using the Robot Operating System (ROS). To create the simulation environment, we utilised RotorS [10], which runs on Gazebo 11. This environment enabled us to receive a live video stream from the onboard camera of the Bebop2 at 30 fps while also allowing us to control the drone’s movements simultaneously. Also, we incorporated the Keyboard node to control the drone manually and initiate or cancel the autonomous flight.
To obtain the blind spot waypoints during the exploration stage, we utilised two nodes simultaneously. The first node, DeepPilot, captures the video stream and generates a mosaic image composed of six frames, updated every five frames to provide temporal information. DeepPilot uses the mosaic image as input to provide four flight signals
During the navigation stage, we employed a waypoint controller node to guide the drone through each gate on the race track. This node utilised the previously discovered waypoints and the drone’s global position obtained from the Gazebo simulator. In the refinement stage, we ran the waypoint controller and dataset collection node simultaneously. The dataset collection saves a mosaic image associated with the flight signals provided by the waypoint controller. It is important to mention that we obtained a new model of DeepPilot, using the dataset collected using the training racetracks described in Section 3.3.1. This dataset contains images and flight commands related to areas of the racetrack, in contrast to the original dataset of DeepPilot, which was created to centre the drone to the gate.
The Keyboard node is solely used for initiating or cancelling autonomous flights during the experiment stage. Further, before each experiment, no information is provided to the system regarding the race track, including the number of gates or their position, orientation, and height.
Race track description
We have created two race tracks to evaluate our approach. The first test race track spans an area of

The figure displays the first test race track, consisting of five gates placed at two meters with various orientations. The race track extends over

Left figure shows the race track, composed of 18 gates at different heights and orientations in the RotorS simulator. The race track extends over
We have established the following constraints for our experiments: 1) The system is not given any prior information about the track, including its size or shape; 2) The number, size, height, orientation, and positions of the gates are unknown to the system; 3) The drone is only aware of its global position in the arena, which is provided by Gazebo; 4) The vehicle will not have access to external feedback during the exploration stage; 5) In the navigation stage, which is the second step of our approach, the modules from stage 1 are disabled, and only the controller for waypoint navigation is activated. 6) The human pilot performs the same number of laps as the artificial pilot in the exploration stage.
Results
We compared the performance of four pilots on two test race tracks: a human pilot, DeepPilot, a waypoint controller, and an improved version of DeepPilot. The human pilot has extensive experience flying real drones, mostly quadrotors, both indoors and outdoors. Although not a professional drone racer, the pilot is well-versed in flying the drone using the RotorS simulator and has the same number of opportunities (ten) as the artificial pilot to fly on the racetrack used in these experiments. Also, we used the DeepPilot model presented in [26] that uses a gate detector to fine-tune the gains for each flight command to complete the racetrack without requiring a new model [31]. In addition, the waypoint controller is based on the average waypoints discovered in ten runs during the exploration stage. Finally, we present a new DeepPilot model, trained with data obtained from the navigation stage, incorporating several enhancements over the original model.
The metrics used to evaluate a pilot’s performance in autonomous drone racing include the number of gates crossed and time. Additionally, we include the distance recorded and speed per lap. The latter provides information to illustrate the neural pilot’s speed and underlines the scenarios in which the neural pilot may benefit from additional training examples.

Drone trajectories are generated when the drone is flown by a human pilot (magenta), by DeepPilot (green), by the waypoint controller (yellow) and by an improved DeepPilot (blue cyan). We highlight in blue the discovered enter waypoint and the discovered exit waypoint in red. Figure (a) shows the top view of these trajectories. The side view in (b) helps to appreciate the improved DeepPilot shows stable behaviour and more natural movements similar to the human pilot and the waypoint controller.
Figure 10(a) illustrates a top view of the first test trajectory, where the objective is to compare the performance of the four pilots in a curved section. This figure shows the entry points, represented in blue, and exit points, represented in red, of the blind zone in a single run and the trajectories executed. Note that DeepPilot (green line) centres the drone to be perpendicular to the gate; once the drone is centred, it crosses and goes to the next gate. While the human pilot (magenta line), waypoint controller (yellow line) and the improved version of DeepPilot (cyan blue line) navigate smoothly and naturally. Figure 10(b) shows the side view of this evaluation, and it can be noticed that DeepPilot (green line) presents slight oscillations between one gate and another, especially when approaching the gate to cross. In the case of the waypoint controller (yellow line), it changes its height between gates. On the other hand, the improved version of DeepPilot (cyan blue line) and the human pilot (magenta line) maintain the height because the view of the gate doesn’t require an elevation change; they keep their height value constant.
In Figs 11–14, we illustrate the changes in the value of the control signals used by the four pilots. For instance, in the roll signal, Fig. 11, the human pilot maintains a value of 0, as they only handle pitch and orientation control. Conversely, DeepPilot maintains ranges between 0 and 0.4 and, in between the gates, sends signals up to 1 to correct the drone’s heading. On the other hand, the waypoint controller makes roll changes to bring the drone closer to the reference point. Finally, it can be observed that the improved version of DeepPilot controls most of the orientation, similar to the human pilot. It is also important to note that it keeps roll ranges of less than 0.8 to perform navigation like the human pilot.
Figure 12 illustrates the speed at which the pilots move forward. It is worth noting that the human pilot maintains the maximum speed throughout the entire trajectory, while DeepPilot only sends the drone a maximum signal when crossing the gate. The waypoint controller reduces its rate when approaching the reference points (waypoints in the blind zone) due to the vehicle’s inertia. The improved version of DeepPilot is a significant improvement over the original model, as it maintains signals close to the maximum throughout the race track, enabling faster trajectory completion.

The figure illustrates the range of values used to control the roll of a drone in the first test race track. The first plot shows the roll commands executed by a human pilot, which remain constant at 0. The second plot corresponds to DeepPilot [30], a system that adjusts the roll only when approaching the gate to centre the drone. The third plot shows the roll adjustments made by a waypoint controller at the start of a turn. Lastly, the improved DeepPilot modifies the roll when starting the runway and crossing the gates, similar to the waypoint controller.

The figure depicts the range of values used to control the pitch of a drone in the first test race track. The first plot shows the pitch commands executed by a human pilot, which remain constant at 1. The second plot corresponds to DeepPilot [30], where the pitch is set to a maximum value only when approaching the gate to cross it. The third plot shows the pitch adjustments made by a waypoint controller at the entrance and exit of the blind spot zone. Lastly, the improved DeepPilot maintains a constant pitch value, similar to the human pilot.

The figure illustrates the range of values used to control the yaw signal of a drone in the first test race track. The first plot shows the yaw commands executed by a human pilot, which remain constant at 0 and only change at the start of a curve. The second plot corresponds to DeepPilot [30], which maintains a constant yaw value and only adjusts when approaching a gate to center the drone. The third plot shows the yaw adjustments made by a waypoint controller when the drone moves to the next waypoint. Lastly, the improved DeepPilot modifies the yaw signal, similar to the waypoint controller, but only at low speeds.

The figure illustrates the range of values used to control the altitude signal of a drone in the first test race track. The first plot shows the constant altitude value maintained by the human pilot at 0. The second plot corresponds to DeepPilot [30], which changes the altitude signal continuously and sets it to 1 when approaching a gate. Finally, the third plot shows the constant altitude adjustments made by a waypoint controller, similar to the improved DeepPilot.
In Fig. 13, we illustrate the control signals used by the pilots for yaw changes. Notably, the orientation changes made by the human pilot, the waypoint controller, and the improved version of DeepPilot are similar, as they all change direction towards the next gate. However, the original DeepPilot model does not perform this action, as it keeps its orientation at 0 during navigation and only makes changes when close to the gate, up to 0.6. Additionally, the altitude control signals are shown in Fig. 14, where the most notable change in altitude is seen with the original DeepPilot model. During the flight, it oscillates between gates, and only just in front of an entrance, it sends a control signal of up to 1 to cross the gate.

Drone trajectories are generated when the drone is flown by a human pilot (magenta), by DeepPilot (green), by the waypoint controller (yellow) and by an improved DeepPilot(blue). We highlight in blue the discovered enter waypoint and the discovered exit waypoint in red. Figure (a) shows the top view of these trajectories. The side view in (b) helps to appreciate the DeepPilot exhibits more oscillations, whereas the human pilot and the waypoint controller perform very similarly. Note that the improved DeepPilot shows stable behaviour and more natural movements.

The figure compares the roll commands executed during the second race track. It can be observed that the human pilot, waypoint controller, and improved DeepPilot make similar gradual changes of less than 0.4 in roll. On the other hand, DeepPilot shows abrupt changes when passing between gates to center the drone.

The figure compares the pitch commands executed during the second race track. It can be observed that the human pilot, waypoint controller, and improved DeepPilot maintain a pitch value of over 0.8. In contrast, DeepPilot varies the pitch value and only sets it to 1 when the drone is centered on a gate.

The figure compares the yaw commands executed during the second race track. It can be observed that the human pilot performed changes towards the gate; this means that the yaw value depends on the position of the next gate. While DeepPilot varies the yaw value along the racetrack and sets a maximum value when the drone is close to the gate. Finally, the waypoint controller and improved DeepPilot change the yaw value while the drone is in the blind spot.

The figure compares the altitude commands executed during the second race track. The human pilot maintains a constant altitude, except when the next gate is higher, in which case the pilot sets a maximum height value. The waypoint controller and improved DeepPilot gradually adjust the altitude value as the drone approaches the gate, with a higher adjustment than the human pilot when necessary. In contrast, DeepPilot varies the altitude value along the racetrack and sets a maximum value when the drone is close to the gate.
Figure 15(a) compares the trajectories performed by a human pilot, illustrated in magenta, and the performed by DeepPilot, highlighted in green in the second racetrack. Also shown are the entry (blue) and exit (red) points discovered in the blind spot zone of each gate. These points were discovered with one lap. The human pilot performs with a more stable trajectory in the same figure. DeepPilot produces multiple oscillations, especially between gates. However, navigating with the waypoint controller exhibits stable movements, similar to the human pilot’s. In Fig. 16, we compare the roll signal changes performed by the pilots. In this figure, DeepPilot performed a low range to complete the racetrack without collision, taking more time per lap. While the human pilot, the waypoint controller and the improved DeepPilot guide a drone at higher range values to complete the race track faster.
Additionally, we compare the pilots’ flight signal to translate the drone to the front, see Fig. 17. Note that DeepPilot presents constant speed changes that provoke jerks. In contrast to the other pilots that maintain a constant speed, which allows completing the race track in a minor time. For example, the waypoint controller reduces the pitch signal when the drone is closer to the reference.
Concerning yaw signals, Fig. 18, the improved DeepPilot, executes variations in yaw signals due to the model having learned to cross the gate without the need to be perpendicular to it and, once across the gate, corrects its orientation to the next gate.
Figure 15(b) helps to appreciate the variations in height for all these approaches, with DeepPilot having larger oscillations. These changes in height are illustrated in Fig. 19. Where the human pilot, the waypoint controller, and the improved DeepPilot maintain the height at a low range, only change the altitude value when it is necessary. In contrast to the original DeepPilot increment and decrement of the value of altitude, increasing the time to complete the race track.

Top view of ten drone trajectories generated by the waypoint controller using the average of the discovered waypoints. The latter was discovered during 10 laps of our approach using DeepPilot and the gate detector. These discovered waypoints are depicted in light blue and light red, corresponding to the enter and exit waypoints of the blind spot zone. The average waypoints are indicated in blue and red, respectively. a zoom-in view helps to appreciate discovered enter waypoints (light blue points), the average waypoint (blue point), and the drone trajectories generated by the waypoint controller. Note that the trajectories are very similar.
In Fig. 20, we present a top view of the second race track, highlighting the entry points (light blue) and exit points (light red) of the blind spot identified after ten laps using the exploration stage. Additionally, we display the averaged waypoints representing the entrance and exit points of the blind spot in blue and red, respectively. Drone trajectories obtained from ten runs using the waypoint controller are also depicted. On the right, the square zooms in on a specific area to display the trajectories and, in this case, the various entry points discovered by DeepPilot during the ten laps. The average position is marked in blue. It’s worth noting that the waypoint controller generates very similar drone trajectories when using the average waypoints. It is important to note that these waypoints are exclusively used for comparing the performance of the waypoint controller against the improved DeepPilot model, which learns from different track sections and not the entire trajectory.
To summarise our results, Table 3 displays the average outcomes from ten testing runs on the first and second test race tracks. We compare the performance of a human pilot, the neural pilot (DeepPilot), the waypoint controller using the discovered waypoints after one lap with our proposed approach, and the waypoint controller using averaged waypoints discovered after ten laps in the exploration stage. Additionally, we report the time, speed, and trajectory achieved by the improved DeepPilot.
During the exploration stage, DeepPilot takes significantly more time than the human pilot, approximately 7.56 times longer on the curve race track and 8.52 times longer on the ellipse race track. However, once the waypoints are discovered after just one lap, the waypoint controller significantly reduces this time, especially on the ellipse race track (7.19 times faster). Furthermore, if more laps are allowed (10 laps) to discover additional waypoints, the waypoint controller utilises the average of these waypoints, resulting in an additional reduction of 1.67 seconds, indicating that the waypoints improve with more laps.
On average, the human pilot flies the drone at speeds of
The strategy employed to identify points of interest and use them to complete the track more efficiently has enabled the generation of a new dataset for training a new DeepPilot model, reducing the time required to complete the track. For instance, on the elliptical track, the original DeepPilot model takes 533.4 seconds to complete, while the new improved model only takes 206.9 seconds. Even in the figures, you can observe that the improved DeepPilot guides the drone in a manner similar to the waypoint controller despite being trained on only specific track sections.
Average results for 10 runs testing the performance of a human pilot, the neural pilot (DeepPilot), the waypoint controller using the discovered waypoints in 1 lap, and the waypoint controller using averaged waypoints discovered after 10 laps
Inspired by how human pilots train on an unfamiliar race track to familiarise themselves with it, we have presented an approach for autonomous drone racing to improve DeepPilot’s performance. This artificial pilot trained to regress basic flight commands from camera images in which a gate is observed to navigate an unknown race track still performs the navigation at low speed. Since the artificial pilot aims to complete the race track faster, we proposed a strategy allowing it to learn from experiences to improve its performance, just as human pilots do.
Our strategy consists of three stages: exploration, navigation, and refinement. None of the stages provides information about the gates (i.e. their position, orientation, height, or number), only the drone’s position on the race track. In the exploration stage, we use a single-shot detector for the drone to visually detect the gates on the race track and, in combination with the neural pilot, automatically discover the entry and exit positions of what we call the ‘blind spot zone’, where the gate is no longer visible during the crossing. Once these positions are discovered, we use the navigation stage where the waypoints found in the exploration stage serve as a reference for a flight controller to perform a much faster flight. We then create a new dataset by associating the drone’s camera images and the flight signals obtained from the flight controller to train a new DeepPilot model and improve its performance.
Our approach enables DeepPilot to learn new movements and complete racetracks faster than the original model. We propose training DeepPilot on various key sections, as depicted in Fig. 6, to enhance its adaptability to diverse scenarios and track types. This approach increases the neural pilot’s versatility, effectively handling unexpected situations. The results demonstrate that the Improved DeepPilot navigates naturally and smoothly, resembling human pilots’ behaviour. Furthermore, the new version does not require the drone to be perpendicular to the gate to cross it, as demonstrated by the provided examples. Additionally, the control signals prioritise yaw and pitch signals, eliminating ambiguity in roll and yaw. Therefore, unlike the original DeepPilot, which required three specialised models, the Improved DeepPilot only requires a single model to navigate the track.
In our experiments, we compared its performance to that of a human pilot in the RotorS simulator, and although the human pilot completed the racetrack faster than the other pilots in the experiment, we managed to improve DeepPilot by using only sections of the race track to get similar behaviour to the human pilot. For example, the way it navigates a curved section, controlling only the yaw and pitch signals while keeping the altitude at a constant range. We consider these results encouraging and promising for our proposed approach, and as future work, we will test with more sophisticated flight controllers that use the discovered waypoints to train a new DeepPilot model. Additionally, we will address the implementation of our approach in a real outdoor environment, where the principal challenge is obtaining precise drone localisation without external sensors and generating a dataset robust in texture and illumination variation.
