Abstract
Waypoints have enhanced the prospect of fully autonomous drone applications. However, Geographical Position System (GPS) spoofing and signal interferences are key issues in waypoint-based drone applications. Also, conceptual waypoint-based drone applications require accurate awareness of waypoints based on environmental cues and integration of additional sensing modalities. Additional sensor modalities may overwhelm drones’ processing resources, reducing operational time. This study proposes W-MobileNet, a denoising model for autonomous trajectory trail navigation based on precision control of a path planner, denoising capabilities of Weiner filters, and perceptual knowledge of convolutional neural networks. Creatively integrating the modules of W-MobileNet results in an intuitive drone navigation controller characterized by position, orientation, and speed estimation. Further, a generic loss function that significantly aids models to converge faster during training is proposed based on adaptive weights. An extensive evaluation of a simulated and real-world experiment shows that W-MobileNet is more favorable in precision and robustness than contemporary state-of-the-art models. W-MobileNet has the potential to become one of the standards for autonomous drone applications.
Introduction
Over a decade now, the robotics field, which encompasses Unmanned Aerial Vehicles herein UAV, has drastically transformed the industrial world in diverse tasks. The standard UAV on the market is usually equipped with single or multiple cameras, and its capability of flying, which gives it a broader view of terrains, is an advantage. As it stands now, UAVs are being utilized in fields such as search and rescue missions, aerial surveillance, agriculture technology, industrial inspection, military applications, package delivery, and many more [1]. UAVs can operate in indoor and outdoor environments, and with regards to control; it can be either a human-based or autonomous-based piloting. Due to the projected potential, much attention has been channeled to the automation aspect of UAV navigation. The continued development of deep learning is out-smarting human intelligence gradually in peculiar tasks, and it is only a matter of time before deep learning-based systems perform better than humans in unmanned systems navigation.
The desire to attain this uncensored achievement in UAV navigation has called for enormous research into autonomous UAV navigation, which dwells on sensors, knowledge-based models, and navigation algorithms (e.g., state-of-the-art path planners). Most outdoor UAV navigation methods in obstacle-free or obstacle-populated environments rely mostly on GPS, which has proven to some extent feasible, but not entirely desirable as the threats to GPS systems such as GPS spoofing is a continuous challenge. Besides, there are GPS non-functional zones, both outdoor and, to a large extent, indoor environments. Also, GPS is receptive to strong signal interferences when it encounters signals such as radio emissions in nearby bands and signals from jammers. Unstructured features within the environment usually challenge other UAV navigational methods based on conceptual waypoints which dwell on cues. Furthermore, in vision-based navigation, motion blur, aliasing, and other elements considered as noise can negatively influence the navigation of the UAV.
This study presents an attempt toward autonomous UAV navigation from the perspective of vision-based utilizing conceptual waypoints as a guide. For brevity, the contributions of the paper ensue as: A framework, W-MobileNet that constitutes an adapted MobileNet [2], a variant Weiner filter [3] as a denoiser, and a variant path planner, the three-dimensional Non-Linear Guidance Law (3D-NLGL) [4] is proposed as a UAV navigational controller. A novel loss function characterized by adaptive weights that facilitates faster model convergence during training is proposed. Lastly, a simulated and real-world experiment is conducted with comprehensive comparisons with some state-of-the-art navigational controllers to ascertain the practicability of the proposed navigational controller. Experimental results indicate that W-MobileNet offers intuitive and precise navigation for the UAV.
The paper’s organization ensues as follows: Section 2 entails prior research conducted within the scope of UAV navigation, followed by Section 3, the methodology. In Section 4, we delineate the setup and the experiment conducted with concluding experimental results and analysis. The conclusion drawn on the work and feasible future works is given in Section 5.
Related works
Most of the existing UAV navigation works are categorized into two based on the operational environment, either obstacle-populated or obstacle-free. Within the operating environment, the navigation type is either waypoint navigation (i.e., following a predefined trail based on beacons or imaginary reference points) or path planning (i.e., exploratory navigation) [5]. The mechanism within the types of navigation usually is either map-aided, which dwells on proximity sensors for constructing maps of the environment, or a vision-based which does not use built maps.
Map-aided autonomous navigation
Reiterated literature based on map-aided navigation includes but is not limited to Simultaneous Localization and Mapping (SLAM) [6], Parallel Tracking and Mapping (PTAM) [7], and Structure from Motion (SfM) [8]. The methods mentioned above primarily utilize data readings from sensors (e.g., infrared, ultrasonic, and optical flow) in constructing a map. SLAM and PTAM build a 3D model/map of the environment based on the information from the sensors to aid a UAV during navigation. Light Detection and Ranging (LIDAR) and Sound Navigation and Ranging (SONAR) are utilized in obstacle-populated environments. Even though some of the methods above worked, the number of computing resources needed to build a 3D map may overwhelm the computing resources of a UAV. An optimal path for UAV localization based on the Kalman filter and data generated by an ultrasonic sensor attached to a UAV was proposed [9]. The authors estimated observation density which played a crucial role in their navigation based on a predefined constructed terrain map. Their aim was the altitude, collision avoidance, stability, and anti-drift control. In the retrospective findings of the authors, it was apparent the computational complexity and cost were high due to the number of heavyweight sensors attached to the UAV. As detailed in their work, three accelerometers and gyroscopes, four infrared sensors, an ultrasonic sensor, one high-speed motor, and a flight computer were used. Cruz et al. capitalized on the capabilities of proximity sensors to sense obstacles within an environment and constructed detectable obstacles on a 3D map to aid a UAV during navigation [10]. Other works in the space of 3D map construction for UAV navigation that utilizes LIDAR are that of Stubblebine et al., [9], Bachrach et al., [11], Vandapel et al., [12], Bry et al., [13], and Bachrach et al., [14]. Although the reconstruction of a map to aid the UAV in navigating is feasible, especially in obstacle-populated environments, it comes at the cost of extra usage of sensing modalities susceptible to calibration errors and high computational complexity. Again, methods based on reconstructing maps of an environment with fewer environmental features are prone to failure.
Vision-based autonomous navigation
From the viewpoint of vision-based navigation, a UAV navigated in an obstacle environment based on Reinforcement Learning (RL) and model predictive control [15]. A similar study used RL to guide a UAV to navigate an indoor environment [16]. Again, an RL and shooter model was adopted as a UAV navigation controller [17]. An exploit of [17] is that of Chao et al., [18] that utilized a model-free RL technique and not the standard Deep Neural Networks (DNNs) that were adopted in [19]. Although the RL approach toward autonomous UAV navigation eliminates the task of data labeling, its drawback is the overloaded states which can lead to diminishing results. In reality, using the RL method can be costly since its learning process is based on feedback from previous predictions (correct or wrong). Usually, wrong predictions lead to the crushing of the UAV hence wearing and tearing out. Reference [20] estimated the positions and orientations based on a computational efficient Convolutional Neural Network (CNN) that utilized transfer learning for UAV navigation. Despite the advantage of transfer learning in the authors’ proposition, their method exhibited large baselines, which can cause scale-invariant feature transform localizers to fail drastically. Again, the transfer of knowledge from one domain may not generalize to other disciplines with varying features in data (e.g., initializing a DCNN being trained to learn geometric features with ImageNet weights). A YOLO-based DCNN was used to process a video feed of a micro aerial vehicle to trail predefined paths [21]. Although the resulting section of the authors’ work indicated their method’s success, the learned policy’s generalization capabilities to an unseen environment were not established. Further, an anchor-based YOLO version was used, reducing inference speed and accuracy compared to anchor-free object detectors. An optimized DCNN compatible with varying image resolutions was proposed to aid a UAV in following a trail in an unstructured environment [22]. The authors acknowledged the limitation of generalization when their optimized DCNN was trained using low-to-high resolution images. Further, the proposition in the paper restricts the UAV to move in directions forward, back, left, and right only with a fixed altitude and speed.
Based on imitation learning, a Recurrent Neural Network (RNN) was trained end-to-end together with a Long-short-term-memory (LSTM) network to guide a UAV to navigate through a room [23]. The authors assert that a pre-trained network requires little training data and also serves as a reasonable basis for training new models hence lots of time is saved during training. Moreover, they argue that incorporating a limited time window in LSTM during training yields better results than training with previous images. Although the limited time window eliminates the correlation problem, it also leads to the problem of higher variances and computational complexity, which slows down the training process and reduces inference speed. A DCNN was trained as a supervised image classifier to guide a UAV to navigate a hiker’s trail [24]. The presented approach is limited to discrete orientation and positions, thus a limitation in the movement of the UAV. A two-class multilayer perceptron was proposed as a tracker and detector, which used mapped images from the front camera of a UAV for navigation during the inspection of aerial power lines [25]. The network was trained to classify background from objects within the environment to aid the UAV in navigating the inspected power line. Since the approach depends on extracting background information for motion estimation, it is challenging in environments with little background information.
Contrary to the numerous deep learning approaches used in UAV navigation, the proposed method in this study aims to control precision and eliminate additional sensing modalities for autonomous UAV navigation. The W-MobileNet framework sequentially is characterized by a denoiser for image restoration, a lightweight DCNN model for feature extraction and inference making, and a path-planner to streamline navigational commands. Next, the methodology section elaborates on the proposed method.
Methodology
Data
The Microsoft AirSim simulator [26] built on top of Unreal Engine, which provides a realistic synthetic environment, is used. The environment is set in the Landscape Mountains, characterized by snowy mountains, lakes, forests, and rocky lands. A Pixhawk-4 (PX4), which supports Software in The Loop (SITL), is used together with the Cygwin toolchain to manually fly the quadrotor (UAV) in the virtual environment to collect data (see Fig. 1). The manual flight covers 50 predefined waypoints over a distance of 750 meters, serving as the reference trajectory. Each waypoint has a gateway inspired by the AirSim drone racing lab [27]. The gateways serve two purposes: (i) as an evaluation metric; thus, to check if the UAV successfully passes through the gateway, and (ii) to aid visually the turn anticipation of the UAV, which helps in the tangential interception of a segment part of the reference trajectory. A total of 61,274 frames were collected together with flight telemetry and reposited at [28]. Data augmentation is carried out on 20% of the total frames to add slight variance to the data. The details of the data are tabulated in Table 1.

Illustration of manual flight through waypoints/gateways in the Landscape Mountain environment via Microsoft Airsim Simulator. The blue triangle indicates the Field of View (FOV) of the UAV.
Summary of collected synthetic data
Within the RGB color space scheme (channels), pixels within each channel are subject to some perturbation (i.e., an adversarial attack that in the context of UAV navigation emanates as a result of the swift movement of the UAV collecting imagery data in an environment). An adversarial attack is a variety of noise (e.g., additive, Gaussian, and Poisson noises, respectively) that interferes with information within an image. A classical approach in handling adversarial attacks is using Weiner filters which find a trade-off between noise smoothing and inverse filtering by removing the noise and inverting the blurring effect based on mean square error. The Weiner filter in the frequency domain is given as:

Representation of denoised images using adaptive Weiner filter.
In this section, an elaboration on the modification of the inherited DCNN model is given. The MobileNet has a computation per epoch of 31 seconds and a total of 28 layers comprising depth-wise and pointwise convolutions followed by batch normalization and ReLU activation. An Average Pooling layer convolves on the extracted features and is then fed to an FC layer and a softmax layer for classification. The following modifications are introduced, and justification for such changes resulting in the proposed W-MobileNet is given.
Stem block
The adaptive Weiner filter is fused as a stem block, and as already explained, its purpose is to denoise frames before the modified DCNN of W-MobileNet starts convolving on the training data. It must be noted that the stem block is not affected by backpropagation during training; as such, the W-MobileNet can be seen as a framework and not a classical DCNN model.
Spatial Separable convolution
The inherited model (MobileNet) uses depthwise separable convolution, which conceptually is a two-stage operation. First, a channel-wise convolution on each channel (RGB) is performed (i.e., 3 × 3 ×1 RGBchannels ). Afterward, a 1 × 1 ×3 point-wise convolution linearly integrates the outputs from the channel-wise convolution. Although depthwise convolution in MobileNet enormously reduces the computational complexity and inference time, we conjecture that it can be further reduced with little or no depreciation in performance by using spatial separable convolutions in place of some of the depthwise layers. The spatial separable convolution merges the two-stage operation in depthwise convolution (the channel-wise and pointwise) into a single-stage; this reduces every two layers (depthwise) in the MobileNet architecture to one layer. Since the Keras framework has both pointwise and depthwise initializer, regularizer, and constraints of separable convolution 2D defined within the same init () function, the implementation is feasible.
Shallowing the network
Again, since some repetitive layers in MobileNet do not significantly contribute to the model’s performance, eliminating such layers is laudable. Therefore, for the five repetitive layers with output shapes, 14 × 14 × 512 is reduced to three. Readers are to refer to Table 1 in reference [2] for in-depth details. The elimination of two repetitive layers reduces the computation complexity and increases the inference time.
Branch FC layers
The final modification is three separate branch FC layers that take convolved features from the average pooling layer. Each branch is responsible for estimating one of the three needed navigational control inputs (position, orientation, and speed). Each branch has a softmax layer for classification and a linear regressor for regression. In addition to the modifications, the Swish activation function is utilized [30], while batch normalization is maintained as in MobileNet.
Next, the working principle of W-MobileNet is explained. A 224 × 224 × 3 image goes through the stem block of W-MobileNet (there is no downsampling operation in the stem block). After denoising, the image goes through the second module of the W-MobileNet framework (the modified MobileNet). Here downsampling takes effect using a stride of 2 within the first convolutional layer and the first four sequential Depthwise layers. Lastly, an average pooling is applied on the output dimension of 7 × 7, converted into a one-dimensional vector, and then fed to the three branches of the fully connected layers. During training, W-MobileNet is fed with images and the associated labels, which are sets of positions in North, East, and Down (NED), orientations in quaternions rather than rotational matrix due to computational complexity and UAV drifts, and speed. An example of the associated labels is as follows:
As suggested in [26], orientation in quaternions within the AirSim simulator is much more stable. A quaternion constitutes a four-vector value out of which there are real and complex elements. A quaternion can be expressed as a sum of a scalar q0 and a vector q = (q1, q2, q3) as:

A complete framework of W-MobileNet: (a) denotes the adapted Weiner filters for image restoration, (b) represents the modified MobileNet for feature extraction and inference making, and (c) illustrates the modified 3D-NLGL for streamlining navigational commands. Best seen at a zoom resolution > 140%.
In [4], with some limitations, a two-dimensional (2D) NLGL algorithm is modified to work in a three-dimensional space. Briefly, the NLGL algorithm uses the Virtual Target Point (VTP), which is an imaginary point on the desired path L, (i.e., L is the distance between the UAVs current position and the desired destination) to move the UAV via periodic updates/iterations. In [4], the author’s modification requires feeding the 3D-NLGL with the current vehicle (UAV) position (x, y, z), the yaw (ψ), the distance L and the desired velocity V path of the vehicle in the x body frame of the vehicle using a constant speed. The algorithm then returns velocities in x, y, and z directions, with z representing their altitude and the yaw angle ψ for rotation. From the algorithm description given, there is a limitation in orientation (no roll and pitch). Again, a constant speed is used in deriving velocities in x, y, and z directions, and the velocity derived in z direction is substituted directly to get the altitude. Although the approach used in [4] to gain altitude is feasible, desirably, flying vehicles gain altitude via orientation by pitching up/down based on some velocity. Based on the limitations outlined, this study extends the 3D-NLGL algorithm in [4] from now on 3D-NLGL-X to output orientations in the aspect of roll, yaw, and pitch, velocities in x, y, and z directions derived from alternating speed. As such, the UAV has six degrees (up, down, left, right, forward, and back) of freedom of movement within a three-dimensional space, see Fig. 4 for visual insight.

Graphical insight into the 3D-NLGL-X.
First, the drag force acting on the UAV is taken into consideration. Assuming the drag force acting on the UAV in all weather conditions in the virtual environment is calculated as:
From lines 22 to 28 in Algorithm 1, the required orientation roll ∅ D , pitch ∅ D , and yaw ψ D , which must be ⩽π to aline the UAV with the selected γVPT is computed. We found this experimentally efficient as it allows smooth orientation movement in small intermittent angles rather than swift angle rotation. On line 28 of Algorithm 1, the required velocity, which is set proportional to dVPTx,y,z, projection in the xyz plane about the distance L contrary to xy plane as in [4] is computed. The return of Algorithm 1, as seen on line 29, is sent as the navigational commands (line 30) to the UAV, where the commands are integrated with the Inertial Measurement Unit (IMU) of the quadrotor (UAV), and the UAV updates its position.
DCNN/CNN models are trained using optimization algorithms that require loss functions to update the model’s weights via backpropagation to minimize the loss associated with the model’s next prediction. In DCNN/CNN regression tasks, the Mean Squared Error (MSE), which computes the average squared differences between a model’s predictions and the ground truth (actual instances/values), is the preference. However, since MSE succumbs to outliers in data, squaring the residual magnifies the error when the difference between the prediction and the actual is high. As a result, when the weights of a DCNN/CNN model are nearing perfection during training, an outlier may disrupt the nearly perfect weights due to the magnified error; this leads to the less robustness of the MSE as compared to the Mean Absolute Error (MAE) which is less stable. To this end, the Adaptive Weighted Loss (ADWL) is proposed in this study. ADWL is given as:
By definition, ADWL is the squared difference of the actual difference between the predicted and the actual plus or minus the adaptive weight, which initially is set to an initializer l i = 0.1, …, 0.5 and subsequently replaced with the loss l after each iteration/epoch. Figure 5 shows the losses of the W-MobileNet framework trained with the proposed loss function ADWL and the existing loss functions (MSE, MAE, MSLE, and MBE). From Fig. 5, using ADWL results in early convergence and spikes reduction, which correlates to abrupt changes in the near-perfect weights of W-MobileNet during training. Among the existing cost functions (MSE, MAE, MSLE, and MBE), MSE shows a much more competitive performance; however, the spiky nature of MSE is a challenge.

Loss comparison for ADWL and state-of-the-art loss functions used in training of W-MobileNet.
In Table 2, dummy data is utilized to give further elaboration. It can be seen that the MSE penalizes the significant errors, as seen from row 2, whereas ADWL relaxes the penalty on significant errors. On the other hand, MAE does not penalize significant errors (directly proportional to the residual). Also, it does not take into consideration the negative residuals, which leads to the less stable nature of the MAE. The MBE also tends to run into negatives, as seen from rows 1 and 2, thereby going beneath the global minimum (0). Lastly, the MSLE treats small and large residuals nearly the same as in rows 1 and 2.
Synopsis analysis of AWDL and existing loss fuctions on a dummy data
A simulated and real-world experiment with varying evaluation metrics is carried out to evaluate W-MobileNet and compared with the reference trajectory in addition to the following comparators: A Multi-Task Regression-based Learning (MTRL) method which adapts the architecture of the Siamese network and predicts positions and orientations; refer to Fig. 2 in [31] for graphical insight. An Iterative Learning Controller (ILC) encapsulated with a feedback PD controller for stability [32]; this is a non-neural network. The ILC is a well-established control strategy for non-linear systems (e.g., robotic systems). Again the ILC is an intelligent control methodology that utilizes historical data to improve its subsequent predictions/actions; this operational principle has similarities with DCNN models since DCNN models also use historical data prior to predictions. An ablated W-MobileNet herein MobileNet-X. MobileNet-X is without the stem block, modification to the backbone modular, and integration with the 3D-NLGL-X; hence the difference in performances between W-MobileNet and MobileNet-X denotes the improvements W-MobileNet offers.
Simulated experiment, setup, and W-MobileNet training
Virtual verification is via the Microsoft AirSim platform built on Unreal Engine, which provides a realistic synthetic 3D environment. W-MobileNet is built as a top layer controller on the simulator using AirSim APIs. Figure 6 illustrates the basic structure of the simulation platform.

Schematic of the simulation platform.
Data is resized to 224 × 224 × 3 and pre-processed using channel mean-subtraction to center the data. The channel mean-subtraction allows each feature to have a similar range. As such, a single global learning rate multiplier is enough (i.e., during backpropagation, the gradient does not go out of range). Since W-MobileNet performs both classification and regression, for loss functions, categorical-cross entropy is used for classification (Softmax layer in Fig. 3) and the newly proposed loss function AWDL for regression (Linear layer in Fig. 3). Further, equal loss weights were used. 80% of the normalized data is fed to the W-MobileNet for training to commence. W-MobileNet is trained for 100 epochs using a batch size of 64, a learning rate of 0.0001, which controls the step along the gradient, and over time, the Adam optimizer is used to reduce the learning rate progressively. During testing, a copy of the training data (80% used during training) is heavily augmented and added to the unseen 20% of testing data to attain a whole flight trajectory. To ensure the most diversified testing environment, the UAV is set off in an anticlockwise direction of the trajectory rather than clockwise during data collection. In addition, environmental factors (weather conditions) are activated in the Airsim. The computing resource for the experiment is equipped with Nvidia GeForce RTX 2070 with a memory of 16GB running on windows 10 Pro.
An extensive evaluation based on the metrics (i) trajectory performance and time-series analysis, (ii) Hausdorff distance, (iii) cross-track error, and (iv) a quantitative measure (time to complete the trajectory, total distance covered, and the number of successful gateways completed) are used to evaluate the effectiveness, robustness, reliability, and shortcomings of W-MobileNet and the comparators. The experiment is repeated twice under three weather conditions, (i) clear, (ii) rainy/foggy, and (iii) snowy. The intensity of rainy/foggy weather is set to 0.55, and that of snowy is 0.75. Throughout the three weather conditions, an alternating windy condition of 0m/s to 20m/s expressed in the NED directions in the Airsim is maintained.
Trajectory performance and time-series analysis
The trajectory performance evaluates how best each method guides the UAV to follow the reference path. Figures 7a, 7b, and 7c represent trial 1 of the experiment under the three weather conditions (clear, rainy/foggy, and snowy). As seen from trial 1 in Fig. 7a, W-MobileNet mimics the reference trajectory with nearly no track deviation. In clear weather, W-MobileNet exhibits better perceptual knowledge of the environment and can freely navigate with six degrees of freedom; as such, it can follow the reference track effortlessly. Both MTRL and ILC performance is relatively better than the MobileNet-X, which experiences minimum drifts in clear weather conditions. In the rainy/foggy weather of trial 1, the W-MobileNet outperforms the comparators, as seen in Fig. 7b. It can be seen that W-MobileNet gains a slightly higher altitude from the onset. Still, it quickly declines to the optimal altitude and trails the reference trajectory regardless of the environment’s state. Further, navigation under MobileNet-X is challenging, as seen in Fig. 7b. The UAV experiences continuous drifts, circles, and hovers for a short period. The MTRL-based approach initially drifts heavily from the reference trajectory, and about halfway through the track, it starts to converge with the reference trajectory. Such can be said for the ILC method, which improves halfway through the course.

Trajectory performances for the navigational controllers in varying weather conditions in trials 1 and 2, respectively.
Figure 7c depicts the performance of the W-MobileNet trajectory in a snowy environment with an intensity of 0.75; this demonstrates the efficiency of the W-MobileNet configuration. Regardless of the boisterous nature of the snowy environment, which conceptually reduces the visibility of gateways, W-MobileNet denoises the frames, hence attaining a better perception of the environment than the comparators. In Fig. 7c, navigation under the guidance of MobileNet-X, ILC, and MTRL is heavily affected by drifts, circling, and hovering almost throughout the trajectory due to poor perception. Among the comparators, MobileNet-X is unable to complete the entire course. The UAV lands due to long periods of hovering and circling almost at the same place. The conclusion drawn here is that MobileNet-X has an erroneous visual percept of the environment and also fails to generalize; hence its predictions result in the hovering/circling behavior of the UAV. The experiment is repeated for a second round, trial 2. As seen from Figs. 7d, 7e, and 7f, a similar interpretation can be deduced from trial 1; hence further elaboration is not given.
Time series is used to analyze the performance of each navigational controller (W-MobileNet, MobileNet-X, ILC, and MTRL) to gain more insight. For brevity and the assertion that trial 2 trajectory performances of the navigational controllers have similarities with that of trial 1, time series analysis for trial 1 only is given; refer to Figs. 8a, 8b, and 8c, respectively. In Figs. 8a, 8b, and 8c, from top to down, is the visualization of time series analysis in x, y, and z directions for the three varying environments accordingly. Here, the vertical axis denotes the offset in x, y, and z directions, and the horizontal axis is a normalized time for the recorded data points at each gateway. From the perspective of a stochastic model in time series analysis, observations closer in time are much more related than observations farther apart. Subsequently, Figs. 8a, 8b, and 8c show that W-MobileNet has a closer relationship with the reference trajectory than the comparators in all the environment scenarios. Thence, W-MobileNet can be said to be an effective and efficient autonomous navigational controller, arguably in the setting of UAV flight controllers.

Time series analysis for the UAV navigational controllers in trial 1 of the simulated experiment.
Since the trajectories obtained under the guidance of each navigational controller belongs to the same metric space (M, d) and share similar views; a pairwise distance measure, the Hausdorff distance [33, 34], is employed to compute how far each trajectory is contained within the reference trajectory and vice-versa. The Hausdorff distance is expressed as:
That is the set of all points within ɛ of trajectory X . Trajectory Y is also fattened, as in Equation 9. A tabulation of the Hausdorff distance between the reference trajectory and the trajectories adduced under the navigational controllers is given in Table 3. No values for the reference trajectory are provided since it is approximately equal to that of the navigational controllers’ Hausdorff distances.
Results of the Hausdorff distance in the simulated experiment
In Table 3, the lower the Hausdorff distance, the closer the two trajectories being compared. For clarity, the Hausdorff distance between the reference trajectory and W-MobileNet is used to give more insight. From Table 3, under trial 1 of the clear weather environment, the Hausdorff distance between the reference trajectory and that of W-MobileNet is 3.65m . The former implies when the width of the reference trajectory is thickened or widened by 3.65m, the trajectory adduced under the W-MobileNet controller will be contained within the reference trajectory and vice-versa. As seen from Table 3, trajectories adduced under the W-MobileNet controller are relatively close to that of the reference trajectory due to precision in control. As such, W-MobileNet attains the lowest Hausdorff distances compared to the comparator controllers.
The Cross-Track-Error (XTE) is defined as the instantaneous vertical deviation of the UAV either to the left or right side of the reference trajectory (i.e., XTE is to determine the lateral position of the UAV regarding the reference trajectory). To compute XTE, a vector
From Fig. 9, the XTE for W-MobileNet under the three varying environmental conditions in both experiment repetitions is better than the comparators. Regardless of the deviation on both sides of the reference trajectory (deviation on the left and right side of the path), it can be seen that W-MobileNet converges quickly with the reference trajectory. In all environmental scenarios of the simulated experiment, the XTE for W-MobileNet declines quickly and reaches zero. However, sometimes, due to swift changes in orientation, the XTE momentarily increases, but W-MobileNet quickly puts the UAV back on the reference path; hence the XTE for W-MobileNet is nearly zero. The capability of W-MobileNet to mitigate the XTE almost to zero is attributed to the 3D-NLGL-X modular.

Comparison of cross-track-error between W-MobileNet and the two comparators for trials 1 and 2 of the simulated experiment.
As explained earlier, the 3D-NLGL-X acts as a path planner, enhancing the predictions from the DCNN part of W-MobileNet. By comparing the XTE of the comparators to W-MobileNet, clearly from Fig. 9, under all environmental scenarios, the comparators record higher XTE, and it takes a longer competitive time for the comparators to converge with the reference path. An observation of the large XTE for ILC is due to a gimbal lock, a loss of one degree of movement in three-dimensional space. As a result, the continuous erroneous accumulation of orientation leads to drifts from the reference trajectory, which reflects in the various evaluation metrics for the ILC method. In short, since the W-MobileNet records the least XTE, it takes less time to complete the entire trajectory. A normalized time is used on the x-axis of the sub-figures in Fig. 9 for the representation of the XTE for comparative reasons. The actual times to complete the trajectories are given under the quantitative measure metric under sub Section 4.2.4.
To analyze the quantitative performance of each navigational controller, the time taken to complete the trajectory, the total distance covered, and the number of gateways completed are used. Distances between two gateways are used to calculate the time taken to travel between the two. Assuming the path between two gateways G1 and G2 is straight with positional vectors
Table 4 gives a numerical insight into the performance of each navigational controller. Based on the number of gateways completed in all varying weather scenarios and repetitions of the experiment, W-MobileNet exhibits much better robustness than MTRL, ILC, and MobileNet-X. W-MobileNet completes at least 35 out of 50 gateways (refer to Table 4, trial 1 of snowy weather) and 46 out of 50 gateways at most (refer to trial 1 of clear weather). In addition, the time taken to complete the trajectory and total distance covered is satisfactory compared to that of MTRL, ILC, and MobileNet-X in both experiment repetitions in the clear and rainy/foggy weather scenarios, respectively. From Table 4, the most challenging scenario is that of the snowy weather environment. Here the number of gateways completed reduces for all navigational controllers compared to those achieved in clear and rainy/foggy weather. The time and total distance covered are also worse than the results in the clear and rainy/foggy environments except for MobileNet-X in trial 1 of rainy/foggy weather, where it records a time of 26 min 31sec and a distance of 919.41m. The disparities in the duration and total distance reached by the four navigators in the two repetitions of the experiment are ascribed to the airflow to overcome and the UAV’s behavior (hovering, circling, and deviation from the reference trajectory).
Numerical insight into the performance of each navigational controller for the simulated experiment
A secondary observation centered much on the reliability of each navigational controller is to understand the correlation between the time, total distance, and gateways completed. From Table 4, results under clear weather, W-MobileNet achieved more gateways in trial 1 compared to the gateways completed in trial 2. Yet, in trial 2, W-MobileNet has better results for the time and total distance covered than in trial 1. Similarly, in the most challenging scenario, the snowy weather environment, MobileNet-X records presumably the best time (10 min 21sec) during trial 1, refer to Table 4. However, it fails to complete the trajectory (refer to Fig. 7c for graphical insight), recording 203.35m and completing 4 out of 50 gateways. The discrepancies between the results under time, the total distance covered, and the number of gateways completed for the repetition of the experiment make it quite challenging to assess the reliability of each navigational controller. To this end, the mean of each sub-evaluation metric (time, total distance covered, and gateway completed) for trials 1 and 2 in all three weather scenarios are computed. The results are summarized in Table 5. Based on Table 5, it is evident that W-MobileNet has more substantial reliability than MTRL, ILC, and MobileNet-X. It should be noted that the average distance covered by MobileNet-X even falls short of that of the reference trajectory since it did not complete the entire course twice, refer to Fig. 7c and 7e.
Summarized results of the numerical performance of the navigational controllers for the simulated experiment
The real-world experiment is designed to answer the following research questions: (i) What are the transferability and generalization capabilities of W-MobileNet from simulation to reality? (ii) Can W-MobileNet be deployed on a physical drone? (iii) How well can W-MobileNet handle real-world trajectories layout?
Setup
DJI Tello is used, a low-cost programmable drone that runs on an Intel 14-core processor and shoots at 720p with a 5-megapixel at a frame rate of 30fps. First, two reference trajectories are obtained from an expert user flying the DJI Tello drone on a circular and rectangular track using the GameSir T1d Controller. The adduced reference trajectories are used as a benchmark to assess the performance of W-MobileNet alongside the comparators (MTRL, ILC, and MobileNet-X). The two trajectories mentioned before are characterized by varying challenges (straight segments, swift elevation changes, and hairpin curves). Fabricated gateways measured 100cm × 100cm were used at vantage waypoints of the two reference trajectories refer to Fig. 10c. The fabricated gateways aid in visualizing the reference trajectories/tracks since they are imaginary in reality and serve as an evaluation metric.

Trajectory performance by each navigational controller in the real-world experiment.
All three DCNN models acting as navigational controllers follow a similar training and testing approach as in sub-Section 4.1. In response to the question of transferability and generalization, the last convolutional layer of W-MobileNet is retrained for geometric feature specifics of the real-world trajectories. The learned weights from the simulated experiment (circle trajectory) are used as initializers. The same training modification is applied to the two DCNN comparators. Since the DJI Tello drone is a low-resource constrained edge node, the compute resource stated in sub-Section 4.1 is utilized during the retraining and testing the navigational controllers. However, the question of deploying W-MobileNet on a physical drone is partially answered since the drone acts in the real environment based on the navigational commands received from W-MobileNet running on a computer. The last question, which is centered on the performance of W-MobileNet in particular in the real environment, is answered using the experimental results, which are evaluated on three axes: (i) trajectory performance, (ii) Hausdorff distance, and (iii) gateways completed. Unless otherwise noted, 6 and 4 fabricated gateways are used in the circle and rectangular trajectories, respectively.
The isometric trajectory performance of the navigators is shown in Figs. 10a and 10b. On a scale of 100%, we conjecture W-MobileNet achieves a track performance rate of about 90% and 93% for the circle and rectangular trajectories, respectively, factoring in all the challenges associated with both reference trajectories. MobileNet-X, an ablated navigational controller to W-MobileNet, shows the worst performance.
Based on the trajectories performance as shown in Figs. 10a and 10b, there is much more improvement in trajectory performance due to the configuration of the W-MobileNet framework, which: (i) denoises the From left to right: (a) circular trajectory performance, (b) rectangular trajectory performance, and (c) a composite scene from the real-world experiment.frames, as such significant features are learned, and (ii) integrates navigational estimates with the 3D-NLGL-X, leading to intuitive and precision control of the UAV. The MTRL and ILC methods perform averagely on both reference trajectories competitively, and both are relatively better than the MobileNet-X method but lag behind W-MobileNet. Table 6 summarizes the numerical performance of the navigators under evaluation metrics Hasudorf distance and gateways completed. Complimentary, a conjectured trajectory performance rate is also given in Table 6, and Fig. 10c shows a composite scene during the real-world experiment. Readers can refer to the video in [35] for visual insight into the performance of W-MobileNet. Although we fancy an overall success rate of 91.5% for W-MobileNet for both trajectories, some minimum drifts during the real-world experiment are acknowledged. Presumptuously, the drifts observed in the real-world experiment possibly could be attributed to a calibration error of the DJI Tello drone
Real-world numerical results for insight into the performance of the navigational controllers
Real-world numerical results for insight into the performance of the navigational controllers
This study implements a new hybrid method for the autonomous vision-based UAV trajectory trial task. The hybrid approach is segmented into three main modules: (i) a denoising mechanism, (ii) the perceptual awareness of DCNN, and (iii) the precision mechanism of a path planner. Integrating the three modules, as mentioned before, resulted in a UAV navigational controller termed W-MobileNet, which continuously predicted navigational commands regarding position coordinates, orientation, and speed for the UAV. Further, a novel loss function characterized by a weighting mechanism was utilized in training W-MobileNet, resulting in faster convergence. The pros and cons of the proposed hybrid navigational controller were investigated extensively from a simulation perspective and a real-world experiment utilizing a micro aerial drone. Based on the evaluation metrics: (i) trajectory track performance and time series analysis, (ii) Hausdorff distance, (iii) cross-track-error, and (iv) a numerical quantitative measure, the findings of the experiments indicate W-MobileNet is favorable in terms of precision and robustness over the state-of-the-art methods, a non-neural network method, and an ablated W-MobileNet.
Although the potential of W-MobileNet as a UAV navigational controller takes us a step closer to autonomous UAV applications, scaling the presented approach to applications such as package delivery would always require the partial intervention of a human pilot due to the unknown path. In addition, W-MobileNet lacks a UAV calibration error mechanism; as such, the two critics, as mentioned earlier, would be a direction to look into in future research.
Footnotes
Acknowledgment
This work was partly supported by the National Natural Science Foundation of China under Grants 61571099 and 61501098.
The authors thank Lui Zhang and Joshua Offeh Beakoh for aiding the real-world experiments and Priscilla Fosu Sarkodie and Abigail Boamah for extensive proofreading. Also, the authors are grateful for the helpful advice from Dr. Brigther Agyemang, Samuel Osei Agyemang, and Micheal Osei Agyemang.
