Abstract
With the rapid development of unmanned aerial vehicle (UAV) technology and computer vision, real-time object detection in UAV aerial images has become a current research hotspot. However, the detection tasks in UAV aerial images face challenges such as disparate object scales, numerous small objects, and mutual occlusion. To address these issues, this paper proposes the ASM-YOLO model, which enhances the original model by replacing the Neck part of YOLOv8 with an efficient bidirectional cross-scale connections and adaptive feature fusion (ABiFPN) . Additionally, a Structural Feature Enhancement Module (SFE) is introduced to inject features extracted by the backbone network into the Neck part, enhancing inter-network information exchange. Furthermore, the MPDIoU bounding box loss function is employed to replace the original CIoU bounding box loss function. A series of experiments was conducted on the VisDrone-DET dataset, and comparisons were made with the baseline network YOLOv8s. The experimental results demonstrate that the proposed model in this study achieved reductions of 26.1% and 24.7% in terms of parameter count and model size, respectively. Additionally, during testing on the evaluation set, the proposed model exhibited improvements of 7.4% and 4.6% in the AP50 and mAP metrics, respectively, compared to the YOLOv8s baseline model, thereby validating the practicality and effectiveness of the proposed model. Subsequently, the generalizability of the algorithm was validated on the DOTA and DIOR datasets, which share similarities with aerial images captured by drones. The experimental results indicate significant enhancements on both datasets.
Keywords
Introduction
The integration of UAV aerial image processing with deep learning technology is currently a focal point of scrutiny within the academic community. UAVs, known for their maneuverability and flexibility, find extensive applications in monitoring, particularly in intricate natural environments and terrains. UAV-based monitoring, when compared to conventional methods, presents advantages such as wide coverage, high efficiency, and cost-effectiveness, rendering UAVs an optimal choice in the monitoring domain. Nonetheless, generic object detection models often encounter challenges in accurately identifying objects in low-altitude UAV aerial images. Aerial photographs captured by UAVs differ significantly from ground-based photography, manifesting characteristics such as expansive scenes, diminutive objects (defined as those with pixels less than 32
Although YOLOv8 [3] stands out as an excellent real-time object detection model, its performance is suboptimal when directly applied to real-time object detection tasks for drones, particularly in handling challenges such as small targets and occlusions in drone scenarios. Solely pursuing high accuracy proves inadequate for meeting real-time requirements [4], and prioritizing real-time performance compromises accuracy [5]. This trade-off makes it challenging to achieve the desired results in both aspects. Therefore, this paper proposes a series of improvements to YOLOv8, aiming to address the simultaneous requirements of real-time performance and accuracy in drone object detection tasks. To tackle the issue of varying scales in UAV object detection, we utilize an efficient bidirectional cross-scale connections and adaptive feature fusion (ABiFPN) with a P2 layer, replacing the PAN-FPN [6] structure in the original YOLOv8 Neck. This adjustment aims to mitigate the common issue of overlooking small objects in low-altitude aerial images. Additionally, to improve information exchange between the Backbone and Neck networks, we introduce the Structural Feature Enhancement Module (SFE). The SFE module injects features from the Backbone into the Neck network, facilitating additional fusion processing. The SFE module optimally utilizes features from the Backbone, and the Coordinate Attention (CA) mechanism [7] within enables the model to focus on the spatial and positional information of distant dependencies during information exchange. This improvement substantially enhances the accuracy of object detection in low-altitude aerial images. Finally, to ensure the model captures multiscale variations and information on small objects during training, we replace the original YOLOv8 CIoU [8] bounding box loss function with the MPDIoU [9] bounding box loss function.
Ablation experiments were conducted to assess the viability and efficacy of the proposed network optimization methods. The experiments utilized the internationally recognized VisDrone-DET [10] dataset and involved a comparison with the baseline model YOLOv8s. The results of the experiments revealed a notable reduction of 26.1% and 24.7% in parameter count and model size, respectively, for the proposed model. Moreover, examination of the dataset showcased a 7.4% enhancement in Average Precision at 50% Intersection over Union (AP50) and a 4.6% increase in mean Average Precision (mAP) compared to the YOLOv8s baseline model, thus affirming the practicality and effectiveness of the proposed model. Additionally, comparative analyses involving state-of-the-art detection models and mainstream models with comparable parameters provided further evidence of the superior performance of the methods proposed in this paper. Simultaneously, we also validate the algorithm’s generalizability on the DOTA [11] and DIOR [12] datasets, which share similarities with aerial images captured by drones. The experimental results indicate a significant improvement in performance on both of these datasets for ASM-YOLO.
The subsequent sections of this paper are structured as follows: Section 2 provides a review of relevant prior research. In Section 3, an enhanced model for detecting aerial images is introduced, offering a detailed exposition of the model’s structure and functionality. Section 4 delineates the evaluation metrics, experimental settings, and parameter configurations. Subsequently, ablation experiments, comparative experiments, and interpretability experiments were conducted on the VisDrone-DET international open-source dataset. Additionally, generalization validation was performed on the DOTA and DIOR datasets. Section 5 succinctly encapsulates the findings of the entire paper and outlines future research directions.
Related works
In recent years, the rapid development of universal object detection technology has become the foundation for research in numerous application domains. This progress not only highlights its significance in providing crucial technical support for the practical implementation of unmanned aerial vehicle (UAV) target detection tasks but also reflects its integral role in the broader domain. Section 2.1 of this paper will offer an in-depth exploration of the evolution of universal object detection. Moving forward, Section 2.2 will shift focus to the practical application scenarios of UAV target detection. The discussions in these two sections aim to contribute to a comprehensive understanding of the intricate interplay between universal object detection technology and the challenges posed by UAV target detection tasks.
Universal object detection model
Universal object detection techniques fall into two primary categories: one-stage and two-stage. One-stage techniques yield results in an end-to-end fashion, transforming the object detection challenge into a global regression problem. These techniques concurrently assign positions and categories to multiple candidate boxes, achieving a distinct separation between objects and backgrounds. Classic one-stage object detection methodologies encompass SSD [13], the YOLO (You Only Look Once) series [14, 15, 16], and RetinaNet [17]. In contrast, two-stage techniques employ heuristic methods or region proposal generation in the first stage to acquire multiple candidate boxes, subsequently subjecting them to filtering, classification, and regression in the second stage. Traditional two-stage object detection approaches include R-CNN [18], Faster R-CNN [19], and Mask R-CNN [20]. While two-stage techniques demonstrate superior detection accuracy on expansive datasets such as MS COCO [1] and PASCAL VOC [21], they often struggle to meet the real-time demands of edge computing architectures. Conversely, one-stage object detection models strike a balance between real-time performance and detection accuracy. Notably, the YOLO series stands out as a highly regarded single-stage object detection model.
The YOLO series of networks typically consists of three core components: Backbone, Neck, and Head. The Backbone extracts essential features from the image, with shallow layers capturing edge and texture features, while deep layers capture object and semantic information. The Neck connects the Backbone and the Head, aggregating and refining features from the Backbone, commonly performing feature fusion to handle information at different scales. As the final component of the YOLO model, the Head is responsible for predictions. It consists of one or more task-specific sub-networks for classification, localization, and, more recently, instance segmentation and pose estimation. The Head utilizes features provided by the Neck to generate predictions for each candidate object. Finally, post-processing steps such as NMS [22] filter out overlapping predictions, retaining only the detections with the highest confidence.
In 2023, Ultralytics introduced YOLOv8 in the realm of general object detection, incorporating insights from the YOLOv7 ELAN [23] philosophy into the design of its backbone network and Neck components. YOLOv8 tactfully replaced the YOLOv5 [24] C3 structure with the more gradient-rich C2f structure and adjusted channel numbers for diverse-scale models. The Head section of YOLOv8 underwent substantial modifications, featuring a decoupled head structure that segregates classification and detection heads. It also transitioned from the conventional Anchor-Based to Anchor-Free approach, further refining detection accuracy. Noteworthy innovations in loss calculation include the implementation of the TaskAlignedAssigner positive Sample assignment strategy and Distribution Focal Loss (DFL) [25], intending to fortify the model’s robustness and accuracy. Furthermore, YOLOv8 elevated object detection performance by introducing the CIoU [8] bounding box loss function. In summary, YOLOv8 has excelled in object detection tasks through these inventive improvements, solidifying its status as a widely applied and popular model in diverse domains. It offers scaled versions like YOLOv8n (nano), YOLOv8s (small), YOLOv8m (medium), YOLOv8l (large), and YOLOv8x (extra large), catering to various visual tasks encompassing object detection, segmentation, pose estimation, tracking, and classification.
Object detection in aerial images from unmanned aerial vehicle perspective
Despite notable advancements in general object detection techniques, the task of detecting objects in low-altitude UAV aerial images still presents numerous challenges. These challenges encompass various factors, including the presence of small objects, multi-scale variations, complex backgrounds, occlusions, and lighting conditions. In order to address these issues, literature [26] proposed a sliding window cropping method specifically designed for detecting high-resolution images obtained from remote sensing satellites. The proposed method involves dividing the original captured image into multiple smaller images of a fixed size prior to inputting them into the network. This sequential approach effectively mitigates the problem of information loss that occurs when directly downsizing the original high-resolution image upon network entry. Presently, this method is widely employed to tackle the challenge of detecting small objects within high-resolution images. Furthermore, a recent study outlined in literature [4] introduced the utilization of image upscaling to enhance detection performance in tasks involving small objects. While these two image preprocessing methods offer significant improvements in the accuracy of small object detection within low-altitude UAV aerial images, their deployment in edge computing scenarios undoubtedly entails substantial computational overhead.
In the literature [27], a structure similar to an encoder-decoder architecture is utilized to modify the dimensions of feature maps by employing various types of blocks and convolutional layers. Moreover, object detection is performed using five detection layers. The K-Means algorithm is employed to determine the optimal values for anchor boxes. Common data augmentation techniques, including the selective blacking out of specific regions, are employed. These techniques have shown a certain level of effectiveness in detecting objects in aerial images captured by UAVs. The literature [28] presents a UAV image object detection method named Multi-Proxy Detection Network with Unified Foreground Packing (UFPMP-Det). In order to tackle the challenge of detecting densely distributed small objects in UAV-captured scenes, this method introduces the Unified Foreground Packing (UFP). Firstly, background suppression is achieved through cluster merging, followed by packing the resulting sub-regions into mosaics for individual inferences, leading to a significant reduction in overall computational time. Moreover, to mitigate the confusion arising from object similarity and intra-class variations, the method introduces a Multi-Proxy Detection Network (MP-Det). By employing multiple proxies, the MP-Det learns to model the distribution of objects and enforces proxy diversification by minimizing the optimal transport loss guided by the Bag-of-Instance-Words (BoIW). The literature [29] introduces MS-YOLOv7, an UAV aerial image object detection model that improves upon YOLOv7 [23]. This model specifically focuses on detecting common large objects and small objects with higher aspect ratios in UAV aerial images. It achieves this by employing multiple detection heads and the CBAM [30] convolutional attention module to extract features at various scales. To address the challenge of detecting densely packed objects, the model incorporates Swin Transformer [31] units and introduces a novel pyramid pooling module named SPPFS. Moreover, the model combines SoftNMS [32] and the Mish [33] activation function to enhance the network’s capability to recognize overlapping and occluded objects. The literature [34] addresses the issue of false positives and false negatives in the detection of small objects in UAV aerial images. This is achieved by introducing the Bi-PAN-FPN approach to enhance the Neck component of YOLOv8. In order to mitigate information loss during long-distance feature transmission and significantly reduce the number of model parameters, the Ghostblock v2 [35] structure is introduced in the backbone section, replacing a portion of the original C2f in V8. Additionally, to improve the overall performance of the detection task, the WiseIoU [36] bounding box loss function is employed, along with a dynamic non-monotonic focusing mechanism and an "outlier" evaluation of anchor box quality. This combination enables the detector to consider anchor boxes of different qualities. The literature [37] presents UAV-YOLOv8, a object detection model specifically designed for UAV aerial scenes, which builds upon the YOLOv8 architecture. Firstly, the WiseIoU [36] is employed as the bounding box regression loss, accompanied by a judicious gradient allocation strategy that prioritizes common quality Samples. This strategy effectively enhances the model’s localization capability. Secondly, the introduction of BiFormer [38], an attention mechanism, optimizes the backbone network and enhances the model’s focus on crucial information. Lastly, the design of a feature processing module called Focal FasterNet block (FFNB) integrates shallow and deep features by proposing two new detection scales. The integration of these features through the proposed multi-scale feature fusion network significantly boosts the model’s detection performance and reduces the rate of missed detections for small objects. The literature [39] introduces a remote sensing image object detection method based on Consistency- and Dependence-Guided Knowledge Distillation (CDKD). This approach effectively utilizes the Spatial- and Channeloriented Structure Discriminative modules (SCSDM) to extract discriminative spatial positions and channels focused by the teacher model. By eliminating noise and the influence of complex backgrounds, the method enhances the feature representation of the student model. Guided by SCSDM, the paper successfully establishes the consistency and dependence of features between the teacher and student models. This work holds significant reference value for unmanned aerial vehicle image processing and model lightweighting.
Although [27, 28, 29, 34, 37, 39] has made certain contributions to the development in this field, it still has several limitations. Firstly, these methods primarily focus on achieving high detection accuracy, which they accomplish effectively. However, their speed falls short of meeting the deployment requirements on edge devices, thereby impeding their practical application. Secondly, despite the commendable performance of these methods in terms of detection accuracy and speed, the size of the model has increased during the improvement process, thereby exacerbating the burden of deployment. Consequently, practical applications demand additional computational resources and storage space, consequently limiting the feasibility of these methods. Therefore, when further enhancing low-altitude UAV aerial image detection methods, it is crucial to consider reducing the model’s size while simultaneously improving detection accuracy and speed.
Building upon the aforementioned considerations, this paper aims to propose a precise and lightweight method for detecting low-altitude UAV aerial images. Our focal point will be to tackle the aforementioned challenges by adopting novel algorithms and techniques, which enable efficient deployment on edge devices. To enhance detection accuracy, we will introduce novel feature representation methods or refine existing feature extractors. Furthermore, we will investigate approaches to optimize model architecture and parameter settings, thereby improving detection speed and efficiency.
Methods
Structure of the ASM-YOLO
To tackle the object detection task in low-altitude aerial images captured by UAVs, this paper proposes an improved version of YOLOv8 called ASM-YOLO. The overall architecture, as depicted in Figure 1, consists of several components. Firstly, the (a) Backbone structure, responsible for feature extraction, employs the CSPDarknet53 backbone network. Due to its excellent feature extraction capabilities and lower resource consumption, we adhere to its design. The useful feature layers extracted from this structure, namely B2, B3, B4, and B5, are also utilized in the subsequent Neck structure, denoted as P2 to P5 and H2 to H5 in the Neck and Head structures, respectively. Secondly, the (b) SFE structure combines the CA attention mechanism to inject the features extracted by the Backbone into the Neck structure. This process focuses on spatial and positional information. Next, the (c) Neck structure replaces the original PAN-FPN structure with a four-layer ABiFPN. It receives features from layers B2 to B5 of the Backbone and performs adaptive fusion. Lastly, the (d) Head structure follows the decoupled heads approach used in YOLOv8. It consists of separate branches for classification and bounding box regression. The classification task employs Binary Cross-Entropy (BCE) loss, while the regression task adopts MPDIoU loss and DFL. The detailed parts of this network are shown in Figure 2(a) to (i).

The main workflow structure of ASM-YOLO.

The detailed process structure of ASM-YOLO.
In one-stage object detectors, the Neck component typically incorporates a network resembling the Feature Pyramid Network (FPN) [40]. This network integrates high-level features with low-level features to enhance the representation capacity of the latter. Additionally, it assigns objects of different scales to different detection layers, thereby achieving a divide-and-conquer strategy. Over the years, several FPN variants have been introduced, including NAS-FPN [41], BiFPN [42], DetectoRS [43], and AFPN [44]. To address the problem of drastic scale differences and the difficulty in detecting small objects in UAV aerial images, this paper replaces the Neck component of YOLOv8 with a BiFPN structure firstly. Additionally, an adaptive feature fusion mechanism is employed to fuse information from different scales within this structure. Furthermore, a shallow feature layer, P2, is introduced to improve the detection performance for small objects. The improved structure is referred to as ABiFPN, and its architecture is illustrated in Figure 3. The data flow of the ABiFPN structure is shown in Eqs (1) to (6).

Structure of ABiFPN.
Although FPN has achieved effective cross-scale feature fusion, its unidirectional top-down connection structure still limits the expressive power of features. In order to overcome this limitation, BiFPN introduces bottom-up connections as well as bidirectional feature aggregation. Moreover, to address the imbalanced contributions of features at different scales, BiFPN employs a weighted feature fusion strategy that enables the network to learn the significance of each feature, thereby replacing simple addition operations. Additionally, BiFPN considers a single top-down and bottom-up connection as a layer and iteratively stacks this structure to form multiple layers of BiFPN, thereby enhancing feature representation. In terms of network connectivity, BiFPN optimizes the structure by removing single-input nodes, adding connections among nodes at the same scale, and employing other methods to improve the network’s representation capacity without increasing parameters or computational complexity. The ABiFPN utilized in this study adheres to the design principles of BiFPN.
The P2 detection layer exhibits significant advantages due to its relatively higher resolution, enabling it to capture fine-grained feature information in the image and enhance the precision of object detection. By performing convolution, pooling, and activation operations at the lower levels of the pyramid network, the P2 layer effectively extracts local features and provides a strong foundation for object detection. Considering that the original BiFPN network’s performance in handling small-sized objects remains unsatisfactory, this paper introduces the P2 detection layer to enhance the detection performance of small objects. The subsequent object detection head utilizes the P2 feature layer for predicting object categories and positions, achieving accurate object detection and localization. Additionally, the P2 feature layer collaborates with other pyramid layers, leveraging multi-scale information and enabling the entire object detection system to adapt to objects of different scales, thereby improving the overall detection performance. The calculation of the parameter count for the modified ABiFPN can be determined using Eqs (7) to (11).
Among them,
Since the upsampling process is to perform adjacent differences on features, it only generates calculation amount and does not generate parameter amount.
Among them,
Among them, InputSize is the image size input by the network;
Among them,
The equation above reveals that the excessive number of channels is one of the primary issues associated with the Neck component. Furthermore, previous literature [45] has highlighted the presence of redundancy among different channels and the necessity to increase the channel count for Depthwise Convolution in order to compensate for accuracy degradation and heightened memory access. To address these concerns, this study aims to diminish computational redundancy and alleviate memory pressure. Drawing inspiration from the design principles of Partial Convolution, only a subset of channels is retained for feature extraction, while the channel count of the original P3-P5 layers is standardized to 128. This channel reduction strategy not only reduces the computational burden but also capitalizes on the effectiveness of Partial Convolution by extracting scene features using representative channels, with Pointwise Convolution being utilized to integrate information from all channels. By adopting this approach, optimization is achieved in terms of both computation and memory access, while ensuring effective feature representation. Ultimately, the modification of the structural parameters yields a 26% decrease in the parameter count compared to the original PAN-FPN.
To tackle the challenge of complex backgrounds in UAV object detection, this study proposes an adaptive feature fusion mechanism within the Neck component of the Feature Pyramid Network (FPN). Feature fusion serves multiple purposes within the FPN network: firstly, it facilitates the generation of complementary multi-scale features; secondly, it enhances the information flow between different network layers; thirdly, it mitigates semantic inconsistencies; fourthly, it improves the detection performance of small objects; and finally, it reduces the overall parameter count. However, the conventional simple stacking method fails to effectively discern the importance of individual feature maps, leading to a decrease in the detection performance of small objects in complex scenes. In order to achieve more intelligent and efficient feature fusion, this research develops an adaptive feature fusion module based on an attention mechanism. This module is capable of discerning the distinct contributions of feature maps at each position to the final object detection task, assigning varying attention weights accordingly. The structure is depicted in Figure 4, it employs 1

Adaptive feature fusion.
The complex and diverse backgrounds present in images captured by low-altitude aerial drones necessitate the introduction of the SFE module in this study to accurately capture the desired information. Figure 2 illustrates the composition of the SFE module, which comprises the Conv structure and the CA module. The primary objective of this module is to inject informative feature maps into the neck component of the network, thereby enhancing the representation capability and abstraction level of the features.
According to the operational flow illustrated in Figure 5, the mechanism initiates by conducting average pooling in both the height and width dimensions on the input features, resulting in a downsampled representation with reduced spatial resolution. Following that, stacked concatenation, convolution, batch normalization, and nonlinear mapping operations are employed to augment the expressive capabilities of the features. Subsequently, the features are partitioned into two parallel stages, where convolution and nonlinear mapping are independently performed in the height and width dimensions. Lastly, the mapping results are merged with the input features through residual connections. This adaptive computation iteratively occurs multiple times across the height, width, and channel dimensions to accentuate the positional and channel-specific information of the features.
The design of this mechanism considers the structural characteristics of the input data and enhances the expressive power of the features through operations such as stacked concatenation, convolution, batch normalization, and nonlinear mapping. Additionally, residual [46] connections are utilized to preserve the information of the input features, thereby preventing information loss. Consequently, when addressing scenarios with long-range dependencies, the incorporation of CA attention facilitates precise filtering of spatial and channel information pertaining to the relevant features. In the context of detecting objects in complex backgrounds of low-altitude aerial images captured by UAVs, this mechanism effectively and accurately locates the objects, ultimately accomplishing the objective of object detection.

Coordinate attention.
In the context of object detection in low-altitude aerial images acquired by UAVs, the presence of small objects is prominent, necessitating the use of a well-designed loss function to enhance the model’s detection performance. YOLOv8 incorporates the CIoU [8] and DFL [25] loss functions for addressing boundary box loss and binary cross-entropy loss, respectively. Nonetheless, the CIoU approach exhibits several limitations. Firstly, it fails to consider the balance between challenging and easy samples. Secondly, the inclusion of aspect ratio as a penalty term within the loss function introduces inaccuracies when the ground truth box and predicted box share the same aspect ratio but differ in width and height values. Consequently, the penalty term fails to accurately capture the true disparity between the two boxes. Lastly, the computation of CIoU entails the use of inverse trigonometric functions, resulting in increased computational complexity for the model. The computation procedure of CIoU is illustrated by Eq. (12).
In Eq. (12), the symbol IoU denotes the intersection over union ratio between the predicted box and the ground truth box. The term

MPDIoU.
Although the CIoU loss function incorporates the aspect ratio of bounding boxes as a penalty term, which can partially expedite the regression convergence of predicted boxes, it poses a constraint when the width and height of predicted boxes and ground truth boxes demonstrate a linear proportionality. In such cases, the width and height of the predicted boxes cannot simultaneously increase or decrease, impeding the efficacy of the CIoU loss function. To enhance the detection of small objects in low-altitude aerial images captured by UAVs, this study introduces the MPDIoU bounding box loss function as a substitute for the CIoU loss function. The MPDIoU loss function precisely characterizes the position and shape of objects by minimizing the distance between the top-left and bottom-right points of the predicted and ground truth bounding boxes. The specific computation process is delineated by Eqs (13) to (16). The advantages of the MPDIoU loss function lie in its streamlined computation process based on the minimum point distance measurement, rendering it applicable to both overlapping and non-overlapping bounding box regression. By incorporating the MPDIoU loss function, this study surmounts the limitations of the CIoU loss function during the convergence of width and height ratios, thereby amplifying the accuracy and performance of small object detection.
Among them, some parameters in Eqs (13) to (16) are shown in Figure 6.
Evaluation indicators
The model’s detection performance in this study is evaluated using precision (P), recall (R), average precision at an IoU threshold of 0.5 (AP50), and mean average precision within the IoU threshold range of [0.5, 0.95] (mAP). The assessment of model resource consumption includes parameters (Par), floating point operations per second (FLOPs), and model size (MS). The speed of the model’s detection is measured by the inference time required for a single image (Latency). The calculations for P, R, and mAP are described in Eqs (17) to (19).
where TP (True Positives) represents the number of correctly identified positive Samples, FP (False Positives) represents the number of incorrectly identified negative Samples, and FN (False Negatives) represents the number of incorrectly identified positive Samples.
The experimental data utilized in this study originates from the publicly released VisDrone2021 [10] dataset by the AISKYEYE data mining team at Tianjin University. This dataset exclusively comprises aerial footage captured by UAVs. It encompasses 263 video clips, totaling 179,264 frames, along with 10,209 static images. The data is obtained through cameras affixed to diverse types of UAVs, exhibiting variations across multiple dimensions, including location (spanning 14 distinct cities in China), environment (encompassing both urban and rural areas), objects (encompassing pedestrians, vehicles, and bicycles), and density (encompassing sparse and crowded scenes). The dataset encompasses a wide range of weather and lighting conditions, effectively representing diverse real-life scenarios. The video clips possess a maximum resolution of 3840
The experiment employed the Ubuntu 20.04 operating system, Python 3.8 programming language, and PyTorch 2.0 deep learning framework. The training process was accelerated using an NVIDIA 3090 (24G) GPU. The code used in this study was based on Ultralytics YOLOv8.0.114 and subsequently enhanced. To maintain impartiality, none of the conducted experiments in this research relied on pre-trained weights. Table 1 presents the primary parameter configurations employed during the training phase.
Training parameter table.
Training parameter table.
The trained model is compared against the YOLOv8s baseline model by conducting training on the VisDrone-DET-train dataset and validation on the VisDrone-DET-val dataset. The training outcomes are depicted in Figure 7, illustrating the performance improvements achieved by the proposed approach.
The effectiveness of the proposed enhancements was demonstrated through ablation experiments conducted on the baseline model using the VisDrone-DET dataset. The experimental findings are presented in Table 2. In this table, ABiFPN signifies the incorporation of the Adaptive Weighted Bi-directional Feature Pyramid Network with a P2 detection layer serving as an improved Neck component within the network. SFE represents the Structural Feature Enhancement module, which facilitates the interaction between the Backbone and Neck networks. MPDIoU denotes the substitution of the original CIoU bounding box loss function of the baseline network with the MPDIoU loss function. The ✓ symbol indicates the adoption of the respective enhancement strategy.

Training results, (a) The training results of both YOLOv8s and the proposed model are evaluated based on precision. (b) The training results of both YOLOv8s and the proposed model are assessed using recall. (c) The training results of both YOLOv8s and the proposed model are evaluated using AP50. (d) The training results of both YOLOv8s and the proposed model are measured using mAP.
Ablation experiment. The best result is presented in bold.
The experimental findings presented in Table 2 demonstrate that each enhancement strategy significantly improves the detection performance of the baseline model on the VisDrone-DET dataset. Substituting the PAN-FPN component with an ABiFPN that incorporates a detection layer P2 leads to a notable enhancement in the model’s detection effectiveness, resulting in respective increases of 6.8% and 4.1% in the AP50 and mAP metrics. This improvement can be attributed to the abundance of small objects in the scenes captured by drones, as well as the substantial variations in image scales. The integration of the SFE module into the baseline network results in a 1.0% increase in AP50 and a 0.7% increase in mAP. This improvement stems from the enhanced utilization of backbone feature extraction facilitated by the SFE structure. Employing the MPDIoU as the new bounding box regression loss function leads to a 0.4% increase in AP50 and a 0.2% increase in mAP. However, the inclusion of the ABiFPN structure introduces complexity to the model and prolongs the inference time.
To evaluate the effectiveness of the proposed model, comparative experiments were conducted between the enhanced model presented in this paper and several widely recognized models in the field of real-time object detection. The selected models include YOLOv5 (version 7.0), released by Ultralytics in November 2022; YOLOv6 (version 3.0), released by Meituan in February 2023; YOLOv7, released by the original YOLOv4 team in July 2022; Gold-YOLO, released by Huawei in September 2023; and YOLOv8, also released by Ultralytics in January 2023. The experimental results on the VisDrone-DET dataset are presented in Table 3, where ASM-YOLO demonstrates superior performance in terms of Recall, AP50, and mAP metrics. In comparison to the baseline network of YOLOv8, ASM-YOLO achieves reductions of 26.1% and 24.7% in parameter count and model size, respectively. Meanwhile, in terms of detection precision, the AP50 and mAP metrics witness improvements of 7.4% and 4.6%, respectively. The trade-off is a slight decrease in detection speed.
Comparative experiment table. The best result is presented in bold.
Comparative experiment table. The best result is presented in bold.
Figure 8 visually demonstrates the outstanding comprehensive performance of ASM-YOLO compared to the prevailing object detection models on the VisDrone-DET dataset, accentuating its feasibility.
In this section, we discuss and analyze the experimental results obtained from evaluating the ASM-YOLO model on the VisDrone-DET-val and VisDrone-DET-test-dev datasets. Table 4 presents the performance of ASM-YOLO across 10 categories, as well as the overall recall, AP50, and mAP metrics in the VisDrone-DET dataset. Notably, the detection performance for the car, van, and bus categories shows great promise, achieving AP50 scores of 79.2%, 47.6%, and 62.8%, respectively, on the test set.
Detection Performance of ASM-YOLO on Various Categories in the VisDrone-DET-val and VisDrone-DET-test-dev Datasets.
Detection Performance of ASM-YOLO on Various Categories in the VisDrone-DET-val and VisDrone-DET-test-dev Datasets.
1Ind. is indicators; 2Ped. is pedestrian; 3Awn. is awning-tricycle; 4val is VisDrone-DET-val; 5test is VisDrone-DET-test-dev.

Comparative Experiment, The x-axis denotes latency, and the y-axis represents accuracy. The size of the triangular markers signifies the parameter size of each model. In (a), the experiments are compared based on the evaluation using the AP50 metric, while in (b), the experiments are evaluated using the mAP metric.
The detection performance of the proposed model in practical application scenarios was validated by selecting several complex images from the VisDrone-DET dataset, as depicted in Figure 9. Figure 9(a) showcases three drone-captured images, arranged from left to right: a nighttime scene, a distant small target scene, and an image with a complex background. Figure 9(b) presents the detection results of YOLOv8s. In this set of images, it is evident that the lower half of the first image exhibits issues of both missed detections and false positives, while the upper half of the second image and the left side of the third image show missed detections of vehicles and pedestrians, respectively. Figure 9 (c) displays the detection results achieved using ASM-YOLO, which, compared to Figure 9(b), demonstrates a certain degree of improvement in addressing the observed missed detections and false positives issues.

Detection Effect: (a) Original Image; (b) YOLOv8s Detection Result; (c) ASM-YOLO Detection Result. Red boxes indicate false positive detections, light blue boxes indicate missed detections.
To ascertain the efficacy of the ASM-YOLO model, this study employs the Grad-CAM technique [49] for explicative analysis. Grad-CAM utilizes gradient information to pinpoint regions in the image associated with the target concept, thereby producing visually interpretable outcomes. Its underlying mechanism, grounded in gradient information, is adaptable to various deep learning models without necessitating retraining. In practical applications, Grad-CAM furnishes perspicuous visual explications, facilitating a nuanced comprehension of the model’s decision-making process. Through the application of Grad-CAM, this research attains a more profound understanding of the deep network’s focal points on disparate regions within the image. This gradient-centric localization methodology bears substantial significance in unraveling the decision-making intricacies of deep networks, validating the dependability and applicability of the network.

Grad-CAM Visualization: (a) Original Image; (b) Grad-CAM Visualization of YOLOv8s; (c) Grad-CAM Visualization of ASM-YOLO.
Figure 10 illustrates the experimental results. The first row showcases three selected images captured by low-altitude drones. The second row presents the Grad-CAM visualization results of the baseline YOLOv8s model, while the third row exhibits the Grad-CAM visualization results of the proposed model in this study. From Figure 10, it is evident that the baseline YOLOv8s model has limitations in detecting small objects at longer distances, despite performing well in close-range object detection. The proposed model in this study successfully mitigates this drawback of the baseline model. Moreover, this study demonstrates the superior generalization ability of the proposed model compared to the baseline model in the task of detecting low-altitude drone-captured images.
To validate the generality of our improved algorithm, we additionally employed the DOTA dataset and DIOR dataset, which bear a resemblance to the low-altitude unmanned aerial vehicle (UAV) perspective. The DOTA dataset comprises 2,806 aerial images with 188,282 annotated instances. The image sizes range from approximately 800
Next is the DIOR dataset, comprising 23,463 images with dimensions of 800
Experimental results demonstrate that our proposed algorithm not only performs exceptionally well on the VisDrone-Det dataset but also achieves significant performance improvements on the DOTA and DIOR datasets. As shown in Table 5, our improved algorithm enhances the mAP metric by 1.5% and 2.4% on DOTA and DIOR, respectively. These results validate the feasibility and effectiveness of our enhanced algorithm.
Cross-dataset validation experimental results comparison table.
Cross-dataset validation experimental results comparison table.
In this study, we propose the ASM-YOLO algorithm as an improvement upon YOLOv8s for object detection in low-altitude UAV imagery. To address challenges in detecting objects with significant scale variations and difficulties in detecting small objects during UAV aerial processes, we advocate the replacement of the PAN-FPN structure in YOLOv8 with the ABiFPN structure. By maintaining a fixed channel number of 128 in the ABiFPN structure, the model’s parameter count is reduced by 26.1%. Concurrently, the AP50 metric exhibits an improvement from 32.6% to 39.4%. This enhancement underscores the pragmatic efficacy of our proposed algorithm in simultaneously reducing parameter count and augmenting detection accuracy. To tackle challenges associated with complex backgrounds in UAV aerial images, rendering objects challenging to detect, we introduce the SFE module. This module is designed to inject features extracted by the Backbone into the Neck structure, thereby amplifying the efficiency of utilizing features extracted by the Backbone. Through integration with the CA mechanism, this structure accentuates positional and directional information during the feature injection process. This strategic focus on small object features within long-range dependencies enhances the model’s ability to detect small objects amidst intricate backgrounds. Additionally, the introduction of the MPDIoU boundary box loss function, characterized by superior localization performance and faster convergence speed, further elevates the overall detection performance of the model.
In the experiments conducted on the public dataset VisDrone-DET, this paper demonstrates the effectiveness of the proposed algorithm compared to the baseline model YOLOv8s. While reducing the parameter size and model size by 26.1% and 24.7% respectively, the AP50 and mAP metrics improve by 7.4% and 4.6% respectively. Additionally, we conducted algorithm generalization validation on the DOTA and DIOR datasets. Experimental results indicate a notable improvement in the performance of ASM-YOLO across both datasets.
In the future, we plan to further explore the application of drones based on this foundation. Currently, the field of smart transportation still faces several challenges, including issues related to traffic flow prediction [50] and traffic data collection [51, 52]. We intend to leverage the unique advantages of drones, particularly from their aerial perspective, to assist in addressing these challenges and contribute to the development of a more intelligent transportation system. This research direction is poised to provide substantial support for the future advancements in smart transportation.
Footnotes
Acknowledgments
This work was supported in part by the projects of the National Natural Science Foundation of China (62376059), Fujian Provincial Department of Science and Technology (2021Y4019) and Fujian Provincial Audit Office (MSK2306).
